Authors:
Monika Wysoczańska
1
;
2
and
Tomasz Trzciński
3
;
2
Affiliations:
1
Sport Algorithmics and Gaming, Poland
;
2
Warsaw University of Technology, Poland
;
3
Tooploox, Poland
Keyword(s):
Multimodal Learning, Activity Recognition, Music Genre Classification, Multimodal Fusion.
Abstract:
Video content analysis is still an emerging technology, and the majority of work in this area extends from the still image domain. Dance videos are especially difficult to analyse and recognise as the performed human actions are highly dynamic. In this work, we introduce a multimodal approach for dance video recognition. Our proposed method combines visual and audio information, by fusing their representations, to improve classification accuracy. For the visual part, we focus on motion representation, as it is the key factor in distinguishing dance styles. For audio representation, we put the emphasis on capturing long-term dependencies, such as tempo, which is a crucial dance discriminator. Finally, we fuse two distinct modalities using a late fusion approach. We compare our model with corresponding unimodal approaches, by giving exhaustive evaluation on the Let’s Dance dataset. Our method yields significantly better results than each single-modality approach. Results presented in t
his work not only demonstrate the strength of integrating complementary sources of information in the recognition task, but also indicate the potential of applying multimodal approaches within specific research areas.
(More)