Skating-Mixer: Multimodal MLP for Scoring Figure Skating

03/08/2022
by   Jingfei Xia, et al.
0

Figure skating scoring is a challenging task because it requires judging players' technical moves as well as coordination with the background music. Prior learning-based work cannot solve it well for two reasons: 1) each move in figure skating changes quickly, hence simply applying traditional frame sampling will lose a lot of valuable information, especially in a 3-5 minutes lasting video, so an extremely long-range representation learning is necessary; 2) prior methods rarely considered the critical audio-visual relationship in their models. Thus, we introduce a multimodal MLP architecture, named Skating-Mixer. It extends the MLP-Mixer-based framework into a multimodal fashion and effectively learns long-term representations through our designed memory recurrent unit (MRU). Aside from the model, we also collected a high-quality audio-visual FS1000 dataset, which contains over 1000 videos on 8 types of programs with 7 different rating metrics, overtaking other datasets in both quantity and diversity. Experiments show the proposed method outperforms SOTAs over all major metrics on the public Fis-V and our FS1000 dataset. In addition, we include an analysis applying our method to recent competitions that occurred in Beijing 2022 Winter Olympic Games, proving our method has strong robustness.

READ FULL TEXT

page 2

page 5

page 14

research
07/05/2022

Multimodal Frame-Scoring Transformer for Video Summarization

As the number of video content has mushroomed in recent years, automatic...
research
10/28/2022

On the Role of Visual Context in Enriching Music Representations

Human perception and experience of music is highly context-dependent. Co...
research
11/22/2017

Integrating both Visual and Audio Cues for Enhanced Video Caption

Video caption refers to generating a descriptive sentence for a specific...
research
04/05/2021

Can audio-visual integration strengthen robustness under multimodal attacks?

In this paper, we propose to make a systematic study on machines multise...
research
05/30/2023

Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic Space

In recent years, the task of weakly supervised audio-visual violence det...
research
10/27/2022

Multimodal Transformer Distillation for Audio-Visual Synchronization

Audio-visual synchronization aims to determine whether the mouth movemen...

Please sign up or login with your details

Forgot password? Click here to reset