Multimodal Transformer Distillation for Audio-Visual Synchronization

10/27/2022
by   Xuanjun Chen, et al.
0

Audio-visual synchronization aims to determine whether the mouth movements and speech in the video are synchronized. VocaLiST reaches state-of-the-art performance by incorporating multimodal Transformers to model audio-visual interact information. However, it requires high computing resources, making it impractical for real-world applications. This paper proposed an MTDVocaLiST model, which is trained by our proposed multimodal Transformer distillation (MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the cross-attention distribution and value-relation in the Transformer of VocaLiST. Our proposed method is effective in two aspects: From the distillation method perspective, MTD loss outperforms other strong distillation baselines. From the distilled model's performance perspective: 1) MTDVocaLiST outperforms similar-size SOTA models, SyncNet, and PM models by 15.69 MTDVocaLiST reduces the model size of VocaLiST by 83.52 similar performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/02/2019

Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events

We tackle the task of environmental event classification by drawing insp...
research
12/06/2022

Self-Supervised Audio-Visual Speech Representations Learning By Multimodal Self-Distillation

In this work, we present a novel method, named AV2vec, for learning audi...
research
01/23/2023

Zorro: the masked multimodal transformer

Attention-based models are appealing for multimodal processing because i...
research
10/15/2021

From Multimodal to Unimodal Attention in Transformers using Knowledge Distillation

Multimodal Deep Learning has garnered much interest, and transformers ha...
research
03/21/2023

ModEFormer: Modality-Preserving Embedding for Audio-Video Synchronization using Transformers

Lack of audio-video synchronization is a common problem during televisio...
research
05/24/2023

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Direct speech-to-speech translation (S2ST) aims to convert speech from o...
research
03/08/2022

Skating-Mixer: Multimodal MLP for Scoring Figure Skating

Figure skating scoring is a challenging task because it requires judging...

Please sign up or login with your details

Forgot password? Click here to reset