MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

03/09/2023
by   Xize Cheng, et al.
0

Multi-media communications facilitate global interaction among people. However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech. This lack of research is mainly due to the absence of datasets containing visual speech and translated text pairs. In this paper, we present AVMuST-TED, the first dataset for Audio-Visual Multilingual Speech Translation, derived from TED talks. Nonetheless, visual speech is not as distinguishable as audio speech, making it difficult to develop a mapping from source speech phonemes to the target language text. To address this issue, we propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks. To further minimize the cross-modality gap and its impact on knowledge transfer, we suggest adopting mixed speech, which is created by interpolating audio and visual streams, along with a curriculum learning strategy to adjust the mixing ratio as needed. MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2. Moreover, it achieves state-of-the-art performance in lip reading on CMLR (11.1%), LRS2 (25.5%), and LRS3 (28.0%).

READ FULL TEXT

page 8

page 14

research
08/30/2023

Speech Wikimedia: A 77 Language Multilingual Speech Dataset

The Speech Wikimedia Dataset is a publicly available compilation of audi...
research
10/31/2022

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Direct speech-to-speech translation (S2ST) is an attractive research top...
research
10/24/2020

Cross-Modal Transfer Learning for Multilingual Speech-to-Text Translation

We propose an effective approach to utilize pretrained speech and text m...
research
05/24/2023

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Direct speech-to-speech translation (S2ST) aims to convert speech from o...
research
06/01/2023

Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

Research in multilingual speech-to-text translation is topical. Having a...
research
10/31/2021

Visualization: the missing factor in Simultaneous Speech Translation

Simultaneous speech translation (SimulST) is the task in which output ge...
research
12/05/2020

SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams

We present SpeakingFaces as a publicly-available large-scale multimodal ...

Please sign up or login with your details

Forgot password? Click here to reset