LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

12/09/2021
by   Leyuan Qu, et al.
0

The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is firstly pre-trained on ∼2400h multi-lingual (e.g. English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID, TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and -independent settings. In addition to English, we conduct Chinese speech reconstruction on the CMLR dataset to verify the impact on transferability. Lastly, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve state-of-the-art performance on both English and Chinese benchmark datasets.

READ FULL TEXT

page 1

page 2

page 4

page 6

page 7

page 8

page 11

research
11/29/2022

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

In this paper, we propose a novel multi-modal multi-task encoder-decoder...
research
03/01/2022

Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training

Human speech data comprises a rich set of domain factors such as accent,...
research
05/04/2022

SVTS: Scalable Video-to-Speech Synthesis

Video-to-speech synthesis (also known as lip-to-speech) refers to the tr...
research
09/02/2020

Seeing wake words: Audio-visual Keyword Spotting

The goal of this work is to automatically determine whether and when a w...
research
06/16/2021

LiRA: Learning Visual Speech Representations from Audio through Self-supervision

The large amount of audiovisual content being shared online today has dr...
research
12/18/2022

BEATs: Audio Pre-Training with Acoustic Tokenizers

The massive growth of self-supervised learning (SSL) has been witnessed ...
research
02/13/2020

Self-supervised learning for audio-visual speaker diarization

Speaker diarization, which is to find the speech segments of specific sp...

Please sign up or login with your details

Forgot password? Click here to reset