RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations

07/03/2023
by   Neha Sahipjohn, et al.
0

Significant progress has been made in speaker dependent Lip-to-Speech synthesis, which aims to generate speech from silent videos of talking faces. Current state-of-the-art approaches primarily employ non-autoregressive sequence-to-sequence architectures to directly predict mel-spectrograms or audio waveforms from lip representations. We hypothesize that the direct mel-prediction hampers training/model efficiency due to the entanglement of speech content with ambient information and speaker characteristics. To this end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis. First, a non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content. A vocoder then converts the speech features into raw waveforms. Extensive evaluations confirm the effectiveness of our setup, achieving state-of-the-art performance on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT datasets. Speech samples from RobustL2S can be found at https://neha-sherin.github.io/RobustL2S/

READ FULL TEXT

page 1

page 6

research
04/01/2021

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

We propose using self-supervised discrete representations for the task o...
research
07/08/2022

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Unconstrained lip-to-speech synthesis aims to generate corresponding spe...
research
05/25/2022

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

Direct speech-to-speech translation (S2ST) systems leverage recent progr...
research
11/01/2020

Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies

Self-supervised speech representations have been shown to be effective i...
research
10/23/2020

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

We present a novel approach to any-to-one (A2O) voice conversion (VC) in...
research
06/27/2023

3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement

Disentangling uncorrelated information in speech utterances is a crucial...
research
09/22/2022

Cross-domain Voice Activity Detection with Self-Supervised Representations

Voice Activity Detection (VAD) aims at detecting speech segments on an a...

Please sign up or login with your details

Forgot password? Click here to reset