Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

09/01/2022
by   Sindhu B Hegde, et al.
0

In this work, we address the problem of generating speech from silent lip videos for any speaker in the wild. In stark contrast to previous works, our method (i) is not restricted to a fixed number of speakers, (ii) does not explicitly impose constraints on the domain or the vocabulary and (iii) deals with videos that are recorded in the wild as opposed to within laboratory settings. The task presents a host of challenges, with the key one being that many features of the desired target speech, like voice, pitch and linguistic content, cannot be entirely inferred from the silent face video. In order to handle these stochastic variations, we propose a new VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations. With the help of multiple powerful discriminators that guide the training process, our generator learns to synthesize speech sequences in any voice for the lip movements of any person. Extensive experiments on multiple datasets show that we outperform all baselines by a large margin. Further, our network can be fine-tuned on videos of specific identities to achieve a performance comparable to single-speaker models that are trained on 4× more data. We conduct numerous ablation studies to analyze the effect of different modules of our architecture. We also provide a demo video that demonstrates several qualitative results along with the code and trained models on our website: <http://cvit.iiit.ac.in/research/projects/cvit-projects/lip-to-speech-synthesis>

READ FULL TEXT

page 1

page 4

page 11

research
06/28/2022

Show Me Your Face, And I'll Tell You How You Speak

When we speak, the prosody and content of the speech can be inferred fro...
research
08/23/2020

A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

In this work, we investigate the problem of lip-syncing a talking face v...
research
05/17/2020

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Humans involuntarily tend to infer parts of the conversation from lip mo...
research
01/11/2022

MR-SVS: Singing Voice Synthesis with Multi-Reference Encoder

Multi-speaker singing voice synthesis is to generate the singing voice s...
research
10/07/2021

Cloning one's voice using very limited data in the wild

With the increasing popularity of speech synthesis products, the industr...
research
11/19/2021

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. ...
research
07/20/2017

VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop

We present a new neural text to speech (TTS) method that is able to tran...

Please sign up or login with your details

Forgot password? Click here to reset