Audio-visual video-to-speech synthesis with synthesized input audio

07/31/2023
by   Triantafyllos Kefalas, et al.
0

Video-to-speech synthesis involves reconstructing the speech signal of a speaker from a silent video. The implicit assumption of this task is that the sound signal is either missing or contains a high amount of noise/corruption such that it is not useful for processing. Previous works in the literature either use video inputs only or employ both video and audio inputs during training, and discard the input audio pathway during inference. In this work we investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference. In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech. Our experiments demonstrate that this approach is successful with both raw waveforms and mel spectrograms as target outputs.

READ FULL TEXT

page 2

page 5

page 8

research
06/27/2023

Large-scale unsupervised audio pre-training for video-to-speech synthesis

Video-to-speech synthesis is the task of reconstructing the speech signa...
research
03/01/2023

On the Audio-visual Synchronization for Lip-to-Speech Synthesis

Most lip-to-speech (LTS) synthesis models are trained and evaluated unde...
research
04/01/2022

Residual-guided Personalized Speech Synthesis based on Face Image

Previous works derive personalized speech features by training the model...
research
05/04/2022

SVTS: Scalable Video-to-Speech Synthesis

Video-to-speech synthesis (also known as lip-to-speech) refers to the tr...
research
08/19/2018

Dynamic Temporal Alignment of Speech to Lips

Many speech segments in movies are re-recorded in a studio during postpr...
research
04/06/2020

Vocoder-Based Speech Synthesis from Silent Videos

Both acoustic and visual information influence human perception of speec...
research
03/19/2015

Deep Transform: Time-Domain Audio Error Correction via Probabilistic Re-Synthesis

In the process of recording, storage and transmission of time-domain aud...

Please sign up or login with your details

Forgot password? Click here to reset