Video-Driven Speech Reconstruction using Generative Adversarial Networks

06/14/2019
by   Konstantinos Vougioukas, et al.
2

Speech is a means of communication which relies on both audio and visual information. The absence of one modality can often lead to confusion or misinterpretation of information. In this paper we present an end-to-end temporal model capable of directly synthesising audio from silent video, without needing to transform to-and-from intermediate features. Our proposed approach, based on GANs is capable of producing natural sounding, intelligible speech which is synchronised with the video. The performance of our model is evaluated on the GRID dataset for both speaker dependent and speaker independent scenarios. To the best of our knowledge this is the first method that maps video directly to raw audio and the first to produce intelligible speech when tested on previously unseen speakers. We evaluate the synthesised audio not only based on the sound quality but also on the accuracy of the spoken words.

READ FULL TEXT
research
04/27/2021

End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks

Video-to-speech is the process of reconstructing the audio speech from a...
research
05/23/2018

End-to-End Speech-Driven Facial Animation with Temporal GANs

Speech-driven facial animation is the process which uses speech signals ...
research
03/11/2023

The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

The Multi-modal Information based Speech Processing (MISP) challenge aim...
research
03/25/2019

Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks

Speech is a rich biometric signal that contains information about the id...
research
11/14/2020

Speech Prediction in Silent Videos using Variational Autoencoders

Understanding the relationship between the auditory and visual signals i...
research
01/10/2023

Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

In this paper we propose a method for end-to-end speech driven video edi...
research
05/31/2017

Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers

In this paper, we present a system that associates faces with voices in ...

Please sign up or login with your details

Forgot password? Click here to reset