Visual Speech Enhancement Without A Real Visual Stream

12/20/2020
by   Sindhu B Hegde, et al.
0

In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state-of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over "audio-only" methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech-driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as a "visual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is comparable (< 3 that we can exploit the advantages of using lip movements even in the absence of a real video stream. We rigorously evaluate our model using quantitative metrics as well as human evaluations. Additional ablation studies and a demo video on our website containing qualitative comparisons and results clearly illustrate the effectiveness of our approach. We provide a demo video which clearly illustrates the effectiveness of our proposed approach on our website: <http://cvit.iiit.ac.in/research/projects/cvit-projects/visual-speech-enhancement-without-a-real-visual-stream>. The code and models are also released for future research: <https://github.com/Sindhu-Hegde/pseudo-visual-speech-denoising>.

READ FULL TEXT

page 3

page 7

research
03/31/2022

Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

Since facial actions such as lip movements contain significant informati...
research
08/23/2020

A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

In this work, we investigate the problem of lip-syncing a talking face v...
research
09/14/2023

AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement

Speech enhancement systems are typically trained using pairs of clean an...
research
11/23/2017

Visual Speech Enhancement using Noise-Invariant Training

Visual speech enhancement is used on videos shot in noisy environments t...
research
12/21/2022

ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement

Prior works on improving speech quality with visual input typically stud...
research
08/28/2018

Contextual Audio-Visual Switching For Speech Enhancement in Real-World Environments

Human speech processing is inherently multimodal, where visual cues (lip...
research
02/12/2019

Puppet Dubbing

Dubbing puppet videos to make the characters (e.g. Kermit the Frog) conv...

Please sign up or login with your details

Forgot password? Click here to reset