LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

11/20/2022
by   Rodrigo Mira, et al.
0

Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use spectral mapping/masking to reproduce the clean audio, often resulting in visual backbones added to existing speech enhancement architectures. In this work, we propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture, and then converts them into waveform audio using a neural vocoder (HiFi-GAN). We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference. Our experiments show that LA-VocE outperforms existing methods according to multiple metrics, particularly under very noisy scenarios.

READ FULL TEXT
research
09/14/2023

AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement

Speech enhancement systems are typically trained using pairs of clean an...
research
03/31/2022

Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

Since facial actions such as lip movements contain significant informati...
research
08/28/2018

Contextual Audio-Visual Switching For Speech Enhancement in Real-World Environments

Human speech processing is inherently multimodal, where visual cues (lip...
research
03/04/2022

Look&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Active speaker detection and speech enhancement have become two increasi...
research
05/09/2023

Temporal Convolution Network Based Onset Detection and Query by Humming System Design

Onsets are a key factor to split audio into several notes. In this paper...
research
11/21/2020

Deep Network Perceptual Losses for Speech Denoising

Contemporary speech enhancement predominantly relies on audio transforms...
research
04/14/2022

RadioSES: mmWave-Based Audioradio Speech Enhancement and Separation System

Speech enhancement and separation have been a long-standing problem, esp...

Please sign up or login with your details

Forgot password? Click here to reset