AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement

09/14/2023
by   Ju-chieh Chou, et al.
0

Speech enhancement systems are typically trained using pairs of clean and noisy speech. In audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data available; most audio-visual datasets are collected in real-world environments with background noise and reverberation, hampering the development of AVSE. In this work, we introduce AV2Wav, a resynthesis-based audio-visual speech enhancement approach that can generate clean speech despite the challenges of real-world training data. We obtain a subset of nearly clean speech from an audio-visual corpus using a neural quality estimator, and then train a diffusion model on this subset to generate waveforms conditioned on continuous speech representations from AV-HuBERT with noise-robust training. We use continuous rather than discrete representations to retain prosody and speaker information. With this vocoding task alone, the model can perform speech enhancement better than a masking-based baseline. We further fine-tune the diffusion model on clean/noisy utterance pairs to improve the performance. Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test and is close in quality to the target speech in the listening test. Audio samples can be found at https://home.ttic.edu/ jcchou/demo/avse/avse_demo.html.

READ FULL TEXT
research
11/20/2022

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Audio-visual speech enhancement aims to extract clean speech from a nois...
research
12/21/2022

ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement

Prior works on improving speech quality with visual input typically stud...
research
05/09/2023

Temporal Convolution Network Based Onset Detection and Query by Humming System Design

Onsets are a key factor to split audio into several notes. In this paper...
research
11/21/2020

Deep Network Perceptual Losses for Speech Denoising

Contemporary speech enhancement predominantly relies on audio transforms...
research
04/25/2020

Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement

We present an introspection of an audiovisual speech enhancement model. ...
research
12/20/2020

Visual Speech Enhancement Without A Real Visual Stream

In this work, we re-think the task of speech enhancement in unconstraine...
research
02/19/2021

Speech enhancement with weakly labelled data from AudioSet

Speech enhancement is a task to improve the intelligibility and perceptu...

Please sign up or login with your details

Forgot password? Click here to reset