Audio-Visual Speech Enhancement with Score-Based Generative Models

06/02/2023
by   Julius Richter, et al.
0

This paper introduces an audio-visual speech enhancement system that leverages score-based generative models, also known as diffusion models, conditioned on visual information. In particular, we exploit audio-visual embeddings obtained from a self-super­vised learning model that has been fine-tuned on lipreading. The layer-wise features of its transformer-based encoder are aggregated, time-aligned, and incorporated into the noise conditional score network. Experimental evaluations show that the proposed audio-visual speech enhancement system yields improved speech quality and reduces generative artifacts such as phonetic confusions with respect to the audio-only equivalent. The latter is supported by the word error rate of a downstream automatic speech recognition model, which decreases noticeably, especially at low input signal-to-noise ratios.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/07/2019

Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoder

Variational auto-encoders (VAEs) are deep generative latent variable mod...
research
06/07/2022

Universal Speech Enhancement with Score-based Diffusion

Removing background noise from speech audio has been the subject of cons...
research
03/31/2022

Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain

Score-based generative models (SGMs) have recently shown impressive resu...
research
10/26/2018

Scaling Speech Enhancement in Unseen Environments with Noise Embeddings

We address the problem of speech enhancement generalisation to unseen en...
research
04/19/2022

Audio-Visual Wake Word Spotting System For MISP Challenge 2021

This paper presents the details of our system designed for the Task 1 of...
research
10/30/2021

Cross-attention conformer for context modeling in speech enhancement for ASR

This work introduces cross-attention conformer, an attention-based archi...
research
07/31/2020

Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones

A novel framework for meeting transcription using asynchronous microphon...

Please sign up or login with your details

Forgot password? Click here to reset