Towards Robust Real-time Audio-Visual Speech Enhancement

12/16/2021
by   Mandar Gogate, et al.
5

The human brain contextually exploits heterogeneous sensory information to efficiently perform cognitive tasks including vision and hearing. For example, during the cocktail party situation, the human auditory cortex contextually integrates audio-visual (AV) cues in order to better perceive speech. Recent studies have shown that AV speech enhancement (SE) models can significantly improve speech quality and intelligibility in very low signal to noise ratio (SNR) environments as compared to audio-only SE models. However, despite significant research in the area of AV SE, development of real-time processing models with low latency remains a formidable technical challenge. In this paper, we present a novel framework for low latency speaker-independent AV SE that can generalise on a range of visual and acoustic noises. In particular, a generative adversarial networks (GAN) is proposed to address the practical issue of visual imperfections in AV SE. In addition, we propose a deep neural network based real-time AV SE model that takes into account the cleaned visual speech output from GAN to deliver more robust SE. The proposed framework is evaluated on synthetic and real noisy AV corpora using objective speech quality and intelligibility metrics and subjective listing tests. Comparative simulation results show that our real time AV SE framework outperforms state-of-the-art SE approaches, including recent DNN based SE models.

READ FULL TEXT

page 1

page 5

page 8

page 10

research
09/23/2019

CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement

Noisy situations cause huge problems for suffers of hearing loss as hear...
research
11/18/2021

Towards Intelligibility-Oriented Audio-Visual Speech Enhancement

Existing deep learning (DL) based speech enhancement approaches are gene...
research
09/01/2017

Audio-Visual Speech Enhancement based on Multimodal Deep Convolutional Neural Network

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE ...
research
09/06/2017

Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification

Improving speech system performance in noisy environments remains a chal...
research
06/13/2020

SE-MelGAN – Speaker Agnostic Rapid Speech Enhancement

Recent advancement in Generative Adversarial Networks in speech synthesi...
research
07/17/2022

Improving spatial cues for hearables using a parameterized binaural CDR estimator

We investigate a speech enhancement method based on the binaural coheren...
research
04/12/2021

L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing

The L3DAS21 Challenge is aimed at encouraging and fostering collaborativ...

Please sign up or login with your details

Forgot password? Click here to reset