DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

07/31/2018
by   Mandar Gogate, et al.
0

Human auditory cortex excels at selectively suppressing background noise to focus on a target speaker. The process of selective attention in the brain is known to contextually exploit the available audio and visual cues to better focus on target speaker while filtering out other noises. In this study, we propose a novel deep neural network (DNN) based audiovisual (AV) mask estimation model. The proposed AV mask estimation model contextually integrates the temporal dynamics of both audio and noise-immune visual features for improved mask estimation and speech separation. For optimal AV features extraction and ideal binary mask (IBM) estimation, a hybrid DNN architecture is exploited to leverages the complementary strengths of a stacked long short term memory (LSTM) and convolution LSTM network. The comparative simulation results in terms of speech quality and intelligibility demonstrate significant performance improvement of our proposed AV mask estimation model as compared to audio-only and visual-only mask estimation approaches for both speaker dependent and independent scenarios.

READ FULL TEXT
research
06/15/2016

Multi-Modal Hybrid Deep Neural Network for Speech Enhancement

Deep Neural Networks (DNN) have been successful in en- hancing noisy spe...
research
12/05/2019

Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras

In this work, we propose a new method to address audio-visual target spe...
research
05/24/2018

VisemeNet: Audio-Driven Animator-Centric Speech Animation

We present a novel deep-learning based approach to producing animator-ce...
research
07/31/2018

Lip-Reading Driven Deep Learning Approach for Speech Enhancement

This paper proposes a novel lip-reading driven deep learning framework f...
research
09/23/2019

CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement

Noisy situations cause huge problems for suffers of hearing loss as hear...
research
03/24/2019

Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss

The SpeakerBeam-FE (SBF) method is proposed for speaker extraction. It a...
research
12/12/2017

Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation

The task of estimating the maximum number of concurrent speakers from si...

Please sign up or login with your details

Forgot password? Click here to reset