Multi-Modal Hybrid Deep Neural Network for Speech Enhancement

06/15/2016
by   Zhenzhou Wu, et al.
0

Deep Neural Networks (DNN) have been successful in en- hancing noisy speech signals. Enhancement is achieved by learning a nonlinear mapping function from the features of the corrupted speech signal to that of the reference clean speech signal. The quality of predicted features can be improved by providing additional side channel information that is robust to noise, such as visual cues. In this paper we propose a novel deep learning model inspired by insights from human audio visual perception. In the proposed unified hybrid architecture, features from a Convolution Neural Network (CNN) that processes the visual cues and features from a fully connected DNN that processes the audio signal are integrated using a Bidirectional Long Short-Term Memory (BiLSTM) network. The parameters of the hybrid model are jointly learned using backpropagation. We compare the quality of enhanced speech from the hybrid models with those from traditional DNN and BiLSTM models.

READ FULL TEXT
research
07/31/2018

DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

Human auditory cortex excels at selectively suppressing background noise...
research
08/28/2018

Contextual Audio-Visual Switching For Speech Enhancement in Real-World Environments

Human speech processing is inherently multimodal, where visual cues (lip...
research
10/12/2018

A Fully Time-domain Neural Model for Subband-based Speech Synthesizer

This paper introduces a deep neural network model for subband-based spee...
research
04/25/2020

Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement

We present an introspection of an audiovisual speech enhancement model. ...
research
03/08/2022

Harmonicity Plays a Critical Role in DNN Based Versus in Biologically-Inspired Monaural Speech Segregation Systems

Recent advancements in deep learning have led to drastic improvements in...
research
10/26/2022

Acoustically-Driven Phoneme Removal That Preserves Vocal Affect Cues

In this paper, we propose a method for removing linguistic information f...
research
06/29/2021

Towards a generalized monaural and binaural auditory model for psychoacoustics and speech intelligibility

Auditory perception involves cues in the monaural auditory pathways as w...

Please sign up or login with your details

Forgot password? Click here to reset