Audio-Visual Speech Enhancement based on Multimodal Deep Convolutional Neural Network

09/01/2017
by   Jen-Cheng Hou, et al.
0

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus on addressing audio information only. In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of convolutional neural networks (CNNs) in SE, we propose an audio-visual deep CNN (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. In the proposed AVDCNN SE model, audio and visual data are first processed using individual CNNs, and then, fused into a joint network to generate enhanced speech at the output layer. The AVDCNN model is trained in an end-to-end manner, and parameters are jointly learned through back-propagation. We evaluate enhanced speech using five objective criteria. Results show that the AVDCNN yields notably better performance, compared with an audio-only CNN-based SE model and two conventional SE approaches, confirming the effectiveness of integrating visual information into the SE process.

READ FULL TEXT
research
03/30/2017

Audio-Visual Speech Enhancement based on Multimodal Deep Convolutional Neural Networks

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE ...
research
02/14/2022

EMGSE: Acoustic/EMG Fusion for Multimodal Speech Enhancement

Multimodal learning has been proven to be an effective method to improve...
research
03/30/2017

Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE ...
research
05/24/2020

Lite Audio-Visual Speech Enhancement

Previous studies have confirmed the effectiveness of incorporating visua...
research
11/15/2018

On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Audio-visual speech enhancement (AV-SE) is the task of improving speech ...
research
12/16/2021

Towards Robust Real-time Audio-Visual Speech Enhancement

The human brain contextually exploits heterogeneous sensory information ...
research
05/11/2023

Extending Audio Masked Autoencoders Toward Audio Restoration

Audio classification and restoration are among major downstream tasks in...

Please sign up or login with your details

Forgot password? Click here to reset