MFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

01/15/2021
by   Xinmeng Xu, et al.
0

The purpose of speech enhancement is to extract target speech signal from a mixture of sounds generated from several sources. Speech enhancement can potentially benefit from the visual information from the target speaker, such as lip move-ment and facial expressions, because the visual aspect of speech isessentially unaffected by acoustic environment. In order to fuse audio and visual information, an audio-visual fusion strategy is proposed, which goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to more powerful representation which increase intelligibility in noisy conditions. The proposed model fuses audio-visual featureslayer by layer, and feed these audio-visual features to each corresponding decoding layer. Experiment results show relative improvement from 6 level. Moreover, there is a significant increase of PESQ from 1.21 to 2.06 in our -15 dB SNR experiment.

READ FULL TEXT

page 2

page 5

research
06/30/2022

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

Audio-visual speech enhancement system is regarded as one of promising s...
research
08/21/2020

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Speech enhancement and speech separation are two related tasks, whose pu...
research
02/04/2021

VSEGAN: Visual Speech Enhancement Generative Adversarial Network

Speech enhancement is an essential task of improving speech quality in n...
research
01/15/2021

AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

Audio-visual speech enhancement system is regarded to be one of promisin...
research
11/09/2020

An Empirical Study of Visual Features for DNN based Audio-Visual Speech Enhancement in Multi-talker Environments

Audio-visual speech enhancement (AVSE) methods use both audio and visual...
research
07/09/2022

Dual-path Attention is All You Need for Audio-Visual Speech Extraction

Audio-visual target speech extraction, which aims to extract a certain s...
research
03/04/2022

Look&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Active speaker detection and speech enhancement have become two increasi...

Please sign up or login with your details

Forgot password? Click here to reset