An Empirical Study of Visual Features for DNN based Audio-Visual Speech Enhancement in Multi-talker Environments

11/09/2020
by   Shrishti Saha Shetu, et al.
0

Audio-visual speech enhancement (AVSE) methods use both audio and visual features for the task of speech enhancement and the use of visual features has been shown to be particularly effective in multi-speaker scenarios. In the majority of deep neural network (DNN) based AVSE methods, the audio and visual data are first processed separately using different sub-networks, and then the learned features are fused to utilize the information from both modalities. There have been various studies on suitable audio input features and network architectures, however, to the best of our knowledge, there is no published study that has investigated which visual features are best suited for this specific task. In this work, we perform an empirical study of the most commonly used visual features for DNN based AVSE, the pre-processing requirements for each of these features, and investigate their influence on the performance. Our study shows that despite the overall better performance of embedding-based features, their computationally intensive pre-processing make their use difficult in low resource systems. For such systems, optical flow or raw pixels-based features might be better suited.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/30/2022

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

Audio-visual speech enhancement system is regarded as one of promising s...
research
01/15/2021

MFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

The purpose of speech enhancement is to extract target speech signal fro...
research
11/06/2018

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments

In this paper, we address the problem of enhancing the speech of a speak...
research
09/21/2020

Correlating Subword Articulation with Lip Shapes for Embedding Aware Audio-Visual Speech Enhancement

In this paper, we propose a visual embedding approach to improving embed...
research
01/15/2021

AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

Audio-visual speech enhancement system is regarded to be one of promisin...
research
09/20/2016

Deep CTR Prediction in Display Advertising

Click through rate (CTR) prediction of image ads is the core task of onl...
research
10/02/2020

An Empirical Study of DNNs Robustification Inefficacy in Protecting Visual Recommenders

Visual-based recommender systems (VRSs) enhance recommendation performan...

Please sign up or login with your details

Forgot password? Click here to reset