An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

08/21/2020
by   Daniel Michelsanti, et al.
0

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. More recently, visual information from the target speakers, such as lip movements and facial expressions, has been introduced to speech enhancement and speech separation systems, because the visual aspect of speech is essentially unaffected by the acoustic environment. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving state-of-the-art performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: visual features; acoustic features; deep learning methods; fusion techniques; training targets and objective functions. We also survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation.

READ FULL TEXT
research
01/15/2021

MFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

The purpose of speech enhancement is to extract target speech signal fro...
research
08/24/2017

Supervised Speech Separation Based on Deep Learning: An Overview

Speech separation is the task of separating target speech from backgroun...
research
11/15/2018

On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Audio-visual speech enhancement (AV-SE) is the task of improving speech ...
research
05/29/2019

Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect

When speaking in presence of background noise, humans reflexively change...
research
04/14/2022

RadioSES: mmWave-Based Audioradio Speech Enhancement and Separation System

Speech enhancement and separation have been a long-standing problem, esp...
research
08/31/2018

Single-Microphone Speech Enhancement and Separation Using Deep Learning

The cocktail party problem comprises the challenging task of understandi...
research
08/01/2021

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing

With the development of deep learning and artificial intelligence, audio...

Please sign up or login with your details

Forgot password? Click here to reset