On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

11/15/2018
by   Daniel Michelsanti, et al.
0

Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task in a supervised manner. In this context, the choice of the target, i.e. the quantity to be estimated, and the objective function, which quantifies the quality of this estimate, to be used for training is critical for the performance. This work is the first that presents an experimental study of a range of different targets and objective functions used to train a deep-learning-based AV-SE system. The results show that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility, although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2021

Towards Intelligibility-Oriented Audio-Visual Speech Enhancement

Existing deep learning (DL) based speech enhancement approaches are gene...
research
08/21/2020

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Speech enhancement and speech separation are two related tasks, whose pu...
research
09/01/2017

Audio-Visual Speech Enhancement based on Multimodal Deep Convolutional Neural Network

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE ...
research
05/24/2023

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech ...
research
05/29/2019

Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect

When speaking in presence of background noise, humans reflexively change...
research
08/21/2020

CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

In this paper, we present a deep learning-based speech signal-processing...
research
03/30/2017

Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE ...

Please sign up or login with your details

Forgot password? Click here to reset