Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments

11/06/2018
by   Giovanni Morrone, et al.
0

In this paper, we address the problem of enhancing the speech of a speaker of interest in a cocktail party scenario when visual information of the speaker of interest is available. Contrary to most previous studies, we do not learn visual features on the typically small audio-visual datasets, but use an already available face landmark detector (trained on a separate image dataset). The landmarks are used by LSTM-based models to generate time-frequency masks which are applied to the acoustic mixed-speech spectrogram. Results show that: (i) landmark motion features are very effective features for this task, (ii) similarly to previous work, reconstruction of the target speaker's spectrogram mediated by masking is significantly more accurate than direct spectrogram reconstruction, and (iii) the best masks depend on both motion landmark features and the input mixed-speech spectrogram. To the best of our knowledge, our proposed models are the first models trained and evaluated on the limited size GRID and TCD-TIMIT datasets, that achieve speaker-independent speech enhancement in a multi-talker setting.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2019

Joined Audio-Visual Speech Enhancement and Recognition in the Cocktail Party: The Tug Of War Between Enhancement and Recognition Losses

In this paper we propose an end-to-end LSTM-based model that performs si...
research
11/09/2020

An Empirical Study of Visual Features for DNN based Audio-Visual Speech Enhancement in Multi-talker Environments

Audio-visual speech enhancement (AVSE) methods use both audio and visual...
research
07/11/2019

My lips are concealed: Audio-visual speech enhancement through obstructions

Our objective is an audio-visual model for separating a single speaker f...
research
06/13/2019

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments

Speech recognition in cocktail-party environments remains a significant ...
research
11/23/2017

Visual Speech Enhancement using Noise-Invariant Training

Visual speech enhancement is used on videos shot in noisy environments t...
research
03/14/2023

TEA-PSE 3.0: Tencent-Ethereal-Audio-Lab Personalized Speech Enhancement System For ICASSP 2023 DNS Challenge

This paper introduces the Unbeatable Team's submission to the ICASSP 202...
research
02/04/2023

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Lipreading refers to understanding and further translating the speech of...

Please sign up or login with your details

Forgot password? Click here to reset