Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments

06/13/2019
by   Guan-Lin Chao, et al.
0

Speech recognition in cocktail-party environments remains a significant challenge for state-of-the-art speech recognition systems, as it is extremely difficult to extract an acoustic signal of an individual speaker from a background of overlapping speech with similar frequency and temporal characteristics. We propose the use of speaker-targeted acoustic and audio-visual models for this task. We complement the acoustic features in a hybrid DNN-HMM model with information of the target speaker's identity as well as visual features from the mouth region of the target speaker. Experimentation was performed using simulated cocktail-party data generated from the GRID audio-visual corpus by overlapping two speakers's speech on a single acoustic channel. Our audio-only baseline achieved a WER of 26.3 model improved the WER to 4.4 even more pronounced effect, improving the WER to 3.6 approaches, however, did not significantly improve performance further. Our work demonstrates that speaker-targeted models can significantly improve the speech recognition in cocktail party environments.

READ FULL TEXT
research
09/15/2023

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

Previous Multimodal Information based Speech Processing (MISP) challenge...
research
05/15/2019

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

Speech-driven visual speech synthesis involves mapping features extracte...
research
10/31/2021

Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

In typical multi-talker speech recognition systems, a neural network-bas...
research
02/23/2021

Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

Estimating the positions of multiple speakers can be helpful for tasks l...
research
09/04/2019

VoipLoc: Establishing VoIP call provenance using acoustic side-channels

We develop a novel technique to determine call provenance in anonymous V...
research
11/06/2018

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments

In this paper, we address the problem of enhancing the speech of a speak...
research
01/31/2023

Neural Target Speech Extraction: An Overview

Humans can listen to a target speaker even in challenging acoustic condi...

Please sign up or login with your details

Forgot password? Click here to reset