DeepAI AI Chat
Log In Sign Up

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments

by   Guan-Lin Chao, et al.
Carnegie Mellon University

Speech recognition in cocktail-party environments remains a significant challenge for state-of-the-art speech recognition systems, as it is extremely difficult to extract an acoustic signal of an individual speaker from a background of overlapping speech with similar frequency and temporal characteristics. We propose the use of speaker-targeted acoustic and audio-visual models for this task. We complement the acoustic features in a hybrid DNN-HMM model with information of the target speaker's identity as well as visual features from the mouth region of the target speaker. Experimentation was performed using simulated cocktail-party data generated from the GRID audio-visual corpus by overlapping two speakers's speech on a single acoustic channel. Our audio-only baseline achieved a WER of 26.3 model improved the WER to 4.4 even more pronounced effect, improving the WER to 3.6 approaches, however, did not significantly improve performance further. Our work demonstrates that speaker-targeted models can significantly improve the speech recognition in cocktail party environments.


Speaker conditioning of acoustic models using affine transformation for multi-speaker speech recognition

This study addresses the problem of single-channel Automatic Speech Reco...

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

Speech-driven visual speech synthesis involves mapping features extracte...

Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

Estimating the positions of multiple speakers can be helpful for tasks l...

Audio-visual multi-channel speech separation, dereverberation and recognition

Despite the rapid advance of automatic speech recognition (ASR) technolo...

Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

In typical multi-talker speech recognition systems, a neural network-bas...

VoipLoc: Establishing VoIP call provenance using acoustic side-channels

We develop a novel technique to determine call provenance in anonymous V...

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments

In this paper, we address the problem of enhancing the speech of a speak...