Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

05/10/2022
by   Otavio Braga, et al.
0

Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces, at the cost of sacrificing some accuracy on active speaker detection. This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss. By combining the two tasks during training we reduce the ASD classification accuracy by approximately 25 simultaneously improving the ASR performance when compared to the multi-person baseline trained exclusively for ASR.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/11/2022

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

Audio-visual automatic speech recognition is a promising approach to rob...
research
04/01/2022

End-to-end multi-talker audio-visual ASR using an active speaker attention module

This paper presents a new approach for end-to-end audio-visual multi-tal...
research
05/11/2022

End-to-End Multi-Person Audio/Visual Automatic Speech Recognition

Traditionally, audio-visual automatic speech recognition has been studie...
research
07/02/2021

Multi-user VoiceFilter-Lite via Attentive Speaker Embedding

In this paper, we propose a solution to allow speaker conditioned speech...
research
11/23/2022

Whose Emotion Matters? Speaker Detection without Prior Knowledge

The task of emotion recognition in conversations (ERC) benefits from the...
research
02/23/2021

Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

Estimating the positions of multiple speakers can be helpful for tasks l...
research
06/01/2023

Encoder-decoder multimodal speaker change detection

The task of speaker change detection (SCD), which detects points where s...

Please sign up or login with your details

Forgot password? Click here to reset