AVATAR: Unconstrained Audiovisual Speech Recognition

06/15/2022
by   Valentin Gabeur, et al.
8

Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth. Unlike works that simply focus on the lip motion, we investigate the contribution of entire visual frames (visual actions, objects, background etc.). This is particularly useful for unconstrained videos, where the speaker is not necessarily visible. To solve this task, we propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) which is trained end-to-end from spectrograms and full-frame RGB. To prevent the audio stream from dominating training, we propose different word-masking strategies, thereby encouraging our model to pay attention to the visual stream. We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise, and show that our model outperforms all other prior work by a large margin. Finally, we also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.

READ FULL TEXT
research
01/05/2022

Robust Self-Supervised Audio-Visual Speech Recognition

Audio-based automatic speech recognition (ASR) degrades significantly in...
research
05/12/2020

Discriminative Multi-modality Speech Recognition

Vision is often used as a complementary modality for audio speech recogn...
research
06/14/2021

Learning Audio-Visual Dereverberation

Reverberation from audio reflecting off surfaces and objects in the envi...
research
09/20/2021

Audio-Visual Speech Recognition is Worth 32×32×8 Voxels

Audio-visual automatic speech recognition (AV-ASR) introduces the video ...
research
12/14/2020

AV Taris: Online Audio-Visual Speech Recognition

In recent years, Automatic Speech Recognition (ASR) technology has appro...
research
04/01/2022

End-to-end multi-talker audio-visual ASR using an active speaker attention module

This paper presents a new approach for end-to-end audio-visual multi-tal...
research
05/11/2022

End-to-End Multi-Person Audio/Visual Automatic Speech Recognition

Traditionally, audio-visual automatic speech recognition has been studie...

Please sign up or login with your details

Forgot password? Click here to reset