Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition

01/25/2022
by   Dmitriy Serdyuk, et al.
0

Audio-visual automatic speech recognition (AV-ASR) extends the speech recognition by introducing the video modality. In particular, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for the image classification task. In this work, we propose to replace the 3D convolution with a video transformer video feature extractor. We train our baselines and the proposed model on a large scale corpus of the YouTube videos. Then we evaluate the performance on a labeled subset of YouTube as well as on the public corpus LRS3-TED. Our best model video-only model achieves the performance of 34.9 relative improvements over the convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/20/2021

Audio-Visual Speech Recognition is Worth 32×32×8 Voxels

Audio-visual automatic speech recognition (AV-ASR) introduces the video ...
research
05/12/2020

Discriminative Multi-modality Speech Recognition

Vision is often used as a complementary modality for audio speech recogn...
research
11/08/2019

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

This work presents a large-scale audio-visual speech recognition system ...
research
12/01/2017

Visual Features for Context-Aware Speech Recognition

Automatic transcriptions of consumer-generated multi-media content such ...
research
06/01/2023

Encoder-decoder multimodal speaker change detection

The task of speaker change detection (SCD), which detects points where s...
research
06/18/2017

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

Audio-visual recognition (AVR) has been considered as a solution for spe...
research
10/05/2020

Fine-Grained Grounding for Multimodal Speech Recognition

Multimodal automatic speech recognition systems integrate information fr...

Please sign up or login with your details

Forgot password? Click here to reset