Audio-Visual Speech Recognition is Worth 32×32×8 Voxels

09/20/2021
by   Dmitriy Serdyuk, et al.
0

Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, often by relying on information conveyed by the motion of the speaker's mouth. The use of the video signal requires extracting visual features, which are then combined with the acoustic features to build an AV-ASR system [1]. This is traditionally done with some form of 3D convolutional network (e.g. VGG) as widely used in the computer vision community. Recently, image transformers [2] have been introduced to extract visual features useful for image classification tasks. In this work, we propose to replace the 3D convolutional visual front-end with a video transformer front-end. We train our systems on a large-scale dataset composed of YouTube videos and evaluate performance on the publicly available LRS3-TED set, as well as on a large set of YouTube videos. On a lip-reading task, the transformer-based front-end shows superior performance compared to a strong convolutional baseline. On an AV-ASR task, the transformer front-end performs as well as (or better than) the convolutional baseline. Fine-tuning our model on the LRS3-TED training set matches previous state of the art. Thus, we experimentally show the viability of the convolution-free model for AV-ASR.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/25/2022

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition

Audio-visual automatic speech recognition (AV-ASR) extends the speech re...
research
04/29/2020

Multiresolution and Multimodal Speech Recognition with Transformers

This paper presents an audio visual automatic speech recognition (AV-ASR...
research
10/07/2019

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Instructional videos get high-traffic on video sharing platforms, and pr...
research
12/01/2017

Visual Features for Context-Aware Speech Recognition

Automatic transcriptions of consumer-generated multi-media content such ...
research
07/22/2022

Fact sheet: Automatic Self-Reported Personality Recognition Track

We propose an informed baseline to help disentangle the various contextu...
research
06/15/2022

AVATAR: Unconstrained Audiovisual Speech Recognition

Audio-visual automatic speech recognition (AV-ASR) is an extension of AS...
research
09/18/2021

Violence Detection in Videos

In the recent years, there has been a tremendous increase in the amount ...

Please sign up or login with your details

Forgot password? Click here to reset