Visual Speech Recognition for Multiple Languages in the Wild

02/26/2022
by   Pingchuan Ma, et al.
5

Visual speech recognition (VSR) aims to recognise the content of speech based on the lip movements without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to larger training sets rather than the model design. In this work, we demonstrate that designing better models is equally important to using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations. We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show furthermore that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/25/2023

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Audio-visual speech recognition has received a lot of attention due to i...
research
11/08/2019

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

This work presents a large-scale audio-visual speech recognition system ...
research
07/23/2021

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Deep learning techniques have been shown to be efficient in various task...
research
02/12/2021

End-to-end Audio-visual Speech Recognition with Conformers

In this work, we present a hybrid CTC/Attention model based on a ResNet-...
research
07/02/2021

Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription

Domain-specific data is the crux of the successful transfer of machine l...
research
01/04/2022

A Hierarchical Model for Spoken Language Recognition

Spoken language recognition (SLR) refers to the automatic process used t...
research
04/10/2023

Do We Train on Test Data? The Impact of Near-Duplicates on License Plate Recognition

This work draws attention to the large fraction of near-duplicates in th...

Please sign up or login with your details

Forgot password? Click here to reset