Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

03/25/2023
by   Pingchuan Ma, et al.
0

Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets. However, accurate labelling of datasets is time-consuming and expensive. Hence, in this work, we investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. For this purpose, we use publicly-available pre-trained ASR models to automatically transcribe unlabelled datasets such as AVSpeech and VoxCeleb2. Then, we train ASR, VSR and AV-ASR models on the augmented training set, which consists of the LRS2 and LRS3 datasets as well as the additional automatically-transcribed data. We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions. The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3. In particular, it achieves a WER of 0.9 approach, and outperforms methods that have been trained on non-publicly available datasets with 26 times more training data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/09/2022

Thai Wav2Vec2.0 with CommonVoice V8

Recently, Automatic Speech Recognition (ASR), a system that converts aud...
research
02/26/2022

Visual Speech Recognition for Multiple Languages in the Wild

Visual speech recognition (VSR) aims to recognise the content of speech ...
research
01/04/2023

Audio-Visual Efficient Conformer for Robust Speech Recognition

End-to-end Automatic Speech Recognition (ASR) systems based on neural ne...
research
08/11/2023

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

Visual Speech Recognition (VSR) differs from the common perception tasks...
research
05/27/2023

Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on the False Alarms in Automated Speech Recognition Testing

Recent studies have proposed the use of Text-To-Speech (TTS) systems to ...
research
04/18/2022

Extracting Targeted Training Data from ASR Models, and How to Mitigate It

Recent work has designed methods to demonstrate that model updates in AS...
research
02/18/2018

Visual-Only Recognition of Normal, Whispered and Silent Speech

Silent speech interfaces have been recently proposed as a way to enable ...

Please sign up or login with your details

Forgot password? Click here to reset