ASR is all you need: cross-modal distillation for lip reading

11/28/2019
by   Triantafyllos Afouras, et al.
24

The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data. We achieve this by distilling from an Automatic Speech Recognition (ASR) model that has been trained on a large-scale audio-only corpus. We use a cross-modal distillation method that combines CTC with a frame-wise cross-entropy loss. Our contributions are fourfold: (i) we show that ground truth transcriptions are not necessary to train a lip reading system; (ii) we show how arbitrary amounts of unlabelled video data can be leveraged to improve performance; (iii) we demonstrate that distillation significantly speeds up training; and, (iv) we obtain state-of-the-art results on the challenging LRS2 and LRS3 datasets for training only on publicly available data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2023

Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation

Large-scale pre-trained language models (PLMs) with powerful language mo...
research
05/31/2023

ViLaS: Integrating Vision and Language into Automatic Speech Recognition

Employing additional multimodal information to improve automatic speech ...
research
11/26/2019

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Lip reading has witnessed unparalleled development in recent years thank...
research
06/25/2021

Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition

Cued Speech (CS) is a visual communication system for the deaf or hearin...
research
07/04/2021

Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition

We propose a cross-modal transformer-based neural correction models that...
research
03/25/2022

Chain-based Discriminative Autoencoders for Speech Recognition

In our previous work, we proposed a discriminative autoencoder (DcAE) fo...
research
07/03/2022

Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR

Leveraging context information is an intuitive idea to improve performan...

Please sign up or login with your details

Forgot password? Click here to reset