ASR is all you need: cross-modal distillation for lip reading

11/28/2019 ∙ by Triantafyllos Afouras, et al. ∙ 24

The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data. We achieve this by distilling from an Automatic Speech Recognition (ASR) model that has been trained on a large-scale audio-only corpus. We use a cross-modal distillation method that combines CTC with a frame-wise cross-entropy loss. Our contributions are fourfold: (i) we show that ground truth transcriptions are not necessary to train a lip reading system; (ii) we show how arbitrary amounts of unlabelled video data can be leveraged to improve performance; (iii) we demonstrate that distillation significantly speeds up training; and, (iv) we obtain state-of-the-art results on the challenging LRS2 and LRS3 datasets for training only on publicly available data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual speech recognition (VSR) has received increasing amounts of attention in recent years due to the success of deep learning models trained on corpora of aligned text and face videos 

[Assael16, Chung16, Chung17]

. In many machine learning applications, training on very large datasets has proven to have huge benefits, and indeed 

[Shillingford18] recently demonstrated significant performance improvements by training on a very large-scale proprietary dataset. However, the largest publicly available datasets for training and evaluating visual speech recognition, LRS2 and LRS3 [Chung17, Afouras18d], are orders of magnitude smaller than their audio-only counterparts used for training Automatic Speech Recognition (ASR) models [panayotov2015librispeech, Baumann2018]. This indicates that there are potential gains to be made from a scalable method that could exploit vast amounts of unlabelled video data.

In this direction, we propose to train a VSR model by distilling from an ASR model with a teacher-student approach. This opens up the opportunity to train VSR model on audio-visual datasets that are an order of magnitude larger than LRS2 and LRS3, such as VoxCeleb2 [Chung18a] and AVSpeech [ephrat2018looking], but lack text annotations. More generally, the VSR model can be trained from any available video of talking heads, e.g. from YouTube. Training by distillation eliminates the need for professionally transcribed subtitles, and also removes the costly step of forced-alignment between the subtitles and speech required to create VSR training data [Chung16].

Our aim is to to pretrain on large unlabelled datasets in order to boost lip reading performance. In the process we also discover that human-generated captions are actually not necessary to train a good model. The approach we follow, as shown in Fig. 1, combines a distillation loss with conventional Connectionist Temporal Classification (CTC) [Graves06]. An alternative option to exploit the extra data, would have been to train solely with CTC on the ASR transcriptions. However we find that compared to that approach, distillation provides a significant acceleration to training.

Figure 1: Cross-modal distillation of an ASR teacher into a student VSR model. CTC loss on the ASR-generated transcripts is combined with minimizing the KL-divergence between the student and teacher posterior distributions.

1.1 Related Work

Supervised lip reading. There has been a number of recent works on lip reading using datasets such as LRS2 [Chung17] and LRS3 [Afouras18d]. Works on word-level lip reading [Chung16] have proposed CNN models and temporal fusion methods for word-level classification. [Stafylakis17]

combines a deeper residual network and an LSTM classifier to achieve the state-of-the-art on the same task. Of more relevance to this work is open set character-level lip reading, for which recent work can be divided into two groups. The first uses CTC where the model predicts frame-wise labels and is trained to minimize the loss resulting from all possible input-output alignments under a monotonicity constraint. LipNet 

[Assael16] and more recently LSVSR [Shillingford18]

are based on this approach. The latter demonstrates state-of-the-art performance by training on proprietary data that is orders of magnitude larger than any public dataset. The second group is sequence-to-sequence models that predict the output sequence one token at a time in an autoregressive manner, attending to different parts of the input sequence on every step. Some examples are the sequence-to-sequence LSTM with attention model used by 

[Chung17] and the Transformer-based model used by  [Afouras18b]. [petridis2018audio, Afouras19] take a hybrid approach that combines the two ideas, namely using a CTC loss with attention-based models. Both approaches can use external language models during inference to boost performance[Kannan17, Maas15]

Knowledge distillation (KD).

Distilling knowledge between two neural networks has been popularised by 


. Supervision provided by the teacher is used to train the student on potentially unlabelled data, usually from a larger network into a smaller network to reduce model size. There are two popular ways of distilling information: training the student to regress the teacher’s pre-softmax logits 


, and minimising the cross-entropy between the probability outputs 

[li2014learning, hinton2015distilling].

Sequence and CTC distillation.

KD has also been studied in the context of sequence modeling. For example it has been used to compress sequence-to-sequence models for neural machine translation 

[kim2016sequence] and ASR [kim2019knowledge]. Distillation of acoustic models trained with CTC has also been investigated for distilling a BLSTM model into a uni-directional LSTM so that it can be used online [kim2017improved], transferring a deep BLSTM model into a shallower one [ding2019compression], and the posterior fusion of multiple models to improve performance [Kurata2019GuidingCP].

Cross-modal distillation. Our approach falls into a group of works that use networks trained on one modality to transfer knowledge to another, in a teacher-student manner. There have been many interesting variations on this idea, such as using a visual recognition network (trained on RGB images) as a teacher for student networks which take depth or optical flow [Gupta2016CrossMD], or audio [aytar16soundnet] as inputs. More specific examples include using the output of a pre-trained face emotion classifier to train a student network that can recognize emotions in speech [Albanie18] or visual recognition of human pose to train a network to recognize pose from radio signals [Zhao18]. The closest work to ours is Wei et al. [li2019improving] who apply cross-modal distillation from ASR for learning audio-visual speech recognition. An interesting finding is that the student surpasses the teacher’s performance, by exploiting the extra information available in the video modality. However, their method is focused on improving ASR by incorporating visual information, rather than learning to lip read from the video signal alone, and they train the teacher model with ground truth supervision on the same dataset as the student one. Consequently, their method does not apply naturally to using unlabelled audio-visual data.

2 Datasets

A summary of audio-visual speech datasets found in the literature is given in Table 1. In particular, LSVSR and MV-LRS contain aligned ground truth transcripts and have been used to train state-of-the-art lip reading models [Shillingford18, Afouras19]. However, these datasets are not publicly available which hinders reproduction and comparison. In this paper we focus on using only publicly available datasets. LRS2 and LRS3 are public audio-visual datasets that contain transcriptions but are relatively small. Librispeech is large, transcribed, diverse regarding the number of speakers, but audio-only. On the other hand VoxCeleb2, which is similar in scale, is audio-visual but lacks transcriptions. We use our distillation method to pretrain on VoxCeleb2 and then fine-tune and evaluate the resulting model on LRS2 and LRS3.

Dataset # Utter. # Hours Mod. Tran. Public
LSVSR [Shillingford18] 2.9M 3,800 AV
MV-LRS [Chung16] 500k 775 AV
Librispeech [panayotov2015librispeech] 292k 1,000 A
VoxCeleb2 [Chung18a] 1.1M 2,300 AV
LRS2 [Chung17] 118k 224 AV
LRS3 [Afouras18d] 165k 475 AV
VoxCeleb2 (clean) 140k 334 AV

Table 1: Statistics of modern audio-visual datasets. Tran.: Indicates if the dataset is labelled, i.e. includes aligned transcriptions; Mod.: Modalities included (A=audio-only, AV=audio + video). VoxCeleb2 (clean) refers to the subset of VoxCeleb2 we obtain after filtering according to Section 2.

To enable the use of an unlabelled speech dataset for training lip reading models for English, we first filter out unsuitable videos. For example, in VoxCeleb2, the language spoken is not always English, while the audio in many samples can be noisy and therefore hard for an ASR model to comprehend. We first run the trained teacher ASR model (details in section 3) to obtain transcriptions on all the unlabelled videos. We then use a simple proxy to select good samples: for each utterance we calculate the percentage of words with 4 characters or more in the ASR output that are valid english words and keep only the samples for which this is 90% or more.

As a second refinement stage, we obtain transcriptions from a separate ASR model. We use a model similar to wave2letter [liptchinsky17] trained on Librispeech. We then compare the generated transcriptions with the ones from the teacher model and only keep an utterance when the overlap in terms of Word Error Rate is below 28%. For VoxCeleb2, the above process discards a large part of the dataset, resulting in approximately clean utterances out of the 1M in total.

3 Cross-modal distillation

3.1 Teacher acoustic model

As a teacher, we used the state-of-the-art Jasper 10x5 acoustic model [li2019jasper], a deep 1D-convolutional residual network.

3.2 Student lip reading model

For lip reading we use a student model with an architecture similar to the teacher’s. More specifically, we adapt the Jasper acoustic model for lip reading as shown in Table 2

. The input to this network are visual features extracted from a spatio-temporal residual CNN

[Stafylakis17], that has been pre-trained on word-level lip reading.

# Blocks Block Kernel # Output Channels Dropout # Sub Blocks
1 Conv1 11 stride=0.5 256 0.2 1
1 B1 11 256 0.2 3
1 B2 13 384 0.2 3
1 B3 17 512 0.2 3
1 B4 21 640 0.3 3
1 B5 25 768 0.3 3
1 Conv2 29 dilation=2 896 0.4 1
1 Conv3 1 1024 0.4 1
1 Conv4 1 # graphemes + 1 0 1
Table 2: Architecture of Jasper-lip 5x3. To modify the Jasper model for lip-reading, we replace the first strided convolutional layer with a transposed convolution (stride=0.5).

3.3 CTC loss on transcriptions

CTC provides a loss function that enables training networks on sequence to sequence tasks without the need for explicit alignment of training targets to input frames. The CTC output token set

consists of output grapheme alphabet augmented with a blank symbol ‘’:

. The network consumes the input sequence and outputs a probability distribution

over for each frame . A CTC path is a sequence of grapheme and blank labels with the same length as the input. Paths can be mapped to possible output sequences with a many-to-one function that removes the blank labels and collapses repeated non-blank labels. The probability of an output sequence given input sequence is obtained by marginalizing over all the paths that are mapped to through B: . [Graves06] computes and differentiates this sum w.r.t. the posteriors efficiently, enabling one to train the network by minimizing the CTC loss over input-output sequence pairs :

3.4 Distillation loss

To distill the acoustic model into the target lip-reading model, we minimize the KL-divergence between the teacher and student CTC posterior distributions or, equivalently, the frame level cross-entropy loss:

where and denote the CTC posteriors for frame obtained from the teacher and student model respectively. This type of distillation has been used by other authors when distilling acoustic CTC models within the same modality (audio) and is referred to as frame-wise KD [takashima2019investigation, sak2015acoustic, kim2017improved].

3.5 Combined loss

As shown on Fig. 1, given the transcription of an utterance and corresponding teacher posteriors, we combine the CTC and KD loss terms into a common objective:

where and

are hyperparameters that balance the two terms.

4 Experimental Setup

We train on the VoxCeleb2, LRS2 and LRS3 datasets and evaluate on LRS2 and LRS3. In this context, we investigate the following training scenarios:

Full supervision. We use annotated datasets only and train the model with CTC loss on the ground truth transcriptions, similarly to [Assael16, Afouras18b]. This is the baseline method.

No supervision. We do not use any ground truth transcriptions and rely solely on the transcriptions and posteriors of the ASR teacher model for the training signal.

Unsupervised pre-training and fine-tuning We first pre-train the model using distillation on data without ground truth transcriptions. We then fine-tune the model on a transcribed target dataset (either LRS2 or LRS3) with full supervision. We perform two sets of experiments in this setting: i) we use the ground truth annotations of all the samples in the dataset that we are fine-tuning on, or ii) we only use the ground truth of the “main” and “trainval” subsets of LRS2 and LRS3 respectively, which contain a small fraction of the total samples.

4.1 Implementation details

Our implementation is based on the Nvidia Seq2Seq framework [openseq2seq]. As a teacher model, we use the 10x5 Jasper model pretrained on Librispeech. For extracting the visual features from the input video we use publicly available visual frontend from [Afouras18b] which is trained on word-level lip reading. We train the student model with the NovoGrad optimizer and the settings of [li2019jasper] on 4 GPUs with 11GB memory and a batch size of 64 on each. We set and . Decoding is performed with a 8192-width beam search that uses a 6-gram language model trained on the Librispeech corpus text.

5 Experiments

Trained on Evaluated on
Method Vox. LRS2 LRS3 LRS2 LRS3
LSVSR [Shillingford18] - 55.1
TM-seq2seq [Afouras19] GT GT 48.3 58.9
Hyb. CTC/Att. [petridis2018audio] GT 63.5 -
CTC GT 58.5 -
CTC + KD ASR 58.2 -
CTC + KD ASR/GT 57.9 -
CTC GT - 68.8
CTC + KD ASR - 65.6
CTC + KD ASR/GT - 65.1

ASR ASR ASR 55.6 62.8
CTC + KD ASR GT ASR 53.2 -
CTC + KD ASR ASR GT - 60.9
Table 3: Word Error Rate % (WER, lower is better) evaluation. CTC: Model trained with CTC loss. CTC + KD: Combined loss. denotes using all the ground truth transcriptions of the dataset, the transcriptions obtained from the teacher ASR model, and first pre-training with the transcriptions and then fine-tuning with a small fraction of the ground truth data. Vox.: VoxCeleb2 (clean). Trained on large non-public labelled datasets: LSVSR for [Shillingford18] and MV-LRS for [Afouras19] (see Table 1).
Figure 2: Progression of the greedy WER (validation) during training. Our method accelerates training significantly compared to training with CTC alone.

We summarize our results in Table 3. The baseline method (CTC, GT) obtains WER on LRS2 and on LRS3 when trained and evaluated on each dataset separately. In the same setting, and without any ground truth transcriptions, our method achieves similar performance on LRS2 () and even better on LRS3 (). This result demonstrates that human-annotated videos are not necessary in order to effectively train lip reading models. Fine-tuning with limited ground truth transcriptions, as described in Section 4, reduces this to for LRS2 and for LRS3. For training on LRS2 alone, these results outperform the previous state-of-the art which was by [petridis2018audio], and set a strong benchmark for a method trained solely on LRS3.

Using our method to train on all the available data, i.e. VoxCeleb2, LRS2 and LRS3 without any ground truth transcriptions, we further reduce the WER to and for LRS2 and LRS3 respectively. If we moreover fine-tune with a small amount of ground truth transcriptions, the WER drops to (LRS2) and (LRS3). Finally, training on each dataset with full supervision after unsupervised pre-training on the other two, yields the best results, for LRS2 and for LRS3. Comparing these numbers to the results we obtained when training on each dataset individually, one concludes that using extra unlabelled audio-visual speech data is indeed an effective way to boost performance.

Distillation significantly accelerates training, even when compared to using ground truth transcriptions. In Fig. 2 we indicatively compare the learning curves of the baseline model, trained with CTC loss on ground truth transcriptions, and our proposed method, trained on transcriptions and posteriors from the teacher model. Our intuition is that the acceleration is due to the distillation providing explicit alignment information to the model, contrary to CTC which only provides an implicit signal.

6 Discussion and future work

In this paper we demonstrated an effective strategy to train strong models for visual speech recognition by distilling knowledge from a pre-trained ASR model. This training method does not require manually annotated data and is therefore suitable for pre-training on unlabeled datasets. It can be optionally fine-tuned on a small amount of annotations and achieves performance that exceeds all existing lip reading systems aside from those trained using proprietary data.

There are many languages for which the annotated data for visual speech recognition is very limited. Since our method is applicable to any video with a talking head, given access to a pretrained ASR model and unlabelled data for a new language, we could naturally extend to lip reading that language.

We note that several authors [takashima2019investigation, Kurata2019GuidingCP, ding2019compression, sak2015acoustic] have reported difficulties distilling acoustic models trained with CTC, stemming from the misalignment between the teacher and student spike timings. From the solutions proposed in literature we only experimented with sequence-level KD [takashima2019investigation] but did not observe any improvements. Investigating the extent of this problem in the cross-modal distillation domain is left to future work.

The method we have proposed can be scaled to arbitrarily large amounts of data. Given time and resource constraints we only utilized VoxCeleb2 and trained a relatively small network (5x3 instead of the 10x5 Jasper). In future work we plan to scale up in terms of both dataset and model size to develop models that can match and surpass the ones trained on very large-scale annotated datasets.