Visual speech recognition (VSR) has received increasing amounts of attention in recent years due to the success of deep learning models trained on corpora of aligned text and face videos[Assael16, Chung16, Chung17]
. In many machine learning applications, training on very large datasets has proven to have huge benefits, and indeed[Shillingford18] recently demonstrated significant performance improvements by training on a very large-scale proprietary dataset. However, the largest publicly available datasets for training and evaluating visual speech recognition, LRS2 and LRS3 [Chung17, Afouras18d], are orders of magnitude smaller than their audio-only counterparts used for training Automatic Speech Recognition (ASR) models [panayotov2015librispeech, Baumann2018]. This indicates that there are potential gains to be made from a scalable method that could exploit vast amounts of unlabelled video data.
In this direction, we propose to train a VSR model by distilling from an ASR model with a teacher-student approach. This opens up the opportunity to train VSR model on audio-visual datasets that are an order of magnitude larger than LRS2 and LRS3, such as VoxCeleb2 [Chung18a] and AVSpeech [ephrat2018looking], but lack text annotations. More generally, the VSR model can be trained from any available video of talking heads, e.g. from YouTube. Training by distillation eliminates the need for professionally transcribed subtitles, and also removes the costly step of forced-alignment between the subtitles and speech required to create VSR training data [Chung16].
Our aim is to to pretrain on large unlabelled datasets in order to boost lip reading performance. In the process we also discover that human-generated captions are actually not necessary to train a good model. The approach we follow, as shown in Fig. 1, combines a distillation loss with conventional Connectionist Temporal Classification (CTC) [Graves06]. An alternative option to exploit the extra data, would have been to train solely with CTC on the ASR transcriptions. However we find that compared to that approach, distillation provides a significant acceleration to training.
1.1 Related Work
Supervised lip reading. There has been a number of recent works on lip reading using datasets such as LRS2 [Chung17] and LRS3 [Afouras18d]. Works on word-level lip reading [Chung16] have proposed CNN models and temporal fusion methods for word-level classification. [Stafylakis17]
combines a deeper residual network and an LSTM classifier to achieve the state-of-the-art on the same task. Of more relevance to this work is open set character-level lip reading, for which recent work can be divided into two groups. The first uses CTC where the model predicts frame-wise labels and is trained to minimize the loss resulting from all possible input-output alignments under a monotonicity constraint. LipNet[Assael16] and more recently LSVSR [Shillingford18]
are based on this approach. The latter demonstrates state-of-the-art performance by training on proprietary data that is orders of magnitude larger than any public dataset. The second group is sequence-to-sequence models that predict the output sequence one token at a time in an autoregressive manner, attending to different parts of the input sequence on every step. Some examples are the sequence-to-sequence LSTM with attention model used by[Chung17] and the Transformer-based model used by [Afouras18b]. [petridis2018audio, Afouras19] take a hybrid approach that combines the two ideas, namely using a CTC loss with attention-based models. Both approaches can use external language models during inference to boost performance[Kannan17, Maas15]
Knowledge distillation (KD).
Distilling knowledge between two neural networks has been popularised by[hinton2015distilling]
. Supervision provided by the teacher is used to train the student on potentially unlabelled data, usually from a larger network into a smaller network to reduce model size. There are two popular ways of distilling information: training the student to regress the teacher’s pre-softmax logits[ba2014deep]
, and minimising the cross-entropy between the probability outputs[li2014learning, hinton2015distilling].
Sequence and CTC distillation.
KD has also been studied in the context of sequence modeling. For example it has been used to compress sequence-to-sequence models for neural machine translation[kim2016sequence] and ASR [kim2019knowledge]. Distillation of acoustic models trained with CTC has also been investigated for distilling a BLSTM model into a uni-directional LSTM so that it can be used online [kim2017improved], transferring a deep BLSTM model into a shallower one [ding2019compression], and the posterior fusion of multiple models to improve performance [Kurata2019GuidingCP].
Cross-modal distillation. Our approach falls into a group of works that use networks trained on one modality to transfer knowledge to another, in a teacher-student manner. There have been many interesting variations on this idea, such as using a visual recognition network (trained on RGB images) as a teacher for student networks which take depth or optical flow [Gupta2016CrossMD], or audio [aytar16soundnet] as inputs. More specific examples include using the output of a pre-trained face emotion classifier to train a student network that can recognize emotions in speech [Albanie18] or visual recognition of human pose to train a network to recognize pose from radio signals [Zhao18]. The closest work to ours is Wei et al. [li2019improving] who apply cross-modal distillation from ASR for learning audio-visual speech recognition. An interesting finding is that the student surpasses the teacher’s performance, by exploiting the extra information available in the video modality. However, their method is focused on improving ASR by incorporating visual information, rather than learning to lip read from the video signal alone, and they train the teacher model with ground truth supervision on the same dataset as the student one. Consequently, their method does not apply naturally to using unlabelled audio-visual data.
A summary of audio-visual speech datasets found in the literature is given in Table 1. In particular, LSVSR and MV-LRS contain aligned ground truth transcripts and have been used to train state-of-the-art lip reading models [Shillingford18, Afouras19]. However, these datasets are not publicly available which hinders reproduction and comparison. In this paper we focus on using only publicly available datasets. LRS2 and LRS3 are public audio-visual datasets that contain transcriptions but are relatively small. Librispeech is large, transcribed, diverse regarding the number of speakers, but audio-only. On the other hand VoxCeleb2, which is similar in scale, is audio-visual but lacks transcriptions. We use our distillation method to pretrain on VoxCeleb2 and then fine-tune and evaluate the resulting model on LRS2 and LRS3.
|Dataset||# Utter.||# Hours||Mod.||Tran.||Public|
To enable the use of an unlabelled speech dataset for training lip reading models for English, we first filter out unsuitable videos. For example, in VoxCeleb2, the language spoken is not always English, while the audio in many samples can be noisy and therefore hard for an ASR model to comprehend. We first run the trained teacher ASR model (details in section 3) to obtain transcriptions on all the unlabelled videos. We then use a simple proxy to select good samples: for each utterance we calculate the percentage of words with 4 characters or more in the ASR output that are valid english words and keep only the samples for which this is 90% or more.
As a second refinement stage, we obtain transcriptions from a separate ASR model. We use a model similar to wave2letter [liptchinsky17] trained on Librispeech. We then compare the generated transcriptions with the ones from the teacher model and only keep an utterance when the overlap in terms of Word Error Rate is below 28%. For VoxCeleb2, the above process discards a large part of the dataset, resulting in approximately clean utterances out of the 1M in total.
3 Cross-modal distillation
3.1 Teacher acoustic model
As a teacher, we used the state-of-the-art Jasper 10x5 acoustic model [li2019jasper], a deep 1D-convolutional residual network.
3.2 Student lip reading model
For lip reading we use a student model with an architecture similar to the teacher’s. More specifically, we adapt the Jasper acoustic model for lip reading as shown in Table 2
. The input to this network are visual features extracted from a spatio-temporal residual CNN[Stafylakis17], that has been pre-trained on word-level lip reading.
|# Blocks||Block||Kernel||# Output Channels||Dropout||# Sub Blocks|
|1||Conv4||1||# graphemes + 1||0||1|
3.3 CTC loss on transcriptions
CTC provides a loss function that enables training networks on sequence to sequence tasks without the need for explicit alignment of training targets to input frames. The CTC output token setconsists of output grapheme alphabet augmented with a blank symbol ‘’:
. The network consumes the input sequence and outputs a probability distributionover for each frame . A CTC path is a sequence of grapheme and blank labels with the same length as the input. Paths can be mapped to possible output sequences with a many-to-one function that removes the blank labels and collapses repeated non-blank labels. The probability of an output sequence given input sequence is obtained by marginalizing over all the paths that are mapped to through B: . [Graves06] computes and differentiates this sum w.r.t. the posteriors efficiently, enabling one to train the network by minimizing the CTC loss over input-output sequence pairs :
3.4 Distillation loss
To distill the acoustic model into the target lip-reading model, we minimize the KL-divergence between the teacher and student CTC posterior distributions or, equivalently, the frame level cross-entropy loss:
where and denote the CTC posteriors for frame obtained from the teacher and student model respectively. This type of distillation has been used by other authors when distilling acoustic CTC models within the same modality (audio) and is referred to as frame-wise KD [takashima2019investigation, sak2015acoustic, kim2017improved].
3.5 Combined loss
4 Experimental Setup
We train on the VoxCeleb2, LRS2 and LRS3 datasets and evaluate on LRS2 and LRS3. In this context, we investigate the following training scenarios:
Full supervision. We use annotated datasets only and train the model with CTC loss on the ground truth transcriptions, similarly to [Assael16, Afouras18b]. This is the baseline method.
No supervision. We do not use any ground truth transcriptions and rely solely on the transcriptions and posteriors of the ASR teacher model for the training signal.
Unsupervised pre-training and fine-tuning We first pre-train the model using distillation on data without ground truth transcriptions. We then fine-tune the model on a transcribed target dataset (either LRS2 or LRS3) with full supervision. We perform two sets of experiments in this setting: i) we use the ground truth annotations of all the samples in the dataset that we are fine-tuning on, or ii) we only use the ground truth of the “main” and “trainval” subsets of LRS2 and LRS3 respectively, which contain a small fraction of the total samples.
4.1 Implementation details
Our implementation is based on the Nvidia Seq2Seq framework [openseq2seq]. As a teacher model, we use the 10x5 Jasper model pretrained on Librispeech. For extracting the visual features from the input video we use publicly available visual frontend from [Afouras18b] which is trained on word-level lip reading. We train the student model with the NovoGrad optimizer and the settings of [li2019jasper] on 4 GPUs with 11GB memory and a batch size of 64 on each. We set and . Decoding is performed with a 8192-width beam search that uses a 6-gram language model trained on the Librispeech corpus text.
|Trained on||Evaluated on|
|Hyb. CTC/Att. [petridis2018audio]||✗||GT||✗||63.5||-|
|CTC + KD||✗||ASR||✗||58.2||-|
|CTC + KD||✗||ASR/GT||✗||57.9||-|
|CTC + KD||✗||✗||ASR||-||65.6|
|CTC + KD||✗||✗||ASR/GT||-||65.1|
CTC + KD
|CTC + KD||ASR||ASR/GT||ASR||54.0||-|
|CTC + KD||ASR||GT||ASR||53.2||-|
|CTC + KD||ASR||ASR||ASR/GT||-||62.7|
|CTC + KD||ASR||ASR||GT||-||60.9|
We summarize our results in Table 3. The baseline method (CTC, GT) obtains WER on LRS2 and on LRS3 when trained and evaluated on each dataset separately. In the same setting, and without any ground truth transcriptions, our method achieves similar performance on LRS2 () and even better on LRS3 (). This result demonstrates that human-annotated videos are not necessary in order to effectively train lip reading models. Fine-tuning with limited ground truth transcriptions, as described in Section 4, reduces this to for LRS2 and for LRS3. For training on LRS2 alone, these results outperform the previous state-of-the art which was by [petridis2018audio], and set a strong benchmark for a method trained solely on LRS3.
Using our method to train on all the available data, i.e. VoxCeleb2, LRS2 and LRS3 without any ground truth transcriptions, we further reduce the WER to and for LRS2 and LRS3 respectively. If we moreover fine-tune with a small amount of ground truth transcriptions, the WER drops to (LRS2) and (LRS3). Finally, training on each dataset with full supervision after unsupervised pre-training on the other two, yields the best results, for LRS2 and for LRS3. Comparing these numbers to the results we obtained when training on each dataset individually, one concludes that using extra unlabelled audio-visual speech data is indeed an effective way to boost performance.
Distillation significantly accelerates training, even when compared to using ground truth transcriptions. In Fig. 2 we indicatively compare the learning curves of the baseline model, trained with CTC loss on ground truth transcriptions, and our proposed method, trained on transcriptions and posteriors from the teacher model. Our intuition is that the acceleration is due to the distillation providing explicit alignment information to the model, contrary to CTC which only provides an implicit signal.
6 Discussion and future work
In this paper we demonstrated an effective strategy to train strong models for visual speech recognition by distilling knowledge from a pre-trained ASR model. This training method does not require manually annotated data and is therefore suitable for pre-training on unlabeled datasets. It can be optionally fine-tuned on a small amount of annotations and achieves performance that exceeds all existing lip reading systems aside from those trained using proprietary data.
There are many languages for which the annotated data for visual speech recognition is very limited. Since our method is applicable to any video with a talking head, given access to a pretrained ASR model and unlabelled data for a new language, we could naturally extend to lip reading that language.
We note that several authors [takashima2019investigation, Kurata2019GuidingCP, ding2019compression, sak2015acoustic] have reported difficulties distilling acoustic models trained with CTC, stemming from the misalignment between the teacher and student spike timings. From the solutions proposed in literature we only experimented with sequence-level KD [takashima2019investigation] but did not observe any improvements. Investigating the extent of this problem in the cross-modal distillation domain is left to future work.
The method we have proposed can be scaled to arbitrarily large amounts of data. Given time and resource constraints we only utilized VoxCeleb2 and trained a relatively small network (5x3 instead of the 10x5 Jasper). In future work we plan to scale up in terms of both dataset and model size to develop models that can match and surpass the ones trained on very large-scale annotated datasets.