Visual speech recognition (VSR) or lipreading is the task of recognising speech based on the visual stream only. Lipreading has attracted a lot of attention recently mainly due to its robustness in noisy environments where the audio signal might be heavily corrupted.
The traditional lipreading approach was based on the Discrete Cosine Transform and Hidden Markov Models (HMMs)[Potamianos2003, Dupont2000, zhou14]. Recently, the focus has shifted to deep models due to their superior performance. Such models consist of fully connected [petridis2017deepVisualSpeech, petridis2018visualWhisper, petridis2017end, wand16, petridis17AV] or convolutional layers [stafylakis17, shillingford2018large, afouras2018deep, chung16b] which extract features from the mouth region of interest, followed by recurrent layers or attention [chung16b, petridis2018audio] / self-attention architectures [afouras2018deep]. Few works have also focused on the computational complexity of visual speech recognition [koumparoulis2019mobilipnet, shrivastava2019mobivsr], and in any case efficient methods have trailed massively behind full-fledged ones in terms of accuracy.
The state-of-the-art approach for recognition of isolated words is the one proposed in [martinez2020lipreading]. It consists of a 3D convolutional layer followed by an 18-layer Residual Network (ResNet) [He_2016_CVPR]
, a Temporal Convolutional Network (TCN) network and a softmax layer. It achieves the state-of-the-art performance on the LRW[chung16b] and LRW1000 [lrw1000] datasets, which are the largest publicly available datasets for isolated word recognition.
In this work we focus on improving the performance of the state-of-the-art model and training lightweight models without considerable decrease in performance. Lipreading is a challenging task due to the nature of the signal, where a model is tasked with distinguishing between e.g. million and millions solely based on visual information. We resort to Knowledge Distillation (KD) [hinton2014distilling]
since it provides an extra supervisory signal with inter-class similarity information. For example, if two classes are very similar as in the case above, the KD loss will penalize less when the algorithm confuses them. We leverage this insight to produce a sequence of teacher-student classifiers in the same manner as[furlanello2018born, yang2018knowledge], by which student and teacher have the same architecture, and the student will become the teacher in the next generation until no improvement observed (see Fig. 1).
Our second contribution is the proposal of a novel lightweight architecture. The ResNet-18 backbone can be readily exchanged for an efficient one, such as a version of the MobileNet [howard2017mobilenets] or ShuffleNet [ma2018shufflenet] families. Furthermore, both architectures have a parameter controlling the width of the networks, effectively controlling the computational complexity of the backbone. However, there is no such equivalent for the head classifier. The key to designing the efficient backbones is the use of depthwise separable convolutions (a depthwise convolution followed by a pointwise convolution) [chollet2017xception] to replace the standard convolution. This operation dramatically reduces the amount of parameters and the number of FLOPs. Thus, we devise a novel variant of the Temporal Convolution Networks that relies on depthwise separable convolutions instead. The resulting efficient lipreading architecture is shown in Fig. 2. We conduct experiments on replacing only the backbone, only the back-end, and both of them, and on giving each component different capacity. The result is a full range of lightweight models with varied efficiency-accuracy trade-offs.
Our third contribution is to use the KD framework to recover some of the performance of these efficient networks. Unlike the full-fledged case, it is now possible to use a higher-capacity network to drive the optimization. However, we find that just using the best-performing model as the teacher, which is the standard practice in the literature, yields sub-optimal performance. Instead, we use intermediate networks whose architecture is in-between the full-fledged and the efficient one. Thus, similar to [Martinez2020Training], we generate a sequence of teacher-student pairs that progressively bridges the architectural gap.
We provide experimental evidence showing that a) we achieve new state-of-the-art performance on LRW [chung16b] and LRW-1000 [lrw1000] by a wide margin and without any increase of computational performance †† The models and code are available at https://sites.google.com/view/audiovisual-speech-recognition and b) our lightweight models can achieve competitive performance. For example, we match the current state-of-the-art on LRW [martinez2020lipreading] using 8x fewer FLOPs and 4x fewer parameters.
Knowledge Distillation: Knowledge Distillation (KD) [hinton2014distilling] was initially proposed to transfer knowledge for model compression, so that the student capacity would be much smaller than the teacher one. Recent studies [furlanello2018born, bagherinezhad2018label, yang2018knowledge] have experimentally shown that the student can still benefit when the teacher and student network have identical architectures. This naturally gave rise to the idea of training in generations. In particular, the student of one generation is used as the teacher of the subsequent generation. This process is iterated until no further improvement is observed. Finally, an ensemble can be optionally used so as to combine the predictions from multiple generations [furlanello2018born]. The training pipeline is detailed in Fig. 1.
Depthwise Seperable Convolution: Standard convolutions rely on a kernel with dimensionality , where is the spatial extent of the kernel and and
are the number of channels of the input tensor and output tensor. Thus, the cost of the convolution is proportional to. Instead, a depthwise separable convolution separates the spatial and the channel-wise operations to reduce the computational cost. That is to say, it first applies a convolution with kernel size to perform the spatial convolution, where channel interactions are directly ignored, and then a convolution with kernel size to capture the correlations between channels. Thus, cost is proportional to , where typically the second term dominates. Depthwise Seperable Convolutions are widely used in the existing efficient network architectures including Xception [chollet2017xception], ShuffleNet [zhang2018shufflenet, ma2018shufflenet], and MobileNet [howard2017mobilenets, sandler2018mobilenetv2].
3 Towards Practical Lipreading
Base architecture: We use the visual speech recognition architecture proposed in [martinez2020lipreading] as our base architecture. The details are shown in Fig. 2. It consists of a modified ResNet-18 backbone in which the first convolution has been substituted by a 3D convolution of kernel size . The rest of the network follows a standard design up to the global average pooling layer. A multi-scale temporal convolutional network (MS-TCN) follows to model the short term and long term temporal information simultaneously.
Efficient backbone: The efficient spatial backbone is produced by replacing the ResNet-18 with an efficient network based on depthwise separable convolutions. For the purpose of this study, We use ShuffleNet v2 () as the backbone, where is the width multiplier [ma2018shufflenet]. In addition to the use of depthwise in this lightweight architecture, the channel shuffle operation is designed to enable information communication between different groups of channels. Specifically, ShuffleNet v2 (0.5) has 8.5 fewer parameters and 36 fewer FLOPs than ResNet-18. ShuffleNet v2 (1.0) has 5.13 fewer parameters and 12.1 fewer FLOPs.
Depthwise Separable TCN: We note that the cost of the temporal convolutional network in MS-TCN is non-negligible. To build an efficient architecture, in addition to replacing standard convolutions with depthwise separable convolutions, we reduce the amount of heads in MS-TCN and leave only one branch with a kernel size of 3. The architecture is denoted as depthwise separable temporal convolutional network (DS-TCN). The detailed architecture is shown in Fig. 2.
3.2 Distillation Loss
This work aims to minimise the combination of cross-entropy loss () for hard targets and KL divergence loss () for soft targets. Let us denote the labels as , the parameters of student and teacher models as and , respectively, and the predictions from the student and teacher models as and , respectively. denotes the softmax function and
Note that we have omitted the temperature term, which is commonly used to soften the logits of theterm, since we found that the proposed approach works well even without it.
4 Experimental Setup
Datasets: Lip Reading in the Wild (LRW) [chung16b] and LRW-1000 [lrw1000] are the largest publicly available lipreading datasets. Both datasets are very challenging as they contain a large number of speakers, have large variations in head poses, illumination and background noise. LRW is based on a collection of over 1000 speakers from BBC programs. There are 488763, 25000, 25000 utterances of 500 target words on training, validation, and test sets, respectively. Each utterance is composed of 29 frames (1.16 seconds), where the target word is surrounded by other context words. LRW-1000 is the largest Mandarin lipreading dataset collected from more than 2000 speakers with a duration of approximately 57 hours. It has 1000 Mandarin syllable-based classes with a total of 718018 utterances. It contains utterances of varying length from 0.01 up to 2.25 seconds.
|Method||Top-1 Acc. (%)|
|ResNet34 + BLSTM [stafylakis17]||83.0|
|ResNet34 + DenseNet52 +||83.3|
|ResNet34 + BGRU [petridis18]||83.4|
|2-stream 3D-CNN + BLSTM [weng19]||84.1|
|ResNet18 + BLSTM [stafylakis2018pushing]||84.3|
|ResNet18 + BGRU + Cutout [zhang2020can]||85.0|
|Resnet18 + MS-TCN [martinez2020lipreading]||85.3|
|Initial Model Trained With||Adam||AdamW|
|Student Models Trained With||AdamW||AdamW|
|ResNet18 + MS-TCN - Teacher||85.6||87.0|
|ResNet18 + MS-TCN - Student 1||87.4||87.8|
|ResNet18 + MS-TCN - Student 2||87.8||87.5|
|ResNet18 + MS-TCN - Student 3||87.9||-|
|ResNet18 + MS-TCN - Student 4||87.7||-|
Pre-processing: For the video sequences in LRW dataset, 68 facial landmarks are detected and tracked using dlib [king2009dlib]. The faces are aligned to a neural reference frame to remove differences related to rotation and scale using a similarity transformation. A bounding box of is used to crop the mouth region of interest once the centre of the mouth is located. It should be noted that the video sequences in LRW1000 dataset are already cropped so there is no need for pre-processing.
The lipreading model is trained in an end-to-end manner. We train the model for 80 epochs using an initial learning rate of 3e-4, a weight decay of 1e-4 and a mini-batch of 32. We decay the learning rate using a cosine annealing schedule[sgdr17]. We should note that all models are trained from random initialisation, without using any external datasets.
During training, each sequence is flipped horizontally with a probability of 0.5, randomly cropped to a size ofand mixup [zhang2017mixup] is used with a weight of 0.4. During testing, we use the center patch of the image sequence. To improve robustness, we train all models with variable-length augmentation similarly to [martinez2020lipreading]. where each sequence is segmented temporally at a random point prior and after the boundary of the target word.
5.1 Born-Again Distillation
|Method||Top-1 Acc. (%)|
|ResNet34 + DenseNet52 + ConvLSTM [wang2019]||36.9|
|ResNet34 + BLSTM [stafylakis17]||38.2|
|ResNet18 + BGRU [zhang2020can]||38.6|
|Resnet18 + MS-TCN [martinez2020lipreading]||41.4|
|ResNet18 + BGRU + Cutout [zhang2020can]||45.2|
|Initial Model Trained With||Adam|
|Student Models Trained With||AdamW|
|ResNet18 + MS-TCN - Teacher||43.2|
|ResNet18 + MS-TCN - Student 1||45.3|
|ResNet18 + MS-TCN - Student 2||44.7|
In this set of experiments, we apply born-again distillation [furlanello2018born], so that student and teacher have identical architectures. An ensemble of student models is also created as suggested by [furlanello2018born]. Results on the LRW dataset are shown in Table. 1. We notice that when we train a model without distillation, using the AdamW optimiser [loshchilov2017decoupled] leads to a significant increase in performance when compared to the Adam optimiser. However, we find the best results after self-distillation are similar and only have 0.1% difference no matter what the accuracy of the initial teacher model is. This leads to a new state of the art performance on LRW by 2.6% margin over the previous one without an increase of computational cost. Furthermore, an ensemble of the models reaches an accuracy of 88.6%, which further pushes the state-of-the-art performance on LRW.
Results on the LRW-1000 dataset are shown in Table 2. In this case, our proposed best single-model accuracy results in an absolute improvement of 3.9% compared to the previous state-of-the-art accuracy on LRW-1000 among works only using the publicly available data. Furthermore, the ensemble model yields a further 1.3%, resulting in a 5.2% overall improvement. These results confirm that indeed inter-class similarity information is crucial for lipreading.
5.2 Efficient Lipreading
|Student Backbone||Student Back-end||Distillation||Top-1||Params||FLOPs|
|(Width mult.)||(Width mult.)||Acc.|
|ResNet-18 [martinez2020lipreading]||MS-TCN (3)||-||85.3||36.4||10.31|
|ResNet-34 [petridis18]||BGRU (512)||-||83.4||29.7||18.71|
|ShuffleNet v2 (1)||MS-TCN (3)||✗||84.4||28.8||2.23|
|ShuffleNet v2 (0.5)||MS-TCN (3)||✗||83.1||27.9||1.69|
|ShuffleNet v2 (1)||TCN (2)||✗||82.7||9.1||1.31|
|ShuffleNet v2 (1)||DS-MS-TCN (3)||✗||84.5||9.3||1.26|
|ShuffleNet v2 (1)||TCN (1)||✗||81.0||3.8||1.12|
|ShuffleNet v2 (0.5)||TCN (2)||✗||81.6||8.2||0.77|
|ShuffleNet v2 (0.5)||TCN (1)||✗||78.1||2.9||0.58|
|ShuffleNet v2 (0.5)||DS-TCN (2)||✗||76.2||3.5||0.58|
One of the major limitations of current lipreading models barring their use in practical applications is that of their computational cost. Many speech recognition applications rely on on-device computing, where the computational capacity is limited, and memory footprint and battery consumption are also important factors. We aim to bridge this gap by constructing very efficient models that can perform on par with competing lipreading methods. To this end, we explore replacing the ResNet-18 backbone and the TCN-based classifier head with efficient alternatives. We chose to replace the ResNet-18 with a ShuffleNet v2 architecture [ma2018shufflenet] as preliminary experiments showed superior performance over MobileNetV2 [sandler2018mobilenetv2] and EfficientNet-B0 [TanL19_efficientnet] alternatives. In order to control the backbone complexity, we further consider a channel width multiplier of 0.5 and of 1.
The TCN-based head classifier has the following variants: TCN and MS-TCN. TCN indicates the vanilla TCN [BaiTCN2018] with kernel of size 3. We chose that kernel size as it yields comparable performance to larger kernels yet has lower computational cost, and a smaller kernel results in large accuracy drops. MS-TCN indicates the multi-scale variant presented in [martinez2020lipreading]. Finally, for the purpose of model efficiency, we introduce novel depthwise-separable variants of these models, noted as DS-TCN and DS-MS-TCN respectively. We can similarly add capacity to the different TCN variants with a width multiplier respect to the base size of 256 channels.
We explored multiple options as teacher networks. Since now there are higher capacity models that can be used as teachers, we do not need to resort to self distillation. We explore the following options: Use the best-performing network as teacher, use an intermediate capacity network as teacher with either the same backbone or head (e.g., ShuffleNet + MS-TCN as teacher, ShuffleNet + DS-MS-TCN as student), and combining each option with further training in generations. Since each strategy works to varying degrees for the different architectures, we use performance on the validation set to choose the best-performing model and report the accuracy on the test partition. However, we found that using an intermediate-capacity network as teacher, and afterwards using distillation in generations is often the best option. Since the intermediate architecture itself is also trained through distillation, we end up with a progressive teacher-student sequence as in [Martinez2020Training].
The results on LRW dataset are shown in Table 3. Remarkably, replacing the state-of-the-art ResNet18-MS-TCN with ShuffleNet-DS-MS-TCN provides the same accuracy than the previous state-of-the-art MS-TCN of [martinez2020lipreading], while requiring 8.2 fewer FLOPs and 3.9 fewer parameters. This is particularly remarkable since the MS-TCN is already quite efficient, having slightly lower computational cost than the lightweight architecture of MobiVSR-1 [shrivastava2019mobivsr]. Another remarkable combination is the ShuffleNet v2 (0.5) + TCN model, which achieves 79.9% accuracy on LRW with as little as 0.58 GFLOPs and 2.9M parameters, a reduction of 17.8 and 12.5 respectively when compared to the ResNet18-MS-TCN model of [martinez2020lipreading].
The same pattern is also observed on the LRW1000 dataset, which is shown in Table 4. ShuffleNet v2 (0.5) - DS-TCN () provides a higher performance (4.3% absolute improvement) while requiring 9.4 fewer parameters and 36.1 fewer FLOPs than DenseNet [lrw1000]. An additional absolute improvement of 1.1% is achieved in the model ShuffleNet v2 (0.5) - DS-TCN () by using DS-TCN (1) as the teacher model. A visual depiction of number of parameters vs. accuracy is given in Fig. 3.
|Student Backbone||Student Back-end||Distillation||Top-1||Params||FLOPs|
|(Width mult.)||(Width mult.)||Acc.|
|3D DenseNet [lrw1000]||BGRU (256)||-||34.8||15.0||30.32|
|ShuffleNet v2 (1)||TCN (1)||✗||40.7||3.9||1.73|
|ShuffleNet v2 (1)||DS-TCN (1)||✗||39.1||2.5||1.68|
|ShuffleNet v2 (0.5)||TCN (1)||✗||40.5||3.0||0.89|
|ShuffleNet v2 (0.5)||DS-TCN (1)||✗||39.1||1.6||0.84|
In this work, we present state-of-the-art results on isolated word recognition by knowledge distillation. We also investigate efficient models for visual speech recognition and we achieve results similar to the current state-of-the-art while reducing the computational cost by 8 times. It would be interesting to investigate in future work how cross-modal distillation affects the performance of audiovisual speech recognition models.