FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire

Lipreading is an impressive technique and there has been a definite improvement of accuracy in recent years. However, existing methods for lipreading mainly build on autoregressive (AR) model, which generate target tokens one by one and suffer from high inference latency. To breakthrough this constraint, we propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading. Thus, in this paper, we introduce three methods to reduce the gap between FastLR and AR model: 1) to address challenges 1 and 2, we leverage integrate-and-fire (I&F) module to model the correspondence between source video frames and output text sequence. 2) To tackle challenge 3, we add an auxiliary connectionist temporal classification (CTC) decoder to the top of the encoder and optimize it with extra CTC loss. We also add an auxiliary autoregressive decoder to help the feature extraction of encoder. 3) To overcome challenge 4, we propose a novel Noisy Parallel Decoding (NPD) for I&F and bring Byte-Pair Encoding (BPE) into lipreading. Our experiments exhibit that FastLR achieves the speedup up to 10.97× comparing with state-of-the-art lipreading model with slight WER absolute increase of 1.5% and 5.5% on GRID and LRS2 lipreading datasets respectively, which demonstrates the effectiveness of our proposed method.


Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input

Non-autoregressive (NAR) transformer models have achieved significantly ...

A Study of Non-autoregressive Model for Sequence Generation

Non-autoregressive (NAR) models generate all the tokens of a sequence in...

Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Transformers have recently dominated the ASR field. Although able to yie...

High-Speed and High-Quality Text-to-Lip Generation

As a key component of talking face generation, lip movements generation ...

Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder

Fast inference speed is an important goal towards real-world deployment ...

Diffsound: Discrete Diffusion Model for Text-to-sound Generation

Generating sound effects that humans want is an important topic. However...

Source-side Prediction for Neural Headline Generation

The encoder-decoder model is widely used in natural language generation ...

1. Introduction

Lipreading aims to recognize sentences being spoken by a talking face, which is widely used now in many scenarios including dictating instructions or messages in a noisy environment, transcribing archival silent films, resolving multi-talker speech (Afouras et al., 2018b) and understanding dialogue from surveillance videos. However, it is widely considered a challenging task and even experienced human lipreaders cannot master it perfectly (Assael et al., 2016; Shillingford et al., 2018). Thanks to the rapid development of deep learning in recent years, there has been a line of works studying lipreading and salient achievements have been made.

Existing methods mainly adopt autoregressive (AR) model, either based on RNN (Stafylakis and Tzimiropoulos, 2017; Zhao et al., 2019), or Transformer (Afouras et al., 2018a; Afouras et al., 2018b). Those systems generate each target token conditioned on the sequence of tokens generated previously, which hinders the parallelizability. Thus, they all without exception suffer from high inference latency, especially when dealing with the massive videos data containing hundreds of hours (like long films and surveillance videos) or real-time applications such as dictating messages in a noisy environment.

To tackle the low parallelizability problem due to AR generation, many non-autoregressive (NAR) models (Gu et al., 2017; Lee et al., 2018; Guo et al., 2019; Wang et al., 2019; Ma et al., 2019; Liu et al., 2020; Ren et al., 2020b) have been proposed in the machine translation field. The most typical one is NAT-FT (Gu et al., 2017), which modifies the Transformer (Vaswani et al., 2017) by adding a fertility module to predict the number of words in the target sequence aligned to each source word. Besides NAR translation, many researchers bring NAR generation into other sequence-to-sequence tasks, such as video caption (Ren et al., 2019, 2020a), speech recognition (Chen et al., 2019) and speech synthesis(Oord et al., 2017; Ren et al., 2019). These works focus on generating the target sequence in parallel and mostly achieve more than an order of magnitude lower inference latency than their corresponding AR models.

However, it is very challenging to generate the whole target sequence simultaneously in lipreading task in following aspects:

  • The considerable discrepancy of sequence length between the input video frames and the target text tokens makes it difficult to estimate the length of the output sequence or to define a proper decoder input during the inference stage. This is different from machine translation model, which can even simply adopt the way of uniformly mapping the source word embedding as the decoder input (Wang et al., 2019) due to the analogous text sequence length.

  • The true target sequence distributions show a strong correlation across time, but the NAR model usually generates target tokens conditionally independent of each other. This is a poor approximation and may generate repeated words. Gu et al. (2017) terms the problem as “multimodal-problem”.

  • The feature representation ability of encoder could be weak when just training the raw NAR model due to lack of effective alignment mechanism.

  • The removal of the autoregressive decoder, which usually acts as a language model, makes the model much more difficult to tackle the inherent ambiguity problem in lipreading.

In our work, we propose FastLR, a non-autoregressive lipreading model based on Transformer. To handle the challenges mentioned above and reduce the gap between FastLR and AR model, we introduce three methods as follows:

  • To estimate the length of the output sequence and alleviates the problem of time correlation in target sequence, we leverage integrate-and-fire (I&F) module to encoding the continuous video signal into discrete token embeddings by locating the acoustic boundary, which is inspired by Dong and Xu (2019). These discrete embeddings retain the timing information and correspond to the target tokens directly.

  • To enhance the feature representation ability of encoder, we add the connectionist temporal classification (CTC) decoder on the top of encoder and optimize it with CTC loss, which could force monotonic alignments. Besides, we add an auxiliary AR decoder during training to facilitate the feature extraction ability of encoder.

  • To tackle the inherent ambiguity problem and reduce the spelling errors in NAR inference, we first propose a novel Noisy Parallel Decoding (NPD) for I&F method. The rescoring method in NPD takes advantages of the language model in the well-trained AR lipreading teacher without harming the parallelizability. Then we bring Byte-Pair Encoding (BPE) into lipreading, which compresses the target sequence and makes each token contain more language information to reduce the dependency among tokens compared with character level encoding.

The core contribution of this work is that, we are the first to propose a non-autoregressive lipreading system, and present several elaborate methods metioned above to bridge the gap between FastLR and state-of-the-art autoregressive lipreading models.

The experimental results show that FastLR achieves the speedup up to 10.97 comparing with state-of-the-art lipreading model with slight WER increase of 1.5% and 5.5% on GRID and LRS2 lipreading datasets respectively, which demonstrates the effectiveness of our proposed method. We also conduct ablation experiments to verify the significance of all proposed methods in FastLR.

2. Related Works

2.1. Autoregressive Deep Lipreading

Prior works utilizing deep learning for lipreading mainly adopt the autoregressive model. The first typical approach is LipNet

(Assael et al., 2016) based on CTC (Graves et al., 2006), which takes the advantage of the spatio-temporal convolutional front-end feature generator and GRU (Chung et al., 2014). Further, Stafylakis and Tzimiropoulos (2017) propose a network combining the modified 3D/2D-ResNet architecture with LSTM. Afouras et al. (2018b) introduce the Transformer self-attention architecture into lipreading, and build TM-seq2seq and TM-CTC. The former surpasses the performance of all previous work on LRS2-BBC dataset by a large margin. To boost the performance of lipreading, Petridis et al. (2018) present a hybrid CTC/Attention architecture aiming to obtain the better alignment than attention-only mechanism, Zhao et al. (2019) provide the idea that transferring knowledge from audio-speech recognition model to lipreading model by distillation.

However, these methods, either based on recurrent neural network or Transformer, all adopt autoregressive decoding method which takes in the input video sequence and generates the tokens of target sentence

one by one during the inference process. And they all suffer from the high latency.

2.2. Non-Autoregressive Decoding

An autoregressive model takes in a source sequence and then generates words in target sentence one by one with the causal structure during the inference process (Sutskever et al., 2014; Vaswani et al., 2017). To reduce the inference latency, Gu et al. (2017)

introduce non-autoregressive model based on Transformer into the machine translation field, which generates all target words in parallel. The conditional probability can be defined as


where is the length of the target sequence gained from the fertility prediction function conditioned on the source sentence. Due to the multimodality problem (Gu et al., 2017), the performance of NAR model is usually inferior to AR model. Recently, a line of works aiming to bridge the performance gap between NAR and AR model for translation task has been presented (Ghazvininejad et al., 2019; Guo et al., 2019).

Besides the study of NAR translation, many works bring NAR model into other sequence-to-sequence tasks, such as video caption (Yang et al., 2019), speech recognition (Chen et al., 2019) and speech synthesis (Oord et al., 2017; Ren et al., 2019).

Figure 1. The overview of the model architecture for FastLR.

2.3. Spike Neural Network

The integrate-and-fire neuron model describes the membrane potential of a neuron according to the synaptic inputs and the injected current

(Burkitt, 2006). It is bio-logical and widely used in spiking neural networks. Concretely, the neuron integrates the input signal forwardly and increases the membrane potential. Once the membrane potential reaches a threshold, a spike signal is generated, which means an event takes place. Henceforth, the membrane potential is reset and then grows in response to the subsequent input signal again. It enables the encoding from continuous signal sequences to discrete signal sequences, while retaining the timing information.

Recently, Dong and Xu (2019) introduce the integrate-and-fire model into speech recognition task. They use continuous functions that support back-propagation to simulate the process of integrate-and-fire. In this work, the fired spike represents the event that locates an acoustic boundary.

3. Methods

In this section, we introduce FastLR and describe our methods thoroughly. As shown in Figure 1

, FastLR is composed of a spatio-temporal convolutional neural network for video feature extraction (visual front-end) and a sequence processing model (main model) based on Transformer with an enhenced encoder, a non-autoregressive decoder and a I&F module. To further tackle the challenges in non-autoregressive lipreading, we propose the NPD method for I&F and bring byte-pair encoding into our method. The details of our model and methods are described in the following subsections

222We introduce the visual front-end in section 4.2 as it varies from one dataset to another.:

3.1. Enhenced Encoder

The encoder of FastLR is composed of stacked self-attention and feed-forward layers, which are the same as those in Transformer (Vaswani et al., 2017) and autoregressive lipreading model (TM-seq2seq(Afouras et al., 2018b)). Thus, we add an auxiliary autoregressive decoder, shown in the left panel of Figure 1, and by doing so, we can optimize the AR lipreading task with FastLR together with one shared the encoder during training stage. This transfers knowledge from the AR model to FastLR which facilitates the optimization. Besides, we add the connectionist temporal classification (CTC) decoder with CTC loss on the encoder for forcing monotonic alignments, which is a widely used technique in speech recognition field. Both adjustments improve the feature representation ability of our encoder.

3.2. Integrate-and-fire module

To estimate the length of the output sequence and alleviate the problem of time correlation in target sequence, we adopt continuous integrate-and-Fire (I&F) (Dong and Xu, 2019) module for FastLR. This is a soft and monotonic alignment which can be employed in the encoder-decoder sequence processing model. First, the encoder output hidden sequence

will be fed to a 1-dimensional convolutional layer followed by a fully connected layer with sigmoid activation function. Then we obtain the weight embedding sequence

which represents the weight of information carried in . Second, the I&F module scans and accumulates them from left to right until the sum reaches the threshold (we set it to 1.0), which means an acoustic boundary is detected. Third, I&F divides at this point into two part: and . is used for fulfilling the integration of current embedding to be fired, while is used for the next integration of . Then, I&F resets the accumulation and continues to scan the rest of which begins with for the next integration. This procedure is noted as “accumulate and detect”. Finally, I&F multiplies all (or ) in by corresponding and integrates them according to detected boundaries. An example is shown in Figure 2.

3.3. Non-autoregressive Decoder

Different from Transformer decoder, the self-attention of FastLR’s decoder can attend to the entire sequence for the conditionally independent property of NAR model. And we remove the inter-attention mechanism since FastLR already has an alignment mechanism (I&F) between source and target. The decoder takes in the fired embedding sequence of I&F and generates the text tokens in parallel during either training or inference stage.

3.4. Noisy parallel decoding (NPD) for I&F

The absence of AR decoding procedure makes the model much more difficult to tackle the inherent ambiguity problem in lipreading. So, we design a novel NPD for I&F method to leverage the language information in well-trained AR lipreading model.

In section 3.2, it is not hard to find that, represents the length of predicted sequence (or ), where is the total sum of . And Dong and Xu (2019) propose a scaling strategy which multiplies by a scalar to generate , where is the length of target label . By doing so, the total sum of is equal to and this teacher-forces I&F to predict with the true length of which would benefit the cross-entropy training.

However, we do not stop at this point. Besides training, we also scale during the inference stage to generate multiple candidates of weight embedding with different length bias . When set the beam size ,


where is the output of I&F module during inference and length bias is provided in ”Length Controller” module in Figure 1

. Then, we utilize the re-scoring method used in Noisy Parallel Decoding (NPD), which is a common practice in non-autoregressive neural machine translation, to select the best sequence from these

candidates via an AR lipreading teacher:


where is the probability of the sequence generated by autoregressive model; The means the optimal generation of FastLR given a source sentence and weight embedding , represents the parameters of model.

The selection process could leverage information in the language model (decoder) of the well-trained autoregressive lipreading teacher, which alleviates the ambiguity problem and gives a chance to adjust the weight embedding generated by I&F module for predicting a better sequence length. Note that these candidates can be computed independently, which won’t hurt the parallelizability (only doubles the latency due to the selection process). The experiments demonstrate that the re-scored sequence is more accurate.

3.5. Byte-Pair Encoding

Byte-Pair Encoding (Sennrich et al., 2015) is widely used in NMT (Vaswani et al., 2017) and ASR (Dong and Xu, 2019) fields, but rare in lipreading tasks. BPE could make each token contain more language information and reduce the dependency among tokens compared with character level encoding, which alleviate the problems of non-autoregressive generation discussed before. In this work, we tokenize the sentence with moses tokenizer 333
and then use BPE algorithm to segment each target word into sub-words.

Figure 2. An example to illustrate how I&F module works. respresents the encoder output hidden sequence. In this case .

3.6. Training of FastLR

We optimize the CTC decoder with CTC loss, which maps the frame-level video feature sequence to a target text sequence. CTC introduces a set of intermediate representation path termed as CTC paths for one target text sequence . Each CTC path is composed of scattered target text tokens and blanks which can reduce to the target text sequence by removing the repeated words and blanks. The likelihood of could be calculated as the sum of probabilities of all CTC paths corresponding to it:


Thus, CTC loss can be formulated as:


where denotes the set of source video and target text sequence pairs in one batch.

We optimize the auxiliary autoregressive task with cross-entropy loss, which can be formulated as:


And most importantly, we optimize the main task FastLR with cross-entropy loss , which can be formulated as:


Then, the total loss function for training our model is:


where the , ,

are hyperparameters to trade off the three losses.

4. Experiments and Results

4.1. Datasets


The GRID dataset (Cooke et al., 2006) consists of 34 subjects, and each of them utters 1,000 phrases. It is a clean dataset and easy to learn. We adopt the split the same with Assael et al. (2016), where 255 random sentences from each speaker are selected for evaluation. In order to better recognize lip movements, we transform the image into gray scale, and crop the video images to a fixed size containing the mouth region with Dlib face detector. Since the vocabulary size of GRID datasets is quite small and most words are simple, we do not apply Byte-Pair Encoding (Sennrich et al., 2015) on GRID, and just encode the target sequence at the character level.


The LRS2 dataset contains sentences of up to 100 characters from BBC videos (Afouras et al., 2018a), which have a range of viewpoints from frontal to profile. We adopt the origin split of LRS2 for train/dev/test sets, which contains 46k, 1,082 and 1,243 sentences respectively. And we make use of the pre-train dataset provided by LRS2 which contains 96k sentences for pretraining. Following previous works (Afouras et al., 2018a; Afouras et al., 2018b; Zhao et al., 2019), the input video frames are converted to grey scale and centrally cropped into images. As for the text sentence, we split each word token into subwords using BPE (Sennrich et al., 2015), and set the vocabulary size to 1k considering the vocabulary size of LRS2.

The statistics of both datasets are listed in Table 1.

Dataset Utt. Word inst. Vocab hours
GRID 33k 165k 51 27.5
LRS2 (Train-dev) 47k 337k 18k 29
Table 1. The statistics on GRID and LRS2 lip reading datasets. Utt: Utterance.

4.2. Visual feature extraction

For GRID datasets, we use spatio-temporal CNN to extract visual features follow Torfi et al. (2017)

. The visual front-end network is composed of four 3D convolution layers with 3D max pooling and RELU, and two fully connected layers. The kernel size of 3D convolution and pooling is

, the hidden sizes of fully connected layer as well as output dense layer are both 256. We directly train this visual front-end together with our main model end-to-end on GRID on the implementation444 by Torfi et al. (2017).

For LRS2 datasets, we adopt the same structure as Afouras et al. (2018a), which uses a 3D convolution on the input frame sequence with a filter width of 5 frames, and a 2D ResNet decreasing the spatial dimensions progressively with depth. The network convert the frame sequence into feature sequence, where is frame number, frame height, frame width respectively. It is worth noting that, training the visual front-end together with the main model could obtain poor results on LRS2, which is observed in previous works (Afouras et al., 2018b). Thus, as Zhao et al. (2019) do, we utilize the frozen visual front-end provided by Afouras et al. (2018b), which is pre-trained on a non-public datasets MV-LRS (Chung and Zisserman, 2017), to exact the visual features. And then, we train FastLR on these features end-to-end. The pre-trained model can be found in

4.3. Model Configuration

We adopt the Transformer (Vaswani et al., 2017) as the basic model structure for FastLR because it is parallelizable and achieves state-of-the-art accuracy in lipreading (Afouras et al., 2018b). The model hidden size, number of encoder-layers, number of decoder-layers, and number of heads are set to for LRS2 dataset and for GRID dataset respectively. We replace the fully-connected network in origin Transformer with 2-layer 1D convolution network with ReLU activation which is commonly used in speech task and the same with TM-seq2seq (Afouras et al., 2018b) for lipreading. The kernel size and filter size of 1D convolution are set to and 9 respectively. The CTC decoder consists of two fully-connected layers with ReLU activation function and one fully-connected layer without activation function. The hidden sizes of these fully-connected layers equal to . The auxiliary decoder is an ordinary Transformer decoder with the same configuration as FastLR, which takes in the target text sequence shifted right one sequence step for teacher-forcing.

4.4. Training setup

As mentioned in section 3.1, to boost the feature representation ability of encoder, we add an auxiliary connectionist temporal classification (CTC) decoder and an autoregressive decoder to FastLR and optimize them together. We set to 0.5, to during warm-up training stage, and set to during main training stage for simplicity. The training steps of each training stage are listed in details in Table 2. Note that experiment on GRID dataset needs more training steps, since it is trained with its visual front-end together from scratch, different from experiments on LRS2 dataset. Moreover, the first 45k steps in warm-up stage for LRS2 are trained on LRS2-pretrain sub-dataset and all the left steps are trained on LRS2-main sub-dataset (Afouras et al., 2018a; Afouras et al., 2018b; Zhao et al., 2019).

We train our model FastLR using Adam following the optimizer settings and learning rate schedule in Transformer (Vaswani et al., 2017). The training procedure runs on 2 NVIDIA 1080Ti GPUs. Our code is based on tensor2tensor (Vaswani et al., 2018).

Warm-up 300k 55k
Main 160k 120k
Table 2. The training steps of FastLR for different datasets for each training stage.

4.5. Inference and Evaluation

During the inference stage, the auxiliary CTC decoder as well as autoregressive decoder will be thrown away. Given the beam size , FastLR generates candidates of weight embedding sequence which correspond to text sequences, and these text sequences will be sent to the decoder of a well-trained autoregressive lipreading model (TM-seq2seq) for selection as described in section 3.4. The result of selected best text sequence is marked with ”NPD9”. We conduct the experiments on both ”NPD9” and ”without NPD”. To be specific, the result of ”without NPD” means directly using the candidate with zero-length bias without a selection process, which has a lower latency.

The recognition quality is evaluated by Word Error Rate (WER) and Character Error Rate (CER). Both error rate can be defined as:


where S, D, I and N are the number of substitutions, deletions, insertions and reference tokens (word or character) respectively.

When evaluating the latency, we run FastLR on 1 NVIDIA 1080Ti GPU in inference.

Method WER CER
Autoregressive Models
LSTM (Wand et al., 2016) 20.4% /
LipNet (Assael et al., 2016) 4.8% 1.9%
WAS (Chung et al., 2017) 3.0% /
Non-Autoregressive Models
NAR-LR (base) 25.8% 13.6%
FastLR (Ours) 4.5% 2.4%
Table 3. The word error rate (WER) and character error rate (CER) on GRID
Method WER CER
Autoregressive Models
WAS (Chung et al., 2017) 70.4% /
BLSTM+CTC (Afouras et al., 2018a) 76.5% 40.6%
FC-15 (Afouras et al., 2018a) 64.8% 33.9%
LIBS (Zhao et al., 2019) 65.3% 45.5%
TM-seq2seq (Afouras et al., 2018b) 61.7% 43.5%
Non-Autoregressive Models
NAR-LR (base) 81.3% 57.9%
FastLR (Ours) 67.2% 46.9%
Table 4. The word error rate (WER) and character error rate (CER) on LRS2. denotes baselines from our reproduction.

4.6. Main Results

We conduct experiments of FastLR, and compare them with autoregressive lipreading baseline and some mainstream state-of-the-art of AR lipreading models on the GRID and LRS2 datasets respectively. As for TM-seq2seq (Afouras et al., 2018b), it has the same Transformer settings with FastLR and works as the AR teacher for NPD selection. We also apply CTC loss and BPE technique to TM-seq2seq for a fair comparison. 555Our reproduction has a weaker performance compared with the results reported in (Afouras et al., 2018a; Afouras et al., 2018b). Because we do not have the resource of MV-LRS, a non-public dataset which contains individual word excerpts of frequent words used by (Afouras et al., 2018a; Afouras et al., 2018b). Thus, we do not adopt curriculum learning strategy as Afouras et al. (2018a).

The results on two datasets are listed in Table 3 and 4. We can see that 1) WAS (Chung et al., 2017) and TM-seq2seq (Afouras et al., 2018a; Afouras et al., 2018b) obtain the best results of autoregressive lipreading model on GRID and LRS2. Compared with them, FastLR only has a slight WER absolute increase of 1.5% and 5.5% respectively. 2) Moreover, on GRID dataset, FastLR outperforms LipNet (Assael et al., 2016) for 0.3% WER, and exceeds LSTM (Wand et al., 2016) with a notable margin; On LRS2 dataset, FastLR achieves better WER scores than WAS and BLSTM+CTC (Afouras et al., 2018a) and keeps comparable performance with LIBS (Zhao et al., 2019) and FC-15 (Afouras et al., 2018a). In addition, compared with LIBS, we do not introduce any distillation method in training stage, and compared with WAS and TM-seq2seq, we do not leverage information from other datasets beyond GRID and LRS2.

We also propose a baseline non-autoregressive lipreading model without Integrate-and-Fire module termed as NAR-LR (base), and conduct experiments for comparison. As the result shows, FastLR outperforms this NAR baseline distinctly. The overview of the design for NAR-LR (base) is shown in Figure 3.

Figure 3. The NAR-LR (base) model. It is also based on Transformer (Vaswani et al., 2017), but generates outputs in the non-autoregressive manner (Gu et al., 2017)

. It sends a series of duplicated trainable tensor into the decoder to generates target tokens. The repeat count of this trainable tensor is denoted as ”m”. For training, ”m” is set to ground truth length, but for inference, we estimate it by a linear function of input length, and the parameters are obtained using the least square method on the train set. The auxiliary AR decoder is the same as FastLR’s. The CTC decoder contains FC layers and CTC loss.

4.7. Speedup

In this section, we compare the average inference latency of FastLR with that of the autoregressive Transformer lipreading model. And then, we analyze the relationship between speedup and the length of the predicted sequence.

4.7.1. Average Latency Comparison

The average latency is measured in average time in seconds required to decode one sentence on the test set of LRS2 dataset. We record the inference latency and corresponding recognition accuracy of TM-seq2seq (Afouras et al., 2018a; Afouras et al., 2018b), FastLR without NPD and FastLR with NPD9, which is listed in Table 5.

The result shows that FastLR speeds up the inference by 11.94 without NPD, and by 5.81 with NPD9 on average, compared with the TM-seq2seq which has similar number of model parameters. Note that the latency is calculated excluding the computation cost of data pre-processing and the visual front-end.

Method WER Latency (s) Speedup
TM-seq2seq (Afouras et al., 2018b) 61.7% 0.215 1.00
FastLR (no NPD) 73.2% 0.018 11.94
FastLR (NPD 9) 67.2% 0.037 5.81
Table 5. The comparison of average inference latency and corresponding recognition accuracy. The evaluation is conducted on a server with 1 NVIDIA 1080Ti GPU, 12 Intel Xeon CPU. The batch size is set to 1. The average length of the generated sub-word sequence are all about 14.

4.7.2. Relationship between Speedup and Length

During inference, the autoregressive model generates the target tokens one by one, but the non-autoregressive model speeds up the inference by increasing parallelization in the generation process. Thus, the longer the target sequence is, the more the speedup rate is. We visualize the relationship between the length of the predicted sub-word sequence in Figure 4. It can be seen that the inference latency increases distinctly with the predicted text length for TM-seq2seq, while nearly holds a small constant for FastLR.

Then, we bucket the test sequences of length within , and calculate their average inference latency for TM-seq2seq and FastLR to obtain the maximum speedup on LRS2 test set. The results are 0.494s and 0.045s for TM-seq2seq and FastLR (NPD9) respectively, which shows that FastLR (NPD9) achieves the speedup up to 10.97 on LRS2 test set, thanks to the parallel generation which is insensitive to sequence length.

(a) TM-seq2seq (Afouras et al., 2018b)
(b) FastLR (NPD9)
Figure 4. Relationship between Inference time (second) and Predicted Text Length for TM-seq2seq (Afouras et al., 2018b) and FastLR.

5. Analysis

In this section, we first conduct ablation experiments on LRS2 to verify the significance of all proposed methods in FastLR. The experiments are listed in Table 6. Then we visualize the encoder-decoder attention map of the well-trained AR model (TM-seq2seq) and the acoustic boundary detected by the I&F module in FastLR to check whether the I&F module works well.

Naive Model with I&F ¿1 75.2%
+Aux 93.1% 64.9%
+Aux+BPE 75.7% 52.7%
+Aux+BPE+CTC 73.2% 51.4%
(FastLR) 67.2% 46.9%
Table 6. The ablation studies on LRS2 dataset. Naive Model with I&F is the naive lipreading model only with Integrate-and-Fire. ”+Aux” means adding the auxiliary autoregressive task. We add our methods and evaluate their effectiveness progressively.

5.1. The Effectiveness of Auxiliary AR Task

As shown in the table 6, the naive lipreading model with Integrate-and-Fire is not able to converge well, due to the difficulty of learning the weight embedding in I&F module from the meaningless encoder hidden. Thus, the autoregressive lipreading model works as the auxiliary model to enhance the feature representation ability of encoder, and guides the non-autoregressive model with Integrate-and-Fire to learn the right alignments (weight embedding). From this, the model with I&F begins to generate the target sequence with meaning, and (Row 3).

5.2. The Effectiveness of Byte-Pair Encoding

BPE makes each token contain more language information and reduce the dependency among tokens compared with character level encoding. In addition, from observation, the speech speed of BBC video is a bit fast, which causes that one target token (character if without BPE) corresponds to few video frames. While BPE compresses the target sequence and this will help the Integrate-and-Fire module to find the acoustic level alignments easier.

From the table 6 (Row 4), it can be seen that BPE reduces the word error rate and character error rate to 75.7% and 52.7% respectively, which means BPE helps the model gains the ability to generates understandable sentence.

5.3. The Effectiveness of CTC

The result shows that (Row 5), adding auxiliary connectionist temporal classification(CTC) decoder with CTC loss will further boost the feature representation ability of encoder, and cause 2.5% absolute decrease in WER. At this point, the model gains considerable recognition accuracy compared with the traditional autoregressive method.

5.4. The Effectiveness of NPD for I&F

Table 6 (Row 6) shows that using NPD for I&F can boost the performance effectively. We also study the effect of increasing the candidates number for FastLR on LRS2 dataset, as shown in Figure 5. It can be seen that, when setting the candidates number to , the accuracy peaks. Finally, FastLR achieves considerable accuracy compared with state-of-the-art autoregressive lipreading model.

Figure 5. The effect of cadidates number on WER and CER for FastLR model.

5.5. The Visualization of Boundary Detection

We visualize the encoder-decoder attention map in Figure 6, which is obtained from the well-trained AR TM-seq2seq. The attention map illustrates the alignment between source video frames and the corresponding target sub-word sequence.

The figure shows that the video frames between two horizontal red lines are roughly just what the corresponding target token attends to. It means that the ”accumulate and detect” part in I&F module tells the acoustic boundary well and makes a right prediction of sequence length.

Figure 6. An example of the visualization for encoder-decoder attention map and the acoustic boundary. The horizontal red lines represent the acoustic boundaries detected by I&F module in FastLR, which split the video frames to discrete segments.

6. Conclusion

In this work, we developed FastLR, a non-autoregressive lipreading system with Integrate-and-Fire module, that recognizes source silent video and generates all the target text tokens in parallel. FastLR consists of a visual front-end, a visual feature encoder and a text decoder for simultaneous generation. To bridge the accuracy gap between FastLR and state-of-the-art autoregressive lipreading model, we introduce I&F module to encode the continuous visual features into discrete token embedding by locating the acoustic boundary. In addition, we propose several methods including auxiliary AR task and CTC loss to boost the feature representation ability of encoder. At last, we design NPD for I&F and bring Byte-Pair Encoding into lipreading, and both methods alleviate the problem caused by the removal of AR language model. Experiments on GRID and LRS2 lipreading datasets show that FastLR outperforms the NAR-LR baseline and has a slight WER increase compared with state-of-the-art AR model, which demonstrates the effectiveness of our method for NAR lipreading.

In the future, we will continue to work on how to make a better approximation to the true target distribution for NAR lipreading task, and design more flexible policies to bridge the gap between AR and NAR model as well as keeping the fast speed of NAR generation.


This work was supported in part by the National Key R&D Program of China (Grant No.2018AAA0100603), Zhejiang Natural Science Foundation (LR19F020006), National Natural Science Foundation of China (Grant No.61836002, No.U1611461 and No.61751209) and the Fundamental Research Funds for the Central Universities (2020QNA5024). This work was also partially supported by the Language and Speech Innovation Lab of HUAWEI Cloud.