Lip reading aims to infer the speech content by using visual information like lip movements, and is robust to the ubiquitous acoustic noises [luo2020pseudo-convolutional]. This special property makes it important for noisy and silent scenarios [chung2017lip, zhang2019spatio-temporal, StafylakisT17, zhang2020can]
. With the rapid development of deep learning technologies and the recent emergence of several large-scale lip reading datasets[chung2016lip, yang2019lrw-1000, chung2017lip, zhao2019cascade], there has been an exciting progress of lip reading in recent years, and many appealing results have been achieved [assael2017lipnet, afouras2019deep, zhang2019spatio-temporal, zhao2019cascade, zhao2019hearing]. However, almost all of the existing methods focus on the problem of monolingual lip reading. In this paper, we try to make an exploration of multilingual lip reading, which has not been considered before to our best knowledge.
Due to the structure of our vocal organs, the set of distinguishable pronunciations we could make is finite. So the set of distinguishable pronunciations in each language is also finite, leading to many common pronunciations shared among different languages. For example, there are as many as 32 phonemes existing in both English and Mandarin words, as shown in Figure 1.(a). Some example words with their corresponding phoneme-based representations are listed in Figure 1.(b). The same phonemes would generate the same or similar lip movements even though the speakers are of different languages, as shown in Figure 1.(c). This fact makes it possible to perform synergize learning for multilingual lip reading.
Each language has its own rule to compose different units (characters or phonemes) into a valid word. If we could make the lip reading model master the rules for the language, it should be able to perform well when coming to this languages. The composition rule for each language can be considered similarly to a fill-in-the-blank problem. If the model could make accurate predictions to any missing units when given its previous and later units no matter which language the input is, it should be effective for multilingual lip reading. Therefore, we map the problem of learning of the rule for a specific language as learning to infer the target unit when given its bidirectional context. A synchronous bidirectional learning (SBL) block is introduced to construct the decoder module to finish the prediction process.
Overall, the main contributions of this paper could be summarized as follows.
We make an exploration to the possibility of multilingual lip reading. As far as we know, it is the first time to tackle the lip reading problem in a multilingual setting using large-scale lip reading datasets.
We introduce phonemes as the modeling units, which acts as the bridge to link different languages. Then a synchronous bidirectional learning (SBL) framework is proposed to enhance the learning of the rule for each language. Finally, an extra task of judging the language type is introduced to make the learning more targeted at different languages.
With a thorough evaluation and comparison, our method shows its obvious advantages for multilingual lip reading, and outperforms the existing state of the art performance by a large margin on both the English and the Mandarin benchmarks.
2 Related work
2.1 Lip Reading
Lip reading has made substantial progress in recent years [zhang2019spatio-temporal, zhang2020can, zhao2020mutual, luo2020pseudo-convolutional, afouras2019deep, chung2017lip, chung2016lip, petridis2018end, zhao2019cascade, zhao2019hearing, StafylakisT17, assael2017lipnet, wang2019multi-grained, xiao2020deformation]. Existing lip reading methods could be generally divided into two categories, decoding based methods and classification based methods.
In the first category, lip reading is taken as a sequence (image sequence) to sequence (text sequence) problem, and the seq2seq methods based on RNN or Transformer [vaswani2017attention] are used. For example, Chung et al. [chung2017lip] was the first to use an RNN based encoder-decoder framework to perform lip reading and has achieved several appealing results. Luo et al. [luo2020pseudo-convolutional] proposed to introduce the CER (character error rate) to the RNN based seq2seq model to minimize the gap between the optimization target and the final evaluation measure. Zhang et al. [zhang2019spatio-temporal] introduced a temporal focal block to capture the short-range dependencies based on the Transformer-seq2seq model. They achieved an accuracy of % on the English benchmark LRW after pre-training on LRS2-BBC and LRS3-TED. It is the current best performance using decoding based methods.
In the second category, the whole input image sequence is taken as a single object belonging to a word class, and the lip reading problem is considered as a video classification problem. In 2017, Stafylakis et al. [StafylakisT17] proposed an effective architecture to predict the word label of the input sequence. Their model includes three parts, a 3D convolutional layer, a ResNet module, and a LSTM module, which has been used widely in the subsequent lip reading methods [petridis2018end, zhang2019spatio-temporal, luo2020pseudo-convolutional, xiao2020deformation, zhao2020mutual]. In 2019, Wang [wang2019multi-grained] proposed a multi-grained spatio-temporal modeling method to finish the lip reading task on three different granularities. Later, the work in [zhang2020can] performed a comprehensive study to evaluate the effects of different facial regions and obtained an accuracy of 85.02 and 45.24 on LRW and LRW-1000 respectively. This performance is the current best performance on these two datasets.
2.2 Multilingual Learning
Multilingual learning has been studied for a long time in the field of speech recognition and natural language processing. Dalmia et al.[dalmia2018sequence-based] found that the end-to-end multi-lingual training of seq2seq models is beneficial to low resource cross-lingual speech recognition. In 2018, Zhou et al. [zhou2018multilingual] proposed to use the sub-words as modeling unit based on the Transformer architecture [vaswani2017attention] and achieved good results for multilingual speech recognition. Tan et al. [tanxu2019-iclr] proposed to train separate models for each language at first and then perform knowledge distillation from each language-specific model to the multilingual model for multilingual translation. Wang et al. [sokolov-nmt-G2P] presented a Grapheme-to-Phoneme (G2P) model which share the same encoder and decoder across multiple languages by utilizing a combination of universal symbol inventories of Latin-like alphabets and cross-linguistically shared feature representations. Toshniwal et al. [Toshniwal-multilingual-asr] take a union of language-specific grapheme sets and train a grapheme-based sequence-to-sequence model on data combined by different languages for speech recognition. Inspired by the above related methods, we make an exploration to the synergized learning of multilingual lip reading, and introduce a synchronous bidirectional learning framework to solve the multilingual lip reading problem, which has not been touched before.
3 The Proposed SBL Framework
As shown in Figure 2.(a), we build our model based on the Transformer architecture. The whole model can be divided into two main parts: the video encoder and the synchronous bidirectional decoder. The video encoder is responsible for encoding the input image sequence to generate a preliminary representation of the sequence. Then a synchronous bidirectional decoder is introduced to predict the left-to-right and the right-to-left output sequence simultaneously, based on the outputs of the encoder. In the decoding process, the model has to learn from the bidirectional context of both the previous and future time steps to generate correct predictions.
3.1 The Video Encoder
As shown in Figure 2.(a), the video encoder mainly consists of two modules, the CNN based front-end and stacked self-attention blocks. The CNN based front-end is used to capture the short-term spatial-temporal patterns in the image sequence, and stacked self-attention blocks are used to weight the patterns at different time steps in the visual sequence to obtain the final representation of the encoder.
Specifically, we denote the input image sequence as , where is the number of frames in the sequence. We use and
to denote the height and width of the frames respectively. The image sequence is input to a 3D-convolutional layer firstly, followed by a max-pooling layer. The spatial dimension is reduced to a quarter of the input size, but the temporal dimension is kept the same as the input. That is, the dimension of the output would be. Then a ResNet-18 [hekaiming_resnet_16]
module is introduced to output a 512-d vector at each time step, which would then be added by their corresponding positional encodings and used as the input of subsequent self-attention blocks. The final output of the last self-attention block is employed as the final representation of the input sequence. We denote the output as, where represents the parameter of the whole encoder, and is composed by vectors.
The structure of each self-attention block is the same as [vaswani2017attention]. As shown in Figure 2.(b), the output of each self-attention block can be described as:
Where and means the output of a single head attention and multi-head attention respectively, and are the same, corresponding to the input of the block. Each head would have its own learnable parameters and . is another learnable parameter to combine all the outputs from all the heads. In this paper, we employ and in the encoder. Both the dimension of the query matrix Q and key matrix , and the dimension of the value matrix , are set to .
3.2 The Synchronous Bidirectional Decoder
Given the representation of each input sequence , the synchronous bidirectional (SB) decoder is introduced to predict each phoneme at each output’s time step (). As shown in Figure 2.(a), the decoder part is composed of several stacked synchronous bidirectional learning (SBL) blocks. Each block would combine the context from both the previous and future time steps to generate its output to the next SBL block. We use the context of the left-to-right (L2R) and the right-to-left (R2L) directions in the label sequence to express the previous and future context.
As shown in Figure 2.(a), each SBL block contains two branches: the L2R branch and the R2L branch. Each branch consists of a self-attention block and a vanilla-attention block. The self-attention block is used to perform a weighted sum of its input at different time steps, where the weights are obtained from its input by itself, as shown in Eq.(1) where are all equal to the input. The vanilla attention block is similar to the self-attention block, and also output a weighted sum of its input (corresponding to the output of the previous self-attention block) at different time steps. But the weights are generated according to the output of the encoder, as shown in Figure 2 where and are equal to the output of the encoder and is the output of the previous self-attention block.
To unify the context from the L2R and the R2L branches, there are some differences between the first block and the subsequent blocks. We assume the ground truth labels of each sequence as
, where each sequence is padded to the same length. The architecture can be described as follows.
For the first SBL block, a sequence of phonemes before the current time step together with their corresponding positional encodings is used as the input. For example, when to predict the target unit at time step , and are used as the input to the first L2R and R2L branch respectively, as shown in Figure 2.(a). Each (or ) is equal to the previous prediction result (or ) in the test process. For training, we introduce probabilistic teacher forcing, where (or ) is equal to the ground truth unit
with a probability, and to the previous prediction result (or ) with a probability 1-.
For the SBL blocks after the first one, the input would be reversed at first to generate the R2L branch’s input.
For all the SBL blocks, the output of R2L would be reversed to perform element-wise summation with the output of L2R branch, and the summation is used as the output of the corresponding block.
Finally, two fully connected layers are introduced to project the output of the two branches of the last SBL block to the unified phoneme space respectively.
To make the learning process more targeted and effective for each specific language, we introduce an extra task to make the model predict the language identity of the input by adding an extra indicator label to the ground-truth sequence . With the prediction task, the model can be guided to learn in a more targeted and effective manner for different languages.
3.3 Learning Process
Given the above pipeline, the model is learned by minimizing the cross-entropy loss at each time step. Specifically, we use and to denote the prediction results of the L2R and R2L branch at time step respectively. Then the model would be optimized to minimize as follows:
where , . and are used to balance the learning of the two branches, and both of them are set to be 0.5 in our experiments.
For the test process, we introduce the entropy of the prediction results to measure the quality of the L2R and the R2L branch. A small entropy of the prediction results indicates a strong belief of the model. In the ideal case, the prediction is like a one-hot vector. We define and as the entropy of the prediction distribution of the L2R and R2L branch at time step respectively. The combination result is denoted as C-Bi, where means combining. It is achieved by:
In the setting where an extra language indicator variable is introduced, the above combination operation is performed only when the judgements of the language identity from the two branches are the same. If the judgements are different, then we directly adopt the result from the branch which has a small entropy over the language identity judgement.
Based on the existing available large-scale lip reading datasets, we evaluate the proposed SBL framework on two lip reading benchmarks, the English dataset LRW, and the Mandarin dataset LRW-1000.
English Lip Reading Dataset: LRW [chung2016lip], released in 2016, is the first large scale English word-level lip-reading datasets, which includes 500 English words. There are 1000 training samples in each word class. All the videos are collected from BBC TV broadcasts, resulting in various types of speaking conditions in the wild. It has become a popular and influential benchmark for the evaluation of many existing lip reading methods.
Mandarin Lip Reading Dataset: LRW-1000 [yang2019lrw-1000], released in 2019, is a naturally distributed large scale benchmark for Mandarin word-level lip-reading. There are 1000 Mandarin words and phrases, and more than 700 thousand samples in total. The length and frequency of the words are all naturally distributed without extra specifications, forcing the model to be easily adaptive to the practical case where some words indeed appear more frequently than others.
4.1 Implementation Details
We crop the mouth regions of each frame on LRW in the same way as [petridis2018end]. The images in LRW-1000 are already cropped well and we use them directly without other pre-processing. Each frame is converted to grayscale, resized to 112 112, and then randomly cropped to 88 88 to the model. Each word in both the English dataset LRW and the Mandarin dataset LRW-1000 is converted to a sequence of phonemes, which would be used as the target label sequence. In our paper, we use 40, 47, and 56 phonemes for only English, only Chinese and the union of English and Chinese respectively.
In the training phase, the Adam [adam14] optimizer is employed with default parameters. The learning rate would be changed automatically in the training process according to the schedule strategy as [vaswani2017attention]. To speed up the training speed and the generalization performance of the model, we set the teacher forcing rate
as 0.5. The implementation is based on PyTorch. Dropout with probability 0.5 is applied to each layer in our model.
To perform a comparison with other methods more easily, we also adopt the word-level accuracy (Acc.) to measure our performance. For our model, Acc.=1 - WER where WER is the word error rate of the prediction.
4.2 Ablation Study of the Proposed SBL Framework
In this section, we try to answer two questions by experiments. (1) Is it possible to perform multilingual lip reading, after all, mono-lingual lip reading itself has already been a very challenging task? (2) Would the proposed SBL framework be effective for the synergy of multilingual lip reading? How much improvement could it bring to the recognition of each language?
I. For Question-1: We construct the following three frameworks for comparison.
I-A. TM (Baseline): For the baseline, we use the video encoder shown in Figure 2.(a), with the decoder as the traditional Transformer [vaswani2017attention]. The model is trained separately on LRW and LRW-1000 to obtain two variants for English and Mandarin lip reading respectively.
I-B. TM-ML: Based on the above baseline model, we train our model by combining LRW and LRW-1000 together to generate a new mixed data, where ML refers to the introduction of training on different languages simultaneously.
I-C. TM-ML-Flag: Based on I-B, we add an indicator variable to introduce an extra task of predicting the language type. Here, we use Flag to express the introduction of this task.
The results are shown in Table 1, where “EN/CHN” and “EN+CHN” mean that the corresponding model is trained with only a single dataset or both datasets respectively.According to Table 1 and Table 2, we could find that there is a significant improvement when using the mixed data for training. This shows that the joint learning of different languages could help improve the model’s capacity and performance for phonemes, leading to enhance the model’s performance for each individual language. This conclusion is consistent with the results in the related ASR and NLP domain [dalmia2018sequence-based, graves2006connectionist, zhou2018multilingual, li2020towards].
As can be seen from Table 1 and Table 2, there is a further improvement, when we further introduce an extra language type prediction task to the learning process. It suggests that an explicit introduction of the task to predict language type could help the model learn the rules of different languages more effectively.
|Method||Languages||EN_LRW (PER. )||CHN_LRW-1000 (PER. )|
II. For Question-2: We perform comparison from two aspects. Firstly, we evaluate our idea that the composition rules of each language can be learned more easily by using bi-directional context and so could provide help for multilingual lip reading. Then we compare with the proposed SBL framework to verify its effectiveness. For this target, we performed the following comparison at first.
II-A. TM-ML-BD: Based on the setting of I-B, we introduce an extra decoder module which is targeted to make predictions in a right-to-left direction. We use BD to denote that bi-directional information is used in the learning process.
II-B. TM-ML-BD-Flag: Based on the model II-A, an extra prediction task of language type as setting I-C is introduced in this setting.
The results are shown in Table 1 and Table 2. We can see that there is an obvious improvement when the bi-directional context is introduced. The accuracy increased from 81.03% and 44.58% to 84.12% and 52.61% on LRW and LRW-1000 respectively. This improvement verifies the effectiveness of our idea that the rules of each language could be learned by learning to infer the target phoneme given its bidrectional context. When we introduce the extra task of predicting language type, the performance is further improved.
|Method||Languages||EN_LRW (Acc. )||CHN_LRW-1000 (Acc. )|
Based on the above evaluation, we make a further comparison of the above models with our proposed SBL, which unifies the two-directional context together in a single block, instead of two separate single-directional modules. For this target, we perform the following experiments.
II-C. SBL-First: In this setting, we only introduce the first SBL module to the decoder, but keep the subsequent blocks in the deocoder as the traditional blocks in the vanilla decoder in Transformer [vaswani2017attention].
II-D. SBL-All: In this setting, the architecture is totally the same as shown in Figure 2.(a), where each block in the decoder is designed to combine the bidirectional context together.
II-E. SBL-Flag: This setting is almost the same as II-D, except that an extra task of predicting languages type is introduced to the learning process.
The results are shown in Table 3. As we can see, it is much better even we introduce the SBL block only at the first layer. It achieves the performance of 85.26% and 54.35% on LRW and LRW-1000 respectively. This result has already outperformed TM-ML-BD-Flag which introduce two separate uni-directional decoder branches and the extra prediction task of language type. When we introduce the SBL block through the whole decoder with the extra language-type prediction task, SBL-All-Flag outperforms the others by a large margin on both the two datasets.
|Method||Languages||EN_LRW (Acc. )||CHN_LRW-1000 (Acc. )|
4.3 Comparison with the State of the Art
In this part, we perform a comparison with other related state-of-the-art lip reading methods, including both seq2seq based decoding methods and classification based methods. The results in Table 4 show that our SBL framework outperforms the state-of-the-art performance by a large margin, especially on LRW-1000.
Inspired by the related multilingual study in the field of automatic speech recognition and NLP, we try to explore the possibility of multilingual synergized lip reading. We introduce the phonemes as the modeling units. And a new synchronous bidirectional learning manner is introduced to unify the two-directional context together, to enhance the learning of each language. With a thorough evaluation of the proposed method from several aspects, we demonstrate the effectiveness of the proposed SBL framework. Our work also achieve new state-of-the-art performance on both the two challenging datasets LRW and LRW-1000.