Mutual Information Maximization for Effective Lip Reading

03/13/2020 ∙ by Xing Zhao, et al. ∙ 0

Lip reading has received an increasing research interest in recent years due to the rapid development of deep learning and its widespread potential applications. One key point to obtain good performance for the lip reading task depends heavily on how effective the representation can be to capture the lip movement information and meanwhile to resist the noises resulted from the change of pose, lighting conditions, speaker's appearance and so on. Towards this target, we propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level to enhance the relations of the features with the speech content. On the one hand, we constraint the features generated at each time step to enable them carry a strong relation with the speech content by imposing the local mutual information maximization constraint (LMIM), leading to improvements over the model's ability to discover fine-grained lip movements and the fine-grained differences among words with similar pronunciation, such as “spend” and “spending”. On the other hand, we introduce the mutual information maximization constraint on the global sequence's level (GMIM), to make the model be able to pay more attention to discriminate key frames related with the speech content, and less to various noises appeared in the speaking process. By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading. To verify this method, we evaluate on two large-scale benchmark. We perform a detailed analysis and comparison on several aspects, including the comparison of the LMIM and GMIM with the baseline, the visualization of the learned representation and so on. The results not only prove the effectiveness of the proposed method but also report new state-of-the-art performance on both the two benchmarks.



There are no comments yet.


page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Lip reading is a task to infer the speech content in a video by using only the visual information, especially the lip movements. It has many crucial applications in practice, such as assisting audio-based speech recognition [4], biometric authentication [2], aiding hearing-impaired people [24]

, and so on. With the huge success of deep learning based models for several related tasks in the computer vision domain, some works began to introduce the powerful deep models for effective lip reading in these years

[2, 20, 19, 16]. For example, [20]

proposed an end-to-end deep learning architecture for word level visual speech recognition, which is a combination of a convolutional network with a bidirectional Long Short-Term Memory network, yielding an improvement of 6.8% on the accuracy than before. Besides the great impetus of deep learning technologies, several large-scale lip reading datasets, were released in recent years, such as LRW

[6], LRW-1000 [25], LRS2 [5], LRS3 [1], and so on. These datasets have also contributed significantly to the recent progress of lip reading.

(a) An example of a video sample with the annotated label “ABOUT”
(b) Another two samples of “ABOUT”
Fig. 1: The word-level lip reading is a challenge task. (a) The actual frames of the annotated word “ABOUT” include only frames at the time step T = 1219. (b) The same word label always have a greatly diversified appearances changes.

In this paper, we focus on the word-level lip reading, which is a basic but important branch in the lip reading domain. For this task, each input video is annotated with a single word label even when there are other words in the same video, as shown in Fig. 1. For example, the video sample in Fig. 1(a), including 29 frames in total, is annotated as “ABOUT”, but the actual frames of the word “ABOUT” include only frames at time step T = 1219, shown in the red boxes. The frames before and after this interval are corresponding to the word “JUST” and “TEN” respectively, not “ABOUT”. This is consistent with the actual case where the exact boundary of a single word is always hard to get. This property requires a good lip reading model to be able to learn the latent but consistent patterns reflected in different videos with the same word label, and so able to pay more attention to valid key frames, but less to other unrelated frames. Besides the challenges of inaccurate word boundaries, the video samples corresponding to the same word label always have greatly diversified appearance changes, as shown in Fig. 1(b). All these properties require the lip reading model to be able to resist the noises in the sequence to capture the consistent latent patterns in various speech conditions.

In the meanwhile, due to the limited effective area of lip movements, different words probably show similar appearance in the speaking process. Especially, the existence of homophones where different words may look the same or quite similar increases many extra difficulties to this task. These properties require the model being able to discover the fine-grained differences related to different words in the frame-level to distinguish each word from the other.

To solve the above issues, we introduce the mutual information maximization (MIM) on different levels to help the model learn both robust and discriminative representations for effective lip reading. On the one hand, the representations at the global sequence level would be required to have a maximized mutual information with the speech content, to force the model learning the latent consistent global patterns of the same word label in different samples, while being robust to the variations of pose, light and other label-unrelated conditions. On the other hand, the features at the local frame level would be required to maximize their mutual information with the speech content to enhance the word-related fine-grained movements at each time step to further enhance the differences between different words. By combining these two types of constraints together, the model could automatically find and distinguish the valid important frames for the target word, and ignore other unrelated frames. Finally, we evaluate the proposed approach on two large-scale benchmarks LRW and LRW-1000, whose samples are all collected from various TV shows with a wide variation of the speaking conditions. The results show a new state-of-the-art performance on both the two challenging datasets when compared with other related work in the same condition of using no extra data or extra pre-trained models.

The proposed method could also be easily modified to other existing models for other tasks, which may bring some meaningful insights to the community for other tasks.

Fig. 2: The base architecture.

Ii Related Work

In this section, we provide an overview of the related literature on two closely related aspects, lip reading and mutual information based methods.

Ii-a Lip Reading

When deep learning technologies are not so popular, many methods have achieved several encouraging results by using specifically-designed and hand-engineered features, such as optical flow [18], lip landmarks tracking [8]

, and so on. The classification is often done by Support Vector Machine


together with the Hidden Markov Models (HMMs)

[3, 15]. We refer to [26, 17] for a detailed review on these non-deep methods for lip reading. These previous work have provided an important impetus to the advancement of lip reading at the early stage.

With the rapid development of deep learning in recent years, more and more researchers gradually tend to perform the lip reading task by deep neural networks.

2D-CNN is the first type of network applied to lip reading to extract features for each frame. [13] proposed a system including a CNN and a hidden Markov model with Gaussian mixture observation model (GMM-HMM). The outputs of the CNN are regarded as visual feature sequences, and the GMM-HMM is applied on this sequence for word classification. In the later works [21, 5]

, long short-term memory (LSTM) or gated recurrent unit (GRU) is used to model the patterns in the temporal dimension. The CNN-LSTM based models, which can be trained in an end-to-end manner, has gradually become a processing pipeline for lip reading.

However, the mouth regions in different frames are not always aligned at exactly the same position. So the context shown in nearby frames always plays an important role for effective lip reading. Several methods introduce the 3D convolution operation to tackle this problem [16, 19, 25]. For example, LipNet [2] employed a 3D-CNN at the front-end on the visual frames and obtained remarkable performance for lip reading. Stafylakis et al. [20] combined a 3D-CNN and a 2D-CNN based network to obtain robust features, which got a much higher accuracy on LRW dataset than before.

Besides directly applying different types of deep networks to lip reading, some recent impressive works begun to design particular modules to solve the shortcomings of some existing networks for more effective lip reading. For example, Stafylakis et al. [19] introduced additional word boundary information to improve the performance on the word-level LRW dataset. [5] employed the attention mechanism to select key frames in a sequence-to-sequence model. Wand et al. [22] improved the accuracy of lip reading by domain-adversarial training, which is expected to get speaker-independent features, beneficial to the final word classification. However, their method is hard to apply when coming to a large scale dataset with large number of speakers. Recently, Wang [23] extracted both frame-level fine-grained features and short-term medium-grained features by a 2D-CNN network and 3D-CNN network respectively. In this paper, we propose a new way for effective lip reading. Specifically, we introduce the constraints on both the local feature level and the global representation level to make the model both be able to learn fine-grained features and pay attention to key frames respectively.

Ii-B Mutual Information Mechanism

Mutual information (MI) is a fundamental quantity for measuring the relationship between two random variables. It is always used to evaluate the “amount of information” owned by one random variable when given the other random variable. Based on this property, the mutual information of two random variables is always used as a measure of the mutual dependence between two variables. Moreover, unlike the Pearson correlation coefficient which only captures the information in the degree of linear relationship, mutual information also captures nonlinear statistical dependencies

[10], and therefore has a wide range of applications.

For example, Ranjay et al. [11] solve the visual question answer problem by maximizing the MI between the image, the expected answer and the generated question, leading to the model’s ability to select corresponding powerful features. Li et al. [12]

tried to maximize the MI between the source and target sentences in the neural machine translation task to improve the diversity of translation results.

One work which has a bit relation with our work is Zhu et al. [27], who performed talking face generation by maximizing the MI between the words distribution and the facial/audio distribution. But in our work, we try to maximize the MI between the words distribution and the representation at different levels, to guide the model towards learning both robust and discriminative features for the lip reading task, which is totally different with [27].

Fig. 3: The process of training the base network with the proposed LMIM. The total loss is computed by averaging over all the time steps and patches. The gradients from the LMIM will be back-propagated to the Front-end through the features sampled from the ResNet18. The LMIM will be dropped after training.

Iii The Proposed Mutual Information Maximization for Lip Reading

In this section, we would first give an overview to the overall architecture. Then the particular manner to impose mutual information mechanisms on different levels is presented. Finally, the optimization process to learn the model is provided.

Iii-a The Overall Architecture

Let denotes the input sequence with frames in total, where is the feature vector of the

-th frame. The task of the model is to classify the input sequence

X into one of the classes, where is the number of classes. Let denotes the annotated word label of the sequence, where Y is a dimensional one-hot vector with only a single 1 at the position corresponding to its word label index. We construct our base architecture with two principal components, named as front-end and back-end respectively, which enable the total network to be trained end-to-end.

The Front-end includes a 3D-CNN layer, a spatial pooling layer, a ResNet18 network, and a GAP layer, as shown in Fig. 2. Specifically, given the input image sequence X

, a 3D-CNN layer is firstly applied on the raw frames, in order to perform an initial spatial temporal alignment in the sequence for effective recognition. A spatial max-pooling layer is then followed to compact the features in the spatial domain. It should be noted that we keep the temporal dimension unchanged in this procedure to avoid a further shortage of the movement information in the sequence because the duration of each word is always very short. In the next step, we divide the features into

parts and employ a ResNet18 module at each time step to separately extract discriminative features. To improve the ability to capture fine-grained movements related to the spoken word, we impose the mutual information constraint on the pairs of outputs of ResNet18 and the annotated label. Having been maximized the relations with the annotated label, all these features obtained from the ResNet18 module would be fed into a global average pooling(GAP) layer to compress into -dimensional outputs, where D is the channel of the last layer and 512 in this paper.

With the initial representation from the Front-end, the Back-end, as shown in Fig. 2, include a 3-layer Bi-GRU network and a linear layer, to capture and classify the latent patterns of the sequence. A Bi-GRU contains two independent single directional GRUs. The input sequence is fed into one GRU in the normal order, and into another GRU in the reverse order. The outputs of the two GRUs would be concatenated together at each time step to represent the whole sequence. The output of the Bi-GRU is expected to be a global representation of the whole input sequence with dimension , where

is the number of hidden neurons in each GRU. The representation will be finally sent to a linear layer for classification. To improve its ability to resist noises and select key frames in the sequence, we impose the second mutual information constraint on this global representation.

Iii-B Local Mutual Information Maximization (LMIM)

As stated in the previous section, the performance of lip reading is heavily affected by the model’s ability to capture the local fine-grained lip movements, so as to generate discriminative features to distinguish different words from each other. The MI-based constraint is a promising tool for learning good features in an unsupervised way, because we never need any extra data to train it. As stated above, we would introduce Local Mutual Information Maximization (LMIM) on ResNet18 to help the model focus more on related spatial regions at each time step and produce more discriminative features. For lip reading, the local features nearby the mouth regions are significant for the final accurate recognition. Therefore, unlike most existing work [11, 27], we perform maximization of the MI on each patch of the feature maps rather than the whole feature maps.

Because mutual information is notoriously hard to compute for unknown distribution, we estimate it with the help of deep network here. Following the representation of Jensen-Shannon(JS) MI estimator

[9, 14]:


where , and are the two variables that we want to estimate the MI between them, is a continuous function that we directly use a network to approximate it. The

is the joint distribution of paired samples

, and the is the marginal distribution of the unpaired samples by randomly sampling and . In the optimization process, because is a monotone increasing function, so maximizing the JS MI estimator is equivalent to optimize (1) with when the formula is equal to the binary cross-entropy loss.

With the estimation above, the process of the LMIM is shown in Fig. 3. We assume the feature map in the last layer of ResNet18 (which will be sent to the GAP layer) as F with a shape of , where and are the height, width and the channels respectively. Then we divide the feature F into local patches which looks like we separate the original frame to patches when the receptive field of the features are mapped to the original frame. The label of each sample is expanded by repetition from one-hot vector of dimension to the same height and width as . Then we concatenate the labels and features together to obtain a representation of dimension , which would be used as the input to estimate the Local Mutual Information Maximization network (LMIM). To obtain the local mutual information at each position of the locations, we employ two convolutional layers with kernel size on the concatenated representation. Then a sigmoid activation is applied to the last layer to simulate the value of the mutual information. Please note that the architecture of the network in this step can be any other form, because it is just applied to approximate a continuous function

. But the output layer should always be based on a sigmoid activation function to employ the binary cross-entropy based estimation. The dimension of the outputs of LMIM is

, with each number illustrating the degree of how much the corresponding patch is related with the given word label. In the learning process, we expect the mutual information of every patch close to 1 (Real) if the features and the labels are of the same sample (paired samples), and 0 if the label is different with the annotated label of the input sequence (unpaired sample). To collect unpaired samples, we randomly concatenated the features with other labels in the same batch in the implementation process.

Therefore, the optimization for LMIM can be denoted as a binary cross-entropy loss as:


Noting that in this stage, we have not any special process in the temporal dimension. The features of T time steps in an input video will be sent to LMIM successively. In the end, the mean of the loss at all time steps is computed to obtain the gradients for subsequent update.

Fig. 4: The process of training the network with the proposed GMIM, noted that when we apply the GMIM, a single layer LSTM and a linear layer are also added to the Back-end for computing the weight of each frame, it will be retained after training while the GMIM will be dropped.

Iii-C Global Mutual Information Maximization (GMIM)

In each sequence, the amount of valuable information provided by different frames is not equal for robust lip reading. In several practical cases, there are many frames corresponding with other words than the given target word in a given sequence. One popular way in current related methods is to average over all the time steps to get the final representation, which would suffer superior performance when coming to practice.

In this paper, we introduce global mutual information maximization on the global representation obtained by the Bi-GRU. Specifically, we introduce an additional LSTM together with a linear layer over the outputs of the Front-end. This additional LSTM would assign different weights (-dimensional) for different frames according to the target word. The total architecture is shown in Fig. 4.

Based on the outputs with dimension of the 3-layer Bi-GRU layers and the weighted value , the final global representation is obtained as the weighted average of the outputs as:


The output of dimension is then sent to a linear layer to transform its shape from to , where is the number of classes. Specifically, the final representation of the whole sequence of dimension is applied to get the classification score as


For related valuable key-frames, the weight

should be positive and can be of any value in our method. While for unrelated frames, we just want its weight close to zero, not a negative number for the optimization problem. Therefore we use ReLU to obtain the weight



where is the outputs of the GAP layer, and are the parameters of the linear layer and denotes the hidden state at time step of the extra LSTM layer.

To guide the learning of the weights, we constrain the weighted average vector to contain most of the information about the target word. Specifically, we maximize the MI between the above weighted average representation and the annotated label Y, both of which will be fed into the global mutual information maximization module (GMIM), which consists of two linear layers and outputs a scalar after a sigmoid activation. Similarly to LMIM, If and Y come from paired samples, we expect the outputs of GMIM as large as possible and even close to 1 (Real). In other cases, the output is expected to be close to 0 (Fake). So the objective function can be written as:


Iii-D Loss Function

Combining the cross-entropy loss with the LMIM and GMIM optimization function, the final objective loss function for the whole model is:


where the first term is the cross-entropy loss and is the label. Because the three items in the above equation have the similar numbers in our experiments, we did not allocate different weights to each loss item in our implementation.

Iv Experiments

In this section, we first evaluate the performance of our base architecture (baseline) which can be trained easier than previous methods. Then we conduct a thorough ablation study to the proposed LMIM and GMIM (GLMIM) and figure out how they help the model get better results respectively. we also compare with other state-of-the-art lip reading methods on two large word-level benchmarkss. Finally, we visualize the discriminative representations leaned with the GLMIM. Codes will be available at

Iv-a Datasets

We evaluate our method on two large-scale word-level lip reading benchmarks, LRW and LRW-1000. The samples in both of these two datasets are collected from TV shows, with a wide coverage of the speaking conditions including the lighting conditions, resolution, pose, gender, make-up etc.

LRW [6]: It is released in 2016, including 500 word classes with more than a thousand speakers. It displays substantial diversities in the speaking conditions. The number of instances in the training set reaches 488766, and the number in validation and test set contains 25000 instances for each. LRW remains a challenging dataset and has been widely used by most existing lip reading methods.

LRW-1000 [25]: The dataset is a large-scale naturally distributed word-level benchmark, which has 1000 word classes in total. There are more than 70,000 sample instances in total, with a duration of about 57 hours. This dataset aims at covering a natural variability over different speech modes and imaging conditions to incorporate challenges encountered in practical applications. So the samples of the same word are not limited to a pre-specified length range, to allow the existence of various speech rates, which is consistent with the practical case and also brings more challenges.

Iv-B Implementation Details

The input frames in our implementation are all cropped or resized to (Each video in LRW contains full face and the resolution is larger than , we cropped the mouth region by directly; LRW1000 only contains the mouth region but the resolution is not fixed, we resized them to

). The kernel size, stride and padding of the first 3D-CNN are

, and respectively. Each GRU or LSTM layer has 1024 hidden units (which means each Bi-GRU contains 2048 neurons). The Adam optimizer is applied for fast convergence. In the training process, the learning rate would decay from 0.0001 to 0.00001 when the accuracy doesn’t increase. Dropout is utilized at each Bi-GRU layer to mitigate the overfitting problem.

Method Accuracy
Petridis[16] 82.00%
Petridis[16](our re-implement) 81.70%
The Modified Baseline Architecture 82.14%
TABLE I: Comparison of the modified baseline.

Iv-C Baseline

We adopt [16] as the base architecture. The accuracy of our re-implementation on LRW is a little lower than the value in the original paper. So we use the modified network as described in III-A and take it as our baseline when using no MI constraint. Unlike [16], we introduce the GAP layer to the modified network in order to get rid of training the front-end and the back-end separately. As shown in Table I, our modified architecture is superior to the base architecture, which achieves an accuracy of 82.14% on the LRW dataset.

Iv-D Effect of the LMIM

In order to evaluate the effectiveness of the proposed LMIM, we train the baseline network with and without the LMIM separately. In both the two cases, the LMIM will be dropped when coming to test, which means that these two networks are totally the same in the test process. When we compare the accuracy between these two networks, we find that the network trained with the LMIM performs better. Besides the total accuracy, we conduct a further statistics analysis of the accuracy over each class. As shown in Table II, most classes with the LMIM show a higher accuracy and a clear improvement over the words with similar spellings or pronunciations, such as MAKES/MAKING and POLITICAL/POLITICIANS. This result shows that the proposed LMIM enable to extract the local fine-grained features indeed, which is significant to improve the ability to distinguish the words with similar pronunciations.

Class Baseline Baseline with LMIM Improvement
MAKES 62% 74% 12%
MAKING 80% 92% 12%
POLITICAL 82% 90% 8%
POLITICS 84% 92% 8%
STAND 48% 60% 12%
STAGE 70% 80% 10%
NORTH 78% 90% 12%
NOTHING 78% 86% 8%
SPEND 36% 46% 10%
SPENDING 78% 82% 4%

TABLE II: Examples of the improvement over words with similar pronunciations.
Fig. 5: We randomly sample three words and show the weights of each frame learned with GMIM. The blue line shows the learned weight for each frame. The red dashed line denotes the word boundary for the target word when its value is 1.

Iv-E Effect of the GMIM

The ability to Select key frames is essential for lip reading because a video is always hard to cut to exactly containing only one word. This is why we apply GMIM to make the model pay different attention to all frames to select valid key frames. We directly based the experiments in this part on the model trained with LMIM in IV-D because of its excellent ability to extract fine-grained features. For the sake of fairness, the Front-end is fixed and only the Back-end is trained with GMIM. Without sending any additional word boundary information, we observed that the model has learned the key frames precisely and the accuracy has increased further. When the Front-end is trained together with the Back-end, we get a new state-of-the-art result.

The result of the weights learned with the proposed GMIM, is shown in Fig. 5. The horizontal axis represents the temporal dimension of the video, corresponding to 29 frames in the video. The vertical axis represents the numeric of the learned weights. The blue line shows the curve composed by the learned weights for each frame. The red dashed line with value 1 denotes the range divided by the annotated word boundary for the target word. Our model trained with GMIM not only learns the key frames successfully and pays more attention to the frames which are included in the word boundary, but also allocates small amount of weights to the frames close to the word boundary for capturing the context information.

Iv-F Compare with state-of-the-art methods

Method Accuracy
Chung[2018][7] 71.50%
Chung[2017][5] 76.20%
Petridis[2018][16] 82.00%
Stafylakis[2017][20] 83.00%
Wang[2019][23] 83.34%
Baseline 82.14%
Baseline+LMIM 83.33%
The Proposed GLMIM 84.41%
TABLE III: Comaprison with other related work on LRW.
Method Accuracy
LSTM-5 25.76%
D3D[2018][25] 34.76%
3D+2D 38.19%
Wang[2019][23] 36.91%
Baseline 38.35%
Baseline+LMIM 38.69%
The Proposed GLMIM 38.79%
TABLE IV: Comaprison with other related work on LRW1000.

In this part, we compare the proposed GLMIM with the current state-of-the-art methods on both the two challenging benchmarks, LRW and LRW-1000. On the LRW dataset, although our baseline is not the best, the accuracy is improved for about 1.21% after introducing the LMIM, which is expected to capture more discriminative and fine-grained features for the main task. Meanwhile, the GMIM improves the accuracy to 84.41% furthermore, mainly beneficial from its advantage to pay different attention to different frames. Comparing with other lip reading methods which also have no extra inputs except the visual information, as shown in Table III, we get the best result and provide a new state-of-the-art result on the LRW dataset.

LRW1000 is another challenging large-scale benchmark, with a large variation of speech conditions including lighting conditions, resolution, speaker’s age, pose, gender, and make-up, etc. The best result is only 38.19% up to now. It is challenging to obtain a good performance on this dataset while we achieve a high accuracy of 38.79% which outperforms the existing state-of-the-art results. Table IV gives the accuracy of our models. The improvement of the GMIM is smaller when comparing with the improvement on LRW, this interesting phenomenon may be due to that the number of useless frames in each word sample in LRW-1000 is smaller than LRW, which reduces the role of selecting key frames for each word.

Iv-G Visualization

(a) Results before applying the GLMIM
(b) Results after applying the GLMIM
Fig. 6: An example of the visualization for the final representations form the Bi-GRU. With the help of the GLMIM, the architecture gets more discriminative results.

In this section, we explore further the effect of the proposed GLMIM by visualization. Specifically, we randomly choose 6 classes and each of them contains 20 samples. We send them to thee model of our original baseline architecture with and without the proposed GLMIM respectively. Then we extract the final representations O which will be sent to the linear layer for classification. We apply PCA to reduce its dimension form higher dimensions to 2 dimensions for better visualization. As is shown in Fig. 6

, the variance among these classes before applying GLMIM ranges only from

to ; While the variance has been enlarged to the interval between and after applying GLMIM, which means the variance among the classes have been greatly increased due to the introduction of the proposed GLMIM, which makes it easier to distinguish different classes.

V Conclusion

In this paper, we propose a mutual information maximization based method for both the local fine-grained feature extraction and global key frames selection. We also modify the existing model for lip reading that make it can be trained easier. We performed a detailed ablation study and obtain the best results on both the two largest word-level lip reading datasets.

Vi Acknowledgments

This work is partially supported by National Key R&D Program of China (No. 2017YFA0700804) and National Natural Science Foundation of China (No. 61702486, 61876171).