Towards Pose-invariant Lip-Reading

11/14/2019 ∙ by Shiyang Cheng, et al. ∙ 19

Lip-reading models have been significantly improved recently thanks to powerful deep learning architectures. However, most works focused on frontal or near frontal views of the mouth. As a consequence, lip-reading performance seriously deteriorates in non-frontal mouth views. In this work, we present a framework for training pose-invariant lip-reading models on synthetic data instead of collecting and annotating non-frontal data which is costly and tedious. The proposed model significantly outperforms previous approaches on non-frontal views while retaining the superior performance on frontal and near frontal mouth views. Specifically, we propose to use a 3D Morphable Model (3DMM) to augment LRW, an existing large-scale but mostly frontal dataset, by generating synthetic facial data in arbitrary poses. The newly derived dataset, is used to train a state-of-the-art neural network for lip-reading. We conducted a cross-database experiment for isolated word recognition on the LRS2 dataset, and reported an absolute improvement of 2.55 proposed approach becomes clearer in extreme poses where an absolute improvement of up to 20.64



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, several deep learning approaches for lip-reading [1, 2, 3, 4, 5]

have been presented, replacing the traditional feature extraction process by automatically extracting features from the pixels, and significantly outperforming the traditional approaches. The performance has been further improved by the introduction of end-to-end approaches which attempt to jointly learn the extracted features and perform visual speech classification

[6, 7, 8, 9].

The vast majority of the aforementioned works focused on frontal view lipreading. As a consequence, the performance of such systems degrades in realistic in-the-wild scenarios where the face might not be frontal. To alleviate this, two different approaches have been followed in the literature. The first one trains classifiers using data from all available views in order to build a generic classifier 

[10, 11]. The second approach applies a mapping to transform features from non-frontal views to the frontal view. Lucey et al[12] apply a linear mapping to transform profile view features to frontal view features. This approach has been extended to map other views like 30, 45 and 60 to the frontal view [13] or to the 30 view [14]. However, the performance is degraded as the number of features to be generated by the linear mapping increases [12]. A similar approach [15] has been presented recently in which the mouth ROIs are frontalised using generative adversarial networks (instead of predicting frontal view features). One recent work [16] on multi-view lip-reading tries to combine multiple views of the mouth to improve the performance. However, as it requires multiple cameras, the usage is limited to certain scenarios like meetings and car environments.

All the above works have been applied to small datasets only. Collecting and annotating a large non-frontal lip-reading database requires tremendous time and efforts. As an alternative, in this work, we present an approach that leverages the 3DMM [17] which, starting from the frontal database of LRW [5], enables the generation of synthetic lip-reading data in arbitrary poses. This allows the model to be trained on a large range of poses which results in significant performance improvement on non-frontal views.

Figure 1: Pose augmentation pipeline.

The main goal of pose-invariant lip-reading is to reduce the impact of different poses as it is known that the performance decreases when a classifier is trained and tested on different poses.

Our contributions can be summarised as follows: (i) We describe a method to construct a large-pose synthetic database for lip-reading. Our method capitalizes on robust 3DMM fitting [18], which allows us to take as input a frontal facial image and render it in any arbitrary pose. Using this method, we derive a database that extends the large-scale but mostly frontal database of LRW, which we call LRW in Large Poses (LP). (ii) We investigate the effect of image augmentation as a way to further boost performance. (iii) We use the synthetic database to train a state-of-the-art model [19, 20], and show that the new model significantly outperforms its counterpart trained on LRW. We conducted a cross-database experiment for isolated word recognition on the LRS2 database and achieved an improvement of 2.55%. We also show the benefit of the proposed approach in extreme poses, where an improvement of up to 20.64% is achieved.

2 Methodology

2.1 Pose augmentation

Our core idea is to generate large-pose lip-reading data by augmenting LRW [5] which is a large-scale but mostly frontal database. A simple but effective approach to generate profile faces is through the use of 3DMM [17]. The pose augmentation pipeline, demonstrated in Fig. 1, consists of 2 steps: (1) fit the 3DMM into the 2D image; (2) rotate the fitted 3D face to a new angle and render a new image.

3DMM fitting. Following prior works, we use a combined 3DMM consisting of the Basel [21] and FaceWarehouse [22] models. A typical 3DMM can be expressed as:


where is the reconstructed 3D face, and is the mean 3D face, and are the eigenbasis for facial identity and expression respectively, and and are the corresponding parameters. By applying a weak perspective projection on the 3D model, we project the mesh onto the image:


where denotes the 2D coordinates of projected 3D mesh on the image plane, is the orthographic projection matrix, , and denote the scale, rotation and translation respectively. We can group the parameters from Equ. 1 and 2 into a single set . 3DMM fitting is then defined as the process of recovering given the 3DMM and an input 2D facial image. The goal is to recover such that the error between the 2D projection and the given image is minimized.

Fitting 3DMM into 2D images is a difficult optimization problem, an extensive discussion of which falls beyond the scope of this work. In our case, we use the state-of-the-art 3DDFA method proposed in [18, 23], which trains a fully convolutional network to regress the parameters from an input 2D image in a cascaded manner. In order to provide a better initialization for the 3DMM fitting, we detect 68 facial landmarks for every frame using FAN [24], which is also used to accurately crop the input image.

Rendering of new poses. After fitting the 3DMM to a given facial image, we can use the result to render the same face in a new pose. Unlike most previous face augmentation methods [25, 26, 27] that only synthesize profile faces without background context, we generate profile faces while trying to preserve the original image context. This is essential for training the deep lip-reading neural network that can work well with real-world images at test time. To this end, we used the face rendering method of [18]

. In particular, the fitted 3D face is used for estimating the depth of anchors in the image background (computed using the method from 

[28]). Then, the whole image is triangulated to create a new 3D mesh using the estimated depth. Finally, we rotate this mesh in 3D space to generate new facial images in arbitrary pose.

LRW in Large Pose (LP). We apply the techniques described in the previous sub-sections to obtain a new database for lip reading which we call LP. In particular, the 3DMM is fitted into each frame from each video in LRW, after which we estimate the 3D face pose of each frame. For each video, we randomly select two pose increment angles, one in yaw (-45 to 45) and another in pitch (-30 to 30) direction. To avoid rendering occluded facial parts with random contents, we enforce the sign of increment angles to be the same as the pose of video’s first frame. Finally, we rotate all the frames of this video with the same increment values and render them into a new video. Although we only augment each sequence once (namely doubling the size of LRW), the data for each word still cover a full and continuous range of poses, because each word contains nearly 1,000 examples.

2.2 2D image augmentation

We investigate whether standard image augmentation techniques, widely used in image classification [29], are also beneficial to improving the accuracy of the lip-reading models. To our knowledge, the only augmentation techniques used in lip reading so far are random cropping and flipping. In addition to these techniques, we randomly augment the data during training by applying (a) random scaling from to , (b) random image degradation by downsampling the mouth region to 0.4–0.8 of their size and and then upsampling them back to the original size and (c) randomly placing rectangular noise patches of size 0.1–0.4 of the mouth region.

3 Visual Speech Recognition

The deep learning model used for visual speech recognition is shown in Fig. 2. It consists of a residual network (ResNet) [29] for automatic feature extraction and a 2-layer Bi-GRU to model the temporal dynamics of the features. The architecture is similar to the ones proposed in [19, 20] which achieve state-of-the-art performance on the LRW database.

Figure 2: The block diagram of our lip reading model.

The first part of our network performs spatio-temporal convolution, which is capable of capturing the short-term dynamics of the mouth region. It consists of a convolutional layer with 64 3D kernels of 5 by 7 by 7 size (time/width/height), followed by batch normalization and rectified linear units. This is followed by an 18-layer ResNet to extract the visual features. We have opted for ResNet-18 because preliminary experiments showed it leads to the same performance as ResNet-34 (which is used in the previous studies

[19, 20]) but the training time is reduced by 30%. Finally, the output of ResNet-18 is fed to a 2-layer BiGRU, each layer consisting of 512 cells.

4 Experimental Setup

4.1 Databases

Lip Reading in the Wild (LRW) database [5]. LRW is a large-scale audio-visual database that contains 500 different words from over 1,000 speakers. Each utterance has 29 frames, whose boundary is centered around the target word. The database is divided into training, validation and test sets. The training set contains at least 800 utterances for each class while the validation and test sets contain 50 utterances.

Lip Reading Sentences 2 (LRS2) database [30]. LRS2 consists of 224.5 hours of audio-video-text pairs of speaking face collected from BBC TV shows and news. It is very challenging due to a large variation in the utterance length and speaker’s head pose. LRS2 provides word segmentation for the training set, thus we could extract single word utterances from the database to train/test our models.

4.2 Evaluation protocol

LRW already contains words so there is no need for further processing. However, for LRS2, we need to extract the individual words first using the word boundaries provided in the training set. We find the center frame of each word and select a symmetric 29-frame window around it. In order to filter out some unwanted sequences, we derive some extra rules along this procedure: (1) remove considerably short ( 5 frames) or long segments ( 31 frames); (2) remove segments that appears at the very beginning () or the very end () of the sentence, where denotes the middle frame of a word and is the length of sentence.

We only select the same 500 words as in LRW, resulting in 60,207 word instances. In the cross-database experiment, we use the whole set from LRS2 as our test data. Additionally, we split this data into training, validation and test sets with a ratio of 8:1:1. Nonetheless, this data is very imbalanced, e.g., some words have over 2,000 examples, while some others have only a few. To balance the split, we limit the number of training and validation examples to 90 and 10 per word, respectively. The final balanced set contains 23175, 3119 and 33909 examples for training, validation and testing respectively, we refer it as LRS2-Ba.

Models Accuracy (%) on different test sets
M[LRW] 82.78 69.86 57.05 54.39
M[LP] 81.67 79.08 57.25 54.43
M[LRW+LP] 83.08 79.38 58.86 56.02
M[LRW]+Aug2D 83.20 72.14 58.84 56.07
M[LRW+LP]+Aug2D 83.08 79.53 59.60 56.78
M[LRW+LRS2-Ba] 82.73 69.62 - 59.59
Table 1: Results on the LRW database. Aug2D: 2D image augmentations are applied during training. LRS2-Ba: Balanced partition of LRS2 database.

4.3 Data preprocessing

68 facial points are detected using FAN [24]. All the faces are aligned to a neutral reference frame to remove rotation and scale differences. This is done via an affine transform using 5 landmarks (i.e., eye corners and nose tip). Based on the mouth landmarks, we extract the mouth ROI using a 9696 bounding box. The same procedure is repeated for the entire data to normalise the faces.

We always include two augmentation techniques that have shown to be useful in lip-reading [30], i.e., random cropping (by an 88

88 bounding box) and horizontal flipping (with a probability of 0.5). We optionally include three data augmentation methods introduced in Section 

2.2 to investigate their impact on visual speech recognition. All these data augmentation methods are applied on video-level, thus the same augmentation setting is configured across all frames.

4.4 Training details

Training is divided into 3 phases. We first train a model with a temporal convolutional backend. After this, we replace the backend with a 2-layer Bidirectional Gated Recurrent Unit (Bi-GRU), and we train only the Bi-GRU backend with the weights of the spatiotemporal convolutional layers and ResNet-18 fixed until the model converges. Finally, we train the full model end-to-end. We employ the Adam optimiser

[31] with an initial learning rate of 0.0012. Two NVidia Titan 1080Ti with a total batch size of 160 are used.

5 Results

5.1 Overall results

Results on LRW and LRS2 databases are shown in Table 1. We name our models as follows: (1) M[] denotes the training data composition; (2) A model trained with 2D augmentations (described in Section 2.2) is marked with Aug2D. For instance, M[LRW]+Aug2D is trained using only the original LRW data but with additional image augmentations, M[LP] is trained using only LP data. On the other hand, M[LRW+LP]

is trained on the combined set of LRW and LP, with the same number of iterations (in one epoch) as that of the baseline model

M[LRW]. In this case, we randomly choose examples from either database and we make sure that the total number of training examples remains the same as in the baseline model M[LRW].

We observe that combining the LP data with the LRW data improves the performance on LRS2 database by 1.81 (M[LRW+LP]: 58.86% vs. M[LRW]: 57.05%). On the other hand, by applying extra 2D image augmentations during training (viz. M[LRW]+Aug2D), we also achieve a better result (58.84%) than the baseline model. Last, if we employ both pose and 2D image augmentations (viz. M[LRW+LP]+Aug2D), we obtain the best performance in cross-database experiments, i.e., 59.6% in LRS2 and 56.78% in LRS2-Ba. Similar conclusions can be reached when testing on LRW and LP databases.

Furthermore, we report one possible upper-bound performance (59.59%) on the balanced LRS2 (LRS2-Ba), which is obtained by a model trained on the combined training set of LRW and LRS2-Ba. We call this model M[LRW+LRS2-Ba]. Results from all other models are also reported for LRS2-Ba test set. Our best model (viz. M[LRW+LP]+Aug2D) results in an absolute improvement of 2.39% over the baseline model (M[LRW]: 54.39%) while the upper-bound model (M[LRW+LRS2-Ba]) achieves 59.59%. Clearly, both pose and 2D image augmentation improve the performance of the model trained on LRW and tested on LRS2-Ba, without any laborious efforts in collecting new data.

5.2 Pose-wise results on LRS2-Ba

We further demonstrate pose-wise accuracy achieved by our models on balanced LRS2 data (see Tables 2 and 3). Specifically, for each word utterance, we estimate the 3D pose of every frame and compute the average pose of the sequence. For simplicity, we take the absolute value of poses, based on which, we divide all the sequences into difference pose groups. Taking the yaw angle as an example, we create five groups, i.e., , , , and .

From these results, it can be observed that: (1) Incorporating our synthetic large pose database into the training set improves performance across all pose ranges. This is particularly evident in large poses, for instance, M[LRW+LP] results in an absolute improvement over M[LRW] of 6.43% and 15.92% for yaw in the ranges of and , respectively. The absolute improvement in pitch is 20.64% and 15.79%, respectively. (2) Applying extra 2D image augmentations (M[LRW]+Aug2D) improves the accuracy most of the time, though the improvement is not as significant as that of M[LRW+LP] in large poses. (3) Surprisingly, in the case of large poses, the models trained on combined LRW and LP data sometimes outperform even the upper-bound model M[LRW+LRS2-Ba]. This showcases the effectiveness and usefulness of our pose augmentation approach.

Models Accuracy (%) on different poses
M[LRW] 58.11 55.18 49.62 41.73 23.07
M[LRW]+Aug2D 60.26 55.64 50.55 44.64 29.41
M[LRW+LP] 59.19 55.77 50.67 48.16 38.99
M[LRW+LP]+Aug2D 59.69 56.24 52.86 49.54 39.68
M[LRW+LRS2-Ba] 63.42 59.24 54.29 49.34 35.76
Table 2: LRS2-Ba test accuracy (%) divided by Yaw angle.
Models Accuracy (%) on different poses
M[LRW] 55.77 44.86 28.22 9.52 7.89
M[LRW]+Aug2D 57.35 47.2 32.78 9.52 10.53
M[LRW+LP] 57.08 48.28 38.59 30.16 23.68
M[LRW+LP]+Aug2D 57.77 49.95 37.34 26.98 21.05
M[LRW+LRS2-Ba] 60.79 51.6 34.44 22.22 10.53
Table 3: LRS2-Ba test accuracy (%) divided by Pitch angle.

6 Conclusion

We have presented a method for pose-invariant lip-reading by constructing large-pose synthetic data. The proposed approach is based on 3DMM which allows us to take a frontal facial image and render the face in any arbitrary pose. Augmenting the training set with this method results in improved performance when training on the mostly frontal LRW database and testing on the LRS2 database which contains a variety of poses. It is worth pointing out that a substantial improvement is observed in extreme poses, beyond 45 in yaw and pitch. In future work we will investigate the performance of the proposed approach on other databases with more extreme poses like LRS3 and on continuous visual speech recognition.


  • [1] H. Ninomiya, N. Kitaoka, S. Tamura, Y. Iribe, and K. Takeda, “Integration of deep bottleneck features for audio-visual speech recognition,” in Interspeech, 2015.
  • [2] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y Ng, “Multimodal deep learning,” in ICML, 2011.
  • [3] S. Petridis and M. Pantic, “Deep complementary bottleneck features for visual speech recognition,” in ICASSP, 2016.
  • [4] C. Sui, R. Togneri, and M. Bennamoun, “Extracting deep bottleneck features for visual speech recognition,” in ICASSP, 2015.
  • [5] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in ACCV, 2016.
  • [6] S. Petridis, Z. Li, and M. Pantic, “End-to-end visual speech recognition with LSTMs,” in ICASSP, 2017.
  • [7] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in CVPR, 2017.
  • [8] M. Wand, J. Koutn, and J. Schmidhuber,

    “Lipreading with long short-term memory,”

    in ICASSP, 2016.
  • [9] Y. Assael, B. Shillingford, S. Whiteson, and N. De Freitas, “Lipnet: End-to-end sentence-level lipreading,” arXiv preprint arXiv:1611.01599, 2016.
  • [10] P. Lucey, S. Sridharan, and D. B. Dean, “Continuous pose-invariant lipreading,” in Interspeech, 2008.
  • [11] J. S. Chung and A. P Zisserman, “Lip reading in profile,” in BMVC, 2017.
  • [12] P. Lucey, G. Potamianos, and S. Sridharan, “An extended pose-invariant lipreading system,” in AVSP-W, 2007.
  • [13] V. Estellers and J. P. Thiran, “Multipose audio-visual speech recognition,” in EURASIP, 2011.
  • [14] Y. Lan, B. J. Theobald, and R. Harvey, “View independent computer lip-reading,” in ICME, 2012.
  • [15] A. Koumparoulis and G. Potamianos, “Deep view2view mapping for view-invariant lipreading,” in IEEE SLT-W, 2018.
  • [16] S. Petridis, Y. Wang, Z. Li, and M. Pantic, “End-to-end multi-view lipreading,” in BMVC, 2017.
  • [17] V. Blanz and T. Vetter,

    Face recognition based on fitting a 3D morphable model,”

    IEEE T-PAMI, vol. 25, no. 9, pp. 1063–1074, 2003.
  • [18] X. Zhu, X. Liu, Z. Lei, and S. Z Li, “Face alignment in full pose range: A 3d total solution,” IEEE T-PAMI, vol. 41, no. 1, pp. 78–92, 2019.
  • [19] T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with LSTMs for lipreading,” in Interspeech, 2017.
  • [20] S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition,” in ICASSP, 2018.
  • [21] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3D face model for pose and illumination invariant face recognition,” in IEEE AVSS, 2009.
  • [22] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou, “Facewarehouse: A 3D facial expression database for visual computing,” IEEE TVCG, vol. 20, no. 3, pp. 413–425, 2014.
  • [23] J. Guo, X. Zhu, and Z. Lei, “3DDFA,”, 2018.
  • [24] A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3d facial landmarks),” in ICCV, 2017.
  • [25] J. Zhao, L. Xiong, P. K. Jayashree, J. Li, F. Zhao, Z. Wang, P. S. Pranata, P. S. Shen, S. Yan, and J. Feng, “Dual-agent gans for photorealistic and identity preserving profile face synthesis,” in NIPS, 2017.
  • [26] I. Masi, A. T. Tran, T. Hassner, J. T. Leksut, and G. Medioni, “Do we really need to collect millions of faces for effective face recognition?,” in ECCV, 2016.
  • [27] J. Deng, S. Cheng, N. Xue, Y. Zhou, and S. Zafeiriou, “Uv-gan: Adversarial facial uv map completion for pose-invariant face recognition,” in CVPR, 2018.
  • [28] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z Li, “High-fidelity pose and expression normalization for face recognition in the wild,” in CVPR, 2015.
  • [29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [30] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE T-PAMI, 2018.
  • [31] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.