End-to-End Visual Speech Recognition for Small-Scale Datasets

04/02/2019 ∙ by Stavros Petridis, et al. ∙ Imperial College London 6

Traditional visual speech recognition systems consist of two stages, feature extraction and classification. Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage. However, research on joint learning of features and classification remains limited. In addition, most of the existing methods require large amounts of data in order to achieve state-of-the-art performance, otherwise they under-perform. In this work, we present an end-to-end visual speech recognition system based on fully-connected layers and Long-Short Memory (LSTM) networks which is suitable for small-scale datasets. The model consists of two streams which extract features directly from the mouth and difference images, respectively. The temporal dynamics in each stream are modelled by a Bidirectional LSTM (BLSTM) and the fusion of the two streams takes place via another BLSTM. An absolute improvement of 0.6 3.4 AVLetters and AVLetters2 databases, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual speech recognition or lip-reading is the process of recognising speech by observing only the lip movements, i.e., the audio signal is ignored. The first works in the field (Zhao et al., 2009; Potamianos et al., 2003; Dupont and Luettin, 2000; Matthews et al., 2002) extract features from a mouth region of interest (ROI) and attempt to model their dynamics in order to recognise speech. Lip-reading systems can enable the use of silent interfaces and also enhance acoustic speech recognition in noisy environments since the visual signal is not affected by noise.

Traditionally, two stages have been used for visual speech recognition systems: feature extraction from the mouth region of interest (ROI) and classification (Potamianos et al., 2003; Dupont and Luettin, 2000; Zhou et al., 2011)

. Dimensionality reduction/compression methods, like Discrete Cosine Transform (DCT), are the most common feature extraction approach which results in a compact representation of the mouth ROI. In the second stage, the temporal evolution of the features is modelled by a dynamic classifier, like Hidden Markov Models (HMMs) or Long-Short Term Memory (LSTM) recurrent neural networks.

Several deep learning approaches (Ninomiya et al., 2015; Ngiam et al., 2011; Petridis and Pantic, 2016; Sui et al., 2015b; Chung and Zisserman, 2016a) have been recently presented which automatically extract features from the pixels and replace the traditional feature extraction stage. Few end-to-end approaches have also been proposed which attempt to jointly learn the extracted features and perform visual speech classification (Petridis et al., 2017a; Chung et al., 2017; Wand et al., 2016; Assael et al., 2016; Stafylakis and Tzimiropoulos, 2017). This has led to a new generation of deep-learning-based lipreading systems which significantly outperform the traditional approaches.

The vast majority of modern deep learning approaches require large amounts of data in order to achieve state-of-the-art performance and their success in smaller datasets has been modest. This has led to some researchers claiming that deep learning methods do not perform well on simple tasks and small-scale datasets. Hence, traditional visual speech recognition methods are a better choice when large datasets are not available (Fernandez-Lopez and Sukno, 2018).

In this paper, an end-to-end visual speech recognition system is presented which learns simultaneously the feature extraction and classification stages and is suitable for small-scale datasets where large deep models do not perform so well. The model is an improved version of the model presented in our previous work (Petridis et al., 2017a) and consists of two streams. One stream encodes static information and uses raw mouth ROIs as input. The other stream encodes local temporal dynamics and takes as input difference (diff) images. The temporal dynamics in each stream are modelled by a BLSTM and stream fusion takes place via another BLSTM.

We perform experiments on four different datasets, OuluVS2, CUAVE, AVLetters and AVLetters2 which have been used as the main lip-reading benchmarks before the introduction of very large lip-reading datasets and traditional lip-reading methods still achieve competetive results. A significant absolute improvement on the state-of-the-art classification rate is reported on all datasets.

Figure 1: Overview of the end-to-end visual speech recognition system. Two streams are used for feature extraction directly from the raw images. The first stream extracts features from the raw mouth ROI and the second stream from the diff mouth ROI in order to capture local temporal dynamics. The and features are also computed and appended to the bottleneck layer. The encoding layers are pre-trained using RBMs. The temporal dynamics are modelled by a BLSTM in each stream. A BLSTM is used to fuse the information from both streams and provides a label for each input frame.

2 Related Work

In the first generation of deep models, deep bottleneck architectures (Ngiam et al., 2011; Hu et al., 2016; Ninomiya et al., 2015; Mroueh et al., 2015; Takashima et al., 2016; Petridis and Pantic, 2016)

were used to reduce the dimensionality of various visual and audio features extracted from the mouth ROIs and the audio signal. Then these features are fed to a classifier like a Support Vector Machine (SVM) or an HMM. Ngiam et al.

(Ngiam et al., 2011)

applied principal component analysis (PCA) to the mouth ROIs and bottleneck features were extracted with a deep autoencoder. Then the utterance features were fed to an SVM ignoring the temporal dynamics of the speech. Ninomiya et al.

(Ninomiya et al., 2015) followed a similar approach but the temporal dynamics were taken into account by an HMM. Another similar approach was proposed by Sui et al. (Sui et al., 2015b) who extracted bottleneck features from local binary patterns which were concatenated with DCT features and fed to an HMM. Similar ideas have also been proposed for audiovisual speech recognition (Huang and Kingsbury, 2013; Mroueh et al., 2015; Sui et al., 2015a) where a shared representation of the input audio and visual features is extracted from the bottleneck layer.

In the second generation of deep models, deep bottleneck architectures were used which extract bottleneck features directly from the pixels. Li (Li et al., 2016)

extracted bottleneck features from dynamic representations of images with a convolutional neural network (CNN) which were then fed to an HMM for classification. In our previous work

(Petridis and Pantic, 2016), bottleneck features were extracted directly from raw mouth ROIs by a deep feedforward network which were then fed to an LSTM network for classification. Noda et al. (Noda et al., 2015) predicted the phoneme that corresponds to an input mouth ROI using a CNN, and then an HMM is used together with audio features in order to classify an utterance.

In the third generation of deep models, few end-to-end works have been presented which extract features directly from the mouth ROI pixels and perform classification. The main approaches followed can be divided into two groups. In the first one, fully connected layers are used to extract features and LSTM layers model the temporal dynamics of the sequence (Petridis et al., 2017a; Wand et al., 2016). In the second group, either 3D CNNs are used (Assael et al., 2016; Shillingford et al., 2018) or 3D convolutional layers followed by residual networks (ResNet) (Stafylakis and Tzimiropoulos, 2017)

and then combined with LSTMs or Gated Recurrent Units (GRUs).

These works have also been extended to audio-visual models. Chung et al. (Chung et al., 2017) applied an attention mechanism to both the mouth ROIs and MFCCs for continuous speech recognition. Petridis et al. (Petridis et al., 2017b) used fully connected layers together with LSTMs in order to extract features directly from raw images and spectrograms and perform classification on the OuluVS2 database (Anina et al., 2015). This method has been extended to extract features directly from raw images and audio waveforms using ResNets and bidirectional GRUs (Petridis et al., 2018).

3 Databases

The databases used in this study are the OuluVS2 (Anina et al., 2015), AVLetters (Matthews et al., 2002), CUAVE (Patterson et al., 2002) and AVLetters2 (Cox et al., 2008). Fifty two speakers exist in the OuluVS2 database who repeat 3 times each of the 10 utterances, i.e., there are 156 examples per utterance. The following utterances are included in the dataset: “Excuse me”, “Goodbye”, “Hello”, “How are you”, “Nice to meet you”, “See you”, “I am sorry”, “Thank you”, “Have a good time”, “You are welcome”. The provided mouth ROIs are used and they are downscaled to 26 by 44.

The AVLetters contains 10 speakers saying 3 times the letters A to Z, so in total there are 30 utterances per letter. The mouth ROIs are provided and they are downscaled to 30 by 40.

The CUAVE dataset contains 36 subjects who repeat each digit (from 0 to 9) 5 times each, i.e, there are 180 examples per digit. The normal portion of the database is used which contains frontal facing speakers. The Dlib facial point tracker (Kazemi and Sullivan, 2014) is used to track sixty eight points on the face. Then the faces are registered to a neutral reference frame in order to normalise them for rotation and size differences. An affine transform is used for this purpose using 5 stable points, two eyes corners in each eye and the tip of the nose. The center of the mouth is located based on the tracked mouth points and a bounding box with size 90 by 150 is used to extract the mouth ROI which is then downscaled to 30 by 50.

The AVLetters2 contains 5 speakers saying 7 times the letters A to Z, so in total there are 35 utterances per letter. The faces are first tracked and aligned using the same approach as in the CUAVE dataset. Then a bounding box, around the mouth centre, is extracted and downscaled to 30 by 45.

4 End-To-End Visual Speech Recognition

The proposed deep learning model for visual speech recognition consists of two independent streams, as shown in Fig. 1, which extract features directly from the raw input. Static information is mainly encoded by the first stream which extracts features directly from the raw mouth ROI. Local temporal dynamics are modelled by the second stream which extracts features from the diff mouth ROI (computed by taking the difference between two consecutive frames).

Both streams consist of two parts: an encoder and a BLSTM. The encoder follows a bottleneck architecture which compresses the high dimensional input image to a low dimensional representation. It consists of 3 fully connected hidden layers of sizes 2000, 1000 and 500, respectively, with rectified linear units used as activation units similarly to

(Hinton and Salakhutdinov, 2006). This is followed by a linear bottleneck layer of size 50. The first and second derivatives ( and features, respectively) (Young et al., 2002) are also computed, based on the bottleneck features, and they are appended to the bottleneck layer. In this way, the encoding layers are forced to learn compact representations which are not only discriminative for the task at hand but also produce discriminative and features. This is in contrast to the traditional approaches which have no control over the discriminative power of the and features which are pre-computed at the input level.

The BLSTM layer is added on top of the encoding layers in order to model the temporal dynamics of the features in each stream. The two streams are fused by concatenating the BLSTM outputs of each stream and feeding them to another BLSTM. A softmax layer is used as the output layer which provides a label for each input frame. The entire system is trained end-to-end enabling joint learning of feature extraction and classification layers. In other words, the encoding layers are trained to extract features from mouth ROI pixels which are useful for classification using BLSTMs.

4.1 Single Stream Training

Initialisation:

Each stream is first trained independently. Restricted Boltzmann Machines (RBMs)

(Hinton, 2012)

are used to pre-train in a greedy layer-wise manner the encoding layers. Four Gaussian RBMs are used since the input (pixels) is real-valued and the hidden layers are either rectified linear or linear (bottleneck layer). Each RBM is trained for 20 epochs using contrastive divergence with a mini-batch size of 100 and a fixed learning rate of 0.001. In addition, L2 regularisation is applied with a coefficient of 0.0002.

End-to-End Training: A BLSTM is added on top of the pre-trained encoding layers and its weights are initialised using Glorot initialisation (Glorot and Bengio, 2010)

. Then the model is trained end-to-end using Adam with a mini-batch size of 10 utterances. A learning rate of 0.0003 was used since the default one of 0.001 led to unstable training. In order to avoid overfitting early stopping with a delay of 5 epochs was also used. In addition, gradient clipping was applied to the BLSTM layers.

4.2 Two-Stream Training

Initialisation: Each stream in the final model is initialised based on the corresponding single streams which have been already trained. Then on top of all streams a BLSTM is added in order to fuse the outputs of the single streams. The BLSTM weights are initialised using Glorot initialisation.

End-to-End Training: Finally, the two-stream model is fine-tuned using Adam with learning rate 0.0001. Similarly to single stream training early stopping and gradient clipping were also applied .

Method Mean (Std) Max
End-to-End (Raw Image) 91.8 (1.1) 94.7
End-to-End (Diff Image) 90.3 (1.2) 92.2
End-to-End (Raw + Diff Images) 93.6 (1.0) 95.6
Multitask CNN + BLSTM (Han et al., 2017) - 95.0
CNN pretrained on LRW dataset + DA + LSTM (Chung and Zisserman, 2016b) - 94.1
CNN pretrained on LRW dataset + DA (Chung and Zisserman, 2016a) - 93.2
Autoencoder + TDNN + LSTM (Koumparoulis and Potamianos, 2018) - 90.0
maxout-CNN-BLSTM (Fung and Mak, 2018) - 87.6
CNN + DA (Saitoh et al., 2016) - 85.6
CNN + LSTM, Cross-view Training (Lee et al., 2016) - 82.8
End-to-end CNN + LSTM (Lee et al., 2016) - 81.1
DCT + HMM (oul, b) - 74.8
PCA Network + LSTM + GMM-HMM (Zimmermann et al., 2016) - 74.1
Raw Pixels + LVM (oul, b) - 73.0
Table 1: Classification Accuracy on the OuluVS2 database. The end-to-end models are evaluated using the protocol suggested in (oul, a)

where 40 subjects are used for training and validation and 12 subjects are used for testing. “Mean (Std)” refers to the mean classification accuracy over ten runs and the corresponding standard deviation, while “Max” reports the maximum classification accuracy.

In cross-view training, the model is first trained with data from all views and then fine-tuned with data from the corresponding view. These models are pretrained on the LRW dataset (Chung and Zisserman, 2016a), which is a large database, and then fine-tuned on OuluVS2. DA: Data Augmentation, TDNN: Time-Delay Neural Network, LVM: Latent Variable Models
Method Mean (Std) Max
End-to-End (Raw Image) 85.5 (0.7) 86.4
End-to-End (Diff Image) 82.8 (1.0) 83.9
End-to-End (Raw + Diff Images) 87.3 (0.7) 88.4
SVM + MKL (Benhaim et al., 2013) - 85.0
Visemic AAM + HMM (Papandreou et al., 2009) - 83.0
Patch-based Features + HMM (Lucey and Sridharan, 2006) - 77.1
AAM +HMM (Papandreou et al., 2007) - 75.7
Deep Boltzmann Machines + SVM (Srivastava and Salakhutdinov, 2014) 69.0 (1.5) -
Deep Autoencoder + SVM (Ngiam et al., 2011) 68.7 (1.8) -
Table 2: Classification Accuracy on the CUAVE database. The end-to-end model is trained using the same protocol as (Ngiam et al., 2011; Srivastava and Salakhutdinov, 2014) where 18 subjects are used for training and validation and 18 for testing. “Mean (Std)” refers to the mean classification accuracy over ten runs and the corresponding standard deviation, while “Max” reports the maximum classification accuracy. This model is trained on 28 subjects and tested on 8 subjects. These models are trained and tested using a 6-fold cross validation. This model is trained and tested using a 9-fold cross validation.
Method Mean (Std) Max
End-to-End (Raw Image) 65.9 (2.1) 68.9
End-to-End (Diff Image) 57.3 (1.8) 60.0
End-to-End (Raw + Diff Images) 66.3 (2.0) 69.2
Manifold Kernel PLS (Bakry and Elgammal, 2013) - 65.3
Deep Boltzmann Machines + SVM (Srivastava and Salakhutdinov, 2014) 64.7 (2.5) -
RTMRBM (Hu et al., 2016) - 64.6
Deep Autoencoder + SVM (Ngiam et al., 2011) 64.4 (2.4) -
LBP-TOP + SVM (Zhao et al., 2009) - 58.9
DCT + DBNF (Petridis and Pantic, 2016) - 58.1
CNN + LSTM (Feng et al., 2017) 57.7 (0.8) -
Multiscale Spatial Analysis (Matthews et al., 2002) - 44.6
Table 3: Classification Accuracy on the AVLETTERS database. The end-to-end models are trained using the standard evaluation protocol (Matthews et al., 2002) where the first 2 utterances of each subjects are used for training and the last one for testing. “Mean (Std)” refers to the mean classification accuracy over ten runs and the corresponding standard deviation, while “Max” reports the maximum classification accuracy. PLS: Partial Least Squares, DBNF: Deep BottleNeck Features, LBP-TOP: Local Binary Patterns-Three Orthogonal Planes.
Method Mean (Std) Max
End-to-End (Raw Image) 36.8 (2.9) 42.6
End-to-End (Diff Image) 28.9 (2.0) 32.2
End-to-End (Raw + Diff Images) 35.0 (1.6) 37.8
RTMRBM (Hu et al., 2016) - 31.2
LBP-TOP + KSRC (Frisky et al., 2015) - 25.9
AAM + HMM (Cox et al., 2008) - 8.3
Table 4: Classification Accuracy on the AVLETTERS2 database. The end-to-end models are trained using the speaker-independent evaluation protocol (Cox et al., 2008)

where a 5-fold cross-validation is used. “Mean (Std)” refers to the mean classification accuracy over ten runs and the corresponding standard deviation, while “Max” reports the maximum classification accuracy. RTMRBM: Recurrent Temporal Multimodal Restricted Boltzman Machine, LBP-TOP: Local Binary Patterns-Three Orthogonal Planes, KSRC: Kernel Sparse Representation Classifier.

5 Experimental Setup

5.1 Evaluation Protocol

First, all datasets are divided into into training, validation and test sets. The standard evaluation protocol for the OuluVS2 database is followed where 40 subjects are used for training and validation and 12 for testing (oul, c). Then the 40 subjects are randomly divided into 35 and 5 subjects for training and validation purposes, respectively. This means that there are 1050, 150 and 360 training, validation and test utterances, respectively.

For experiments on the CUAVE database the evaluation protocol suggested in (Ngiam et al., 2011)

was used. The odd-numbered subjects (18 in total) are used for testing and the even-numbered subjects are used for training. The latter are further divided into 12 subjects for training and 6 for validation. This means that there are 590, 300 and 900 training, validation and test utterances, respectively.

The same protocol as the one used in (Ngiam et al., 2011), (Matthews et al., 2002) is followed for the AVLetters datasets. The first two utterances of each subject are used for training and the last utterance is used for testing. This means that there are 520 training utterances and 260 test utterances.

The speaker-independent protocol suggested in (Cox et al., 2008) is used for the AVLetters2 dataset. A 5-fold cross-validation is used, where three speakers are used for training, one for validation and one for testing. This means that in each iteration of the cross-validation there are 546, 182 and 182 training, validation and test utterances, respectively.

The target classes are a one-hot encoding for the 10 (case of CUAVE and OuluVS2) or 26 utterances (case of AVLetters and AVLetters2). The label of each utterance is used to label each frame and the end-to-end model is trained with these frame labels. The majority label over each utterance is used for labeling the entire sequence.

Every time a deep network is trained the results vary due to random initialisation. Hence, in order to present a more objective evaluation each experiment is repeated 10 times and the mean and standard deviation of classification accuracy on the utterance level are reported.

5.2 Preprocessing

The impact of subject dependent characteristics first needs to be reduced since almost all the experiments are subject independent111Only the evaluation protocol on AVLetters is subject dependent.. This is achieved by subtracting the mean image, computed over the entire utterance, from each frame.

The next step is the normalisation of data. All images are z-normalised, i.e. the mean and standard deviation should be equal to 0 and 1 respectively, as suggested in in (Hinton, 2012) before pre-training the encoding layers.

6 Results

In this section we present results for the two-stream end-to-end model, shown in Fig. 1, and also for each individual stream separately. We report the mean classification accuracy and standard deviation of the 10 models trained on each database, OuluVS2, CUAVE, AVLetters and AVLetters2 in Tables 1 to 4, respectively. Just a single accuracy value (with no standard deviation), which is most likely the maximum performance achieved, is provided in almost all previous works. Hence, in order to facilitate a fair comparison, the maximum performance achieved over the 10 runs is also reported.

Results for the OuluVS2 database are shown in Table 1. The best overall result is achieved by the end-to-end 2-stream model, with a mean classification accuracy of 93.6%. It is obvious that even the mean performance is consistently higher than the maximum performance of most previous works. When it comes to maximum performance the proposed end-to-end architecture sets the new state-of-the-art on OuluVS2 with 95.6%. We should also point out, that the proposed 2-stream model outperforms even the CNN models (Chung and Zisserman, 2016a, b) trained with external data. Both models are pre-trained on a large dataset, LRW (Chung and Zisserman, 2016a), and fine-tuned on OuluVS2. In addition, the proposed model also outperforms the CNN model (Han et al., 2017) trained on all views in a multitask scenario where the goal is to correctly predict both the phrase and the view of the given sequence.

Results for the CUAVE database are shown in Table 2. Comparison between different works is difficult since there is not a standard evaluation protocol for this database. The evaluation protocol followed in this study is only used by (Ngiam et al., 2011) and (Srivastava and Salakhutdinov, 2014). The best overall performance is achieved by the end-to-end 2-stream model, with a mean classification accuracy of 87.3% which is an absolute improvement of 18.3% over (Srivastava and Salakhutdinov, 2014). The maximum classification accuracy of 88.4% achieved by this models is the new state-of-the-art performance on the CUAVE dataset, which is an absolute improvement of 3.4% over (Benhaim et al., 2013).

Results for the AVLetters database are shown in Table 3. The best overall performance is achieved by the end-to-end 2-stream model, with a mean classification accuracy of 66.3% which is an absolute improvement of 1.6% over the previous state-of-the-art model (Srivastava and Salakhutdinov, 2014). However, we should note that in this case the improvement over the single stream which uses raw images as input is not statistically significant. The two-stream end-to-end model sets also the new state-of-the-art for the maximum classification accuracy with 69.2%, which is an absolute improvement of 3.9% over (Bakry and Elgammal, 2013). At this point we should mention the work of Pei et al. (Pei et al., 2013) which reports a maximum performance of 69.6%. However, this work uses a non-standard evaluation protocol where the data are randomly divided into 60% and 40% for training and testing, respectively.

Figure 2: Per-subject performance on OuluVS2 database.
Figure 3: Per-subject performance on AVLETTERS database.

Results for the AVLetters2 database are shown in Table 4. In this case, the best overall performance is achieved by the end-to-end single stream model based on raw images, with a mean classification accuracy of 36.8%. The main reason the 2-stream does not perform so well is bad tracking of facial points for some subjects. As a consequence, the mouth ROIs extracted are jittery which affects the performance of the diff stream. The single stream end-to-end model sets also the new state-of-the-art for the maximum classification accuracy with 42.6%, which is an absolute improvement of 10.4% over (Hu et al., 2016). We should emphasize that we use a subject-independent evaluation and due to the small number of subjects the classification accuracy is much lower than the other databases. Much higher results have been reported in the literature for a subject-dependent evaluation protocol with the highest performance of 91.2% reported in (Pei et al., 2013).

Fig. 2 shows the classification accuracy per subject for the OuluVS2 dataset. It is clear that the deviation across different test subjects is not very large. Almost all subjects achieve a classification accuracy over 80% with 8 of them achieving over 95%. A similar pattern is also observed in the CUAVE dataset (Figure is not shown due to lack of space).

Fig. 3

shows the classification accuracy per subject for AVLetters. Contrary to OuluVS2 and CUAVE the performance varies a lot between different subjects with minimum and maximum accuracies of 54% and 81% for subject S06 and S08, respectively. This could be the consequence of the small size of the dataset which does not allow for good generalisation across all subjects or due to differences in the cropped mouth regions. Since the cropped regions are provided it is not easy to verify that all regions were cropped consistently. The same observation about performance variance can be made also for the AVLetters2 dataset (Figure not shown due to lack of space) with a minimum and maximum accuracy of 26% and 50% for subjects S05 and S02, respectively.

The most common confusion pair222Confusions matrices are not included due to lack of space. for the OuluVS2 dataset is between “Hello” (3rd phrase) and “Thank you” (8th phrase) which is consistent with confusions presented in (Petridis et al., 2017a; Lee et al., 2016). The most frequently confused pairs in the CUAVE dataset are zero and two, and six and nine and this is consistent with (Petridis et al., 2017a).

The most common confusions for the AVLetters dataset are between B and P, D and T, and U and Q. This is not surprising since both letters in each pair have the same visual representation. They consist of two phonemes where the first ones belong to the same viseme class and the second one is the same. The letters which are classified correctly most of the time are the following: M, O, R, W, Y. Similar confusions are observed on AVLetters2 as well.

Finally, we should also mention that we experimented with CNNs for the encoders but this led to worse performance than the proposed model. This is consistent with the previous results based on CNN models reported on the OuluVS2 and AVLetters databases which are much lower than the proposed system (see the works of (Fung and Mak, 2018; Saitoh et al., 2016; Lee et al., 2016) in Table 1 and (Feng et al., 2017) in Table 3). This is also reported in (Wand et al., 2016) and it is likely due to the small training sets. Only works which have used external data like (Chung and Zisserman, 2016a, b) or used all views (Han et al., 2017) have been able to report results based on CNN models on OuluVS2 close to the results presented in this work.

In order to further test this assumption, we compare the performance of the end-to-end two-stream model with a state-of-the-art lip-reading model as a function of the amount of training data. The model we consider is based on ResNet and BGRUs (Petridis et al., 2018; Stafylakis and Tzimiropoulos, 2017) and achieves the state-of-the-art performance on the LRW database. The model is trained using the same training protocol as in (Petridis et al., 2018). Fig. 3(a) and 3(b)

show the classification accuracy of the two models for varying training set sizes, from 10% to 100%, on the OuluVS2 and CUAVE datasets, respectively. In the former case, the ResNet model quickly reaches the same level of performance as the proposed end-to-end model. In the latter case, the performance gap between the ResNet model and the proposed model decreases as the training set size increases. However, even when the entire training set is used the performance remains below the proposed model. This probably happens due to the small size of CUAVE training set, which is about half the size of the OuluVS2 training set. This is also another indication that CNN models do not reach their full potential for lip-reading applications when trained on small scale datasets and alternative models, like the one proposed here, can be better suited in this scenario.

(a) OuluVS2
(b) CUAVE
Figure 4: The performance of our approach and the state-of-the-art model based on ResNets and BGRUs (Petridis et al., 2018) as a function of the size of the training set.

7 Conclusion

In this work, we present an end-to-end visual speech recognition system suitable for small-scale datasets which jointly learns to extract features directly from the pixels and perform classification using LSTM networks. Results on four datasets, OuluVS2, CUAVE, AVLetters and AVLetters2, demonstrate that the proposed model achieves state-of-the-art performance on all of them significantly outperforming all other approaches reported in the literature, even CNNs pre-trained on external databases. A natural next step would be to extend the system in order to be able to recognise sentences instead of isolated words.

References

  • oul (a) , a. http://ouluvs2.cse.oulu.fi.
  • oul (b) , b. http://ouluvs2.cse.oulu.fi/ACCVW.html.
  • oul (c) , c. http://www.ee.oulu.fi/research/imag/OuluVS2/ACCVW.html.
  • Anina et al. (2015) Anina, I., Zhou, Z., Zhao, G., Pietikäinen, M., 2015. OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis, in: IEEE FG.
  • Assael et al. (2016) Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N., 2016. Lipnet: Sentence-level lipreading. arXiv preprint arXiv:1611.01599 .
  • Bakry and Elgammal (2013) Bakry, A., Elgammal, A., 2013. MKPLS: Manifold kernel partial least squares for lipreading and speaker identification, in: IEEE CVPR.
  • Benhaim et al. (2013) Benhaim, E., Sahbi, H., Vitte, G., 2013. Designing relevant features for visual speech recognition, in: IEEE ICASSP, pp. 2420–2424.
  • Chung et al. (2017) Chung, J.S., Senior, A., Vinyals, O., Zisserman, A., 2017. Lip reading sentences in the wild, in: IEEE CVPR.
  • Chung and Zisserman (2016a) Chung, J.S., Zisserman, A., 2016a. Lip reading in the wild, in: ACCV, Springer. pp. 87–103.
  • Chung and Zisserman (2016b) Chung, J.S., Zisserman, A., 2016b. Out of time: automated lip sync in the wild, in: Workshop on Multiview Lipreading, ACCV, pp. 251–263.
  • Cox et al. (2008) Cox, S.J., Harvey, R.W., Lan, Y., Newman, J.L., Theobald, B.J., 2008. The challenge of multispeaker lip-reading., in: AVSP, pp. 179–184.
  • Dupont and Luettin (2000) Dupont, S., Luettin, J., 2000. Audio-visual speech modeling for continuous speech recognition. IEEE Trans. on Multimedia 2, 141–151.
  • Feng et al. (2017) Feng, W., Guan, N., Li, Y., Zhang, X., Luo, Z., 2017. Audio visual speech recognition with multimodal recurrent neural networks, in: IEEE IJCNN, pp. 681–688.
  • Fernandez-Lopez and Sukno (2018) Fernandez-Lopez, A., Sukno, F., 2018. Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing .
  • Frisky et al. (2015) Frisky, A.Z.K., Wang, C.Y., Santoso, A., Wang, J.C., 2015. Lip-based visual speech recognition system, in: IEEE ICCST, pp. 315–319.
  • Fung and Mak (2018) Fung, I., Mak, B., 2018. End-to-end low-resource lip-reading with maxout CNN and LSTM, in: IEEE ICASSP, pp. 2511–2515.
  • Glorot and Bengio (2010) Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks., in: Aistats, pp. 249–256.
  • Han et al. (2017) Han, H., Kang, S., Yoo, C.D., 2017. Multi-view visual speech recognition based on multi task learning, in: IEEE ICIP, pp. 3983–3987.
  • Hinton (2012) Hinton, G., 2012. A practical guide to training restricted boltzmann machines, in: Neural Networks: Tricks of the Trade. Springer, pp. 599–619.
  • Hinton and Salakhutdinov (2006) Hinton, G., Salakhutdinov, R., 2006. Reducing the dimensionality of data with neural networks. Science 313, 504–507.
  • Hu et al. (2016) Hu, D., Li, X., Lu, X., 2016. Temporal multimodal learning in audiovisual speech recognition, in: IEEE CVPR, pp. 3574–3582.
  • Huang and Kingsbury (2013) Huang, J., Kingsbury, B., 2013. Audio-visual deep learning for noise robust speech recognition, in: IEEE ICASSP, pp. 7596–7599.
  • Kazemi and Sullivan (2014) Kazemi, V., Sullivan, J., 2014. One millisecond face alignment with an ensemble of regression trees, in: IEEE CVPR, pp. 1867–1874.
  • Koumparoulis and Potamianos (2018) Koumparoulis, A., Potamianos, G., 2018. Deep view2view mapping for view-invariant lipreading, in: IEEE SLT Workshop, pp. 588–594.
  • Lee et al. (2016) Lee, D., Lee, J., Kim, K.E., 2016. Multi-view automatic lip-reading using neural network, in: Workshop on Multi-view Lip-reading Challenges, ACCV. pp. 290–302.
  • Li et al. (2016) Li, Y., Takashima, Y., Takiguchi, T., Ariki, Y., 2016. Lip reading using a dynamic feature of lip images and convolutional neural networks, in: IEEE/ACIS Intl. Conf. on Computer and Information Science, pp. 1–6.
  • Lucey and Sridharan (2006) Lucey, P., Sridharan, S., 2006. Patch-based representation of visual speech, in: Proc. of the HCSNet Workshop on Use of Vision in HCI, pp. 79–85.
  • Matthews et al. (2002) Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R., 2002. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 198–213.
  • Mroueh et al. (2015) Mroueh, Y., Marcheret, E., Goel, V., 2015. Deep multimodal learning for audio-visual speech recognition, in: IEEE ICASSP, pp. 2130–2134.
  • Ngiam et al. (2011) Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y., 2011. Multimodal deep learning, in: Proc. of ICML, pp. 689–696.
  • Ninomiya et al. (2015) Ninomiya, H., Kitaoka, N., Tamura, S., Iribe, Y., Takeda, K., 2015. Integration of deep bottleneck features for audio-visual speech recognition, in: Conf. of the International Speech Communication Association.
  • Noda et al. (2015) Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T., 2015. Audio-visual speech recognition using deep learning. Applied Intelligence 42, 722–737.
  • Papandreou et al. (2007) Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P., 2007. Multimodal fusion and learning with uncertain features applied to audiovisual speech recognition, in: Workshop on Multimedia Signal Processing, pp. 264–267.
  • Papandreou et al. (2009) Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P., 2009. Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. IEEE Trans. on Audio, Speech, and Language Processing 17, 423–435.
  • Patterson et al. (2002) Patterson, E., Gurbuz, S., Tufekci, Z., Gowdy, J., 2002. Moving-talker, speaker-independent feature study, and baseline results using the cuave multimodal speech corpus. EURASIP J. Appl. Signal Process. 2002, 1189–1201.
  • Pei et al. (2013) Pei, Y., Kim, T.K., Zha, H., 2013.

    Unsupervised random forest manifold alignment for lipreading, in: ICCV, IEEE. pp. 129–136.

  • Petridis et al. (2017a) Petridis, S., Li, Z., Pantic, M., 2017a. End-to-end visual speech recognition with LSTMs, in: IEEE ICASSP, pp. 2592–2596.
  • Petridis and Pantic (2016) Petridis, S., Pantic, M., 2016. Deep complementary bottleneck features for visual speech recognition, in: IEEE ICASSP, pp. 2304–2308.
  • Petridis et al. (2018) Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M., 2018. End-to-end audiovisual speech recognition, in: IEEE ICASSP, pp. 6548–6552.
  • Petridis et al. (2017b) Petridis, S., Wang, Y., Li, Z., Pantic, M., 2017b. End-to-end audiovisual fusion with LSTMs, in: AVSP.
  • Potamianos et al. (2003) Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W., 2003. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91, 1306–1326.
  • Saitoh et al. (2016) Saitoh, T., Zhou, Z., Zhao, G., Pietikäinen, M., 2016. Concatenated frame image based CNN for visual speech recognition, in: Workshop on Multi-view Lip-reading Challenges, ACCV. pp. 277–289.
  • Shillingford et al. (2018) Shillingford, B., Assael, Y., Hoffman, M.W., Paine, T., Hughes, C., Prabhu, U., Liao, H., Sak, H., Rao, K., Bennett, L., et al., 2018. Large-scale visual speech recognition. arXiv preprint arXiv:1807.05162 .
  • Srivastava and Salakhutdinov (2014) Srivastava, N., Salakhutdinov, R., 2014. Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 15, 2949–2980.
  • Stafylakis and Tzimiropoulos (2017) Stafylakis, T., Tzimiropoulos, G., 2017. Combining residual networks with LSTMs for lipreading, in: Interspeech, pp. 3652–3656.
  • Sui et al. (2015a) Sui, C., Bennamoun, M., Togneri, R., 2015a. Listening with your eyes: Towards a practical visual speech recognition system using deep boltzmann machines, in: IEEE ICCV, pp. 154–162.
  • Sui et al. (2015b) Sui, C., Togneri, R., Bennamoun, M., 2015b. Extracting deep bottleneck features for visual speech recognition, in: ICASSP, pp. 1518–1522.
  • Takashima et al. (2016) Takashima, Y., Aihara, R., Takiguchi, T., Ariki, Y., Mitani, N., Omori, K., Nakazono, K., 2016. Audio-visual speech recognition using bimodal-trained bottleneck features for a person with severe hearing loss, pp. 277–281.
  • Wand et al. (2016) Wand, M., Koutn, J., Schmidhuber, J., 2016. Lipreading with long short-term memory, in: IEEE ICASSP, pp. 6115–6119.
  • Young et al. (2002) Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., et al., 2002. The HTK book 3, 175.
  • Zhao et al. (2009) Zhao, G., Barnard, M., Pietikainen, M., 2009. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia 11, 1254–1265.
  • Zhou et al. (2011) Zhou, Z., Zhao, G., Pietikäinen, M., 2011. Towards a practical lipreading system, in: IEEE CVPR, pp. 137–144.
  • Zimmermann et al. (2016) Zimmermann, M., Ghazi, M.M., Ekenel, H.K., Thiran, J.P., 2016. Visual speech recognition using PCA networks and LSTMs in a tandem GMM-HMM system, in: Workshop on Multi-view Lip-reading Challenges, ACCV, pp. 264–276.