Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

by   Chunlin Tian, et al.

The Aduio-visual Speech Recognition (AVSR) which employs both the video and audio information to do Automatic Speech Recognition (ASR) is one of the application of multimodal leaning making ASR system more robust and accuracy. The traditional models usually treated AVSR as inference or projection but strict prior limits its ability. As the revival of deep learning, Deep Neural Networks (DNN) becomes an important toolkit in many traditional classification tasks including ASR, image classification, natural language processing. Some DNN models were used in AVSR like Multimodal Deep Autoencoders (MDAEs), Multimodal Deep Belief Network (MDBN) and Multimodal Deep Boltzmann Machine (MDBM) that actually work better than traditional methods. However, such DNN models have several shortcomings: (1) They don't balance the modal fusion and temporal fusion, or even haven't temporal fusion; (2)The architecture of these models isn't end-to-end, the training and testing getting cumbersome. We propose a DNN model, Auxiliary Multimodal LSTM (am-LSTM), to overcome such weakness. The am-LSTM could be trained and tested once, moreover easy to train and preventing overfitting automatically. The extensibility and flexibility are also take into consideration. The experiments show that am-LSTM is much better than traditional methods and other DNN models in three datasets.



page 1

page 2

page 3

page 4


Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

Multimodal learning allows us to leverage information from multiple sour...

Spatiotemporal Networks for Video Emotion Recognition

Our experiment adapts several popular deep learning methods as well as s...

Multimodal Speech Recognition with Unstructured Audio Masking

Visual context has been shown to be useful for automatic speech recognit...

Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

Multi-task learning (MTL) involves the simultaneous training of two or m...

Multimodal Grounding for Sequence-to-Sequence Speech Recognition

Humans are capable of processing speech by making use of multiple sensor...

Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

Multimodal speech recognition aims to improve the performance of automat...

Submodular Rank Aggregation on Score-based Permutations for Distributed Automatic Speech Recognition

Distributed automatic speech recognition (ASR) requires to aggregate out...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic Speech Recognition (ASR) has been investigated over several years, and there is a wealth of literature. The recent progress is Deep Speech2 [3], which utilizes deep Convolution Neural Network (CNN)[10], LSTM[9] and CTC [7], and sequence-to-sequence models [26]. Although ASR achieved excellent result, it had some intrinsical problems including insufficient tolerance of noise and disturbance. Besides, some illusion occurs when the auditory component of one sound is paired with the visual component of another sound, leading to the perception of a third sound [15]. This was described as McGurk effect (McGurk & MacDonald 1976).

Benefited from the multimodal learning, Audio-visual Speech Recognition (AVSR) is a supplement to ASR that mixes audio and visual information together, and amount of work verified that AVSR strengthened the ASR system [17] [2]. When it comes to AVSR, the core part is multimodal learning. In the early decades, many models aimed to fuse the multimodal data more representative and discriminative, which contain multimodal extensions of Hidden Markov Models (HMMs) and some statistical models. But strong prior assumption limits such models. As the revival of Neural Networks, some deep models were proposed in multimodal learning. Different from traditional models, deep models usually have two main aspects:

(1) Many researchers think of Deep Neural Networks (DNN) as performing a kind of good representation learning [5]

, that is, the ability to extract feature of DNN can be exploited for varieties of tasks, especially the CNN for image feature extraction. For aural information, it is refined by some deep models including

Deep Belief Network (DBN), Deep Autoencoder (DAE), Restricted Boltzmann Machines (RBMs) and Deep Bottleneck Features (DBNF)[14][18][27].

(2) The other merit of deep models is the well-performed fusion, conquering the biggest disadvantage of traditional methods. Multimodal DBN (MDBN)[23], Multimodal DAEs (MDAEs)[17], Multimodal Deep Boltzmann Machine (MDBM)[24] are all unsupervised DNN that perform fusion.

However, the aforementioned deep models have two primary shortcomings:

(1) They don’t balance the modal fusion and temporal fusion, or even haven’t temporal fusion. Many methods simply concatenate the features of frames in a single video to do temporal fusion. Besides, in the modal fusion, such methods don’t consider the correlation among different frames.

(2) The architecture of these models isn’t end-to-end, the training and testing getting cumbersome.

Inspiring by

Multimodal Recurrent Neural Networks

[11] for image captioning, we propose an end-to-end DNN architecture, Auxiliary Multimodal LSTM (am-LSTM), for AVSR and lipreading. The am-LSTM overcomes the two main weaknesses mentioned before. It is composed of LSTMs, projection and recognition, and is trained once. Therefore the modal fusion and temporal fusion are accomplished at the same time, i.e. the modal and temporal fusion are mixed and combined in terms of balance of the two fusion processes. To avoid overfitting and make the DNN converge faster, we use the connection strategy like Deep Residual Learning [8]–auxiliary connection. Early stopping [20] and dropout [25] regularization are used to fight against overfitting as well. Because of the CNN and LSTM, am-LSTM is also a type of well-known CNN-RNN architecture.

We conducted the experiments in three AVSR datasets: AVLetters (Patterson et al., 2002), AVLetters2 (Cox et al., 2008) and AVDigits (Di Hu et al., 2015). The results suggest that am-LSTM is better than classic AVSR models and some deep models before.

2 Related work

Traditional AVSR Systems. Humans understand the multimodal world in a seemingly effortless manner, although there are vast information processing resources dedicated to the corresponding tasks by the brain [12]. Whereas it is difficult for computers to understand multimodal information, because different modalities have their character in nature, same information can be expressed in different modalities in very dissimilar way, let alone distinct information. Audio is one dimensional temporal signal, video is three dimensional temporal signal and text carries semantic information.

Multimodal fusion is the most important part of multimodal learning. There exists three kinds of fusion strategies: early fusion, late fusion and hybrid fusion. For early fusion, it suffices to concatenate all monomodal features into a single aggregate multimodal descriptor. In late fusion, each modality is classified independently. Intergration is done at the decision level and is usually based on heuristic rules. Hybrid fusion lies in-between early and late fusion methods and are specifically geared towards modeling multimodal time-evolving data

[12] [4].

The representative work in the early years in this field is multistream HMMs (mHMMs) [16] which were confirmed adaptability and flexibility for modeling sequence and temporal data [21]. But probabilistic models have some explicit limitations, especially strong priori. Researchers are also interested in the manner of mapping data in dissimilar space to one space jointly.

Deep Learning for AVSR.

Deep learning provides a powerful toolkit for machine learning. In AVSR, DNN also displays obvious increasement of accuracy.


uses pre-trained CNN to extracted visual features and denoising autoencoders to improve aural features, and then mHMMs to do fusion and classfication. As I mentioned in

1, it mainly utilizes the feature extraction of DNN.

Some other methods are multimodal fusion based on unsupervised DNN. One type is based on DAEs, the representative work is MDAEs [17]. Others based on Boltzmann Machines, [23] uses MDBN, [24] uses MDBM. Generally, DAEs-based models are easy to trained, but lack of theoretical support and flexibility. RBM-based models are hard to train because of the partial function [5], but sufficient support from probabilistic models and simple to extend.

3 The Proposed Model

The am-LSTM aims at fusing the audio-visual data at the same time, considering the modal fusion, temporal fusion and the connection between frames simultaneously. In this section, we will introduce the am-LSTM, and show its simplicity and extensity.

3.1 Simple LSTM

There are two widely known issues with properly training vanilla RNN, the vanishing and the exploding gradient [19]. Without some training tricks in training RNN, LSTM uses gates to avoid gradient vanishing and exploding. A typical LSTM often has 3 gates: input gate, output gate and forget gate which slows down the disappearance of past information and makes Backpropagation through Time (BPTT) easier. The gates work as follows:

Figure 1: Simple LSTM

3.2 am-LSTM

Figure 2: am-LSTM

The am-LSTM is a extension of LSTM which contains two LSTMs and some components. The two LSTMs are video LSTM and audio LSTM, the fundamental architecture is same as simple LSTM, the formulae of video LSTM is (7)(12) (the similar formulae can be drawn for tha audio LSTM).


After video LSTM and audio LSTM, data will be projected into same dimensional space by a projection layer and a activation function thereby. This will also make the model more nonlinear.


and are projection matrix trained in the DNN, g is the activation function.


The features through video LSTM, audio LSTM and projection could be regarded as well-fused features. The influence of different frames in a single video is summed in both video and audio modality. Then a classical

Multi-layer Perceptron

(MLP) with batch-normalization is used as a classifier. The training loss is squared multi-label margin loss.

Figure 3: Batch and loss. We can see that loss is stopped in about 0.2 since strong overfitting.

However, the experiment showed that overfitting is very strong in the architecture aforementioned. Hence, we introduce auxiliary connection which accelerates the convergence and prevents overfitting. The auxiliary connection lies after video LSTM and audio LSTM, then data summed and mapped to the target space. In the training, there are three parts taken into account: the main loss, the video auxiliary loss and the audio auxiliary loss.

The implication here is minimizing the audio-visual loss, video loss and audio loss together. The auxiliary networks help the main networks achieve a win-win situation. Like ResNet, the auxiliary networks convey information from original networks, which make am-LSTM consider rich and hierarchy information.

Figure 4: The details of auxiliary connection.

3.3 Training am-LSTM

To train our am-LSTM model, we adopt the combination of squared multi-label margin loss.


Where is the squared multi-label margin loss, and are prediction and target.


Where is the real training loss of am-LSTM, is the loss of the main part, and are the loss of auxiliary part. and

are the hyperparameters of the influence from auxiliary part. Here, we choose

. And y is the classification target. The auxiliary connection is free enough so that the am-LSTM is flexible and extensive.

4 Experiments

In this section, we introduce our experiments details including datasets, data pre-processing, implementation details and some results, indicating that the am-LSTM is a robust, well-performing and flexible model for AVSR and likewise lipreading.

4.1 Datasets

We conducted our experiments in three datasets: AVLetters (Patterson et al., 2002), AVLetters2 (Cox et al., 2008) and AVDigits (Di Hu et al., 2015).

Datasets Speakers Content Times Miscellaneous
AVLetters 10 A-Z 3 pre-extracted
AVLetter2 5 A-Z 7 prvious split
AVDigits 6 0-9 9 /
Table 1: Details of the datasets

4.2 Data Pre-processing

If the video and audio are not splitted before, we split them into video and audio and make video and audio the same length by truncation or completion. Eventually, centering the data.

Video Pre-processing. Firstly, the Viola-Jones algorithm[28] is employed to extract the Region-of-Interest surrounding the mouth. After the region is resized to pixels, pre-trained VGG-16 [22] is the tool to extract the image features. We use the features of last fully connected layer. Finally, reduce them to 100 principal components with PCA whitening.

Audio Pre-processing.

The features of audio signal is extracted as spectrogram with 20ms Hamming window and 10ms overlap. With 251 points of Fast Fourier Transform and 50 principal components by PCA the spectral coefficient vector is regraded as the audio features.

Data Augmentation. 4 contiguous audio frames correspond to 1 video frame in each time step. We let the aural and visual features in every video simultaneously shift 10 frames up and down randomly to double the data. As a result, the model will have better generalisation in time domain.

4.3 Implementation Details

The am-LSTM has two general LSTMs with dropout mapping 100 dimensional data to 50. Since the video and audio features have the same dimension, the projection is a identity matrix that could be omitted. The MLP in main phase has three layers with batch-normalization, together with ReLU activation function. The auxiliary phase has simple structure mapping 50 dimensional data to 10 classes prepared for classification. As mentioned,

, the training loss is:


The am-LSTM is able to be trained and tested once. No redundant training process is needed. Lipreading is treated as cross modality speech recognition, that is, in the training phase, video and audio modalities are both needed, whereas in the testing phase, only video modality is presented.

4.4 Results

The Evaluation on am-LSTM is splitted into two parts: AVSR task and cross modality lipreading. AVSR is evaluated in two modalities and also trained in two modalities. However cross modality lipreading is evaluated in video but trained in video and audio both. Our experiments on AVSR is conducted in AVLetters2 and AVDigits, cross modality lipreading is conducted in AVLetters.

4.4.1 Audio-visual Speech Recognition

AVSR is the main task of our work, we conducted the experiments on AVLetters2 and AVDigits, the comparison methods are MDBN, MDAEs and RTMRBM. The results indicate that am-LSTM is much better than such models.

Datasets Model Mean Accuracy
AVLetter2 MDBN [23] 54.10%
MDAE [17] 67.89%
RTMRBM (Di Hu et al., 2015) 74.77%
am-LSTM 89.11%
AVDigits MDBN [23] 55.00%
MDAE [17] 66.74%
RTMRBM (Di Hu et al., 2015) 71.77%
am-LSTM 85.23%
Table 2: AVSR performance on AVLetters2 and AVDigits. The result indicates that our model performs better than MDBN, MDAEs and RTMRBM.

4.4.2 Cross Modality Lipreading

Cross modality lipreading is the secondary task. As mentioned before, we trained am-LSTM in both modalities but evaluated in visual modality only. The experiments have two modes: only video and cross modality, which show the superiority of cross modality lipreading. In the experiments in cross modality lipreading, am-LSTM performs much better than MDAEs, CRBM, RTMRBM. The lipreading here is word-level lipreading.

Mode Model Mean
Only Video Multiscale Spatial Analysis [13] 44.60%
Local Binary Pattern [29] 58.85%

Cross Modality
MDAEs [17] 64.21%
CRBM [2] 65.78%
RTMRBM (Di Hu et al., 2015) 66.21%
am-LSTM 88.83%
Table 3: Cross modality lipreading performance. The experiment suggest that cross modality lipreading is better than single modality. And result of our method performs much better than other model.

5 Conclusion

We proposed an end-to-end deep model for AVSR and lipreading which increases mean accuracy obviously. Our experiments suggested that am-LSTM performs much better than other models in AVSR and cross modality lipreading. The benefits of am-LSTM are trained and tested once; extensibility and flexibility. There are no other training processes needed in training am-LSTM. Due to am-LSTM is simple, it could be a tool to model fusion information efficiently. Meanwhile, am-LSTM considers temporal connection, thus it is suitable for sequence features. In the future, we plan to apply am-LSTM in other multimodal temporal tasks and make it more flexible.


I would like to acknowledge my friends Yi Sun, Wenqiang Yang and Peng Xie for helpful support. I would like to thank the developers of Torch


and Tensorflow

[1]. I would also thank my laboratory for providing computational resources.


  • Abadi et al. [2016] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
  • Amer et al. [2014] Mohamed R Amer, Behjat Siddiquie, Saad Khan, Ajay Divakaran, and Harpreet Sawhney. Multimodal fusion using dynamic hybrid models. pages 556–563, 2014.
  • Amodei et al. [2015] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.
  • Atrey et al. [2010] Pradeep K Atrey, M Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S Kankanhalli. Multimodal fusion for multimedia analysis: a survey. Multimedia systems, 16(6):345–379, 2010.
  • Bengio et al. [2015] Yoshua Bengio, Ian J Goodfellow, and Aaron Courville. Deep learning. An MIT Press book in preparation. Draft chapters available at http://www. iro. umontreal. ca/  bengioy/dlbook, 2015.
  • Collobert et al. [2011] Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.
  • Graves et al. [2006] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM, 2006.
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • Mao et al. [2014] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632, 2014.
  • Maragos et al. [2008] Petros Maragos, Alexandros Potamianos, and Patrick Gros. Multimodal processing and interaction: audio, video, text, volume 33. Springer Science & Business Media, 2008.
  • Matthews et al. [2002] Iain Matthews, Timothy F Cootes, J Andrew Bangham, Stephen Cox, and Richard Harvey. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):198–213, 2002.
  • Mroueh et al. [2015] Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. Deep multimodal learning for audio-visual speech recognition. pages 2130–2134, 2015.
  • Nath and Beauchamp [2012] Audrey R Nath and Michael S Beauchamp. A neural basis for interindividual differences in the mcgurk effect, a multisensory speech illusion. NeuroImage, 59(1):781–787, 2012.
  • Nefian et al. [2002] Ara V Nefian, Luhong Liang, Xiaobo Pi, Xiaoxing Liu, and Kevin Murphy.

    Dynamic bayesian networks for audio-visual speech recognition.

    EURASIP Journal on Advances in Signal Processing, 2002(11):1–15, 2002.
  • Ngiam et al. [2011] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 689–696, 2011.
  • Noda et al. [2015] Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G Okuno, and Tetsuya Ogata. Audio-visual speech recognition using deep learning. Applied Intelligence, 42(4):722–737, 2015.
  • Pascanu et al. [2013] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. ICML (3), 28:1310–1318, 2013.
  • Prechelt [1998] Lutz Prechelt. Automatic early stopping using cross validation: quantifying the criteria. Neural Networks, 11(4):761–767, 1998.
  • Rabiner [1989] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Srivastava and Salakhutdinov [2012a] Nitish Srivastava and Ruslan Salakhutdinov. Learning representations for multimodal data with deep belief nets. In International conference on machine learning workshop, 2012a.
  • Srivastava and Salakhutdinov [2012b] Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pages 2222–2230, 2012b.
  • Srivastava et al. [2014] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • Tamura et al. [2015] Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, and Satoru Hayamizu. Audio-visual speech recognition using deep bottleneck features and high-performance lipreading. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 575–582. IEEE, 2015.
  • Viola and Jones [2001] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–511. IEEE, 2001.
  • Zhao et al. [2009] Guoying Zhao, Mark Barnard, and Matti Pietikainen. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7):1254–1265, 2009.