Humans integrate cues from multiple sensory organs, such as our ears and eyes, for reliable perception of real world data. When data from one of the sensory organs, such as the ear, is corrupted by noise, the human brain uses other senses, such as sight, to reduce the uncertainty. In conversational interfaces, speech is the primary mode of communication, with visual cues augmenting the information exchange. The McGurk effect  is one example in speech perception where humans integrate audio and visual cues. Visual cues typically provide information about place of articulation  and lip shapes that aid in discriminating phonemes with similar acoustic characteristics. In acoustically noisy environments, the visual cues help in disambiguating the target speaker from the surrounding audio sources.
There are various computational models of multi-modal information fusion  for audio-visual speech processing. Deep learning provides an elegant framework for designing data driven models for multi-modal and cross-modal feature learning [4, 5]. In 
, stacks of Restricted Boltzmann Machines (RBMs)
were trained to learn joint representations between acoustic features of phonemes and images of the mouth region. Their bimodal deep autoencoder with shared hidden layer representation was able to capture the higher level correlation between acoustic features and visual cues.
In all of the above representations, features of both modalities are learned through fully connected DNNs. Thus, these models are homogeneous in their architecture even though their input modalities are heterogeneous. It is well known that human visual processing is better modeled by CNNs . Higher level feature processing in the human brain also typically involves units that model long term dependencies among lower level features. Deep learning models with memory cells, such as LSTM  and BiLSTM  networks, have out-performed fully connected DNNs and CNNs in noise robust speech recognition [10, 11]. In this paper we propose a novel hybrid deep learning architecture where the acoustic features are first extracted by a fully connected DNN and the visual cues by a CNN. Higher level long-term dependencies among these auditory and visual features are modeled by a BiLSTM. The parameters of this multi-modal hybrid network are jointly optimized using backpropagation. The models are validated on an artificially corrupted audio-visual database .
In the following sections, we present a brief background of existing multi-modal and hybrid deep learning models (Section 2). Subsequently, the architecture of the proposed hybrid model is presented in detail (Section 3). Finally, we report experimental details (Sections 4,5) and conclude (Section 6).
2 Related Work
Recently, there has been increased interest in heterogeneous deep learning architectures [13, 14]. These architectures combine the strengths of constituent deep learning models to learn better high level abstractions of features. In 
, an ensemble model for phoneme recognition was proposed where a CNN and RNN were first independently trained to compute “low-level” features. A linear ensemble model was then trained to combine the posterior probabilities from these lower level classifiers. This model followed the strategy of stacking classifiers to achieve better discrimination and generalization. In , the model combines CNNs, LSTMs and DNNs into a unified framework. Firstly, a CNN was used to reduce spectral variability and its output features were then fed into a LSTM to reduce temporal variability. Finally, the output of the LSTM is processed by a DNN and the whole model is trained jointly. The multi-modal deep learning model proposed in  used sparse RBMs for combining the different lower level modalities. The model we propose combines the strengths of the above models. Our model has a fully connected DNN that takes a few frames of acoustic features as input, and an image processing CNN model that computes a higher level image representation of the lip movements over the same window. The features from these models are concatenated to form a shared representation, which is fed into a BiLSTM model to capture the temporal and spatial inter-dependencies between the audio and visual features. We train the entire model jointly to reconstruct cleaned spectral features. We call this model a BiModal-BiLSTM. The next section explains the proposed model in more detail.
3 BiModal-BiLSTM Model
In the BiModal-BiLSTM model, we take in an image channel and an audio channel at each time-step. For the image channel, we use a CNN to extract a high level feature representation
and for the audio channel, we use a DNN to transform the audio features into a learned representation at the upper layer of the DNN.
Then, we concatenate the two features and pass the joint representation into a BiLSTM model which consists of a forward LSTM,
and a backward LSTM,
The FLSTM and BLSTM are standard LSTM models, as defined in [9, 16] except that they unroll in opposite time direction. The concatenated feature contains bimodal information from audio and image. The output feature from FLSTM contains information from the past frames and the BLSTM output feature contains information from the future. Therefore when we sum these two features and use it to reconstruct the enhanced speech frame with a fully connected layer, the enhanced speech frame will have access to bidirectional information for the past and future from both input audio and image channels which helps in speech enhancement. Figure 1 shows the schematic of the hybrid model.
4 Baseline Models
To understand the effectiveness of BiModal-BiLSTM model, we designed two baseline models with similar number of parameters to answer two questions:
Does having an additional image modality help in model generalization for speech enhancement?
Does the BiLSTM work better than a purely feed-forward neural network?
The second question has already been answered in speech recognition and speech enhancement [10, 11] on speech datasets, but it will be interesting to compare the models alongside our BiModal-BiLSTM model.
The Single-Channel-BiLSTM has the same architecture as our BiModal-BiLSTM model, except that we removed the CNN image feature extractor (Equation 1), and only use the noisy audio channel as input. Everything else is kept the same to ensure that any difference in the final generalization result is due to the CNN image feature extractor.
In the Single-Channel-DNN, we take the noisy audio as input and enhance it directly with a DNN . The single-Channel-DNN has the same DNN architecture as the BiModal-BiLSTM and Single-Channel-BiLSTM (Equation 2). However, to ensure that the total number of parameters in Single-Channel-DNN matches that of Single-Channel-BiLSTM, we appended two extra fully connected layers, so that differences in the final generalization result is due to the difference in network architecture, rather than different number of parameters.
5 Experimental Details
5.1 Experimental Data
We conducted our experiments on an audiovisual dataset consisting of 14 native American English speakers . There are 94 recorded files for each speaker, ranging from short single word clips to long recordings of multiple full sentences. We extracted nonspeech, environmental noises from an on-line corpus 111All noise samples in the same category were concatenated..
For our test set, we used two of the longer audio files (CID Sentences List A and NU Auditory Test No.6 List I) for each speaker. Other samples in the dataset were used to construct the training set. We corrupted each sample with each of the noise types at a selected Signal-to-Noise Ratio (SNR)222We start corrupting using a randomly selected point in each noise clip and we repeat the noise clips if they are too short.. For the training samples, we randomly selected an integral SNR in the range [-5,5]. In total, this gave us roughly 20.7 hours of stereo training data. For the test data, we corrupted with SNRs in steps of 3 in the range [-6,9]. The training noise types were: alarm, animal, crowd, water and water; traffic noise was only used for (unseen) testing.
We extracted the log power spectrum from the audio component of each sample using a 320-point STFT with 0.02s window and 0.01s overlap. For the input to our network, we further extracted the first and second temporal derivatives for each frame and then reduced the number of dimensions to 100 using Principal Component Analysis (PCA). For the models that use visual inputs, we manually took a 100 by 160 crop around the mouth region of each speaker and further down-sample the crop to 64 by 64 for training.
Our models are trained to recover the log power spectrum of the clean audio samples from the corrupted input samples. To complete the reconstruction, we perform an inverse STFT using the recovered power spectrum together with the phase spectrum of the corrupted input. All data manipulation was done using off the shelf packages [19, 20].
|Kernel||Stride||Number of Filters|
|Model||No. of parameters|
5.2 Model Specification
In order to ensure a fair comparison, we chose model sizes such that they have roughly the same number of parameters. Table 2 shows the number of parameters for each model. The DNN audio feature extractor in Equation 2 has architecture 100-500-300- where 100 is the PCA dimension for 1 frame, is the number of frames stacked together and is the dimensionality of . We set to 350 for BiModal-BiLSTM and 400 for Single-Channel-BiLSTM and Single-Channel-DNN. Table 1 shows the specifications of CNN image feature extractor from Equation 1. The Single-Channel-DNN consists of DNN audio feature extractor and two hidden layers of dimensions 1000-500. The Single-Channel-BiLSTM also has the DNN audio feature extractor, followed by one BiLSTM layer of 400 input dimension and 200 output dimension, and a fully connected layer of 200. The BiModal-BiLSTM has the same audio architecture as the Single-Channel-BiLSTM, but with an additional CNN image feature extractor depicted in Figure 1. Since we expect that the audio component contains much more information about the speech than the lip movements from the image, we bias the concatenated shared representation to have 350 dimensions from the audio DNN, , but only 50 dimensions from the image CNN,
. In all the fully-connected and convolutional layers, we used batch normalization to reduce the internal covariate shift of the outputs from one layer to another. From our experiments, we found that this ensures stable convergence.
5.3 Model Training
All the models were trained on NVIDIA Tesla K20 GPUs using Theano and Mozi333https://github.com/hycis/Mozi.git. We used Adam 
as the learning algorithm and Mean-Squared-Error as the objective to be minimized. We keep a 10% of the training data as the validation set and stop training when the validation error has not improved over 5 epochs by at least 1%. This ensures that none of the models over-fits to the training data. We normalise all audio input dimensions to have zero mean and unit variance, and scale the image pixel intensities to [0,1]. This pre-processing step is important to reduce co-variate shift across dimensions and to ensure that each dimension has equal signal intensity been passed to the network.
For the Single-Channel-DNN model, we used a window of 11 frames of the noisy spectrum for each output frame of the clean spectrum. For the BiLSTM models, each input time-step takes in 1 frame of speech and image. We also tried on windows of 3 to 7 frames for each input time-step, but we found that 1 frame worked the best.
For the BiLSTM models, we unrolled the model with 21 time-steps, and trained with back-propagation through time . We found that this gave a good balance between training time and model accuracy. Table 3 shows the final Mean-Squared-Error (MSE) on the validation set. It can be seen that the proposed model has the least error, which indicates that the visual cues are helping in denoising the acoustic features.
We use the Perceptual Evaluation of Speech Quality (PESQ) , which has a high correlation with subjective evaluation scores, as our objective measure for evaluating the quality of denoised speech. Figure 2 shows the average PESQ score of speech enhanced by different models on test utterances corrupted with seen noise (alarm and crowd) and unseen noise (traffic) at different SNRs. Table 4 shows the mean PESQ score across all speakers and all SNRs for the various models. We note that the mean PESQ scores are consistent with the MSE on the cross-validation set. The BiModal-BiLSTM performs best across all seen noises and SNRs but its performance is closer to Single-Channel-BiLSTM under the (unseen) traffic noise conditions. Both BiLSTM models significantly outperform the DNN model. Figure 3
shows the spectrogram of speech corrupted by alarm noise enhanced by different models. It can be seen that the noise is highly non-stationary and overlaps significantly with the speech spectral characteristics. All the models denoise reasonably well. This shows that visual information of lip movements indeed provide additional information in enhancing speech, and that a recurrent neural network is an effective model in learning this BiModal audio-visual information.Since the information provided by the visual stream can only discriminate the manner of articulation, we initially suspected that most of the gains were coming from the suppression of noise in the silence frames. However, as can be seen from the spectrogram, the BiModal-BiLSTM also provides more details to the speech segments.
Higher level information processing in human perception involves multi-sensory integration and modeling of long-term dependencies among the sensory data. Strategies involve integrating cues from multiple senses based on their reliability or Signal to Noise Ratio (SNR). In this paper, motivated by the insights gleaned from human sensory perception, we have proposed a novel multi-modal hybrid deep neural network architecture. The model captures intermediate level representations of speech and images through a fully connected DNN and CNN respectively. The long term dependencies in the intermediate representation are modeled by a BiLSTM. We validated the model on audio-visual speech enhancement, where the task is to estimate clean speech spectra from input noisy speech spectra and images of the corresponding lip region. It is expected that the hybrid model learns to adjust the importance of the audio and visual streams intrinsically based on the uncertainty in the audio stream. The hybrid model is trained jointly using the Backpropagation algorithm. We show that the proposed model achieves higher PESQ score on an average over a range of nonstationary noises and SNRs.
-  Harry McGurk and John MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, pp. 746–748, 1976.
-  Quentin Summerfield, “Lipreading and audio-visual speech perception,” Philosophical Transactions of the Royal Society of London B: Biological Sciences, vol. 335, no. 1273, pp. 71–78, 1992.
-  Gerasimos Potamianos, Chalapathy Neti, Juergen Luettin, and Iain Matthews, “Audio-visual automatic speech recognition: An overview,” Issues in visual and audio-visual speech processing, vol. 22, pp. 23, 2004.
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y
“Multimodal deep learning,”
Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 689–696.
-  Nitish Srivastava and Ruslan R Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Advances in neural information processing systems, 2012, pp. 2222–2230.
-  Geoffrey E Hinton and Ruslan R Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,
“Imagenet classification with deep convolutional neural networks,”in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  Alex Graves and Jürgen Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Networks, vol. 18, no. 5, pp. 602–610, 2005.
-  Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 6645–6649.
-  Andrew L Maas, Quoc V Le, Tyler M O’Neil, Oriol Vinyals, Patrick Nguyen, and Andrew Y Ng, “Recurrent neural networks for noise reduction in robust asr.,” in INTERSPEECH, 2012, pp. 22–25.
-  Carolyn Richie, Sarah Warburton, and Megan Carter, “Audiovisual database of spoken American English LDC2009V01,” Philadelphia: Linguistic Data Consortium, 2009, Web Download.
-  Tara N Sainath, Oriol Vinyals, Andrew Senior, and Hasim Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4580–4584.
-  Li Deng and John C Platt, “Ensemble deep learning for speech recognition,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
-  David H Wolpert, “Stacked generalization,” Neural networks, vol. 5, no. 2, pp. 241–259, 1992.
-  Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural Computing, vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
-  Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “A regression approach to speech enhancement based on deep neural networks,” Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol. 23, no. 1, pp. 7–19, 2015.
-  Guoning Hu, “100 nonspeech sounds,” http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html, Web Download, Accessed: 2016-03-25.
-  Brian McFee, Matt McVicar, Colin Raffel, Dawen Liang, Oriol Nieto, Eric Battenberg, Josh Moore, Dan Ellis, Ryuichi Yamamoto, Rachel Bittner, Douglas Repetto, Petr Viktorin, João Felipe Santos, and Adrian Holovaty, “librosa: 0.4.1,” Oct. 2015.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
-  Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of The 32nd International Conference on Machine Learning, 2015, pp. 448–456.
-  James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio, “Theano: a CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010, Oral Presentation.
-  Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  Paul J Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560, 1990.
-  Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on. IEEE, 2001, vol. 2, pp. 749–752.