Contextual Audio-Visual Switching For Speech Enhancement in Real-World Environments

08/28/2018 ∙ by Ahsan Adeel, et al. ∙ 0

Human speech processing is inherently multimodal, where visual cues (lip movements) help to better understand the speech in noise. Lip-reading driven speech enhancement significantly outperforms benchmark audio-only approaches at low signal-to-noise ratios (SNRs). However, at high SNRs or low levels of background noise, visual cues become fairly less effective for speech enhancement. Therefore, a more optimal, context-aware audio-visual (AV) system is required, that contextually utilises both visual and noisy audio features and effectively accounts for different noisy conditions. In this paper, we introduce a novel contextual AV switching component that contextually exploits AV cues with respect to different operating conditions to estimate clean audio, without requiring any SNR estimation. The switching module switches between visual-only (V-only), audio-only (A-only), and both AV cues at low, high and moderate SNR levels, respectively. The contextual AV switching component is developed by integrating a convolutional neural network and long-short-term memory network. For testing, the estimated clean audio features are utilised by the developed novel enhanced visually derived Wiener filter for clean audio power spectrum estimation. The contextual AV speech enhancement method is evaluated under real-world scenarios using benchmark Grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality is used to evaluate the quality of the restored speech. For subjective testing, the standard mean-opinion-score method is used. The critical analysis and comparative study demonstrate the outperformance of proposed contextual AV approach, over A-only, V-only, spectral subtraction, and log-minimum mean square error based speech enhancement methods at both low and high SNRs, revealing its capability to tackle spectro-temporal variation in any real-world noisy condition.



There are no comments yet.


page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to partially understand speech by looking at the speaker’s lips is known as lip reading. Lip reading is helpful to all sighted people, both with normal and impaired hearing. It improves speech intelligibility in noisy environment, especially, when audio only perception is compared with AV perception. Lip reading confers benefits because the visible articulators, such as lips, teeth, and tongue are correlated with the resonance of the vocal tract. The correlation between the visible properties of the articulatory organs and speech reception has been previously shown in numerous behavioural studies sumby1954visual summerfield1979use mcgurk1976hearing patterson2003two . Early studies such as middelweerd1987effect macleod1990procedure have shown that lip reading could help tolerate an extra 4-6 dB of noise, where one decibel SNR gain is equivalent to 10-15% better intelligibility. In high-noise environments (and also in reverberation), the auditory representation in mid to high frequencies is often degraded summerfield1979use . In such cases, people with hearing impairment require the SNR to be improved by about 1 dB for every 10 dB of hearing loss, if they are to perceive speech as accurately as listeners with normal hearing duquesnoy1983effect .

In the literature, extensive research has been carried out to develop multimodal speech processing methods, which establishes the importance of multimodal information in speech processing katsaggelos2015audiovisual

. Researchers have proposed novel visual feature extraction methods

chen2013pattern waibel1990readings bundy1984linear zhou2014review sui2012discrimination , fusion approaches (early integration nefian2002dynamic , late integration snoek2005early , hybrid integration wu2006multi ), multi-modal datasets cooke2006audio patterson2002cuave , and fusion techniques wu2006multi iyengar2003discriminative noulas2006detection . The multimodal audiovisual speech processing methods have shown significant performance improvement in ASR, speech enhancement and speech separation berthommier2004phonetically chung2016lip wu2016multi noda2015audio gogate2018dnn . Recently, the authors in hou2018audio developed an audio-visual deep CNN (AVDCNN) speech enhancement model that integrates audio and visual cues into a unified network model. The proposed AVDCNN approach is structured as an audio-visual encoder-decoder where both audio and visual cues are processed using two separate CNNs, and later fused into a joint network to generate enhanced speech. However, the proposed AVDCNN approach relies only on deep CNN models. For testing, the authors used self-prepared dataset that contained video recordings of 320 utterances of Mandarin sentences spoken by a native speaker (only one speaker). In addition, the used noises include usual car engine, baby crying, pure music, music with lyrics, siren, one background talker (1T), two background talkers (2T), and three background talkers (3T). In summary, for testing only one Speaker (with 40 clean utterances mixed with 10 different noise types at 5 dB, 0 dB, and -5 dB SIRs) was considered.

In contrast to hou2018audio , the proposed speech enhancement framework leverages the complementary strengths of both deep learning and analytical acoustic modelling (filtering based approach). In addition, the introduced contextual AV component effectively accounts for different noisy conditions by contextually utilising both visual and noisy audio features. It is believed that our brain works in a same way by contextually exploiting and integrating multimodal cues such as lip-reading, facial expressions, body language, gestures, audio, and textual information to further enhance the speech quality and intelligibility. However, limited work has been conducted to contextually exploit multimodal AV switching for speech enhancement applications in different noisy scenarios.

The rest of the paper is organised as follows: Section 2 presents the proposed speech enhancement framework including contextual audio-visual switching and EVWF. Section 3 presents the employed AV dataset and audiovisual feature extraction methodology. In Section 4, comparative experimental results are presented. Finally, Section 5 concludes this work with some future research directions.

Figure 1: Contextual audio-Visual switching for speech enhancement. The system contextually exploits audio and visual cues to estimate clean audio features. Afterwards, the estimated clean audio features are feeded into the proposed enhanced visually derived Wiener filter for the estimation of clean speech spectrum.
Figure 2: Contextual audio-visual switching module: An integration of CLSTM and LSTM with 5 Prior Audio and Visual Frames (taking into account the current AV frame as well as the temporal information of previous 5 AV frames , , , , )

2 Speech Enhancement Framework

A schematic block diagram of the proposed context-aware speech enhancement framework is depicted in Figure 1. The proposed architecture comprises an audio and video feature extraction module, a context-aware AV switching component, a speech filtering component, and a waveform reconstruction module. The contextual AV switching component exploits both audio and visual features to learn the correlation between input (noisy audio and visual features) and output (clean speech) in different noisy scenarios. The designed context-aware AV switching component is composed of two deep learning architectures, an LSTM and CLSTM. The LSTM and CLSTM driven estimated clean audio features are then feeded into the designed EVWF to calculate the optimal Wiener filter, which is then applied to the magnitude spectrum of the noisy input audio signal, followed by the waveform reconstruction process. More details are comprehensively presented in Section 2.1 and Section 2.2.

2.1 Contextual Audio-Visual switching

The proposed deep learning driven context-aware AV switching architecture is depicted in Figure 2 that leverages the complementary strengths of CNN and LSTM. The term context-aware signifies the learning and adaptable capabilities of the proposed AV switching component, that helps exploiting the multimodal cues contextually to better deal with all possible speech scenarios (e.g. low, medium, high background noises, different environments (home, restaurant, social gathering etc)). The CLSTM network for visual cues processing consists of input layer, four Convolution layers (with 16, 32, 64, and 128 filters of size 3

5 respectively), four Max Pooling layers of size 2

2, and one LSTM layer (100 cells). The LSTM network for audio cues processing consists two LSTM layer. Visual features of time instance (k is the current time instance and 5 is the number of prior visual frames) were feeded into the CLSTM model to extract optimised features which were then exploited by 2 LSTM layers. Audio features of time instance

were feeded into the LSTM model with two LSTM layers. The first LSTM layer has 250 cells, which encoded the input and passed its hidden state to the second LSTM layer, which has 300 cells. Finally, the optimised latent features from both CLSTM and LSTM were fused using two dense layers. The architecture was trained with the objective to minimise the mean squared error (MSE) between the predicted and the actual audio features. The MSE (1) between the estimated audio logFB features and clean audio features was minimised using stochastic gradient decent algorithm and RMSProp optimiser. RMSprop is an adaptive learning rate optimiser which divides the learning rate by moving average of the magnitudes of recent gradients to make learning more efficient. Moreover, to reduce the overfitting, dropout (0.20) was applied after every LSTM layer. The MSE cost function

can be written as:


where and are the estimated and clean audio features respectively.

(a) State-of-the-art VWF almajai2011visually .
(b) Proposed enhanced VWF (EVWF)
Figure 5: Visually-derived Wiener filtering (VWF)

2.2 Enhanced Visually Derived Wiener Filter

The EVWF transforms the estimated low dimensional clean audio features into high dimensional clean audio power spectrum using inverse filter-bank (FB) transformation and calculates Wiener filter. The Wiener filter is then applied to the magnitude spectrum of the noisy input audio signal, followed by the inverse fast Fourier transform (IFFT), overlap, and combining processes to produce enhanced magnitude spectrum. The state-of-the-art VWF and designed EVWF are depicted in Figure

5 (a) and (b) respectively. The authors in almajai2011visually

presented a hidden Markov model-Gaussian mixture model (HMM/GMM) based two-level state-of-the-art VWF for speech enhancement. However, the use of HMM/GMM and cubic spline interpolation for clean audio features estimation and low to high dimensional transformation are not optimal choices. This is due to HMM/GMM models limited generalisation capability and poor audio power spectrum estimation of cubic spline interpolation method that fails to estimate the missing power spectral values. In contrast, the proposed EVWF exploited LSTM based FB estimation and an inverse FB transformation for audio power spectrum estimation which addressed both power spectrum estimation and generalisation issues in VWF. In addition, the proposed EVWF replaced the need of voice activity detector and noise estimation.

The frequency domain Wiener Filter is defined as:


where is the noisy audio power spectrum (i.e. clean power spectrum + noisy power spectrum) and

is the clean audio power spectrum. The calculation of the noisy audio power spectrum is fairly easy because of the available noisy audio vector. However, the calculation of the clean audio power spectrum is challenging which restricts the use of Wiener filter widely. Hence, for successful Wiener filtering, it is necessary to acquire the clean audio power spectrum. In this paper, the clean audio power spectrum is calculated using deep learning based lip reading driven speech model.

The FB domain Wiener filter () is given as almajai2011visually :


where is the FB domain lip-reading driven approximated clean audio feature and is the FB domain noise signal. The subscripts k and t represents the channel and audio frame.

The lip-reading driven approximated clean audio feature vector is a low dimensional vector. For high dimensional Wiener filter calculation, it is necessary to transform the estimated low dimensional FB domain audio coefficients into a high dimensional power spectral domain. It is to be noted that the approximated clean audio and noise features () are replaced with the noisy audio (a combination of real clean speech ( and noise )). The low dimension to high dimension transformation can be written as:


Where and are the low and high dimensional audio features respectively, and M is the number of audio frames. The transformation from (4) to (5) is very crucial to the performance of filtering. Therefore, the authors in almajai2011visually proposed the use of cubic spline interpolation method to determine the missing spectral values. In contrast, this article proposes the use of inverse FB transformation and is calculated as follows:



where are the boundary points of the filters and corresponds to the coefficient of the k-points DFT. The boundary points are calculated using:


where is the mel scale frequency

After substitutions, the obtained k-bin Wiener filter (5) is applied to the magnitude spectrum of the noisy audio signal (i.e. ) to estimate the enhanced magnitude audio spectrum (). The enhanced magnitude audio spectrum is given as:


The acquired time-domain enhanced speech signal (10) is then followed by the IFFT, overlap, and combining processes.

3 AudioVisual Corpus & Feature Extraction

In this section we present the developed AV ChiME3 corpus and our feature extraction pipeline.

3.1 Dataset

The AV ChiME3 corpus is developed by mixing the clean Grid videos cooke2006audio with the ChiME3 noises barker2015third (cafe, street junction, public transport (BUS), pedestrian area) for SNRs ranging from -12 to 12dB. The preprocessing includes sentence alignment and incorporation of prior visual frames. The sentence alignment is performed to remove the silence time from the video and prevent model from learning redundant or insignificant information. Prior multiple visual frames are used to incorporate temporal information to improve mapping between visual and audio features. The Grid corpus comprised of 34 speakers each speaker reciting 1000 sentences. Out of 34 speakers, a subset of 5 speakers is selected (two white females, two white males, and one black male) with total 900 command sentences each. The subset fairly ensures the speaker independence criteria. A summary of the acquired visual dataset is presented in Table 1, where the full and aligned sentences, total number of sentences, used sentences, and removed sentences are clearly defined.

Full Sentences Aligned Sentences
Speaker ID Grid ID No. of Sentences Removed Used Removed Used
Speaker 1 S1 1000 11 989 11 989
Speaker 2 S15 1000 164 836 164 836
Speaker 3 S26 1000 16 984 71 929
Speaker 4 S6 1000 9 991 9 991
Speaker 5 S7 1000 11 989 11 989
Table 1: Used Grid Corpus Sentences
Speakers Train Validation Test Total
1 692 99 198 989
2 585 84 167 836
3 650 93 186 929
4 693 99 199 991
5 692 99 198 989
All 3312 474 948 4734
Table 2: Summary of the Train, Test, and Validation Sentences

3.2 Audio feature extraction

The audio features are extracted using widely used log-FB vectors. For log-FB vectors calculation, the input audio signal is sampled at 50kHz and segmented into N 16ms frames with 800 samples per frame and 62.5% increment rate. Afterwards, a hamming window and Fourier transformation is applied to produce 2048-bin power spectrum. Finally, a 23-dimensional log-FB is applied, followed by the logarithmic compression to produce 23-D log-FB signal.

3.3 Visual feature extraction

The visual features are extracted from the Grid Corpus videos recorded at 25 fps using a 2D-DCT based standard and widely used visual feature extraction method. Firstly, the video files are processed to extract a sequence of individual frames. Secondly, a Viola-Jones lip detector viola2001rapid

is used to identify lip-region by defining the Region-of-Interest (ROI) in terms of bounding box. Object detection is performed using Haar feature-based cascade classifiers. The method is based on machine learning where cascade function is trained with positive and negative images. Finally, the object tracker

ross2008incremental is used to track the lip regions across the sequence of frames. The visual extraction procedure produced a set of corner points for each frame, where lip regions are then extracted by cropping the raw image. In addition, to ensure good lip tracking, each sentence is manually validated by inspecting few frames from each sentence. The aim of manual validation is to delete those sentences in which lip regions are not correctly identified abel2016data . Lip tracking optimisation lies outside the scope of the present work. The development of a full autonomous system and its testing on challenging datasets is in progress.

LSTM - V-only
Visual Frames
1 0.092 0.104
2 0.087 0.097
4 0.073 0.085
8 0.066 0.082
14 0.0024 0.0020
18 0.0016 0.019
Table 3: LSTM (V-only model) training and testing accuracy comparison for different prior visual frames

4 Experimental Results

Figure 6: Validation Results for A-only (A2A), V-only (V2A), and contextual AV (AV2A) mapping with 14 prior AV frames - All Speakers. The figure presents an overall behaviour of A2A, V2A, and AV2A mapping. It is to be noted that contextual AV mapping outperforms both A-only and V-only mappings.
Figure 7: PESQ results with ChiME3 noises. At both low and high SNR levels, EVWF with contextual AV module significantly outperformed SS, LMMSE, A-only, and V-only speech enhancement methods. V-only outperforms A-only at low SNRs and A-only outperforms V-only at high SNRs
(a) Clean
(b) SS enhanced
(c) Log MMSE Enhanced
(d) V-Only EVWF Enhanced
(e) A-Only EVWF Enhanced
(f) Contextual AV EVWF Enhanced
Figure 14: Spectrogram of a randomly selected utterance of -9dB SNR from AV ChiME3 corpus (X-axis: Time; Y-axis: Frequency (Hz)): (a) Clean (b) Spectral Subtraction Enhanced (c) Log MMSE Enhanced (d) Visual-only EVWF (e) Audio-only EVWF (f) Contextual AV EVWF. Note that EVWF with A-only, V-only, and AV outperformed SS and LMMSE. In addition, V-only recovered some of the frequency components better than A-only at low SNR
Figure 15: MOS for overall speech quality with ChiME3 noises. Similar to the PESQ results, at both low and high SNR levels, EVWF with contextual AV module significantly outperformed SS, LMMSE, A-only, and V-only speech enhancement methods. It is also to be noted that V-only outperforms A-only at low SNRs and A-only outperforms V-only at high SNRs. However, both A-only and V-only outperform SS and LMMSE at low SNRs and perform comparably at high SNRs.

4.1 Methodology

The AV switching module was trained over a wide-range of noisy AV ChiME3 corpus, ranging from -9dB to 9dB. A subset of dataset was used for neural network training (70%) and remaining 30% was used for network testing (20%) and validation (10%). Table 2 summarises the Train, Test, and Validation sentences. For generalisation (SNR independent) testing, the proposed framework was trained on SNRs ranging from -9 to 6dB, and tested on 9dB SNR.

4.2 Training and Testing Results

In this subsection, the training and testing performances of the proposed contextual AV model are compared with the A-only and V-only models. For preliminary data analysis, V-only model was first trained and tested with different number of prior visual frames, ranging from 1, 2, 4, 8, to 14, and 18 visual frames. The MSE results are depicted in Table 3. It can be seen that by moving from 1 visual frame to 18 visual frames, a significant performance improvement could be achieved. The LSTM model with 1 visual frame achieved the MSE of 0.092, whereas with 18 visual frames, the model achieved the least MSE of 0.058. The LSTM based learning model exploited the temporal information (i.e. prior visual frames) effectively and showed consistent reduction in MSE while going from 1 to 18 visual frames. This is mainly because of its inherent recurrent architectural property and the ability of retaining state over long time spans by using cell gates. The validation results of A-only (A2A), V-only (V2A), and AV (AV2A) are presented in Figure 6. Please note that all three models were trained with a prior 14 frames dataset (an optimal number of prior visual frames for lip movements temporal correlation exploitation). It can be seen that AV contextual model outperformed both A-only and V-only models in achieving least MSE.

4.3 Speech Enhancement

4.3.1 Objective Testing

For objective testing, PESQ test is used for restored speech quality evaluation. The PESQ score ranges from -0.5 to 4.5 representing least and best possible reconstructed speech quality. The PESQ scores for EVWF (with A-only, V-only, and contextual AV mapping), SS, and LMMSE for different SNRs are depicted in Figure 7. It can be seen that, at low SNR levels, EVWF with AV, A-only, and V-only significantly outperformed SS boll1979spectral and LMMSE ephraim1985speech based speech enhancement methods. At high SNR, A-only and V-only performed comparably to SS and LMMSE methods. It is also to be noted that at low SNR, V-only outperforms A-only and at high SNR, A-only outperforms V-only, validating the phenomena of less effective visual cues at high SNR and less effective audio cues at low SNR. In contrast, the contextual AV EVWF significantly outperformed SS and LMMSE based speech enhancement methods both at low and high SNRs. The low PESQ scores with ChiME3 corpus, particularly at high SNRs, can be attributed to the nature of the ChiMe3 noise. The latter is characterised by spectro-temporal variation, potentially reducing the ability of enhancement algorithms to restore the signal. However, the AV contextual switching component dealt better with the spectro-temporal variations. Figure 14 presents the spectrogram of the clean, SS, LMMSE, V-only, A-only, and contextual AV enhanced audio signals. It can be seen that EVWF with A-only, V-only, and AV outperformed SS and LMMSE at low SNR (-9db). Apparently, there is not much difference in the spectrogram of A-only, V-only, and AV. However, a closer inspection reveals that the V-only method approximating some of the frequency components better than A-only. The energy at specific time-frequency units are highlighted in the snazzy 3-D color spectrogram.

4.3.2 Subjective Listening Tests

To examine the effectiveness of the proposed contextual AV speech enhancement algorithm, subjective listening tests were conducted in terms of MOS with self-reported normal-hearing listeners using AV ChiME3 corpus. The listeners were presented with a single stimulus (i.e. enhanced speech only) and were asked to rate the re-constructed speech on a scale of 1 to 5. The five rating choices were: (5) Excellent (when the listener feels unnoticeable difference compared to the target clean speech) (4) Good (perceptible but not annoying) (3) Fair (slightly annoying) (2) Poor (annoying), and (1) Bad (very annoying). The EVWF with contextual AV module was compared with A-only, V-only, SS, and LMMSE. A total of 15 listeners took part in the evaluation session. The clean speech signal was corrupted with ChiME3 noises (at SNRs of -9dB, -6dB, -3dB, 0dB, 3dB, 6dB, and 9dB). Figure 15 depicts the performances of five different speech enhancement methods in terms of MOS with ChiME3 noises. Similar to the PESQ results, it can be seen that, at low SNR levels, EVWF with contextual AV, A-only, and V-only models significantly outperformed SS and LMMSE methods. At high SNRs, A-only and V-only performed comparably to SS and LMMSE methods. At low SNRs, V-only outperforms A-only and at high SNRs, A-only performs better than V-only, validating the phenomena of less effective visual cues at high SNR and less effective audio cues at low SNR. However, EVWF with contextual AV model significantly outperformed SS, LMMSE, A-only, and V-only methods even at high SNRs exploiting the complementary strengths of both audio and visual cues contextually.

5 Conclusion

In this paper, a novel AV switching component is presented that contextually utilises both visual and noisy audio features to approximate clean audio features in different noise exposures. The performance of our proposed contextual AV switch is investigated using novel EVWF and compared with A-only and V-only EVWF. The EVWF ability to reconstruct clean speech signal is directly proportional to AV mapping accuracy (i.e clean audio features estimation), better the mapping accuracy, more optimal Wiener filter can be designed for high quality speech reconstruction. Experimental results revealed that the proposed contextual AV switching module successfully exploited both audio and visual cues to better estimate the clean audio features in a variety of noisy environments. The EVWF with contextual AV model outperformed A-only EVWF, V-only EVWF, SS, and LMMSE based speech enhancement methods in terms of both speech quality and intelligibility at both low and high SNRs. At low SNRs, A-only and V-only outperformed SS and LMMSE but performed comparably at high SNRs. In addition, at low SNRs, V-only outperformed A-only and at high SNRs, A-only outperformed V-only. It validates the phenomena of less effective visual cues at high SNRs and less effective audio cues at low SNRs. However, the EVWF with contextual AV model better tackled the spectro-temporal variation at both low and high SNRS. Interestingly, in the snazzy 3-D colour spectrogram results, it is observed that visual cues helped approximating some of the frequency components better than A-only, where larger intensities at specific time-frequency are apparent. In future, we intend to further investigate the performance of our proposed contextual AV module and EVWF in real-world environments with speaker independent scenarios. Though, for the proof of concept, the current study fairly satisfies the speaker independence criteria where a subset of 5 diverse subjects (two white females, two white males, and one black male) with a total of 900 command sentences per speaker was considered.


This work was supported by the UK Engineering and Physical Sciences Research Council (EPSRC) Grant No. EP/M026981/1. The authors would like to acknowledge Dr Andrew Abel from Xian Jiaotong-Liverpool University for contribution in audio-visual feature extraction. The authors would also like to acknowledge Ricard Marxer and Jon Barker from the University of Sheffield, Roger Watt from the University of Stirling, and Peter Derleth from Sonova, AG, Staefa, Switzerland for their contributions. The authors would also like to thank Kia Dashtipour for conducting MOS test. Lastly, We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.


  • (1) W. H. Sumby and I. Pollack, “Visual contribution to speech intelligibility in noise,” The journal of the acoustical society of america, vol. 26, no. 2, pp. 212–215, 1954.
  • (2) Q. Summerfield, “Use of visual information for phonetic perception,” Phonetica, vol. 36, no. 4-5, pp. 314–331, 1979.
  • (3) H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” 1976.
  • (4) M. L. Patterson and J. F. Werker, “Two-month-old infants match phonetic information in lips and voice,” Developmental Science, vol. 6, no. 2, pp. 191–196, 2003.
  • (5) M. Middelweerd and R. Plomp, “The effect of speechreading on the speech-reception threshold of sentences in noise,” The Journal of the Acoustical Society of America, vol. 82, no. 6, pp. 2145–2147, 1987.
  • (6) A. MacLeod and Q. Summerfield, “A procedure for measuring auditory and audiovisual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use,” British journal of audiology, vol. 24, no. 1, pp. 29–43, 1990.
  • (7) A. Duquesnoy, “Effect of a single interfering noise or speech source upon the binaural sentence intelligibility of aged persons,” The Journal of the Acoustical Society of America, vol. 74, no. 3, pp. 739–743, 1983.
  • (8) A. K. Katsaggelos, S. Bahaadini, and R. Molina, “Audiovisual fusion: Challenges and new approaches,” Proceedings of the IEEE, vol. 103, no. 9, pp. 1635–1653, 2015.
  • (9) C.-h. Chen, Pattern recognition and artificial intelligence.   Elsevier, 2013.
  • (10) A. Waibel and K.-F. Lee, Readings in speech recognition.   Morgan Kaufmann, 1990.
  • (11) A. Bundy and L. Wallen, “Linear predictive coding,” in Catalogue of Artificial Intelligence Tools.   Springer, 1984, pp. 61–61.
  • (12) Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, “A review of recent advances in visual speech decoding,” Image and vision computing, vol. 32, no. 9, pp. 590–605, 2014.
  • (13) C. Sui, R. Togneri, S. Haque, and M. Bennamoun, “Discrimination comparison between audio and visual features,” in Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on.   IEEE, 2012, pp. 1609–1612.
  • (14)

    A. V. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, “Dynamic bayesian networks for audio-visual speech recognition,”

    EURASIP Journal on Advances in Signal Processing, vol. 2002, no. 11, p. 783042, 2002.
  • (15) C. G. Snoek, M. Worring, and A. W. Smeulders, “Early versus late fusion in semantic video analysis,” in Proceedings of the 13th annual ACM international conference on Multimedia.   ACM, 2005, pp. 399–402.
  • (16) Z. Wu, L. Cai, and H. Meng, “Multi-level fusion of audio and visual features for speaker identification,” in International Conference on Biometrics.   Springer, 2006, pp. 493–499.
  • (17) M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual corpus for speech perception and automatic speech recognition,” The Journal of the Acoustical Society of America, vol. 120, no. 5, pp. 2421–2424, 2006.
  • (18) E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, “Cuave: A new audio-visual database for multimodal human-computer interface research,” in Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, vol. 2.   IEEE, 2002, pp. II–2017.
  • (19) G. Iyengar and H. J. Nock, “Discriminative model fusion for semantic concept detection and annotation in video,” in Proceedings of the eleventh ACM international conference on Multimedia.   ACM, 2003, pp. 255–258.
  • (20) A. K. Noulas and B. J. Kröse, “Em detection of common origin of multi-modal cues,” in Proceedings of the 8th international conference on Multimodal interfaces.   ACM, 2006, pp. 201–208.
  • (21) F. Berthommier, “A phonetically neutral model of the low-level audio-visual interaction,” Speech Communication, vol. 44, no. 1, pp. 31–41, 2004.
  • (22) J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” arXiv preprint arXiv:1611.05358, 2016.
  • (23) Z. Wu, S. Sivadas, Y. K. Tan, M. Bin, and R. S. M. Goh, “Multi-modal hybrid deep neural network for speech enhancement,” arXiv preprint arXiv:1606.04750, 2016.
  • (24) K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata, “Audio-visual speech recognition using deep learning,” Applied Intelligence, vol. 42, no. 4, pp. 722–737, 2015.
  • (25) M. Gogate, A. Adeel, R. Marxer, J. Barker, and A. Hussain, “Dnn driven speaker independent audio-visual mask estimation for speech separation,” arXiv preprint arXiv:1808.00060, 2018.
  • (26) J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and H.-M. Wang, “Audio-visual speech enhancement using multimodal deep convolutional neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018.
  • (27) I. Almajai and B. Milner, “Visually derived wiener filters for speech enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 6, pp. 1642–1651, 2011.
  • (28) J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘chime’speech separation and recognition challenge: Dataset, task and baselines,” in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on.   IEEE, 2015, pp. 504–511.
  • (29) P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1.   IEEE, 2001, pp. I–I.
  • (30) D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” International Journal of Computer Vision, vol. 77, no. 1-3, pp. 125–141, 2008.
  • (31) A. Abel, R. Marxer, J. Barker, R. Watt, B. Whitmer, P. Derleth, and A. Hussain, “A data driven approach to audiovisual speech mapping,” in Advances in Brain Inspired Cognitive Systems: 8th International Conference, BICS 2016, Beijing, China, November 28-30, 2016, Proceedings 8.   Springer, 2016, pp. 331–342.
  • (32) S. Boll, “A spectral subtraction algorithm for suppression of acoustic noise in speech,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’79., vol. 4.   IEEE, 1979, pp. 200–203.
  • (33) Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, 1985.