Audio-visual speech recognition (AVSR) is motivated by the natural ability of humans to integrate cross-modal information. When people are listening to speech in a noisy environment, they often unconsciously focus on the speaker's lips, which is of great benefit to human listening and comprehension . Even in clean speech, seeing the lips of the speaker influences perception, as demonstrated by the McGurk effect . It has been shown in many studies [31, 16, 10] that machine AVSR systems can also successfully improve performance on small-vocabulary tasks, when compared to their audio-only speech recognition (ASR) counterparts with otherwise equivalent setups. However, large-vocabulary tasks are still difficult for lipreading, because many phoneme pairs correspond to identical visemes, which makes certain words virtually indistinguishable to a vision-only system, as for example ”do” and ”to”.
This problem also leads to an inherent difficulty of AVSR on large-vocabulary tasks [30, 28], which is acerbated by the fact that many multi-stream fusion approaches perform badly, when the performance of the streams varies widely. In this work, we address this shortcoming by introducing a new stream fusion strategy that is impervious to such disparate single-stream recognition rates and can still benefit from low-quality streams in improving the results of highly reliable, clean audio data. To evaluate it in a realistic manner, we use a large-vocabulary dataset—the Lip Reading Sentences (LRS2) corpus —for all experiments, which we further augment by adding realistic noise and reverberation.
An effective fusion strategy for AVSR is decision fusion, which combines the decisions of multiple classifiers into a common decision. Decision fusion comes in different forms, such as dynamic stream-weighting, or state-based decision fusion (SBDF), e.g. in [1, 21, 14, 18]. An alternative fusion approach is the idea of fusing representations rather than decisions, e.g. via multi-modal attentions . Another example in this direction is that of gating, e.g. in  or in , where a newly designed Gated Multimodal Unit is used for dynamically fusing feature streams within each cell of a network.
In this work, we argue that the ideas of representation fusion and decision fusion can be unified in a different fashion, namely, by using the posterior probabilitiesof single-modality hybrid models as our representation of the uni-modal streams.
This viewpoint opens up a range of new possibilities, centered around these single-modality representations. On the one hand, we can base the multi-modal model on pre-trained hybrid ASR models. On the other hand, we can learn recurrent and dynamic fusion networks, which can benefit from the reliability information that is inherent in the posterior probabilities, such as instantaneous entropy and dispersion, as well as from temporal context.
Overall, in the following, we compare our new approach with the performance of 4 baseline and oracle fusion strategies, which are detailed in Section II. The proposed fusion strategy is introduced in Section III. Section IV describes the set of reliability measures that are employed in all of the dynamic fusion approaches. The experiments are presented in Section V, while Section VI introduces and analyzes the results. Finally, in Section VII, we discuss the lessons learned and give an outlook on future work.
Ii Related work
There are many different fusion strategies in AVSR research. In this section, we give a brief introduction to the fusion strategies that are considered as baseline models in this work. In these baselines as well as in our own model, M = 3 single-modality models are combined, one acoustic and two visual, where are our audio features, and and are shape-based and appearance-based video features; see Section V-B for details.
Ii-a Early integration
Early integration simply fuses the audio and visual information at the level of the input features via
Superscript denotes the transpose.
Ii-B Dynamic stream weighting
Stream weighting is an effective method to fuse different streams. It is a solution to the problem that the various streams may be reliable and informative in very different ways. Hence, a number of works employ the strategy of weighting different modalities [10, 11, 18]. Many utilize static weights; for example in , audio and video speech recognition models are trained separately and the DNN state posteriors of all modalities are combined by constant stream weights according to
Here, is the log-posterior of state in stream at time and
is its estimated combined log-posterior.
The problem of weight determination, however, cannot be neglected . It is clear that in good lighting conditions, the visual information may be more useful, while audio information is most beneficial in frames with a sufficiently high SNR. Therefore, the weight should be dynamically estimated to obtain optimal fusion results. As a baseline approach, we therefore consider dynamic stream weighting, which implements this idea. Specifically, we use dynamic stream weighting as described in  as the baseline. Here, the DNN state posteriors of all modalities are combined by estimated dynamic weights according to
The stream weights are estimated by a feedforward network from a set of reliability measures, introduced in detail in Sec. IV.
Reliability information has proven beneficial for multi-modal integration in many studies [16, 10, 12], where it is used to inform the integration model about the degree of uncertainty in all information streams over time. In , the authors also consider different criteria to train the integration model. In this paper, we use two of them as our baselines, namely the mean squared error (MSE) and the cross-entropy (CE).
This learning-based approach to weighted stream integration can effectively and significantly improve the recognition rates in lower SNR conditions. Also, in contrast to many other stream integration strategies, such as [24, 29, 23], it does not suffer from a loss of performance relative to the best single modality when the modalities differ widely in their performance, but it rather gains accuracy even from the inclusion of less informative streams. This is a feature of great importance for the case at hand, as we need to design a system that will even allow for the inclusion of the visual modality under clean conditions, where audio is far more informative than video data, without loosing—or, ideally, even still gaining—performance.
Ii-C Oracle weighting
We also compute optimal, or oracle stream weights, as described in . These optimal dynamic stream weights are computed in such a way as to minimize the cross-entropy with respect to the ground-truth forced alignment information. Since a known text transcription of the test set is therefore needed in this method, it is only useful to obtain a theoretical upper performance bound for standard stream-weighting approaches. To minimize the cross-entropy, we use convex optimization via CVX .
The obtained oracle stream weights are then used to calculate the estimated log-posterior through Equation (3). As oracle stream weights yield the minimum cross-entropy between the fused posteriors and the ground-truth one-hot posteriors of the reference transcription computed by forced alignment, the corresponding results can be considered as the best achievable word error rate (WER) of a stream-weighting-based hybrid recognition system.
Ii-D End-to-end model
In recent years, end-to-end speech recognition has quickly gained widespread popularity. The end-to-end model predicts character sequences directly from the audio signal. Comparing the end-to-end model and the hybrid speech recognition model, the end-to-end model has a lower complexity and is more easily amenable to multi-lingual ASR. But there are also some advantages to using a hybrid model. For example, the hybrid model can be learned from and adapted to comparatively little data and it can easily integrate with task-specific WFST language models. Importantly for this work, hybrid models allow for integration at the level of the pseudo-posteriors, which is a place for interpretable stream integration.
Hence, in this work, we use the hybrid approach to train the single modality models. To compare the performance of our proposed system to that of end-to-end AVSR, we consider the end-to-end “Watch, Listen, Attend and Spell” model (WLAS)  as a baseline. In this model, the audio and video encoders are LSTM networks. The decoder is an LSTM transducer, which fuses the encoded audio and video sequences through a dual attention mechanism.
Iii System overview
In the following, we propose an architecture that centers around a decision fusion net (DFN), which learns to combine all modalities dynamically.
As shown in Fig. 1, it bases on the state posteriors of each modality, derived from one hybrid recognition model per stream, which we consider as our representation of instantaneous feature inputs. In addition, we provide the DFN with multiple reliability indicators as auxiliary inputs, which help in estimating the multi-modal log-posteriors for the decoder. As mentioned above, we consider single-modality models, one acoustic and two visual. The fused posterior is computed via
Here, , and are the state posteriors of the audio model, and of the appearance-based, and a shape-based video model, respectively.
is a vector composed of the reliability measures at time, which we describe in Sec. IV. As an alternative to the posteriors of each stream, we have also considered a fusion of the log posteriors , but settled on the linear posteriors due to a better model convergence.
DFN training is then performed with the cross-entropy loss
Here, is the target state probability of state , obtained by the forced alignment for the clean acoustic training data. The estimated vector of log-posteriors is obtained from Eq. (4). Finally, the decoder uses these estimated log-posteriors to find the optimum word sequence by a graph search through the decoding graph .
Iv Reliability measures
To support the estimation of the dynamic stream weights, we extract a range of model-based and signal-based reliability measures (see Tab. I), generally computed as in . All of these reliability indicators are used in the dynamic stream weighting baseline as well as in our proposed model.
The model-based measures are entropy, dispersion, posterior difference, temporal divergence, entropy- and dispersion-ratio. All model-based measures are computed from the log-posterior outputs of their respective single-modality models, .
Signal-based reliability measures for the audio data comprise the first 5 MFCC coefficients with their temporal derivatives MFCC, again as in . The SNR is strongly related to the intelligibility of an audio signal. However, due to the realistic, highly non-stationary environmental noises (discussed in Sec. V
) used in data augmentation and testing, conventional SNR estimation algorithms are not showing a robust performance. Instead, therefore, the deep learning approach DeepXi is used to estimate the frame-wise SNR.
The pitch and its temporal derivative, , are also used as reliability indicators. It has been shown that high pitch of a speech signal negatively affects the MFCC coefficients , due to insufficient smoothing of the pitch harmonics in the speech spectrum by the filterbank.
The probability of voicing  is used as an additional cue. It is computed from the Normalized Cross-Correlation Function (NCCF) values for each frame.
is used for face detection and facial landmark extraction. This allows us to use the confidence of the face detector in each frame as an indicator of the visual feature quality. The other signal-based video reliability measures, the Inverse Discrete Cosine Transform (IDCT), and the image distortion estimates, are the same as in.
V Experimental Setup
The Oxford-BBC Lip Reading Sentences (LRS2) corpus is used for all experiments. It contains more than 144k sentences from British television. Table II gives an overview of the dataset size and partitioning. The pre-train set is usually used in AVSR tasks for video or audio-visual model pretraining. In this work, we combine the pre-train and training set to train all acoustic, visual, and AV models.
For the AVSR task, the results are often dominated by the acoustic model. To analyze the performance in different noisy environments and to counter the audio-visual model imbalance, we add acoustic noise to the LRS2 database. The ambient subset of the MUSAN corpus  is used as the noise source. It contains noises such as wind, footsteps, paper rustling, rain, as well as indistinct crowd noises. Seven different SNRs are selected randomly, from -9 dB to 9 dB in steps of 3 dB. We also generated data for a far-field AVSR scenario. As the LRS2 database does not contain highly reverberant data, we artificially reverberate the acoustic data by convolutions with measured impulse responses. These impulse responses also come from the MUSAN corpus. Both types of augmentation use Kaldi’s Voxceleb example recipe.
V-B Feature extraction
The audio model uses 40 log Mel features together with two pitch features (, ) and the probability of voicing, yielding 43-dimensional feature vectors. The audio features are extracted with 25 ms frame size and 10 ms frameshift. The video features are extracted per frame, i.e., every 40 ms. The video appearance model (VA) uses 43-dimensional IDCT coefficients of the grayscale mouth region of interest (ROI) as features. The video shape model (VS) is based on the 34-dimensional non-rigid shape parameters described in .
Since the audio and video features have different frame rates, Bresenham's algorithm  is used to align the video features before training the visual models. This algorithm gives the best first-order approximation for aligning audio and video frames given only a difference in frame rates.
V-C Implementation details
All our experiments are based on the Kaldi toolkit . As mentioned in Section V-A, both pre-train and training sets are used together to train the acoustic and visual models. The initial HMM-GMM training follows the standard Kaldi AMI recipe, namely, monophone training followed by triphone training. A linear discriminant analysis (LDA) is applied to a stacked context of features to obtain discriminative short-term features. Finally, speaker adaptive training (SAT) is used to compensate for speaker variability. Each step produces a better forced alignment for later network training. HMM-DNN training uses the nnet2 p-norm network  recipe, which is efficiently parallelizable.
Once HMM-DNN training has been performed, the acoustic model DNN and two visual observation models are available. They output estimated log-posteriors for each stream, which are then used in our proposed DFN. Its input consists of all stream-wise state-posteriors and the reliability measures.
As mentioned in Section III, the decoder obtains the best word sequence by graph search through a decoding graph using the estimated log-pseudo-posteriors
. To ensure that all experiments and modalities search through the same decoding graph, we share the phonetic decision tree between all single modalities. Thus, the number of states for each modality is identical, specifically 3,856.
In addition, there are 41 reliability indicators, which leads to an overall input dimension of
11,609. The first three hidden layers have 8,192, 4,096, and 1,024 units, respectively, each using the ReLU activation function, layer normalization (LN), and a dropout rate of 0.15. They are followed by 3 BLSTM layers with 1,024 memory cells for each direction, using tanh as the activation function. Finally, a fully connected (FC) layer projects the data to the output dimension of 3,856. A log-softmax function is applied to obtain the estimated log-posteriors.
Early stopping is used to avoid overfitting. We check for early stopping every 7,900 iterations, using the validation set. The training process is stopped if the validation loss does not decrease for 23,700 iterations. Finally, the performance is evaluated on the test set. We performed two experiments with the proposed DFN strategy. The first uses the BLSTM-DFN, exactly as described above, while the second is an LSTM-DFN, replacing the BLSTM layers by purely feed-forward LSTMs.
In this section, we compare the performance of all baseline models and fusion strategies. Figure 2 gives an overview of the results of the audio-only model and compares the results of all baselines and our proposed BLSTM-DFN.
Comparing the audio-only model and the BLSTM-DFN integration, our fusion strategy is able to reduce the Word Error Rate (WER) for every SNR, even for clean acoustic data. For lower SNRs, the DFN can improve the absolute WER by over 10%. Our proposed BLSTM-DFN is also capable of achieving better results in many cases than the–realistically unachievable—oracle weighting (OW), that is based on ground-truth transcription information of the test set and can be considered as the upper limit for the dynamic stream-weighting approach of Equation (3). The end-to-end WLAS model is not able to improve the recognition rates comparing to the audio-only model, which may in part be due to the fact that it does not employ an explicit language model.
Table III lists all the results of our experiments under additive noise. As expected, the audio-only model (AO) has a much better performance than the video-appearance (VA) and video-shape (VS) models. The average WERs of the visual models are over 80%, which illustrates that lipreading is still hard in large-vocabulary tasks. We have also employed the pre-trained spatio-temporal visual front-end from  to extract high-level visual features, without seeing improvements. We assume that this is due to insufficient training data as well as to insufficient generalization across speakers and recording conditions.
Early integration (EI) can also improve the results, but the improvement is not as significant as that of the proposed DFN approach. Comparing the BLSTM-DFN and the LSTM-DFN, the former shows a notable advantage in accuracy, albeit at the price of non-real-time performance. Both the LSTM-DFN and the BLSTM-DFN use recurrent layers with 1024 memory cells. As the number of parameters in a BLSTM layer are almost double that of the LSTM layer, we also trained a BLSTM-DFN using 512 memory cells per layer. The average WER of this model is 16.14%, still better than that of the LSTM-DFN with a similar number of parameters. If we increase the number of cells for the LSTM-DFN to 2048, with the same learning rate, the model suffers from convergence issues.
The dynamic stream weighting results, using the MSE or CE loss, are better than shown in  for three reasons. Firstly, improved reliability measures are used in this work. Secondly,  trains acoustic and visual models only on the training set, whereas here, they are trained on both the pre-train and training data. This gives a significant performance boost to the single-modality systems and also to early integration, but is not of as much added benefit to the dynamic stream-weight estimation, though the weight estimator from  was trained on the validation set, whereas here, it is also trained on the pre-train and training sets. We assume that its relatively small performance gain is due to the limited flexibility of the composition function in dynamic stream weighting.
Comparing the average WER over all acoustic conditions, the proposed BLSTM-DFN is greatly beneficial, outperforming the not realistically achievable OW system, and surpassing all feasible stream integration approaches by a clear margin. Thus, our proposed method outperforms even optimal dynamic stream weighting and therefore provides a fundamentally superior architecture compared to instantaneous stream weighting.
 AO EI MSE CE OW WLAS LSTM- DFN BLSTM- DFN 23.61 19.15 19.54 19.44 12.70 44.24 15.67 15.28
We also checked the case of far-field AVSR by using data augmentation to produce artificially reverberated speech, see Table VI for the results. The BLSTM-DFN still outperforms the other fusion strategies, but it is not as close to the OW. We suspect the reason is an insufficient amount of reverberant acoustic training signals.
Overall, it can be concluded that the introduced DFN is generally superior to instantaneous dynamic stream weighting. The latter can be considered as fusion at the frame level
. Frame-by-frame, it sums log-posteriors of each stream in a weighted fashion. Hence, its estimated combined log-posterior is a linear transformation of the single-modality log-posteriors. In contrast, the DFN can be considered as a cross-temporal fusion strategy at thestate level, as the combined log-posterior is estimated through a non-linear transformation with memory. This allows for a more accurate estimation, in which the BLSTM-DFN gives an added advantage to the LSTM-DFN, since it has access to both past and future contextual information. In this work, the BLSTM-DFN shows a relative WER reduction of 42.18% compared to the audio-only system, while the LSTM-DFN yields a relative WER improvement of 27.09%, showing the benefit of being able to lipread even for noisy LVCSR.
There are still many difficulties for large-vocabulary speech recognition under adverse conditions, but the fusion of acoustic and visual information can bring a significant benefit to these challenging and interesting tasks. In this paper, we propose a new architecture, the decision fusion net (DFN), in which we consider state posteriors of acoustic and visual models as appropriate stream representations for fusion. These are combined by the DFN, which uses stream reliability indicators to estimate the optimal state-posteriors for hybrid speech recognition. It comes in two flavors, a BLSTM-DFN with optimal performance, as well as an LSTM-DFN, which provides the option of real-time decoding.
We compare the performance of our proposed model to early integration as well as to conventional dynamic stream weighting models. In experimental results on noisy as well as on reverberant data, our proposed model shows significant improvements, with the BLSTM version giving a relative word-error-rate reduction of 42.18% over audio-only recognition, and outperforming all baseline models. The hybrid architecture with the proposed DFN clearly outperforms the end-to-end WLAS model, which we attribute to its factorization of stream evaluation, stream integration, and subsequent, language-model-supported, search. It is worth mentioning that, on average, the hybrid DFN model is even superior to a hybrid model with oracle stream weighting, which is an interesting result on its own, given that the latter provides a theoretical upper bound for instantaneous stream weighting approaches.
The natural next goal of our work is to focus on end-to-end audio-visual speech recognition models. Here, we are specifically interested in investigating reliability-supported fusion within the attention mechanism in CTC and transformer systems and in the possibilities that come with probabilistic intermediate representations for these architectures.
-  (2015) Learning dynamic stream weights for coupled-HMM-based audio-visual speech recognition. Audio, Speech, and Language Processing, IEEE/ACM Transactions on 23 (5), pp. 863–876. Cited by: §I.
-  (2018) Deep audio-visual speech recognition. arXiv:1809.02108. Cited by: §I.
-  (2016) OpenFace: A general-purpose face recognition library with mobile applications. Technical report CMU-CS-16-118, CMU School of Computer Science. Cited by: §IV, §V-B.
-  (2020) Gated multimodal networks. Neural Computing and Applications, pp. 1–20. Cited by: §I.
-  (2017) Lip reading sentences in the wild. In , pp. 6447–6456. Cited by: §II-D.
-  (2016) Eye can hear clearly now: inverse effectiveness in natural audiovisual speech processing relies on long-term crossmodal temporal integration. Journal of Neuroscience 36 (38), pp. 9888–9895. Cited by: §I.
Robust feature extraction for continuous speech recognition using the MVDR spectrum estimation method. IEEE Transactions on Audio, Speech, and Language Processing 15 (1), pp. 224–234. Cited by: §IV.
A pitch extraction algorithm tuned for automatic speech recognition. In Proc. ICASSP, pp. 2494–2498. Cited by: §IV.
-  (2014-03) CVX: Matlab Software for Disciplined Convex Programming, version 2.1. Note: http://cvxr.com/cvx Cited by: §II-C.
-  (2008) Dynamic modality weighting for multi-stream HMMs in audio-visual speech recognition. In Proc. ICMI, pp. 237–240. Cited by: §I, §I, §II-B, §II-B.
-  (2002) Noise adaptive stream weighting in audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing 2002 (11), pp. 1–14. Cited by: §II-B.
-  (2013) Multistream recognition of speech: Dealing with unknown unknowns. Proceedings of the IEEE 101 (5), pp. 1076–1088. Cited by: §II-B.
-  (2006) Experiential sampling in multimedia systems. IEEE Transactions on Multimedia 8 (5), pp. 937–946. Cited by: §II-B.
-  (2001) Asynchronous stream modeling for large vocabulary audio-visual speech recognition. Vol. 1, pp. 169–172. Cited by: §I.
-  (1976) Hearing lips and seeing voices. Nature 264 (5588), pp. 746–748. Cited by: §I.
Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates. In Proc. ICASSP, pp. 5320–5324. Cited by: §I, §II-B.
-  (2008) Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing, pp. 559–584. Cited by: §III.
Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal on Advances in Signal Processing 2002 (11), pp. 1–15. Cited by: §I, §II-B.
-  (2019) Deep learning for minimum mean-square error approaches to speech enhancement. Speech Communication 111, pp. 44–55. Cited by: §IV.
-  (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026–8037. Cited by: §V-C.
-  (2003) Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91 (9), pp. 1306–1326. Cited by: §I.
-  (2011) The Kaldi speech recognition toolkit. In Proc. IEEE, Cited by: §II-D, §V-C.
-  (2016) Turbo automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (5), pp. 846–862. Cited by: §II-B.
-  (2005) A new posterior based audio-visual integration method for robust speech recognition. In Proc. Interspeech, Cited by: §II-B.
-  (2015) MUSAN: A Music, Speech, and Noise Corpus. Note: arXiv:1510.08484v1 External Links: Cited by: §V-A.
-  (1982) Using program transformations to derive line-drawing algorithms. ACM Transactions on Graphics (TOG) 1 (4), pp. 259–273. Cited by: §V-B.
-  (2017) Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105. Cited by: §VI.
-  (2020) How to teach DNNs to pay attention to the visual modality in speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 1052–1064. Cited by: §I.
-  (2013) Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Transactions on Cybernetics 44 (2), pp. 175–184. Cited by: §I, §II-B.
-  (2018) Building large-vocabulary speaker-independent lipreading systems. In Proc. Interspeech, Cited by: §I.
-  (2017) Improving speaker-independent lipreading with domain-adversarial training. In Proc. Interspeech, Cited by: §I.
-  (2005) A multimodal fusion system for people detection and tracking. Int. J. Imaging Syst. Technol. 15, pp. 131–142. Cited by: §II-B.
-  (2020a) Audio-visual recognition of overlapped speech for the LRS2 dataset. In Proc. ICASSP, pp. 6984–6988. Cited by: §I.
-  (2020b) Multimodal integration for large-vocabulary audio-visual speech recognition. In Proc. EUSIPCO, pp. 341–345. Cited by: §II-B, §II-B, §II-C, §IV, §IV, §IV, §VI.
-  (2014) Improving deep neural network acoustic models using generalized maxout networks. In Proc. ICASSP, pp. 215–219. Cited by: §V-C.
-  (2019) Modality attention for end-to-end audio-visual speech recognition. In Proc. ICASSP, External Links: Cited by: §I.