Multimodal Integration for Large-Vocabulary Audio-Visual Speech Recognition

07/28/2020
by   Wentao Yu, et al.
0

For many small- and medium-vocabulary tasks, audio-visual speech recognition can significantly improve the recognition rates compared to audio-only systems. However, there is still an ongoing debate regarding the best combination strategy for multi-modal information, which should allow for the translation of these gains to large-vocabulary recognition. While an integration at the level of state-posterior probabilities, using dynamic stream weighting, is almost universally helpful for small-vocabulary systems, in large-vocabulary speech recognition, the recognition accuracy remains difficult to improve. In the following, we specifically consider the large-vocabulary task of the LRS2 database, and we investigate a broad range of integration strategies, comparing early integration and end-to-end learning with many versions of hybrid recognition and dynamic stream weighting. One aspect, which is shown to provide much benefit here, is the use of dynamic stream reliability indicators, which allow for hybrid architectures to strongly profit from the inclusion of visual information whenever the audio channel is distorted even slightly.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2021

Large-vocabulary Audio-visual Speech Recognition in Noisy Environments

Audio-visual speech recognition (AVSR) can effectively and significantly...
research
11/21/2016

Robust end-to-end deep audiovisual speech recognition

Speech is one of the most effective ways of communication among humans. ...
research
06/05/2019

Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition

Several audio-visual speech recognition models have been recently propos...
research
04/19/2021

Fusing information streams in end-to-end audio-visual speech recognition

End-to-end acoustic speech recognition has quickly gained widespread pop...
research
09/19/2019

A Comparison of Hybrid and End-to-End Models for Syllable Recognition

This paper presents a comparison of a traditional hybrid speech recognit...
research
11/23/2018

Improved Frequency Modulation Features for Multichannel Distant Speech Recognition

Frequency modulation features capture the fine structure of speech forma...
research
09/18/2023

Training dynamic models using early exits for automatic speech recognition on resource-constrained devices

The possibility of dynamically modifying the computational load of neura...

Please sign up or login with your details

Forgot password? Click here to reset