Multimodal Speech Recognition with Unstructured Audio Masking

10/16/2020
by   Tejas Srinivasan, et al.
16

Visual context has been shown to be useful for automatic speech recognition (ASR) systems when the speech signal is noisy or corrupted. Previous work, however, has only demonstrated the utility of visual context in an unrealistic setting, where a fixed set of words are systematically masked in the audio. In this paper, we simulate a more realistic masking scenario during model training, called RandWordMask, where the masking can occur for any word segment. Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting. Moreover, our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted. These results show that multimodal ASR systems can leverage the visual signal in more generalized noisy scenarios.

READ FULL TEXT
research
02/13/2020

Looking Enhances Listening: Recovering Missing Speech Using Images

Speech is understood better by using visual context; for this reason, th...
research
06/30/2019

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

Multimodal learning allows us to leverage information from multiple sour...
research
10/05/2020

Fine-Grained Grounding for Multimodal Speech Recognition

Multimodal automatic speech recognition systems integrate information fr...
research
07/04/2014

Recognition of Isolated Words using Zernike and MFCC features for Audio Visual Speech Recognition

Automatic Speech Recognition (ASR) by machine is an attractive research ...
research
02/12/2021

Multimodal Punctuation Prediction with Contextual Dropout

Automatic speech recognition (ASR) is widely used in consumer electronic...
research
02/28/2023

Practice of the conformer enhanced AUDIO-VISUAL HUBERT on Mandarin and English

Considering the bimodal nature of human speech perception, lips, and tee...
research
01/16/2017

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

The Aduio-visual Speech Recognition (AVSR) which employs both the video ...

Please sign up or login with your details

Forgot password? Click here to reset