Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

06/30/2019
by   Tejas Srinivasan, et al.
0

Multimodal learning allows us to leverage information from multiple sources (visual, acoustic and text), similar to our experience of the real world. However, it is currently unclear to what extent auxiliary modalities improve performance over unimodal models, and under what circumstances the auxiliary modalities are useful. We examine the utility of the auxiliary visual context in Multimodal Automatic Speech Recognition in adversarial settings, where we deprive the models from partial audio signal during inference time. Our experiments show that while MMASR models show significant gains over traditional speech-to-text architectures (upto 4.2 not incorporate visual information when the audio signal has been corrupted. This shows that current methods of integrating the visual modality do not improve model robustness to noise, and we need better visually grounded adaptation techniques.

READ FULL TEXT
research
10/16/2020

Multimodal Speech Recognition with Unstructured Audio Masking

Visual context has been shown to be useful for automatic speech recognit...
research
05/02/2020

MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech

We address a challenging and practical task of labeling questions in spe...
research
01/16/2017

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

The Aduio-visual Speech Recognition (AVSR) which employs both the video ...
research
04/27/2022

Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations

Multimodal speech recognition aims to improve the performance of automat...
research
01/29/2020

Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Under noisy conditions, speech recognition systems suffer from high Word...
research
01/22/2015

Deep Multimodal Learning for Audio-Visual Speech Recognition

In this paper, we present methods in deep multimodal learning for fusing...
research
11/14/2020

Speech Prediction in Silent Videos using Variational Autoencoders

Understanding the relationship between the auditory and visual signals i...

Please sign up or login with your details

Forgot password? Click here to reset