Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition

09/05/2018
by   George Sterpu, et al.
0

Automatic speech recognition can potentially benefit from the lip motion patterns, complementing acoustic speech to improve the overall recognition performance, particularly in noise. In this paper we propose an audio-visual fusion strategy that goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to enhanced representations which increase the recognition accuracy in both clean and noisy conditions. We test our strategy on the TCD-TIMIT and LRS2 datasets, designed for large vocabulary continuous speech recognition, applying three types of noise at different power ratios. We also exploit state of the art Sequence-to-Sequence architectures, showing that our method can be easily integrated. Results show relative improvements from 7 alone, depending on the acoustic noise level. We anticipate that the fusion strategy can easily generalise to many other multimodal tasks which involve correlated modalities.

READ FULL TEXT
research
11/13/2018

Modality Attention for End-to-End Audio-visual Speech Recognition

Audio-visual speech recognition (AVSR) system is thought to be one of th...
research
04/17/2020

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Audio-Visual Speech Recognition (AVSR) seeks to model, and thereby explo...
research
12/24/2013

Speech Recognition Front End Without Information Loss

Speech representation and modelling in high-dimensional spaces of acoust...
research
11/21/2016

Robust end-to-end deep audiovisual speech recognition

Speech is one of the most effective ways of communication among humans. ...
research
01/22/2015

Deep Multimodal Learning for Audio-Visual Speech Recognition

In this paper, we present methods in deep multimodal learning for fusing...
research
05/02/2020

MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech

We address a challenging and practical task of labeling questions in spe...
research
05/15/2020

A Novel Fusion of Attention and Sequence to Sequence Autoencoders to Predict Sleepiness From Speech

Motivated by the attention mechanism of the human visual system and rece...

Please sign up or login with your details

Forgot password? Click here to reset