How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

04/17/2020
by   George Sterpu, et al.
0

Audio-Visual Speech Recognition (AVSR) seeks to model, and thereby exploit, the dynamic relationship between a human voice and the corresponding mouth movements. A recently proposed multimodal fusion strategy, AV Align, based on state-of-the-art sequence to sequence neural networks, attempts to model this relationship by explicitly aligning the acoustic and visual representations of speech. This study investigates the inner workings of AV Align and visualises the audio-visual alignment patterns. Our experiments are performed on two of the largest publicly available AVSR datasets, TCD-TIMIT and LRS2. We find that AV Align learns to align acoustic and visual representations of speech at the frame level on TCD-TIMIT in a generally monotonic pattern. We also determine the cause of initially seeing no improvement over audio-only speech recognition on the more challenging LRS2. We propose a regularisation method which involves predicting lip-related Action Units from visual representations. Our regularisation method leads to better exploitation of the visual modality, with performance improvements between 7 Furthermore, we show that the alternative Watch, Listen, Attend, and Spell network is affected by the same problem as AV Align, and that our proposed approach can effectively help it learn visual representations. Our findings validate the suitability of the regularisation method to AVSR and encourage researchers to rethink the multimodal convergence problem when having one dominant modality.

READ FULL TEXT

page 1

page 7

page 8

page 9

page 10

research
09/05/2018

Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition

Automatic speech recognition can potentially benefit from the lip motion...
research
05/19/2020

Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech Recognition

The audio-visual speech fusion strategy AV Align has shown significant p...
research
12/14/2020

AV Taris: Online Audio-Visual Speech Recognition

In recent years, Automatic Speech Recognition (ASR) technology has appro...
research
02/18/2018

Visual-Only Recognition of Normal, Whispered and Silent Speech

Silent speech interfaces have been recently proposed as a way to enable ...
research
06/29/2017

Improving Speech Related Facial Action Unit Recognition by Audiovisual Information Fusion

It is challenging to recognize facial action unit (AU) from spontaneous ...
research
01/29/2020

Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Under noisy conditions, speech recognition systems suffer from high Word...
research
05/15/2020

A Novel Fusion of Attention and Sequence to Sequence Autoencoders to Predict Sleepiness From Speech

Motivated by the attention mechanism of the human visual system and rece...

Please sign up or login with your details

Forgot password? Click here to reset