Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition

08/06/2020
by   Liangfa Wei, et al.
0

Audio-visual information fusion enables a performance improvement in speech recognition performed in complex acoustic scenarios, e.g., noisy environments. It is required to explore an effective audio-visual fusion strategy for audiovisual alignment and modality reliability. Different from the previous end-to-end approaches where the audio-visual fusion is performed after encoding each modality, in this paper we propose to integrate an attentive fusion block into the encoding process. It is shown that the proposed audio-visual fusion method in the encoder module can enrich audio-visual representations, as the relevance between the two modalities is leveraged. In line with the transformer-based architecture, we implement the embedded fusion block using a multi-head attention based audiovisual fusion with one-way or two-way interactions. The proposed method can sufficiently combine the two streams and weaken the over-reliance on the audio modality. Experiments on the LRS3-TED dataset demonstrate that the proposed method can increase the recognition rate by 0.55 conditions, respectively, compared to the state-of-the-art approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

research
04/19/2021

Fusing information streams in end-to-end audio-visual speech recognition

End-to-end acoustic speech recognition has quickly gained widespread pop...
research
06/18/2017

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

Audio-visual recognition (AVR) has been considered as a solution for spe...
research
05/19/2020

Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech Recognition

The audio-visual speech fusion strategy AV Align has shown significant p...
research
01/29/2020

Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Under noisy conditions, speech recognition systems suffer from high Word...
research
05/02/2018

Investigating Audio, Visual, and Text Fusion Methods for End-to-End Automatic Personality Prediction

We propose a tri-modal architecture to predict Big Five personality trai...
research
09/09/2022

Learning Audio-Visual embedding for Person Verification in the Wild

It has already been observed that audio-visual embedding is more robust ...
research
07/02/2021

Continuous Emotion Recognition with Audio-visual Leader-follower Attentive Fusion

We propose an audio-visual spatial-temporal deep neural network with: (1...

Please sign up or login with your details

Forgot password? Click here to reset