Audio-Visual Decision Fusion for WFST-based and seq2seq Models

01/29/2020
by   Rohith Aralikatti, et al.
0

Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER). In such cases, information from the visual modality comprising the speaker lip movements can help improve the performance. In this work, we propose novel methods to fuse information from audio and visual modalities at inference time. This enables us to train the acoustic and visual models independently. First, we train separate RNN-HMM based acoustic and visual models. A common WFST generated by taking a special union of the HMM components is used for decoding using a modified Viterbi algorithm. Second, we train separate seq2seq acoustic and visual models. The decoding step is performed simultaneously for both modalities using shallow fusion while maintaining a common hypothesis beam. We also present results for a novel seq2seq fusion without the weighing parameter. We present results at varying SNR and show that our methods give significant improvements over acoustic-only WER.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2016

Audio Visual Speech Recognition using Deep Recurrent Neural Networks

In this work, we propose a training algorithm for an audio-visual automa...
research
08/06/2020

Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition

Audio-visual information fusion enables a performance improvement in spe...
research
04/19/2021

Fusing information streams in end-to-end audio-visual speech recognition

End-to-end acoustic speech recognition has quickly gained widespread pop...
research
06/30/2019

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

Multimodal learning allows us to leverage information from multiple sour...
research
04/17/2020

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Audio-Visual Speech Recognition (AVSR) seeks to model, and thereby explo...
research
03/15/2023

Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs

Autonomous soundscape augmentation systems typically use trained models ...
research
02/23/2021

Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

Estimating the positions of multiple speakers can be helpful for tasks l...

Please sign up or login with your details

Forgot password? Click here to reset