Robust end-to-end deep audiovisual speech recognition

11/21/2016
by   Ramon Sanabria, et al.
0

Speech is one of the most effective ways of communication among humans. Even though audio is the most common way of transmitting speech, very important information can be found in other modalities, such as vision. Vision is particularly useful when the acoustic signal is corrupted. Multi-modal speech recognition however has not yet found wide-spread use, mostly because the temporal alignment and fusion of the different information sources is challenging. This paper presents an end-to-end audiovisual speech recognizer (AVSR), based on recurrent neural networks (RNN) with a connectionist temporal classification (CTC) loss function. CTC creates sparse "peaky" output activations, and we analyze the differences in the alignments of output targets (phonemes or visemes) between audio-only, video-only, and audio-visual feature representations. We present the first such experiments on the large vocabulary IBM ViaVoice database, which outperform previously published approaches on phone accuracy in clean and noisy conditions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/13/2018

Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Today's Automatic Speech Recognition systems only rely on acoustic signa...
research
09/05/2018

Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition

Automatic speech recognition can potentially benefit from the lip motion...
research
07/28/2020

Multimodal Integration for Large-Vocabulary Audio-Visual Speech Recognition

For many small- and medium-vocabulary tasks, audio-visual speech recogni...
research
09/10/2021

Large-vocabulary Audio-visual Speech Recognition in Noisy Environments

Audio-visual speech recognition (AVSR) can effectively and significantly...
research
01/22/2015

Deep Multimodal Learning for Audio-Visual Speech Recognition

In this paper, we present methods in deep multimodal learning for fusing...
research
09/12/2018

End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

Speech activity detection (SAD) plays an important role in current speec...
research
12/10/2019

A Novel Topology for End-to-end Temporal Classification and Segmentation with Recurrent Neural Network

Connectionist temporal classification (CTC) has matured as an alignment ...

Please sign up or login with your details

Forgot password? Click here to reset