End-to-End Audiovisual Fusion with LSTMs

09/12/2017
by   Stavros Petridis, et al.
0

Several end-to-end deep learning approaches have been recently presented which simultaneously extract visual features from the input images and perform visual speech classification. However, research on jointly extracting audio and visual features and performing classification is very limited. In this work, we present an end-to-end audiovisual model based on Bidirectional Long Short-Term Memory (BLSTM) networks. To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the pixels and spectrograms and perform classification of speech and nonlinguistic vocalisations. The model consists of multiple identical streams, one for each modality, which extract features directly from mouth regions and spectrograms. The temporal dynamics in each stream/modality are modeled by a BLSTM and the fusion of multiple streams/modalities takes place via another BLSTM. An absolute improvement of 1.9 nonlingusitic vocalisations over audio-only classification is reported on the AVIC database. At the same time, the proposed end-to-end audiovisual fusion system improves the state-of-the-art performance on the AVIC database leading to a 9.7 speech recognition experiments on the OuluVS2 database using different views of the mouth, frontal to profile. The proposed audiovisual system significantly outperforms the audio-only model for all views when the acoustic noise is high.

READ FULL TEXT
research
02/18/2018

End-to-end Audiovisual Speech Recognition

Several end-to-end deep learning approaches have been recently presented...
research
01/20/2017

End-To-End Visual Speech Recognition With LSTMs

Traditional visual speech recognition systems consist of two stages, fea...
research
04/02/2019

End-to-End Visual Speech Recognition for Small-Scale Datasets

Traditional visual speech recognition systems consist of two stages, fea...
research
07/27/2022

End-To-End Audiovisual Feature Fusion for Active Speaker Detection

Active speaker detection plays a vital role in human-machine interaction...
research
02/12/2021

End-to-end Audio-visual Speech Recognition with Conformers

In this work, we present a hybrid CTC/Attention model based on a ResNet-...
research
03/12/2017

Combining Residual Networks with LSTMs for Lipreading

We propose an end-to-end deep learning architecture for word-level visua...
research
11/05/2016

LipNet: End-to-End Sentence-level Lipreading

Lipreading is the task of decoding text from the movement of a speaker's...

Please sign up or login with your details

Forgot password? Click here to reset