End-To-End Visual Speech Recognition With LSTMs

01/20/2017
by   Stavros Petridis, et al.
0

Traditional visual speech recognition systems consist of two stages, feature extraction and classification. Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage. However, research on joint learning of features and classification is very limited. In this work, we present an end-to-end visual speech recognition system based on Long-Short Memory (LSTM) networks. To the best of our knowledge, this is the first model which simultaneously learns to extract features directly from the pixels and perform classification and also achieves state-of-the-art performance in visual speech classification. The model consists of two streams which extract features directly from the mouth and difference images, respectively. The temporal dynamics in each stream are modelled by an LSTM and the fusion of the two streams takes place via a Bidirectional LSTM (BLSTM). An absolute improvement of 9.7 CUAVE database when compared with other methods which use a similar visual front-end.

READ FULL TEXT

page 2

page 4

research
04/02/2019

End-to-End Visual Speech Recognition for Small-Scale Datasets

Traditional visual speech recognition systems consist of two stages, fea...
research
09/12/2017

End-to-End Audiovisual Fusion with LSTMs

Several end-to-end deep learning approaches have been recently presented...
research
09/07/2016

A three-dimensional approach to Visual Speech Recognition using Discrete Cosine Transforms

Visual speech recognition aims to identify the sequence of phonemes from...
research
07/27/2022

End-To-End Audiovisual Feature Fusion for Active Speaker Detection

Active speaker detection plays a vital role in human-machine interaction...
research
02/12/2021

End-to-end Audio-visual Speech Recognition with Conformers

In this work, we present a hybrid CTC/Attention model based on a ResNet-...
research
03/12/2019

End-To-End Speech Recognition Using A High Rank LSTM-CTC Based Model

Long Short Term Memory Connectionist Temporal Classification (LSTM-CTC) ...
research
11/11/2020

WaDeNet: Wavelet Decomposition based CNN for Speech Processing

Existing speech processing systems consist of different modules, individ...

Please sign up or login with your details

Forgot password? Click here to reset