DeepAI AI Chat
Log In Sign Up

End-To-End Visual Speech Recognition With LSTMs

by   Stavros Petridis, et al.
Imperial College London

Traditional visual speech recognition systems consist of two stages, feature extraction and classification. Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage. However, research on joint learning of features and classification is very limited. In this work, we present an end-to-end visual speech recognition system based on Long-Short Memory (LSTM) networks. To the best of our knowledge, this is the first model which simultaneously learns to extract features directly from the pixels and perform classification and also achieves state-of-the-art performance in visual speech classification. The model consists of two streams which extract features directly from the mouth and difference images, respectively. The temporal dynamics in each stream are modelled by an LSTM and the fusion of the two streams takes place via a Bidirectional LSTM (BLSTM). An absolute improvement of 9.7 CUAVE database when compared with other methods which use a similar visual front-end.


page 2

page 4


End-to-End Visual Speech Recognition for Small-Scale Datasets

Traditional visual speech recognition systems consist of two stages, fea...

End-to-End Audiovisual Fusion with LSTMs

Several end-to-end deep learning approaches have been recently presented...

A three-dimensional approach to Visual Speech Recognition using Discrete Cosine Transforms

Visual speech recognition aims to identify the sequence of phonemes from...

End-To-End Audiovisual Feature Fusion for Active Speaker Detection

Active speaker detection plays a vital role in human-machine interaction...

End-to-end Audio-visual Speech Recognition with Conformers

In this work, we present a hybrid CTC/Attention model based on a ResNet-...

End-To-End Speech Recognition Using A High Rank LSTM-CTC Based Model

Long Short Term Memory Connectionist Temporal Classification (LSTM-CTC) ...

WaDeNet: Wavelet Decomposition based CNN for Speech Processing

Existing speech processing systems consist of different modules, individ...