Combining Residual Networks with LSTMs for Lipreading

03/12/2017
by   Themos Stafylakis, et al.
0

We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.

READ FULL TEXT

page 2

page 4

research
11/03/2018

Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

Visual and audiovisual speech recognition are witnessing a renaissance w...
research
10/30/2017

Deep word embeddings for visual speech recognition

In this paper we present a deep learning architecture for extracting wor...
research
10/30/2015

Highway Long Short-Term Memory RNNs for Distant Speech Recognition

In this paper, we extend the deep long short-term memory (DLSTM) recurre...
research
09/12/2017

End-to-End Audiovisual Fusion with LSTMs

Several end-to-end deep learning approaches have been recently presented...
research
01/23/2020

Lipreading using Temporal Convolutional Networks

Lip-reading has attracted a lot of research attention lately thanks to a...
research
06/12/2018

Detection of Premature Ventricular Contractions Using Densely Connected Deep Convolutional Neural Network with Spatial Pyramid Pooling Layer

Prematureventricularcontraction(PVC)isatypeof prematureectopicbeatorigin...
research
10/19/2017

Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System

Automatic visual speech recognition is an interesting problem in pattern...

Please sign up or login with your details

Forgot password? Click here to reset