LipNet: End-to-End Sentence-level Lipreading

11/05/2016
by   Yannis Assael, et al.
0

Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 95.2 overlapped speaker split task, outperforming experienced human lipreaders and the previous 86.4

READ FULL TEXT
research
05/03/2015

Sequence to Sequence -- Video to Text

Real-world videos often have complex dynamics; and methods for generatin...
research
01/06/2018

Visual Text Correction

This paper tackles the Text Correction (TC) problem, i.e., finding and r...
research
07/09/2019

Multi-Speaker End-to-End Speech Synthesis

In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end...
research
09/12/2017

End-to-End Audiovisual Fusion with LSTMs

Several end-to-end deep learning approaches have been recently presented...
research
09/05/2017

Sequence Prediction with Neural Segmental Models

Segments that span contiguous parts of inputs, such as phonemes in speec...
research
06/16/2021

Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI

Speech sounds of spoken language are obtained by varying configuration o...
research
01/25/2019

A Neurally-Inspired Hierarchical Prediction Network for Spatiotemporal Sequence Learning and Prediction

In this paper we developed a hierarchical network model, called Hierarch...

Please sign up or login with your details

Forgot password? Click here to reset