DeepAI AI Chat
Log In Sign Up

LipNet: End-to-End Sentence-level Lipreading

by   Yannis Assael, et al.

Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification, rather than sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 95.2 overlapped speaker split task, outperforming experienced human lipreaders and the previous 86.4


Sequence to Sequence -- Video to Text

Real-world videos often have complex dynamics; and methods for generatin...

Visual Text Correction

This paper tackles the Text Correction (TC) problem, i.e., finding and r...

On the Importance of Video Action Recognition for Visual Lipreading

We focus on the word-level visual lipreading, which requires to decode t...

End-to-End Audiovisual Fusion with LSTMs

Several end-to-end deep learning approaches have been recently presented...

Sequence Prediction with Neural Segmental Models

Segments that span contiguous parts of inputs, such as phonemes in speec...

Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI

Speech sounds of spoken language are obtained by varying configuration o...

Visual Rhythm Prediction with Feature-Aligning Network

In this paper, we propose a data-driven visual rhythm prediction method,...

Code Repositories


Exploring Facial Feature generation

view repo