Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

05/04/2019
by   Xinshuo Weng, et al.
0

We focus on the word-level visual lipreading, which requires recognizing the word being spoken, given only the video but not the audio. State-of-the-art methods explore the use of end-to-end neural networks, including a shallow (up to three layers) 3D convolutional neural network (CNN) + a deep 2D CNN (e.g., ResNet) as the front-end to extract visual features, and a recurrent neural network (e.g., bidirectional LSTM) as the back-end for classification. In this work, we propose to replace the shallow 3D CNNs + deep 2D CNNs front-end with recent successful deep 3D CNNs --- two-stream (i.e., grayscale video and optical flow streams) I3D. We evaluate different combinations of front-end and back-end modules with the grayscale video and optical flow inputs on the LRW dataset. The experiments show that, compared to the shallow 3D CNNs + deep 2D CNNs front-end, the deep 3D CNNs front-end with pre-training on the large-scale image and video datasets (e.g., ImageNet and Kinetics) can improve the classification accuracy. On the other hand, we demonstrate that using the optical flow input alone can achieve comparable performance as using the grayscale video as input. Moreover, the two-stream network using both the grayscale video and optical flow inputs can further improve the performance. Overall, our two-stream I3D front-end with a Bi-LSTM back-end results in an absolute improvement of 5.3% over the previous art.

READ FULL TEXT

page 1

page 2

page 5

page 6

page 8

research
03/22/2019

On the Importance of Video Action Recognition for Visual Lipreading

We focus on the word-level visual lipreading, which requires to decode t...
research
08/31/2016

Efficient Two-Stream Motion and Appearance 3D CNNs for Video Classification

The video and action classification have extremely evolved by deep neura...
research
08/10/2017

Semantic Video CNNs through Representation Warping

In this work, we propose a technique to convert CNN models for semantic ...
research
01/29/2019

Visual Rhythm Prediction with Feature-Aligning Network

In this paper, we propose a data-driven visual rhythm prediction method,...
research
04/28/2015

Compact CNN for Indexing Egocentric Videos

While egocentric video is becoming increasingly popular, browsing it is ...
research
09/16/2019

Temporally Consistent Depth Prediction with Flow-Guided Memory Units

Predicting depth from a monocular video sequence is an important task fo...
research
09/15/2017

ClickBAIT: Click-based Accelerated Incremental Training of Convolutional Neural Networks

Today's general-purpose deep convolutional neural networks (CNN) for ima...

Please sign up or login with your details

Forgot password? Click here to reset