On the Importance of Video Action Recognition for Visual Lipreading

03/22/2019
by   Xinshuo Weng, et al.
0

We focus on the word-level visual lipreading, which requires to decode the word from the speaker's video. Recently, many state-of-the-art visual lipreading methods explore the end-to-end trainable deep models, involving the use of 2D convolutional networks (e.g., ResNet) as the front-end visual feature extractor and the sequential model (e.g., Bi-LSTM or Bi-GRU) as the back-end. Although a deep 2D convolution neural network can provide informative image-based features, it ignores the temporal motion existing between the adjacent frames. In this work, we investigate the spatial-temporal capacity power of I3D (Inflated 3D ConvNet) for visual lipreading. We demonstrate that, after pre-trained on the large-scale video action recognition dataset (e.g., Kinetics), our models show a considerable improvement of performance on the task of lipreading. A comparison between a set of video model architectures and input data representation is also reported. Our extensive experiments on LRW shows that a two-stream I3D model with RGB video and optical flow as the inputs achieves the state-of-the-art performance.

READ FULL TEXT
research
05/04/2019

Learning Spatio-Temporal Features with Two-Stream Deep 3D CNNs for Lipreading

We focus on the word-level visual lipreading, which requires recognizing...
research
11/11/2017

End-to-end Video-level Representation Learning for Action Recognition

From the frame/clip-level feature learning to the video-level representa...
research
10/17/2021

TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding

Most of existing video action recognition models ingest raw RGB frames. ...
research
11/17/2014

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Models based on deep convolutional networks have dominated recent image ...
research
08/03/2020

Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition

Human action recognition is regarded as a key cornerstone in domains suc...
research
02/18/2020

Knowledge Integration Networks for Action Recognition

In this work, we propose Knowledge Integration Networks (referred as KIN...
research
10/30/2018

Random Temporal Skipping for Multirate Video Analysis

Current state-of-the-art approaches to video understanding adopt tempora...

Please sign up or login with your details

Forgot password? Click here to reset