TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding

10/17/2021
by   Zhengwei Wang, et al.
0

Most of existing video action recognition models ingest raw RGB frames. However, the raw video stream requires enormous storage and contains significant temporal redundancy. Video compression (e.g., H.264, MPEG-4) reduces superfluous information by representing the raw video stream using the concept of Group of Pictures (GOP). Each GOP is composed of the first I-frame (aka RGB image) followed by a number of P-frames, represented by motion vectors and residuals, which can be regarded and used as pre-extracted features. In this work, we 1) introduce sampling the input for the network from partially decoded videos based on the GOP-level, and 2) propose a plug-and-play mulTi-modal lEArning Module (TEAM) for training the network using information from I-frames and P-frames in an end-to-end manner. We demonstrate the superior performance of TEAM-Net compared to the baseline using RGB only. TEAM-Net also achieves the state-of-the-art performance in the area of video action recognition with partial decoding. Code is provided at https://github.com/villawang/TEAM-Net.

READ FULL TEXT

page 2

page 3

page 9

page 16

page 17

research
03/22/2019

On the Importance of Video Action Recognition for Visual Lipreading

We focus on the word-level visual lipreading, which requires to decode t...
research
04/02/2017

Hidden Two-Stream Convolutional Networks for Action Recognition

Analyzing videos of human actions involves understanding the temporal re...
research
11/20/2018

Reversing Two-Stream Networks with Decoding Discrepancy Penalty for Robust Action Recognition

We discuss the robustness and generalization ability in the realm of act...
research
12/02/2017

Compressed Video Action Recognition

Training robust deep video representations has proven to be much more ch...
research
09/28/2020

PERF-Net: Pose Empowered RGB-Flow Net

In recent years, many works in the video action recognition literature h...
research
11/19/2019

Mimic The Raw Domain: Accelerating Action Recognition in the Compressed Domain

Video understanding usually requires expensive computation that prohibit...
research
08/03/2020

Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition

Human action recognition is regarded as a key cornerstone in domains suc...

Please sign up or login with your details

Forgot password? Click here to reset