Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification

12/01/2020
by   Youngwan Lee, et al.
0

Video classification researches that have recently attracted attention are the fields of temporal modeling and 3D efficient architecture. However, the temporal modeling methods are not efficient or the 3D efficient architecture is less interested in temporal modeling. For bridging the gap between them, we propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D. The T-OSA is devised to build a feature hierarchy by aggregating temporal features with different temporal receptive fields. Stacking this T-OSA enables the network itself to model short-range as well as long-range temporal relationships across frames without any external modules. Inspired by kernel factorization and channel factorization, we also design a depthwise spatiotemporal factorization module, named, D(2+1)D that decomposes a 3D depthwise convolution into two spatial and temporal depthwise convolutions for making our network more lightweight and efficient. By using the proposed temporal modeling method (T-OSA), and the efficient factorized component (D(2+1)D), we construct two types of VoV3D networks, VoV3D-M and VoV3D-L. Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both Something-Something and Kinetics-400. Furthermore, VoV3D shows better temporal modeling ability than a state-of-the-art efficient 3D architecture, X3D having comparable model capacity. We hope that VoV3D can serve as a baseline for efficient video classification.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/03/2020

TEA: Temporal Excitation and Aggregation for Action Recognition

Temporal modeling is key for action recognition in videos. It normally c...
research
11/26/2019

Learning Efficient Video Representation with Video Shuffle Networks

3D CNN shows its strong ability in learning spatiotemporal representatio...
research
07/23/2019

Compact Global Descriptor for Neural Networks

Long-range dependencies modeling, widely used in capturing spatiotempora...
research
09/06/2023

ResFields: Residual Neural Fields for Spatiotemporal Signals

Neural fields, a category of neural networks trained to represent high-f...
research
08/14/2023

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition

Zero-shot video recognition (ZSVR) is a task that aims to recognize vide...
research
09/15/2020

Comparison of Spatiotemporal Networks for Learning Video Related Tasks

Many methods for learning from video sequences involve temporally proces...
research
01/21/2020

A Comprehensive Study on Temporal Modeling for Online Action Detection

Online action detection (OAD) is a practical yet challenging task, which...

Please sign up or login with your details

Forgot password? Click here to reset