More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

by   Quanfu Fan, et al.

Current state-of-the-art models for video action recognition are mostly based on expensive 3D ConvNets. This results in a need for large GPU clusters to train and evaluate such architectures. To address this problem, we present a lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources. The proposed architecture is based on a combination of a deep subnet operating on low-resolution frames with a compact subnet operating on high-resolution frames, allowing for high efficiency and accuracy at the same time. We demonstrate that our approach achieves a reduction by 3∼4 times in FLOPs and ∼2 times in memory usage compared to the baseline. This enables training deeper models with more input frames under the same computational budget. To further obviate the need for large-scale 3D convolutions, a temporal aggregation module is proposed to model temporal dependencies in a video at very small additional computational costs. Our models achieve strong performance on several action recognition benchmarks including Kinetics, Something-Something and Moments-in-time. The code and models are available at


page 1

page 2

page 3

page 4


An Image is Worth 16x16 Words, What is a Video Worth?

Leading methods in the domain of action recognition try to distill infor...

Rethinking Resolution in the Context of Efficient Video Recognition

In this paper, we empirically study how to make the most of low-resoluti...

MoViNets: Mobile Video Networks for Efficient Video Recognition

We present Mobile Video Networks (MoViNets), a family of computation and...

Context-Aware RCNN: A Baseline for Action Detection in Videos

Video action detection approaches usually conduct actor-centric action r...

Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

In this paper, we address the challenges posed by the substantial traini...

Temporal Sequence Distillation: Towards Few-Frame Action Recognition in Videos

Video Analytics Software as a Service (VA SaaS) has been rapidly growing...

Learn to cycle: Time-consistent feature discovery for action recognition

Temporal motion has been one of the essential components for effectively...

Please sign up or login with your details

Forgot password? Click here to reset