2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition

12/29/2020
by   Hengduo Li, et al.
17

3D convolutional networks are prevalent for video recognition. While achieving excellent recognition performance on standard benchmarks, they operate on a sequence of frames with 3D convolutions and thus are computationally demanding. Exploiting large variations among different videos, we introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network. These policies are derived with a two-head lightweight selection network conditioned on each input video clip. Then, only frames and convolutions that are selected by the selection network are used in the 3D model to generate predictions. The selection network is optimized with policy gradient methods to maximize a reward that encourages making correct predictions with limited computation. We conduct experiments on three video recognition benchmarks and demonstrate that our method achieves similar accuracies to state-of-the-art 3D models while requiring 20 computation across different datasets. We also show that learned policies are transferable and Ada3D is compatible to different backbones and modern clip selection approaches. Our qualitative analysis indicates that our method allocates fewer 3D convolutions and frames for "static" inputs, yet uses more for motion-intensive clips.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

research
11/29/2018

AdaFrame: Adaptive Frame Selection for Fast Video Recognition

We present AdaFrame, a framework that adaptively selects relevant frames...
research
12/03/2019

LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition

This paper presents LiteEval, a simple yet effective coarse-to-fine fram...
research
04/27/2021

FrameExit: Conditional Early Exiting for Efficient Video Recognition

In this paper, we propose a conditional early exiting framework for effi...
research
01/12/2022

OCSampler: Compressing Videos to One Clip with Single-step Sampling

In this paper, we propose a framework named OCSampler to explore a compa...
research
11/18/2022

Look More but Care Less in Video Recognition

Existing action recognition methods typically sample a few frames to rep...
research
04/23/2021

Skip-Convolutions for Efficient Video Processing

We propose Skip-Convolutions to leverage the large amount of redundancie...
research
04/01/2016

Leaving Some Stones Unturned: Dynamic Feature Prioritization for Activity Detection in Streaming Video

Current approaches for activity recognition often ignore constraints on ...

Please sign up or login with your details

Forgot password? Click here to reset