OCSampler: Compressing Videos to One Clip with Single-step Sampling

01/12/2022
by   Jintao Lin, et al.
17

In this paper, we propose a framework named OCSampler to explore a compact yet effective video representation with one short clip for efficient video recognition. Recent works prefer to formulate frame sampling as a sequential decision task by selecting frames one by one according to their importance, while we present a new paradigm of learning instance-specific video condensation policies to select informative frames for representing the entire video only in a single step. Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially. Accordingly, these policies are derived from a light-weighted skim network together with a simple yet effective policy network within one step. Moreover, we extend the proposed method with a frame number budget, enabling the framework to produce correct predictions in high confidence with as few frames as possible. Experiments on four benchmarks, i.e., ActivityNet, Mini-Kinetics, FCVID, Mini-Sports1M, demonstrate the effectiveness of our OCSampler over previous methods in terms of accuracy, theoretical computational expense, actual inference speed. We also evaluate its generalization power across different classifiers, sampled frames, and search spaces. Especially, we achieve 76.9 impressive throughput: 123.9 Videos/s on a single TITAN Xp GPU.

READ FULL TEXT

page 1

page 3

page 4

page 13

research
04/20/2021

MGSampler: An Explainable Sampling Strategy for Video Action Recognition

Frame sampling is a fundamental problem in video action recognition due ...
research
04/27/2021

FrameExit: Conditional Early Exiting for Efficient Video Recognition

In this paper, we propose a conditional early exiting framework for effi...
research
05/07/2021

Adaptive Focus for Efficient Video Recognition

In this paper, we explore the spatial redundancy in video recognition wi...
research
01/18/2023

Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using a New Frame Selection Policy and Gating Mechanism

In this paper, Gated-ViGAT, an efficient approach for video event recogn...
research
12/29/2020

2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition

3D convolutional networks are prevalent for video recognition. While ach...
research
11/30/2017

Budget-Aware Activity Detection with A Recurrent Policy Network

In this paper, we address the challenging problem of effi- cient tempora...
research
04/21/2021

Skimming and Scanning for Untrimmed Video Action Recognition

Video action recognition (VAR) is a primary task of video understanding,...

Please sign up or login with your details

Forgot password? Click here to reset