FASTER Recurrent Networks for Video Classification

06/10/2019
by   Linchao Zhu, et al.
0

Video classification methods often divide the video into short clips, do inference on these clips independently, and then aggregate these predictions to generate the final classification result. Treating these highly-correlated clips as independent both ignores the temporal structure of the signal and carries a large computational cost: the model must process each clip from scratch. To reduce this cost, recent efforts have focused on designing more efficient clip-level network architectures. Less attention, however, has been paid to the overall framework, including how to benefit from correlations between neighboring clips and improving the aggregation strategy itself. In this paper we leverage the correlation between adjacent video clips to address the problem of computational cost efficiency in video classification at the aggregation stage. More specifically, given a clip feature representation, the problem of computing next clip's representation becomes much easier. We propose a novel recurrent architecture called FASTER for video-level classification, that combines high quality, expensive representations of clips, that capture the action in detail, and lightweight representations, which capture scene changes in the video and avoid redundant computation. We also propose a novel processing unit to learn integration of clip-level representations, as well as their temporal structure. We call this unit FAST-GRU, as it is based on the Gated Recurrent Unit (GRU). The proposed framework achieves significantly better FLOPs vs. accuracy trade-off at inference time. Compared to existing approaches, our proposed framework reduces the FLOPs by more than 10x while maintaining similar accuracy across popular datasets, such as Kinetics, UCF101 and HMDB51.

READ FULL TEXT
research
01/25/2022

Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition

We address the problem of capturing temporal information for video class...
research
04/26/2019

Recurrent Embedding Aggregation Network for Video Face Recognition

Recurrent networks have been successful in analyzing temporal data and h...
research
08/28/2023

BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation

We address the task of supervised action segmentation which aims to part...
research
05/07/2021

BasisNet: Two-stage Model Synthesis for Efficient Inference

In this work, we present BasisNet which combines recent advancements in ...
research
03/09/2020

Transformation-based Adversarial Video Prediction on Large-Scale Data

Recent breakthroughs in adversarial generative modeling have led to mode...
research
07/04/2019

Video Crowd Counting via Dynamic Temporal Modeling

Crowd counting aims to count the number of instantaneous people in a cro...
research
10/13/2022

Scalable Neural Video Representations with Learnable Positional Features

Succinct representation of complex signals using coordinate-based neural...

Please sign up or login with your details

Forgot password? Click here to reset