An Image is Worth 16x16 Words, What is a Video Worth?

03/25/2021
by   Gilad Sharir, et al.
0

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with × 30 less frames per video, and × 40 faster inference than the current leading method. Code is available at: https://github.com/Alibaba-MIIL/STAM

READ FULL TEXT
research
12/02/2019

More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

Current state-of-the-art models for video action recognition are mostly ...
research
01/25/2022

Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition

We address the problem of capturing temporal information for video class...
research
01/18/2023

Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using a New Frame Selection Policy and Gating Mechanism

In this paper, Gated-ViGAT, an efficient approach for video event recogn...
research
10/12/2021

TAda! Temporally-Adaptive Convolutions for Video Understanding

Spatial convolutions are widely used in numerous deep video models. It f...
research
07/29/2022

GPU-accelerated SIFT-aided source identification of stabilized videos

Video stabilization is an in-camera processing commonly applied by moder...
research
08/07/2023

Recurrent Self-Supervised Video Denoising with Denser Receptive Field

Self-supervised video denoising has seen decent progress through the use...
research
06/15/2020

Multiple Video Frame Interpolation via Enhanced Deformable Separable Convolution

Generating non-existing frames from a consecutive video sequence has bee...

Please sign up or login with your details

Forgot password? Click here to reset