Efficient Video Transformers with Spatial-Temporal Token Selection

11/23/2021
by   Junke Wang, et al.
0

Video transformers have achieved impressive results on major video recognition benchmarks, however they suffer from high computational cost. In this paper, we present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples. Specifically, we formulate token selection as a ranking problem, which estimates the importance of each token through a lightweight selection network and only those with top scores will be used for downstream evaluation. In the temporal dimension, we keep the frames that are most relevant for recognizing action categories, while in the spatial dimension, we identify the most discriminative region in feature maps without affecting spatial context used in a hierarchical way in most video transformers. Since the decision of token selection is non-differentiable, we employ a perturbed-maximum based differentiable Top-K operator for end-to-end training. We conduct extensive experiments on Kinetics-400 with a recently introduced video transformer backbone, MViT. Our framework achieves similar results while requiring 20 approach is compatible with other transformer architectures.

READ FULL TEXT

page 1

page 8

research
07/16/2022

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Text-Video retrieval is a task of great practical value and has received...
research
07/04/2022

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

In this paper, we present a new approach for model acceleration by explo...
research
03/15/2023

EgoViT: Pyramid Video Transformer for Egocentric Action Recognition

Capturing interaction of hands with objects is important to autonomously...
research
08/08/2023

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

Transformers have become the primary backbone of the computer vision com...
research
11/24/2021

Self-slimmed Vision Transformer

Vision transformers (ViTs) have become the popular structures and outper...
research
03/17/2023

Dual-path Adaptation from Image to Video Transformers

In this paper, we efficiently transfer the surpassing representation pow...
research
04/10/2023

ViT-Calibrator: Decision Stream Calibration for Vision Transformer

A surge of interest has emerged in utilizing Transformers in diverse vis...

Please sign up or login with your details

Forgot password? Click here to reset