Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

08/08/2023
by   Shuangrui Ding, et al.
0

Transformers have become the primary backbone of the computer vision community due to their impressive performance. However, the unfriendly computation cost impedes their potential in the video recognition domain. To optimize the speed-accuracy trade-off, we propose Semantic-aware Temporal Accumulation score (STA) to prune spatio-temporal tokens integrally. STA score considers two critical factors: temporal redundancy and semantic importance. The former depicts a specific region based on whether it is a new occurrence or a seen entity by aggregating token-to-token similarity in consecutive frames while the latter evaluates each token based on its contribution to the overall prediction. As a result, tokens with higher scores of STA carry more temporal redundancy as well as lower semantics thus being pruned. Based on the STA score, we are able to progressively prune the tokens without introducing any additional parameters or requiring further re-training. We directly apply the STA module to off-the-shelf ViT and VideoSwin backbones, and the empirical results on Kinetics-400 and Something-Something V2 achieve over 30 reduction with a negligible  0.2 https://github.com/Mark12Ding/STA.

READ FULL TEXT

page 5

page 8

page 14

page 15

page 16

page 17

research
05/02/2022

CenterCLIP: Token Clustering for Efficient Text-Video Retrieval

Recently, large-scale pre-training methods like CLIP have made great pro...
research
06/20/2023

How can objects help action recognition?

Current state-of-the-art video models process a video clip as a long seq...
research
11/23/2021

Efficient Video Transformers with Spatial-Temporal Token Selection

Video transformers have achieved impressive results on major video recog...
research
04/12/2022

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

Although vision transformers (ViTs) have achieved great success in compu...
research
10/12/2022

Token-Label Alignment for Vision Transformers

Data mixing strategies (e.g., CutMix) have shown the ability to greatly ...
research
06/21/2021

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Video understanding relies on perceiving the global content and modeling...
research
07/24/2023

Less is More: Focus Attention for Efficient DETR

DETR-like models have significantly boosted the performance of detectors...

Please sign up or login with your details

Forgot password? Click here to reset