SVT: Supertoken Video Transformer for Efficient Video Understanding
Whether by processing videos with fixed resolution from start to end or incorporating pooling and down-scaling strategies, existing video transformers process the whole video content throughout the network without specially handling the large portions of redundant information. In this paper, we present a Supertoken Video Transformer (SVT) that incorporates a Semantic Pooling Module (SPM) to aggregate latent representations along the depth of visual transformer based on their semantics, and thus, reduces redundancy inherent in video inputs. Qualitative results show that our method can effectively reduce redundancy by merging latent representations with similar semantics and thus increase the proportion of salient information for downstream tasks. Quantitatively, our method improves the performance of both ViT and MViT while requiring significantly less computations on the Kinectics and Something-Something-V2 benchmarks. More specifically, with our SPM, we improve the accuracy of MAE-pretrained ViT-B and ViT-L by 1.5 by 0.2 improve the accuracy of MViTv2-B by 0.2 Kinectics-400 and Something-Something-V2, respectively.
READ FULL TEXT