SVT: Supertoken Video Transformer for Efficient Video Understanding

04/01/2023
by   Chenbin Pan, et al.
5

Whether by processing videos with fixed resolution from start to end or incorporating pooling and down-scaling strategies, existing video transformers process the whole video content throughout the network without specially handling the large portions of redundant information. In this paper, we present a Supertoken Video Transformer (SVT) that incorporates a Semantic Pooling Module (SPM) to aggregate latent representations along the depth of visual transformer based on their semantics, and thus, reduces redundancy inherent in video inputs. Qualitative results show that our method can effectively reduce redundancy by merging latent representations with similar semantics and thus increase the proportion of salient information for downstream tasks. Quantitatively, our method improves the performance of both ViT and MViT while requiring significantly less computations on the Kinectics and Something-Something-V2 benchmarks. More specifically, with our SPM, we improve the accuracy of MAE-pretrained ViT-B and ViT-L by 1.5 by 0.2 improve the accuracy of MViTv2-B by 0.2 Kinectics-400 and Something-Something-V2, respectively.

READ FULL TEXT

page 2

page 5

page 6

research
11/24/2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

A great challenge in video-language (VidL) modeling lies in the disconne...
research
04/22/2021

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

We present a framework for learning multimodal representations from unla...
research
01/12/2022

Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning

It is a challenging task to learn rich and multi-scale spatiotemporal se...
research
07/22/2022

Video Swin Transformers for Egocentric Video Understanding @ Ego4D Challenges 2022

We implemented Video Swin Transformer as a base architecture for the tas...
research
06/08/2022

Patch-based Object-centric Transformers for Efficient Video Generation

In this work, we present Patch-based Object-centric Video Transformer (P...
research
07/23/2020

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

In this paper, we focus on the problem of applying the transformer struc...
research
03/30/2021

Spatiotemporal Transformer for Video-based Person Re-identification

Recently, the Transformer module has been transplanted from natural lang...

Please sign up or login with your details

Forgot password? Click here to reset