SViTT: Temporal Learning of Sparse Video-Text Transformers

04/18/2023
by   Yi Li, et al.
0

Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, we propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost. Project page: http://svcl.ucsd.edu/projects/svitt.

READ FULL TEXT

page 1

page 8

page 15

page 16

research
10/26/2021

Leveraging Local Temporal Information for Multimodal Scene Classification

Robust video scene classification models should capture the spatial (pix...
research
11/24/2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

A great challenge in video-language (VidL) modeling lies in the disconne...
research
07/19/2022

Time Is MattEr: Temporal Self-supervision for Video Transformers

Understanding temporal dynamics of video is an essential aspect of learn...
research
07/09/2023

SAS Video-QA: Self-Adaptive Sampling for Efficient Video Question-Answering

Video question–answering is a fundamental task in the field of video und...
research
09/17/2020

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Recent work on the lottery ticket hypothesis has produced highly sparse ...
research
05/29/2022

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Large-scale pretrained transformers have created milestones in text (GPT...
research
03/17/2023

Dual-path Adaptation from Image to Video Transformers

In this paper, we efficiently transfer the surpassing representation pow...

Please sign up or login with your details

Forgot password? Click here to reset