VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

11/24/2021
by   Tsu-Jui Fu, et al.
19

A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. Recent studies try to mitigate this disconnection via end-to-end training. To make it computationally feasible, prior works tend to "imagify" video inputs, i.e., a handful of sparsely sampled frames are fed into a 2D CNN, followed by a simple mean-pooling or concatenation to obtain the overall video representations. Although achieving promising results, such simple approaches may lose temporal information that is essential for performing downstream VidL tasks. In this work, we present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs. Further, unlike previous studies that found pre-training tasks on video inputs (e.g., masked frame modeling) not very effective, we design a new pre-training task, Masked Visual-token Modeling (MVM), for better video modeling. Specifically, the original video frame patches are "tokenized" into discrete visual tokens, and the goal is to recover the original visual tokens based on the masked patches. Comprehensive analysis demonstrates the effectiveness of both explicit temporal modeling via video transformer and MVM. As a result, VIOLET achieves new state-of-the-art performance on 5 video question answering tasks and 4 text-to-video retrieval tasks.

READ FULL TEXT

page 3

page 5

page 8

page 12

page 13

page 14

page 15

research
09/04/2022

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Masked visual modeling (MVM) has been recently proven effective for visu...
research
11/25/2021

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

The canonical approach to video captioning dictates a caption generation...
research
04/01/2023

SVT: Supertoken Video Transformer for Efficient Video Understanding

Whether by processing videos with fixed resolution from start to end or ...
research
05/17/2022

A CLIP-Hitchhiker's Guide to Long Video Retrieval

Our goal in this paper is the adaptation of image-text models for long v...
research
04/18/2023

SViTT: Temporal Learning of Sparse Video-Text Transformers

Do video-text transformers learn to model temporal relationships across ...
research
10/10/2022

Turbo Training with Token Dropout

The objective of this paper is an efficient training method for video ta...
research
11/19/2022

Efficient Video Representation Learning via Masked Video Modeling with Motion-centric Token Selection

Self-supervised Video Representation Learning (VRL) aims to learn transf...

Please sign up or login with your details

Forgot password? Click here to reset