An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

09/04/2022
by   Tsu-Jui Fu, et al.
8

Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, the pre-extracted video features in previous studies cannot be refined through MVM during pre-training, and thus leading to unsatisfactory downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), which mitigates the disconnection between fixed video representations and MVM training. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens and latent visual features. We conduct comprehensive experiments and provide insights on the factors leading to effective MVM training. Empirically, we show VIOLET pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.

READ FULL TEXT

page 2

page 10

page 13

page 14

page 15

research
11/24/2021

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

A great challenge in video-language (VidL) modeling lies in the disconne...
research
11/03/2021

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Vision-and-language (VL) pre-training has proven to be highly effective ...
research
02/20/2023

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

Although large-scale video-language pre-training models, which usually b...
research
04/18/2021

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Video-text retrieval plays an essential role in multi-modal research and...
research
05/06/2022

Dual-Level Decoupled Transformer for Video Captioning

Video captioning aims to understand the spatio-temporal semantic concept...
research
10/17/2016

Learning and Transfer of Modulated Locomotor Controllers

We study a novel architecture and training procedure for locomotion task...
research
02/04/2023

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer

This paper presents a new method for end-to-end Video Question Answering...

Please sign up or login with your details

Forgot password? Click here to reset