SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

11/21/2022
by   Yuanze Lin, et al.
0

Video-language pre-training is crucial for learning powerful multi-modal representation. However, it typically requires a massive amount of computation. In this paper, we develop SMAUG, an efficient pre-training framework for video-language models. The foundation component in SMAUG is masked autoencoders. Different from prior works which only mask textual inputs, our masking strategy considers both visual and textual modalities, providing a better cross-modal alignment and saving more pre-training costs. On top of that, we introduce a space-time token sparsification module, which leverages context information to further select only "important" spatial regions and temporal frames for pre-training. Coupling all these designs allows our method to enjoy both competitive performances on text-to-video retrieval and video question answering tasks, and much less pre-training costs by 1.9X or more. For example, our SMAUG only needs about 50 NVIDIA A6000 GPU hours for pre-training to attain competitive performances on these two video-language tasks across six popular benchmarks.

READ FULL TEXT

page 3

page 8

page 15

page 16

page 17

research
12/02/2022

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

We present a simple yet effective end-to-end Video-language Pre-training...
research
10/13/2022

RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval

Video language pre-training methods have mainly adopted sparse sampling ...
research
03/28/2023

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Video Foundation Models (VFMs) have received limited exploration due to ...
research
12/30/2022

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Video-language pre-training has advanced the performance of various down...
research
05/20/2021

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

We present a simplified, task-agnostic multi-modal pre-training approach...
research
05/18/2023

CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training

Improving text representation has attracted much attention to achieve ex...
research
11/09/2022

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

The pre-training of masked language models (MLMs) consumes massive compu...

Please sign up or login with your details

Forgot password? Click here to reset