Unmasked Teacher: Towards Training-Efficient Video Foundation Models

03/28/2023
by   Kunchang Li, et al.
0

Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain. Although VideoMAE has trained a robust ViT from limited data, its low-level reconstruction poses convergence difficulties and conflicts with high-level cross-modal alignment. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods. To increase data efficiency, we mask out most of the low-semantics video tokens, but selectively align the unmasked tokens with IFM, which serves as the UnMasked Teacher (UMT). By providing semantic guidance, our method enables faster convergence and multimodal friendliness. With a progressive pre-training framework, our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding. Using only public sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16 achieves state-of-the-art performances on various video tasks. The code and models will be released at https://github.com/OpenGVLab/unmasked_teacher.

READ FULL TEXT
research
12/07/2022

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

This paper presents SimVTP: a Simple Video-Text Pretraining framework vi...
research
11/21/2022

SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

Video-language pre-training is crucial for learning powerful multi-modal...
research
12/02/2022

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

We present a simple yet effective end-to-end Video-language Pre-training...
research
03/29/2023

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Scale is the primary factor for building a powerful foundation model tha...
research
06/21/2021

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Video understanding relies on perceiving the global content and modeling...
research
04/19/2021

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

The pre-trained neural models have recently achieved impressive performa...
research
11/16/2022

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Masked Autoencoders (MAEs) learn generalizable representations for image...

Please sign up or login with your details

Forgot password? Click here to reset