VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

03/29/2023
by   Limin Wang, et al.
0

Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0 89.9 addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2021

Masked Autoencoders Are Scalable Vision Learners

This paper shows that masked autoencoders (MAE) are scalable self-superv...
research
12/11/2022

SEPT: Towards Scalable and Efficient Visual Pre-Training

Recently, the self-supervised pre-training paradigm has shown great pote...
research
02/27/2023

EDMAE: An Efficient Decoupled Masked Autoencoder for Standard View Identification in Pediatric Echocardiography

We propose an efficient decoupled mask autoencoder (EDMAE) for standard ...
research
08/21/2023

MGMAE: Motion Guided Masking for Video Masked Autoencoding

Masked autoencoding has shown excellent performance on self-supervised v...
research
11/03/2022

FactorMatte: Redefining Video Matting for Re-Composition Tasks

We propose "factor matting", an alternative formulation of the video mat...
research
03/28/2023

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Video Foundation Models (VFMs) have received limited exploration due to ...
research
03/23/2022

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Pre-training video transformers on extra large-scale datasets is general...

Please sign up or login with your details

Forgot password? Click here to reset