Masked Autoencoders As Spatiotemporal Learners

05/18/2022
by   Christoph Feichtenhofer, et al.
0

This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90 (vs. 75 information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.

READ FULL TEXT

page 2

page 3

page 14

research
03/23/2022

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Pre-training video transformers on extra large-scale datasets is general...
research
04/29/2021

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

We present a large-scale study on unsupervised spatiotemporal representa...
research
11/16/2022

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Masked Autoencoders (MAEs) learn generalizable representations for image...
research
12/07/2022

SimVTP: Simple Video Text Pre-training with Masked Autoencoders

This paper presents SimVTP: a Simple Video-Text Pretraining framework vi...
research
12/01/2022

Scaling Language-Image Pre-training via Masking

We present Fast Language-Image Pre-training (FLIP), a simple and more ef...
research
09/08/2022

Exploring Target Representations for Masked Autoencoders

Masked autoencoders have become popular training paradigms for self-supe...
research
01/30/2023

Advancing Radiograph Representation Learning with Masked Record Modeling

Modern studies in radiograph representation learning rely on either self...

Please sign up or login with your details

Forgot password? Click here to reset