Motion-Guided Masking for Spatiotemporal Representation Learning

08/24/2023
by   David Fan, et al.
1

Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video understanding. This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE. This motivates the design of a novel masking algorithm that can more efficiently make use of video saliency. Specifically, we propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time. Crucially, these motion-based correspondences can be directly obtained from information stored in the compressed format of the video, which makes our method efficient and scalable. On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +1.3% improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to 66% fewer training epochs. Lastly, we show that MGM generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +4.9% improvement compared to baseline methods.

READ FULL TEXT

page 3

page 8

page 9

page 13

page 14

research
12/02/2018

Disentangling Propagation and Generation for Video Prediction

Learning to predict future video frames is a challenging task. Recent ap...
research
11/25/2020

Can Temporal Information Help with Contrastive Self-Supervised Learning?

Leveraging temporal information has been regarded as essential for devel...
research
10/09/2022

Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders

Masked autoencoders (MAEs) have emerged recently as art self-supervised ...
research
03/31/2022

Deformable Video Transformer

Video transformers have recently emerged as an effective alternative to ...
research
05/12/2017

Single Image Action Recognition by Predicting Space-Time Saliency

We propose a novel approach based on deep Convolutional Neural Networks ...
research
08/21/2023

MGMAE: Motion Guided Masking for Video Masked Autoencoding

Masked autoencoding has shown excellent performance on self-supervised v...
research
01/05/2023

EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

Recent advances in egocentric video understanding models are promising, ...

Please sign up or login with your details

Forgot password? Click here to reset