M^3Video: Masked Motion Modeling for Self-Supervised Video Representation Learning

10/12/2022
by   Xinyu Sun, et al.
0

We study self-supervised video representation learning that seeks to learn video features from unlabeled videos, which is widely used for video analysis as labeling videos is labor-intensive. Current methods often mask some video regions and then train a model to reconstruct spatial information in these regions (e.g., original pixels). However, the model is easy to reconstruct this information by considering content in a single frame. As a result, it may neglect to learn the interactions between frames, which are critical for video analysis. In this paper, we present a new self-supervised learning task, called Masked Motion Modeling (M^3Video), for learning representation by enforcing the model to predict the motion of moving objects in the masked regions. To generate motion targets for this task, we track the objects using optical flow. The motion targets consist of position transitions and shape changes of the tracked objects, thus the model has to consider multiple frames comprehensively. Besides, to help the model capture fine-grained motion details, we enforce the model to predict trajectory motion targets in high temporal resolution based on a video in low temporal resolution. After pre-training using our M^3Video task, the model is able to anticipate fine-grained motion details even taking a sparsely sampled video as input. We conduct extensive experiments on four benchmark datasets. Remarkably, when doing pre-training with 400 epochs, we improve the accuracy from 67.6% to 69.2% and from 78.8% to 79.7% on Something-Something V2 and Kinetics-400 datasets, respectively.

READ FULL TEXT

page 9

page 15

page 17

page 18

research
11/21/2016

Self-Supervised Video Representation Learning With Odd-One-Out Networks

We propose a new self-supervised CNN pre-training technique based on a n...
research
08/21/2023

MGMAE: Motion Guided Masking for Video Masked Autoencoding

Masked autoencoding has shown excellent performance on self-supervised v...
research
08/18/2020

Self-supervised Sparse to Dense Motion Segmentation

Observable motion in videos can give rise to the definition of objects m...
research
01/30/2021

Video Reenactment as Inductive Bias for Content-Motion Disentanglement

We introduce a self-supervised motion-transfer VAE model to disentangle ...
research
11/19/2022

Efficient Video Representation Learning via Masked Video Modeling with Motion-centric Token Selection

Self-supervised Video Representation Learning (VRL) aims to learn transf...
research
09/12/2020

Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning

Self-supervised learning has shown great potentials in improving the vid...
research
06/12/2023

Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for Enhanced Video Forgery Detection

We present a novel approach for the detection of deepfake videos using a...

Please sign up or login with your details

Forgot password? Click here to reset