Motion-aware Self-supervised Video Representation Learning via Foreground-background Merging

09/30/2021
by   Shuangrui Ding, et al.
0

In light of the success of contrastive learning in the image domain, current self-supervised video representation learning methods usually employ contrastive loss to facilitate video representation learning. When naively pulling two augmented views of a video closer, the model however tends to learn the common static background as a shortcut but fails to capture the motion information, a phenomenon dubbed as background bias. This bias makes the model suffer from weak generalization ability, leading to worse performance on downstream tasks such as action recognition. To alleviate such bias, we propose Foreground-background Merging (FAME) to deliberately compose the foreground region of the selected video onto the background of others. Specifically, without any off-the-shelf detector, we extract the foreground and background regions via the frame difference and color statistics, and shuffle the background regions among the videos. By leveraging the semantic consistency between the original clips and the fused ones, the model focuses more on the foreground motion pattern and is thus more robust to the background context. Extensive experiments demonstrate that FAME can significantly boost the performance in different downstream tasks with various backbones. When integrated with MoCo, FAME reaches 84.8 HMDB51, respectively, achieving the state-of-the-art performance.

READ FULL TEXT

page 1

page 4

page 7

research
04/10/2022

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

Visual-only self-supervised learning has achieved significant improvemen...
research
09/12/2020

Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning

Self-supervised learning has shown great potentials in improving the vid...
research
12/07/2021

Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning

Despite the great progress in video understanding made by deep convoluti...
research
06/29/2022

Interventional Contrastive Learning with Meta Semantic Regularizer

Contrastive learning (CL)-based self-supervised learning models learn vi...
research
03/30/2022

Controllable Augmentations for Video Representation Learning

This paper focuses on self-supervised video representation learning. Mos...
research
07/12/2022

Dual Contrastive Learning for Spatio-temporal Representation

Contrastive learning has shown promising potential in self-supervised sp...
research
12/05/2020

Self-Supervised Visual Representation Learning from Hierarchical Grouping

We create a framework for bootstrapping visual representation learning f...

Please sign up or login with your details

Forgot password? Click here to reset