Dual Contrastive Learning for Spatio-temporal Representation

07/12/2022
by   Shuangrui Ding, et al.
0

Contrastive learning has shown promising potential in self-supervised spatio-temporal representation learning. Most works naively sample different clips to construct positive and negative pairs. However, we observe that this formulation inclines the model towards the background scene bias. The underlying reasons are twofold. First, the scene difference is usually more noticeable and easier to discriminate than the motion difference. Second, the clips sampled from the same video often share similar backgrounds but have distinct motions. Simply regarding them as positive pairs will draw the model to the static background rather than the motion pattern. To tackle this challenge, this paper presents a novel dual contrastive formulation. Concretely, we decouple the input RGB video sequence into two complementary modes, static scene and dynamic motion. Then, the original RGB features are pulled closer to the static features and the aligned dynamic features, respectively. In this way, the static scene and the dynamic motion are simultaneously encoded into the compact RGB representation. We further conduct the feature space decoupling via activation maps to distill static- and dynamic-related features. We term our method as Dual Contrastive Learning for spatio-temporal Representation (DCLR). Extensive experiments demonstrate that DCLR learns effective spatio-temporal representations and obtains state-of-the-art or comparable performance on UCF-101, HMDB-51, and Diving-48 datasets.

READ FULL TEXT

page 1

page 4

page 8

research
03/30/2022

Controllable Augmentations for Video Representation Learning

This paper focuses on self-supervised video representation learning. Mos...
research
12/16/2021

Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation

Spatio-temporal representation learning is critical for video self-super...
research
09/12/2020

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

One significant factor we expect the video representation learning to ca...
research
09/30/2021

Motion-aware Self-supervised Video Representation Learning via Foreground-background Merging

In light of the success of contrastive learning in the image domain, cur...
research
09/12/2020

Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning

Self-supervised learning has shown great potentials in improving the vid...
research
07/26/2022

Static and Dynamic Concepts for Self-supervised Video Representation Learning

In this paper, we propose a novel learning scheme for self-supervised vi...
research
12/07/2021

Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning

Despite the great progress in video understanding made by deep convoluti...

Please sign up or login with your details

Forgot password? Click here to reset