Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation

12/16/2021
by   Yujia Zhang, et al.
0

Spatio-temporal representation learning is critical for video self-supervised representation. Recent approaches mainly use contrastive learning and pretext tasks. However, these approaches learn representation by discriminating sampled instances via feature similarity in the latent space while ignoring the intermediate state of the learned representations, which limits the overall performance. In this work, taking into account the degree of similarity of sampled instances as the intermediate state, we propose a novel pretext task - spatio-temporal overlap rate (STOR) prediction. It stems from the observation that humans are capable of discriminating the overlap rates of videos in space and time. This task encourages the model to discriminate the STOR of two generated samples to learn the representations. Moreover, we employ a joint optimization combining pretext tasks with contrastive learning to further enhance the spatio-temporal representation learning. We also study the mutual influence of each component in the proposed scheme. Extensive experiments demonstrate that our proposed STOR task can favor both contrastive learning and pretext tasks. The joint optimization scheme can significantly improve the spatio-temporal representation in video understanding. The code is available at https://github.com/Katou2/CSTP.

READ FULL TEXT

page 2

page 8

research
08/06/2020

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

We propose a self-supervised method to learn feature representations fro...
research
03/31/2022

Video-Text Representation Learning via Differentiable Weak Temporal Alignment

Learning generic joint representations for video and text by a supervise...
research
04/07/2019

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

We address the problem of video representation learning without human-an...
research
01/11/2022

Motion-Focused Contrastive Learning of Video Representations

Motion, as the most distinct phenomenon in a video to involve the change...
research
07/12/2022

Dual Contrastive Learning for Spatio-temporal Representation

Contrastive learning has shown promising potential in self-supervised sp...
research
09/06/2023

Spatio-Temporal Contrastive Self-Supervised Learning for POI-level Crowd Flow Inference

Accurate acquisition of crowd flow at Points of Interest (POIs) is pivot...
research
06/20/2020

Video Playback Rate Perception for Self-supervisedSpatio-Temporal Representation Learning

In self-supervised spatio-temporal representation learning, the temporal...

Please sign up or login with your details

Forgot password? Click here to reset