Spatial-then-Temporal Self-Supervised Learning for Video Correspondence

09/16/2022
by   Rui Li, et al.
4

Learning temporal correspondence from unlabeled videos is of vital importance in computer vision, and has been tackled by different kinds of self-supervised pretext tasks. For the self-supervised learning, recent studies suggest using large-scale video datasets despite the training cost. We propose a spatial-then-temporal pretext task to address the training data cost problem. The task consists of two steps. First, we use contrastive learning from unlabeled still image data to obtain appearance-sensitive features. Then we switch to unlabeled video data and learn motion-sensitive features by reconstructing frames. In the second step, we propose a global correlation distillation loss to retain the appearance sensitivity learned in the first step, as well as a local correlation distillation loss in a pyramid structure to combat temporal discontinuity. Experimental results demonstrate that our method surpasses the state-of-the-art self-supervised methods on a series of correspondence-based tasks. The conducted ablation studies verify the effectiveness of the proposed two-step task and loss functions.

READ FULL TEXT

page 2

page 4

page 7

page 9

research
11/23/2020

Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

We present a novel way for self-supervised video representation learning...
research
01/02/2023

STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

We address the problem of extracting key steps from unlabeled procedural...
research
05/02/2019

Self-supervised Learning for Video Correspondence Flow

The objective of this paper is self-supervised learning of feature embed...
research
12/09/2021

GAN-Supervised Dense Visual Alignment

We propose GAN-Supervised Learning, a framework for learning discriminat...
research
01/05/2023

EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

Recent advances in egocentric video understanding models are promising, ...
research
03/27/2022

Locality-Aware Inter-and Intra-Video Reconstruction for Self-Supervised Correspondence Learning

Our target is to learn visual correspondence from unlabeled videos. We d...
research
04/26/2018

Better and Faster: Knowledge Transfer from Multiple Self-supervised Learning Tasks via Graph Distillation for Video Classification

Video representation learning is a vital problem for classification task...

Please sign up or login with your details

Forgot password? Click here to reset