Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

11/23/2020
by   Zehua Zhang, et al.
0

We present a novel way for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding. Motivated by their effectiveness in supervised learning, we first introduce spatial-temporal feature learning decoupling and hierarchical learning to the context of unsupervised video learning. In particular, our method directs the network to separately capture spatial and temporal features on the basis of contrastive learning via manipulating augmentations as regularization, and further solve such proxy tasks hierarchically by optimizing towards a compound contrastive loss. Experiments show that our proposed Hierarchically Decoupled Spatial-Temporal Contrast (HDC) achieves substantial gains over directly learning spatial-temporal features as a whole and significantly outperforms other state-of-the-art unsupervised methods on downstream action recognition benchmarks on UCF101 and HMDB51. We will release our code and pretrained weights.

READ FULL TEXT

page 1

page 5

page 8

research
01/02/2020

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

We propose a novel self-supervised method, referred to as Video Cloze Pr...
research
08/22/2018

Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition

We propose a self-supervised learning method to jointly reason about spa...
research
09/16/2022

Spatial-then-Temporal Self-Supervised Learning for Video Correspondence

Learning temporal correspondence from unlabeled videos is of vital impor...
research
05/06/2023

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Egocentric gaze anticipation serves as a key building block for the emer...
research
06/21/2022

HOPE: Hierarchical Spatial-temporal Network for Occupancy Flow Prediction

In this report, we introduce our solution to the Occupancy and Flow Pred...
research
02/10/2022

Using Navigational Information to Learn Visual Representations

Children learn to build a visual representation of the world from unsupe...
research
05/27/2021

SSAN: Separable Self-Attention Network for Video Representation Learning

Self-attention has been successfully applied to video representation lea...

Please sign up or login with your details

Forgot password? Click here to reset