Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

11/23/2020
by   Zehua Zhang, et al.
0

We present a novel way for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding. Motivated by their effectiveness in supervised learning, we first introduce spatial-temporal feature learning decoupling and hierarchical learning to the context of unsupervised video learning. In particular, our method directs the network to separately capture spatial and temporal features on the basis of contrastive learning via manipulating augmentations as regularization, and further solve such proxy tasks hierarchically by optimizing towards a compound contrastive loss. Experiments show that our proposed Hierarchically Decoupled Spatial-Temporal Contrast (HDC) achieves substantial gains over directly learning spatial-temporal features as a whole and significantly outperforms other state-of-the-art unsupervised methods on downstream action recognition benchmarks on UCF101 and HMDB51. We will release our code and pretrained weights.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 8

01/02/2020

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

We propose a novel self-supervised method, referred to as Video Cloze Pr...
04/01/2021

Composable Augmentation Encoding for Video Representation Learning

We focus on contrastive methods for self-supervised video representation...
08/08/2021

Skeleton-Contrastive 3D Action Representation Learning

This paper strives for self-supervised learning of a feature space suita...
11/16/2018

Spatial-temporal Multi-Task Learning for Within-field Cotton Yield Prediction

Understanding and accurately predicting within-field spatial variability...
11/25/2020

Can Temporal Information Help with Contrastive Self-Supervised Learning?

Leveraging temporal information has been regarded as essential for devel...
01/11/2022

Motion-Focused Contrastive Learning of Video Representations

Motion, as the most distinct phenomenon in a video to involve the change...
07/22/2020

Video-ception Network: Towards Multi-Scale Efficient Asymmetric Spatial-Temporal Interactions

Previous video modeling methods leverage the cubic 3D convolution filter...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.