Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

by   Zehua Zhang, et al.

We present a novel way for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding. Motivated by their effectiveness in supervised learning, we first introduce spatial-temporal feature learning decoupling and hierarchical learning to the context of unsupervised video learning. In particular, our method directs the network to separately capture spatial and temporal features on the basis of contrastive learning via manipulating augmentations as regularization, and further solve such proxy tasks hierarchically by optimizing towards a compound contrastive loss. Experiments show that our proposed Hierarchically Decoupled Spatial-Temporal Contrast (HDC) achieves substantial gains over directly learning spatial-temporal features as a whole and significantly outperforms other state-of-the-art unsupervised methods on downstream action recognition benchmarks on UCF101 and HMDB51. We will release our code and pretrained weights.



There are no comments yet.


page 1

page 5

page 8


Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

We propose a novel self-supervised method, referred to as Video Cloze Pr...

Composable Augmentation Encoding for Video Representation Learning

We focus on contrastive methods for self-supervised video representation...

Skeleton-Contrastive 3D Action Representation Learning

This paper strives for self-supervised learning of a feature space suita...

Spatial-temporal Multi-Task Learning for Within-field Cotton Yield Prediction

Understanding and accurately predicting within-field spatial variability...

Can Temporal Information Help with Contrastive Self-Supervised Learning?

Leveraging temporal information has been regarded as essential for devel...

Motion-Focused Contrastive Learning of Video Representations

Motion, as the most distinct phenomenon in a video to involve the change...

Video-ception Network: Towards Multi-Scale Efficient Asymmetric Spatial-Temporal Interactions

Previous video modeling methods leverage the cubic 3D convolution filter...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.