Back to the Future: Cycle Encoding Prediction for Self-supervised Contrastive Video Representation Learning

10/14/2020
by   Xinyu Yang, et al.
3

In this paper we show that learning video feature spaces in which temporal cycles are maximally predictable benefits action classification. In particular, we propose a novel learning approach termed Cycle Encoding Prediction (CEP) that is able to effectively represent high-level spatio-temporal structure of unlabelled video content. CEP builds a latent space wherein the concept of closed forward-backward as well as backward-forward temporal loops is approximately preserved. As a self-supervision signal, CEP leverages the bi-directional temporal coherence of the video stream and applies loss functions that encourage both temporal cycle closure as well as contrastive feature separation. Architecturally, the underpinning network structure utilises a single feature encoder for all video snippets, adding two predictive modules that learn temporal forward and backward transitions. We apply our framework for pretext training of networks for action recognition tasks. We report significantly improved results for the standard datasets UCF101 and HMDB51. Detailed ablation studies support the effectiveness of the proposed components. We publish source code for the CEP components in full with this paper.

READ FULL TEXT

page 1

page 2

research
03/05/2020

Self-Supervised Spatio-Temporal Representation Learning Using Variable Playback Speed Prediction

We propose a self-supervised learning method by predicting the variable ...
research
01/07/2021

Learning Temporal Dynamics from Cycles in Narrated Video

Learning to model how the world changes as time elapses has proven a cha...
research
10/28/2020

Cycle-Contrast for Self-Supervised Video Representation Learning

We present Cycle-Contrastive Learning (CCL), a novel self-supervised met...
research
11/11/2020

Unsupervised Video Representation Learning by Bidirectional Feature Prediction

This paper introduces a novel method for self-supervised video represent...
research
11/14/2017

Prediction Under Uncertainty with Error-Encoding Networks

In this work we introduce a new framework for performing temporal predic...
research
07/20/2023

Language-based Action Concept Spaces Improve Video Self-Supervised Learning

Recent contrastive language image pre-training has led to learning highl...

Please sign up or login with your details

Forgot password? Click here to reset