Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles

11/24/2018
by   Dahun Kim, et al.
0

Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have been utilized for visual representation learning for still images, when the number of labeled images is limited or absent at all. Recently, this worthwhile stream of study extends to video domain where the cost of human labeling is even more expensive. However, the most of existing methods are still based on 2D CNN architectures that can not directly capture spatio-temporal information for video applications. In this paper, we introduce a new self-supervised task called as Space-Time Cubic Puzzles to train 3D CNNs using large scale video dataset. This task requires a network to arrange permuted 3D spatio-temporal crops. By completing Space-Time Cubic Puzzles, the network learns both spatial appearance and temporal relation of video frames, which is our final goal. In experiments, we demonstrate that our learned 3D representation is well transferred to action recognition tasks, and outperforms state-of-the-art 2D CNN-based competitors on UCF101 and HMDB51 datasets.

READ FULL TEXT

page 1

page 5

page 7

research
06/16/2018

Two Stream Self-Supervised Learning for Action Recognition

We present a self-supervised approach using spatio-temporal signals betw...
research
03/18/2021

Space-Time Crop Attend: Improving Cross-modal Video Representation Learning

The quality of the image representations obtained from self-supervised l...
research
09/17/2020

Learning to Identify Physical Parameters from Video Using Differentiable Physics

Video representation learning has recently attracted attention in comput...
research
11/26/2018

Evolving Space-Time Neural Architectures for Videos

In this paper, we present a new method for evolving video CNN models to ...
research
08/31/2020

Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

This paper proposes a novel pretext task to address the self-supervised ...
research
05/30/2019

AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

Learning to represent videos is a very challenging task both algorithmic...

Please sign up or login with your details

Forgot password? Click here to reset