Video Representation Learning by Dense Predictive Coding

09/10/2019
by   Tengda Han, et al.
16

The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatial-temporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with self-supervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101(75.7 previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.

READ FULL TEXT

page 1

page 8

page 12

page 13

research
06/16/2018

Two Stream Self-Supervised Learning for Action Recognition

We present a self-supervised approach using spatio-temporal signals betw...
research
01/02/2020

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

We propose a novel self-supervised method, referred to as Video Cloze Pr...
research
06/09/2021

Pretrained Encoders are All You Need

Data-efficiency and generalization are key challenges in deep learning a...
research
04/10/2022

SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition

Learning an egocentric action recognition model from video data is chall...
research
08/05/2020

Self-supervised learning using consistency regularization of spatio-temporal data augmentation for action recognition

Self-supervised learning has shown great potentials in improving the dee...
research
11/30/2022

Spatio-Temporal Crop Aggregation for Video Representation Learning

We propose Spatio-temporal Crop Aggregation for video representation LEa...
research
08/23/2023

MOFO: MOtion FOcused Self-Supervision for Video Understanding

Self-supervised learning (SSL) techniques have recently produced outstan...

Please sign up or login with your details

Forgot password? Click here to reset