Representation Learning with Video Deep InfoMax

07/27/2020
by   R Devon Hjelm, et al.
0

Self-supervised learning has made unsupervised pretraining relevant again for difficult computer vision tasks. The most effective self-supervised methods involve prediction tasks based on features extracted from diverse views of the data. DeepInfoMax (DIM) is a self-supervised method which leverages the internal structure of deep networks to construct such views, forming prediction tasks between local features which depend on small patches in an image and global features which depend on the whole image. In this paper, we extend DIM to the video domain by leveraging similar structure in spatio-temporal networks, producing a method we call Video Deep InfoMax(VDIM). We find that drawing views from both natural-rate sequences and temporally-downsampled sequences yields results on Kinetics-pretrained action recognition tasks which match or outperform prior state-of-the-art methods that use more costly large-time-scale transformer models. We also examine the effects of data augmentation and fine-tuning methods, accomplishingSoTA by a large margin when training only on the UCF-101 dataset.

READ FULL TEXT

page 3

page 7

page 8

research
05/26/2022

Cross-Architecture Self-supervised Video Representation Learning

In this paper, we present a new cross-architecture contrastive learning ...
research
04/08/2022

Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning

Recent self-supervised video representation learning methods focus on ma...
research
03/30/2021

Broaden Your Views for Self-Supervised Video Learning

Most successful self-supervised learning methods are trained to align th...
research
12/02/2021

Self-supervised Video Transformer

In this paper, we propose self-supervised training for video transformer...
research
03/05/2020

Self-Supervised Spatio-Temporal Representation Learning Using Variable Playback Speed Prediction

We propose a self-supervised learning method by predicting the variable ...
research
06/01/2022

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

Vision Transformers (ViTs) enabled the use of transformer architecture o...
research
09/23/2021

Long Short View Feature Decomposition via Contrastive Video Representation Learning

Self-supervised video representation methods typically focus on the repr...

Please sign up or login with your details

Forgot password? Click here to reset