Video Representation Learning with Visual Tempo Consistency

06/28/2020
by   Ceyuan Yang, et al.
3

Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning. We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning (VTHCL). Specifically, by sampling the same instance at slow and fast frame rates respectively, we can obtain slow and fast video frames which share the same semantics but contain different visual tempos. Video representations learned from VTHCL achieve the competitive performances under the self-supervision evaluation protocol for action recognition on UCF-101 (82.1%) and HMDB-51 (49.2%). Moreover, we show that the learned representations are also generalized well to other downstream tasks including action detection on AVA and action anticipation on Epic-Kitchen. Finally, our empirical analysis suggests that a more thorough evaluation protocol is needed to verify the effectiveness of the self-supervised video representations across network structures and downstream tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/28/2020

Cycle-Contrast for Self-Supervised Video Representation Learning

We present Cycle-Contrastive Learning (CCL), a novel self-supervised met...
research
11/03/2020

Learning Representations from Audio-Visual Spatial Alignment

We introduce a novel self-supervised pretext task for learning represent...
research
12/07/2021

STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation

Contrastive representation learning of videos highly relies on the avail...
research
01/15/2023

CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition

Contrastive Masked Autoencoder (CMAE), as a new self-supervised framewor...
research
07/24/2021

Self-Conditioned Probabilistic Learning of Video Rescaling

Bicubic downscaling is a prevalent technique used to reduce the video st...
research
09/30/2022

An information-theoretic approach to unsupervised keypoint representation learning

Extracting informative representations from videos is fundamental for th...
research
09/27/2021

Compressive Visual Representations

Learning effective visual representations that generalize well without h...

Please sign up or login with your details

Forgot password? Click here to reset