Long Short View Feature Decomposition via Contrastive Video Representation Learning

09/23/2021
by   Nadine Behrmann, et al.
0

Self-supervised video representation methods typically focus on the representation of temporal attributes in videos. However, the role of stationary versus non-stationary attributes is less explored: Stationary features, which remain similar throughout the video, enable the prediction of video-level action classes. Non-stationary features, which represent temporally varying attributes, are more beneficial for downstream tasks involving more fine-grained temporal understanding, such as action segmentation. We argue that a single representation to capture both types of features is sub-optimal, and propose to decompose the representation space into stationary and non-stationary features via contrastive learning from long and short views, i.e. long video sequences and their shorter sub-sequences. Stationary features are shared between the short and long views, while non-stationary features aggregate the short views to match the corresponding long view. To empirically verify our approach, we demonstrate that our stationary features work particularly well on an action recognition downstream task, while our non-stationary features perform better on action segmentation. Furthermore, we analyse the learned representations and find that stationary features capture more temporally stable, static attributes, while non-stationary features encompass more temporally varying ones.

READ FULL TEXT

page 1

page 4

page 8

research
04/01/2021

Composable Augmentation Encoding for Video Representation Learning

We focus on contrastive methods for self-supervised video representation...
research
04/08/2022

Spatiotemporal Augmentation on Selective Frequencies for Video Representation Learning

Recent self-supervised video representation learning methods focus on ma...
research
02/27/2018

Real-World Repetition Estimation by Div, Grad and Curl

We consider the problem of estimating repetition in video, such as perfo...
research
01/17/2017

Intrinsically Motivated Acquisition of Modular Slow Features for Humanoids in Continuous and Non-Stationary Environments

A compact information-rich representation of the environment, also calle...
research
07/27/2020

Representation Learning with Video Deep InfoMax

Self-supervised learning has made unsupervised pretraining relevant agai...
research
11/22/2021

Towards Tokenized Human Dynamics Representation

For human action understanding, a popular research direction is to analy...
research
06/07/2016

Hand Action Detection from Ego-centric Depth Sequences with Error-correcting Hough Transform

Detecting hand actions from ego-centric depth sequences is a practically...

Please sign up or login with your details

Forgot password? Click here to reset