Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

12/06/2022
by   Minghao Chen, et al.
0

Previous work on action representation learning focused on global representations for short video clips. In contrast, many practical applications, such as video alignment, strongly demand learning the intensive representation of long videos. In this paper, we introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner, especially for long videos. Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context by combining convolution and transformer. Inspired by the recent massive progress in self-supervised learning, we propose a new sequence contrast loss (SCL) applied to two related views obtained by expanding a series of spatio-temporal data in two versions. One is the self-supervised version that optimizes embedding space by minimizing KL-divergence between sequence similarity of two augmented views and prior Gaussian distribution of timestamp distance. The other is the weakly-supervised version that builds more sample pairs among videos using video-level labels by dynamic time wrapping (DTW). Experiments on FineGym, PennAction, and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference. Surprisingly, although without training on paired videos like in previous works, our self-supervised version also shows outstanding performance in video alignment and fine-grained frame retrieval tasks.

READ FULL TEXT
research
03/28/2022

Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning

Prior works on action representation learning mainly focus on designing ...
research
02/08/2023

Weakly-supervised Representation Learning for Video Alignment and Analysis

Many tasks in video analysis and understanding boil down to the need for...
research
03/22/2023

Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Sequential video understanding, as an emerging video understanding task,...
research
04/13/2023

Video alignment using unsupervised learning of local and global features

In this paper, we tackle the problem of video alignment, the process of ...
research
03/31/2021

Learning by Aligning Videos in Time

We present a self-supervised approach for learning video representations...
research
06/08/2023

Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment

The egocentric and exocentric viewpoints of a human activity look dramat...
research
11/22/2021

Towards Tokenized Human Dynamics Representation

For human action understanding, a popular research direction is to analy...

Please sign up or login with your details

Forgot password? Click here to reset