Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning

03/28/2022
by   Minghao Chen, et al.
2

Prior works on action representation learning mainly focus on designing various architectures to extract the global representations for short video clips. In contrast, many practical applications such as video alignment have strong demand for learning dense representations for long videos. In this paper, we introduce a novel contrastive action representation learning (CARL) framework to learn frame-wise action representations, especially for long videos, in a self-supervised manner. Concretely, we introduce a simple yet efficient video encoder that considers spatio-temporal context to extract frame-wise representations. Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views obtained through a series of spatio-temporal data augmentations. SCL optimizes the embedding space by minimizing the KL-divergence between the sequence similarity of two augmented views and a prior Gaussian distribution of timestamp distance. Experiments on FineGym, PennAction and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification. Surprisingly, although without training on paired videos, our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks. Code and models are available at https://github.com/minghchen/CARL_code.

READ FULL TEXT

page 1

page 9

research
12/06/2022

Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

Previous work on action representation learning focused on global repres...
research
03/31/2021

Learning by Aligning Videos in Time

We present a self-supervised approach for learning video representations...
research
06/08/2023

Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment

The egocentric and exocentric viewpoints of a human activity look dramat...
research
05/06/2023

Transform-Equivariant Consistency Learning for Temporal Sentence Grounding

This paper addresses the temporal sentence grounding (TSG). Although exi...
research
11/22/2021

Towards Tokenized Human Dynamics Representation

For human action understanding, a popular research direction is to analy...
research
03/22/2023

Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Sequential video understanding, as an emerging video understanding task,...
research
09/15/2021

SupCL-Seq: Supervised Contrastive Learning for Downstream Optimized Sequence Representations

While contrastive learning is proven to be an effective training strateg...

Please sign up or login with your details

Forgot password? Click here to reset