Learning by Aligning Videos in Time

03/31/2021
by   Sanjay Haresh, et al.
1

We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task, while exploiting both frame-level and video-level information. We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network. Specifically, the temporal alignment loss (i.e., Soft-DTW) aims for the minimum cost for temporally aligning videos in the embedding space. However, optimizing solely for this term leads to trivial solutions, particularly, one where all frames get mapped to a small cluster in the embedding space. To overcome this problem, we propose a temporal regularization term (i.e., Contrastive-IDM) which encourages different frames to be mapped to different points in the embedding space. Extensive evaluations on various tasks, including action phase classification, action phase progression, and fine-grained frame retrieval, on three datasets, namely Pouring, Penn Action, and IKEA ASM, show superior performance of our approach over state-of-the-art methods for self-supervised representation learning from videos. In addition, our method provides significant performance gain where labeled data is lacking.

READ FULL TEXT

page 6

page 8

page 9

page 14

research
03/28/2022

Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning

Prior works on action representation learning mainly focus on designing ...
research
12/06/2022

Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

Previous work on action representation learning focused on global repres...
research
05/27/2021

Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering

We present a novel approach for unsupervised activity segmentation, whic...
research
04/26/2022

Context-Aware Sequence Alignment using 4D Skeletal Augmentation

Temporal alignment of fine-grained human actions in videos is important ...
research
11/17/2021

Learning to Align Sequential Actions in the Wild

State-of-the-art methods for self-supervised sequential action alignment...
research
08/04/2019

Fully Automatic Video Colorization with Self-Regularization and Diversity

We present a fully automatic approach to video colorization with self-re...
research
06/08/2023

Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment

The egocentric and exocentric viewpoints of a human activity look dramat...

Please sign up or login with your details

Forgot password? Click here to reset