Learning by Aligning 2D Skeleton Sequences in Time

by   Quoc-Huy Tran, et al.

This paper presents a novel self-supervised temporal video alignment framework which is useful for several fine-grained human activity understanding applications. In contrast with the state-of-the-art method of CASA, where sequences of 3D skeleton coordinates are taken directly as input, our key idea is to use sequences of 2D skeleton heatmaps as input. Given 2D skeleton heatmaps, we utilize a video transformer which performs self-attention in the spatial and temporal domains for extracting effective spatiotemporal and contextual features. In addition, we introduce simple heatmap augmentation techniques based on 2D skeletons for self-supervised learning. Despite the lack of 3D information, our approach achieves not only higher accuracy but also better robustness against missing and noisy keypoints than CASA. Extensive evaluations on three public datasets, i.e., Penn Action, IKEA ASM, and H2O, demonstrate that our approach outperforms previous methods in different fine-grained human activity understanding tasks, i.e., phase classification, phase progression, video alignment, and fine-grained frame retrieval.


page 1

page 4

page 10

page 11

page 12


Action Segmentation Using 2D Skeleton Heatmaps

This paper presents a 2D skeleton-based action segmentation method with ...

Context-Aware Sequence Alignment using 4D Skeletal Augmentation

Temporal alignment of fine-grained human actions in videos is important ...

Unsupervised Video Understanding by Reconciliation of Posture Similarities

Understanding human activity and being able to explain it in detail surp...

Permutation-Aware Action Segmentation via Unsupervised Frame-to-Segment Alignment

This paper presents a novel transformer-based framework for unsupervised...

Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment

The egocentric and exocentric viewpoints of a human activity look dramat...

SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-supervised Skeleton-based Action Recognition

Contrastive learning has achieved great success in skeleton-based action...

Spatial Mixture-of-Experts

Many data have an underlying dependence on spatial location; it may be w...

Please sign up or login with your details

Forgot password? Click here to reset