DeepAI AI Chat
Log In Sign Up

Permutation-Aware Action Segmentation via Unsupervised Frame-to-Segment Alignment

by   Quoc-Huy Tran, et al.

This paper presents a novel transformer-based framework for unsupervised activity segmentation which leverages not only frame-level cues but also segment-level cues. This is in contrast with previous methods which often rely on frame-level information only. Our approach begins with a frame-level prediction module which estimates framewise action classes via a transformer encoder. The frame-level prediction module is trained in an unsupervised manner via temporal optimal transport. To exploit segment-level information, we introduce a segment-level prediction module and a frame-to-segment alignment module. The former includes a transformer decoder for estimating video transcripts, while the latter matches frame-level features with segment-level features, yielding permutation-aware segmentation results. Moreover, inspired by temporal optimal transport, we develop simple-yet-effective pseudo labels for unsupervised training of the above modules. Our experiments on four public datasets, i.e., 50 Salads, YouTube Instructions, Breakfast, and Desktop Assembly show that our approach achieves comparable or better performance than previous methods in unsupervised activity segmentation.


page 4

page 5

page 8

page 9

page 11


Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering

We present a novel approach for unsupervised activity segmentation, whic...

TransVCL: Attention-enhanced Video Copy Localization Network with Flexible Supervision

Video copy localization aims to precisely localize all the copied segmen...

Learning by Aligning 2D Skeleton Sequences in Time

This paper presents a novel self-supervised temporal video alignment fra...

One-stage Video Instance Segmentation: From Frame-in Frame-out to Clip-in Clip-out

Many video instance segmentation (VIS) methods partition a video sequenc...

Exploiting Robust Unsupervised Video Person Re-identification

Unsupervised video person re-identification (reID) methods usually depen...

Learning Features by Watching Objects Move

This paper presents a novel yet intuitive approach to unsupervised featu...