DeepAI AI Chat
Log In Sign Up

Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering

by   Sateesh Kumar, et al.

We present a novel approach for unsupervised activity segmentation, which uses video frame clustering as a pretext task and simultaneously performs representation learning and online clustering. This is in contrast with prior works where representation learning and clustering are often performed sequentially. We leverage temporal information in videos by employing temporal optimal transport and temporal coherence loss. In particular, we incorporate a temporal regularization term into the standard optimal transport module, which preserves the temporal order of the activity, yielding the temporal optimal transport module for computing pseudo-label cluster assignments. Next, the temporal coherence loss encourages neighboring video frames to be mapped to nearby points while distant video frames are mapped to farther away points in the embedding space. The combination of these two components results in effective representations for unsupervised activity segmentation. Furthermore, previous methods require storing learned features for the entire dataset before clustering them in an offline manner, whereas our approach processes one mini-batch at a time in an online manner. Extensive evaluations on three public datasets, i.e. 50-Salads, YouTube Instructions, and Breakfast, and our dataset, i.e., Desktop Assembly, show that our approach performs on par or better than previous methods for unsupervised activity segmentation, despite having significantly less memory constraints.


page 7

page 8


Permutation-Aware Action Segmentation via Unsupervised Frame-to-Segment Alignment

This paper presents a novel transformer-based framework for unsupervised...

Learning by Aligning Videos in Time

We present a self-supervised approach for learning video representations...

Differentiable Deep Clustering with Cluster Size Constraints

Clustering is a fundamental unsupervised learning approach. Many cluster...

Timestamp-Supervised Action Segmentation with Graph Convolutional Networks

We introduce a novel approach for temporal activity segmentation with ti...

InstanceFormer: An Online Video Instance Segmentation Framework

Recent transformer-based offline video instance segmentation (VIS) appro...

Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences

Understanding the structure of complex activities in videos is one of th...

Few-Shot Action Recognition with Compromised Metric via Optimal Transport

Although vital to computer vision systems, few-shot action recognition i...