SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos

Self-supervised methods have significantly closed the gap with end-to-end supervised learning for image classification. In the case of human action videos, however, where both appearance and motion are significant factors of variation, this gap remains significant. One of the key reasons for this is that sampling pairs of similar video clips, a required step for many self-supervised contrastive learning methods, is currently done conservatively to avoid false positives. A typical assumption is that similar clips only occur temporally close within a single video, leading to insufficient examples of motion similarity. To mitigate this, we propose SLIC, a clustering-based self-supervised contrastive learning method for human action videos. Our key contribution is that we improve upon the traditional intra-video positive sampling by using iterative clustering to group similar video instances. This enables our method to leverage pseudo-labels from the cluster assignments to sample harder positives and negatives. SLIC outperforms state-of-the-art video retrieval baselines by +15.4 directly transferred to HMDB51. With end-to-end finetuning for action classification, SLIC achieves 83.2 on HMDB51 (+1.6 classification after self-supervised pretraining on Kinetics400.

READ FULL TEXT

page 1

page 16

research
03/13/2023

Nearest-Neighbor Inter-Intra Contrastive Learning from Unlabeled Videos

Contrastive learning has recently narrowed the gap between self-supervis...
research
08/06/2020

Exploring Relations in Untrimmed Videos for Self-Supervised Learning

Existing video self-supervised learning methods mainly rely on trimmed v...
research
11/25/2019

Oops! Predicting Unintentional Action in Video

From just a short glance at a video, we can often tell whether a person'...
research
06/13/2020

DTG-Net: Differentiated Teachers Guided Self-Supervised Video Action Recognition

State-of-the-art video action recognition models with complex network ar...
research
03/20/2023

Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization

We propose a self-supervised method for learning motion-focused video re...
research
07/27/2022

Deep Clustering with Features from Self-Supervised Pretraining

A deep clustering model conceptually consists of a feature extractor tha...
research
01/30/2021

Video Reenactment as Inductive Bias for Content-Motion Disentanglement

We introduce a self-supervised motion-transfer VAE model to disentangle ...

Please sign up or login with your details

Forgot password? Click here to reset