iBoot: Image-bootstrapped Self-Supervised Video Representation Learning

06/16/2022
by   Fatemeh Saleh, et al.
0

Learning visual representations through self-supervision is an extremely challenging task as the network needs to sieve relevant patterns from spurious distractors without the active guidance provided by supervision. This is achieved through heavy data augmentation, large-scale datasets and prohibitive amounts of compute. Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets, compute is an order of magnitude larger, and the amount of spurious patterns the optimizer has to sieve through is multiplied several fold. Thus, directly learning self-supervised representations from video data might result in sub-optimal performance. To address this, we propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework, enabling the model to learn strong spatial and temporal information without relying on the video labeled data. To this end, we modify the typical video-based SSL design and objective to encourage the video encoder to subsume the semantic content of an image-based model trained on a general domain. The proposed algorithm is shown to learn much more efficiently (i.e. in less epochs and with a smaller batch) and results in a new state-of-the-art performance on standard downstream tasks among single-modality SSL methods.

READ FULL TEXT
research
06/17/2020

Self-Supervised Representation Learning for Visual Anomaly Detection

Self-supervised learning allows for better utilization of unlabelled dat...
research
11/25/2020

Can Temporal Information Help with Contrastive Self-Supervised Learning?

Leveraging temporal information has been regarded as essential for devel...
research
01/11/2022

Boosting Video Representation Learning with Multi-Faceted Integration

Video content is multifaceted, consisting of objects, scenes, interactio...
research
04/29/2021

MarioNette: Self-Supervised Sprite Learning

Visual content often contains recurring elements. Text is made up of gly...
research
09/03/2022

Equivariant Self-Supervision for Musical Tempo Estimation

Self-supervised methods have emerged as a promising avenue for represent...
research
09/20/2023

Weak Supervision for Label Efficient Visual Bug Detection

As video games evolve into expansive, detailed worlds, visual quality be...
research
04/06/2023

Self-Supervised Video Similarity Learning

We introduce S^2VS, a video similarity learning approach with self-super...

Please sign up or login with your details

Forgot password? Click here to reset