PreViTS: Contrastive Pretraining with Video Tracking Supervision

12/01/2021
by   Brian Chen, et al.
12

Videos are a rich source for self-supervised learning (SSL) of visual representations due to the presence of natural temporal transformations of objects. However, current methods typically randomly sample video clips for learning, which results in a poor supervisory signal. In this work, we propose PreViTS, an SSL framework that utilizes an unsupervised tracking signal for selecting clips containing the same object, which helps better utilize temporal transformations of objects. PreViTS further uses the tracking signal to spatially constrain the frame regions to learn from and trains the model to locate meaningful objects by providing supervision on Grad-CAM attention maps. To evaluate our approach, we train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS. Training with PreViTS outperforms representations learnt by MoCo alone on both image recognition and video classification downstream tasks, obtaining state-of-the-art performance on action classification. PreViTS helps learn feature representations that are more robust to changes in background and context, as seen by experiments on image and video datasets with background changes. Learning from large-scale uncurated videos with PreViTS could lead to more accurate and robust visual feature representations.

READ FULL TEXT

page 1

page 3

page 6

page 7

page 8

page 12

page 13

page 14

research
02/15/2023

Audio-Visual Contrastive Learning with Temporal Self-Supervision

We propose a self-supervised learning approach for videos that learns re...
research
10/12/2022

Self-supervised video pretraining yields strong image representations

Videos contain far more information than still images and hold the poten...
research
11/01/2021

When Does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning?

Contrastive learning (CL) can learn generalizable feature representation...
research
04/08/2021

CoCoNets: Continuous Contrastive 3D Scene Representations

This paper explores self-supervised learning of amodal 3D feature repres...
research
01/15/2023

CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition

Contrastive Masked Autoencoder (CMAE), as a new self-supervised framewor...
research
06/06/2021

Transformed ROIs for Capturing Visual Transformations in Videos

Modeling the visual changes that an action brings to a scene is critical...
research
08/03/2023

MAP: A Model-agnostic Pretraining Framework for Click-through Rate Prediction

With the widespread application of personalized online services, click-t...

Please sign up or login with your details

Forgot password? Click here to reset