Self-supervised video pretraining yields strong image representations

10/12/2022
by   Nikhil Parthasarathy, et al.
0

Videos contain far more information than still images and hold the potential for learning rich representations of the visual world. Yet, pretraining on image datasets has remained the dominant paradigm for learning representations that capture spatial information, and previous attempts at video pretraining have fallen short on image understanding tasks. In this work we revisit self-supervised learning of image representations from the dynamic evolution of video frames. To that end, we propose a dataset curation procedure that addresses the domain mismatch between video and image datasets, and develop a contrastive learning framework which handles the complex transformations present in natural videos. This simple paradigm for distilling knowledge from videos to image representations, called VITO, performs surprisingly well on a variety of image-based transfer learning tasks. For the first time, our video-pretrained model closes the gap with ImageNet pretraining on semantic segmentation on PASCAL and ADE20K and object detection on COCO and LVIS, suggesting that video-pretraining could become the new default for learning image representations.

READ FULL TEXT

page 4

page 6

page 8

research
03/19/2021

Efficient Visual Pretraining with Contrastive Detection

Self-supervised pretraining has been shown to yield powerful representat...
research
01/03/2023

Ego-Only: Egocentric Action Detection without Exocentric Pretraining

We present Ego-Only, the first training pipeline that enables state-of-t...
research
07/17/2023

Does Visual Pretraining Help End-to-End Reasoning?

We aim to investigate whether end-to-end learning of visual reasoning ca...
research
09/14/2023

Nucleus-aware Self-supervised Pretraining Using Unpaired Image-to-image Translation for Histopathology Images

Self-supervised pretraining attempts to enhance model performance by obt...
research
12/01/2021

PreViTS: Contrastive Pretraining with Video Tracking Supervision

Videos are a rich source for self-supervised learning (SSL) of visual re...
research
10/06/2020

Representation learning from videos in-the-wild: An object-centric approach

We propose a method to learn image representations from uncurated videos...
research
08/03/2023

MAP: A Model-agnostic Pretraining Framework for Click-through Rate Prediction

With the widespread application of personalized online services, click-t...

Please sign up or login with your details

Forgot password? Click here to reset