DINOv2: Learning Robust Visual Features without Supervision

04/14/2023
by   Maxime Oquab, et al.
1

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

READ FULL TEXT

page 2

page 16

page 17

page 18

research
03/02/2021

Self-supervised Pretraining of Visual Features in the Wild

Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and S...
research
03/23/2021

Self-Supervised Pretraining Improves Self-Supervised Pretraining

While self-supervised pretraining has proven beneficial for many compute...
research
07/04/2018

TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces

The immense success of deep learning based methods in computer vision he...
research
05/24/2017

Self-supervised learning of visual features through embedding images into text topic spaces

End-to-end training from scratch of current deep architectures for new c...
research
08/29/2023

A General-Purpose Self-Supervised Model for Computational Pathology

Tissue phenotyping is a fundamental computational pathology (CPath) task...
research
09/27/2021

PASS: An ImageNet replacement for self-supervised pretraining without humans

Computer vision has long relied on ImageNet and other large datasets of ...
research
12/06/2020

Art Style Classification with Self-Trained Ensemble of AutoEncoding Transformations

The artistic style of a painting is a rich descriptor that reveals both ...

Please sign up or login with your details

Forgot password? Click here to reset