DeiT III: Revenge of the ViT

04/14/2022
by   Hugo Touvron, et al.
21

A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of specific tasks. Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT. In this paper, we revisit the supervised training of ViTs. Our procedure builds upon and simplifies a recipe introduced for training ResNet-50. It includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning. Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training recipes for ViT. It also reveals that the performance of our ViT trained with supervision is comparable to that of more recent architectures. Our results could serve as better baselines for recent self-supervised approaches demonstrated on ViT.

READ FULL TEXT

page 7

page 8

page 9

research
12/23/2021

SLIP: Self-supervision meets Language-Image Pre-training

Recent work has shown that self-supervised pre-training leads to improve...
research
10/01/2021

ResNet strikes back: An improved training procedure in timm

The influential Residual Networks designed by He et al. remain the gold-...
research
09/30/2022

Where Should I Spend My FLOPS? Efficiency Evaluations of Visual Pre-training Methods

Self-supervised methods have achieved remarkable success in transfer lea...
research
10/06/2022

Effective Self-supervised Pre-training on Low-compute networks without Distillation

Despite the impressive progress of self-supervised learning (SSL), its a...
research
02/07/2022

Simple Control Baselines for Evaluating Transfer Learning

Transfer learning has witnessed remarkable progress in recent years, for...
research
11/22/2021

Benchmarking Detection Transfer Learning with Vision Transformers

Object detection is a central downstream task used to test if pre-traine...
research
04/14/2023

Uncovering the Inner Workings of STEGO for Safe Unsupervised Semantic Segmentation

Self-supervised pre-training strategies have recently shown impressive r...

Please sign up or login with your details

Forgot password? Click here to reset