Better plain ViT baselines for ImageNet-1k

05/03/2022
by   Lucas Beyer, et al.
0

It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data. Surprisingly, we find this is not the case and standard data augmentation is sufficient. This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT models. Notably, 90 epochs of training surpass 76 top-1 accuracy in under seven hours on a TPUv3-8, similar to the classic ResNet50 baseline, and 300 epochs of training reach 80

READ FULL TEXT
research
02/23/2021

Do Transformer Modifications Transfer Across Implementations and Applications?

The research community has proposed copious modifications to the Transfo...
research
06/18/2021

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Vision Transformers (ViT) have been shown to attain highly competitive p...
research
06/09/2021

Grounding inductive biases in natural images:invariance stems from variations in data

To perform well on unseen and potentially out-of-distribution samples, i...
research
11/30/2021

Pyramid Adversarial Training Improves ViT Performance

Aggressive data augmentation is a key component of the strong generaliza...
research
02/17/2021

Time Matters in Using Data Augmentation for Vision-based Deep Reinforcement Learning

Data augmentation technique from computer vision has been widely conside...
research
04/06/2023

MULLER: Multilayer Laplacian Resizer for Vision

Image resizing operation is a fundamental preprocessing module in modern...
research
05/17/2019

Online Hyper-parameter Learning for Auto-Augmentation Strategy

Data augmentation is critical to the success of modern deep learning tec...

Please sign up or login with your details

Forgot password? Click here to reset