Better plain ViT baselines for ImageNet-1k

by   Lucas Beyer, et al.

It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data. Surprisingly, we find this is not the case and standard data augmentation is sufficient. This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT models. Notably, 90 epochs of training surpass 76 top-1 accuracy in under seven hours on a TPUv3-8, similar to the classic ResNet50 baseline, and 300 epochs of training reach 80


Do Transformer Modifications Transfer Across Implementations and Applications?

The research community has proposed copious modifications to the Transfo...

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Vision Transformers (ViT) have been shown to attain highly competitive p...

Grounding inductive biases in natural images:invariance stems from variations in data

To perform well on unseen and potentially out-of-distribution samples, i...

Pyramid Adversarial Training Improves ViT Performance

Aggressive data augmentation is a key component of the strong generaliza...

Time Matters in Using Data Augmentation for Vision-based Deep Reinforcement Learning

Data augmentation technique from computer vision has been widely conside...

MULLER: Multilayer Laplacian Resizer for Vision

Image resizing operation is a fundamental preprocessing module in modern...

Online Hyper-parameter Learning for Auto-Augmentation Strategy

Data augmentation is critical to the success of modern deep learning tec...

Please sign up or login with your details

Forgot password? Click here to reset