) models. The addition of results when pre-training only on ImageNet-1k was an afterthought, mostly to ablate the effect of data scale. Nevertheless, ImageNet-1k remains a key testbed in the computer vision research and it is highly beneficial to have as simple and effective a baseline as possible.
Thus, coupled with the release of the big vision codebase used to develop ViT , MLP-Mixer , ViT-G , LiT , and a variety of other research projects, we now provide a new baseline that stays true to the original ViT’s simplicity while reaching results competitive with similar approaches [15, 17] and concurrent , which also strives for simplification.
2 Experimental setup
We focus entirely on the ImageNet-1k dataset (ILSVRC-2012) for both (pre)training and evaluation. We stick to the original ViT model architecture due to its widespread acceptance [15, 2, 5, 1, 9], simplicity and scalability, and revisit only few very minor details, none of which are novel. We choose to focus on the smaller ViT-S/16 variant introduced by  as we believe it provides a good tradeoff between iteration velocity with commonly available hardware and final accuracy. However, when more compute and data is available, we highly recommend iterating with ViT-B/32 or ViT-B/16 instead [12, 19], and note that increasing patch-size is almost equivalent to reducing image resolution.
All experiments use “inception crop”  at 224px² resolution, random horizontal flips, RandAugment , and Mixup augmentations. We train on the first 99% of the training data, and keep 1% for minival to encourage the community to stop selecting design choices on the validation (de-facto test) set. The full setup is shown in Appendix A.
The results for our improved setup are shown in Figure 1, along with a few related important baselines. It is clear that a simple, standard ViT trained this way can match both the seminal ResNet50 at 90 epochs baseline, as well as more modern ResNet  and ViT  training setups. Furthermore, on a small TPUv3-8 node, the 90 epoch run takes only 6h30, and one can reach 80% accuracy in less than a day when training for 300 epochs.
The main differences from [4, 12] are a batch-size of 1024 instead of 4096, the use of global average-pooling (GAP) instead of a class token [2, 11], fixed 2D sin-cos position embeddings , and the introduction of a small amount of RandAugment  and Mixup 
(level 10 and probability 0.2 respectively, which is less than). These small changes lead to significantly better performance than that originally reported in .
Notably absent from this baseline are further architectural changes, regularizers such as dropout or stochastic depth , advanced optimization schemes such as SAM , extra augmentations such as CutMix , repeated augmentations , or blurring, “tricks” such as high-resolution fine-tuning or checkpoint averaging, as well as supervision from a strong teacher via knowledge distillation.
Table 1 shows an ablation of the various minor changes we propose. It exemplifies how a collection of almost trivial changes can accumulate to an important overall improvement. The only change which makes no significant difference in classification accuracy is whether the classification head is a single linear layer, or an MLP with one hidden layer as in the original Transformer formulation.
|Posemb: sincos2d learned||75.0||78.0||79.6|
|Batch-size: 1024 4096||74.7||77.3||78.6|
|Global Avgpool [cls] token||75.0||76.9||78.2|
|Head: MLP linear||76.7||78.6||79.8|
|Original + RandAug + MixUp||71.6||74.8||76.1|
|Our improvements (90ep)||76.5||83.1||64.2|
|Our improvements (150ep)||78.5||84.5||66.4|
|Our improvements (300ep)||80.0||85.4||68.3|
It is always worth striving for simplicity.
Acknowledgements. We thank Daniel Suo and Naman Agarwal for nudging for 90 epochs and feedback on the report, as well as the Google Brain team for a supportive research environment.
-  Wuyang Chen, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, and Denny Zhou. A simple single-scale vision transformer for object localization and instance segmentation. CoRR, abs/2112.09747, 2021.
-  Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In International Conference on Computer Vision (ICCV), 2021.
-  Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. RandAugment: Practical data augmentation with no separate search. CoRR, abs/1909.13719, 2019.
-  Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
-  Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross B. Girshick. Masked autoencoders are scalable vision learners. CoRR, abs/2111.06377, 2021.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. CoRR, abs/1603.09382, 2016.
-  Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big Transfer (BiT): General visual representation learning. In European Conference on Computer Vision (ECCV), 2020.
-  Yanghao Li, Hanzi Mao, Ross B. Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. CoRR, abs/2203.16527, 2022.
-  Yong Liu, Siqi Mai, Xiangning Chen, Cho-Jui Hsieh, and Yang You. Towards efficient and scalable sharpness-aware minimization. CoRR, abs/2203.02714, 2022.
-  Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? CoRR, abs/2108.08810, 2021.
-  Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? data, augmentation, and regularization in vision transformers. CoRR, abs/2106.10270, 2021.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34, 2021.
-  Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learing (ICML), 2021.
-  Hugo Touvron, Matthieu Cord, and Hervé Jégou. DeiT III: revenge of the ViT. CoRR, abs/2204.07118, 2022.
-  Ross Wightman, Hugo Touvron, and Hervé Jégou. ResNet strikes back: An improved training procedure in timm. CoRR, abs/2110.00476, 2021.
Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and
CutMix: Regularization strategy to train strong classifiers with localizable features.In International Conference on Computer Vision (ICCV), 2019.
-  Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
-  Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. LiT: Zero-shot transfer with locked-image text tuning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
-  Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations (ICLR), 2018.