When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

by   Xiangning Chen, et al.

Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pretraining and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rate). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3% and +11.0% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations. They also possess more perceptive attention maps.


LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers

Vision Transformers have been incredibly effective when tackling compute...

Long-Short Temporal Contrastive Learning of Video Transformers

Video transformers have recently emerged as a competitive alternative to...

Exploring the Limits of Weakly Supervised Pretraining

State-of-the-art visual perception models for a wide range of tasks rely...

Study of positional encoding approaches for Audio Spectrogram Transformers

Transformers have revolutionized the world of deep learning, specially i...

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

GNNs and chemical fingerprints are the predominant approaches to represe...

Teaching Arithmetic to Small Transformers

Large language models like GPT-4 exhibit emergent capabilities across ge...

SNT: Sharpness-Minimizing Network Transformation for Fast Compression-friendly Pretraining

Model compression has become the de-facto approach for optimizing the ef...

Please sign up or login with your details

Forgot password? Click here to reset