An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

by   Alexey Dosovitskiy, et al.

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.



page 8

page 17

page 21


Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions

With the achievements of Transformer in the field of natural language pr...

Scaling Vision with Sparse Mixture of Experts

Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated exce...

MLP-Mixer: An all-MLP Architecture for Vision

Convolutional Neural Networks (CNNs) are the go-to model for computer vi...

DBIA: Data-free Backdoor Injection Attack against Transformer Networks

Recently, transformer architecture has demonstrated its significance in ...

Understanding Robustness of Transformers for Image Classification

Deep Convolutional Neural Networks (CNNs) have long been the architectur...

Neural Networks as Explicit Word-Based Rules

Filters of convolutional networks used in computer vision are often visu...

CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction

Vision transformer (ViT) has achieved competitive accuracy on a variety ...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.