An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

10/22/2020
by   Alexey Dosovitskiy, et al.
6

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

READ FULL TEXT

Authors

page 8

page 17

page 21

03/02/2022

Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions

With the achievements of Transformer in the field of natural language pr...
06/10/2021

Scaling Vision with Sparse Mixture of Experts

Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated exce...
05/04/2021

MLP-Mixer: An all-MLP Architecture for Vision

Convolutional Neural Networks (CNNs) are the go-to model for computer vi...
11/22/2021

DBIA: Data-free Backdoor Injection Attack against Transformer Networks

Recently, transformer architecture has demonstrated its significance in ...
03/26/2021

Understanding Robustness of Transformers for Image Classification

Deep Convolutional Neural Networks (CNNs) have long been the architectur...
07/10/2019

Neural Networks as Explicit Word-Based Rules

Filters of convolutional networks used in computer vision are often visu...
03/09/2022

CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction

Vision transformer (ViT) has achieved competitive accuracy on a variety ...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.