Patches Are All You Need?

01/24/2022
by   Asher Trockman, et al.
0

Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet. Our code is available at https://github.com/locuslab/convmixer.

READ FULL TEXT

page 14

page 15

research
04/26/2021

Improve Vision Transformers Training by Suppressing Over-smoothing

Introducing the transformer structure into computer vision tasks holds t...
research
06/10/2021

CAT: Cross Attention in Vision Transformer

Since Transformer has found widespread use in NLP, the potential of Tran...
research
02/27/2021

Transformer in Transformer

Transformer is a type of self-attention-based neural networks originally...
research
05/26/2022

Green Hierarchical Vision Transformer for Masked Image Modeling

We present an efficient approach for Masked Image Modeling (MIM) with hi...
research
09/14/2023

Interpretability-Aware Vision Transformer

Vision Transformers (ViTs) have become prominent models for solving vari...
research
10/14/2022

Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?

Recently vision transformers (ViT) have been applied successfully for va...
research
12/13/2022

OAMixer: Object-aware Mixing Layer for Vision Transformers

Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have sh...

Please sign up or login with your details

Forgot password? Click here to reset