Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

05/06/2021
by   Luke Melas-Kyriazi, et al.
0

The strong performance of vision transformers on image classification and other vision tasks is often attributed to the design of their multi-head attention layers. However, the extent to which attention is responsible for this strong performance remains unclear. In this short report, we ask: is the attention layer even necessary? Specifically, we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension. The resulting architecture is simply a series of feed-forward layers applied over the patch and feature dimensions in an alternating fashion. In experiments on ImageNet, this architecture performs surprisingly well: a ViT/DeiT-base-sized model obtains 74.9% top-1 accuracy, compared to 77.9% and 79.9% for ViT and DeiT respectively. These results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought. We hope these results prompt the community to spend more time trying to understand why our current models are as effective as they are.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/05/2021

Patch Slimming for Efficient Vision Transformers

This paper studies the efficiency problem for visual transformers by exc...
research
05/07/2021

ResMLP: Feedforward networks for image classification with data-efficient training

We present ResMLP, an architecture built entirely upon multi-layer perce...
research
01/12/2021

Of Non-Linearity and Commutativity in BERT

In this work we provide new insights into the transformer architecture, ...
research
08/12/2023

Revisiting Vision Transformer from the View of Path Ensemble

Vision Transformers (ViTs) are normally regarded as a stack of transform...
research
05/29/2023

Brainformers: Trading Simplicity for Efficiency

Transformers are central to recent successes in natural language process...
research
07/29/2020

Generative Classifiers as a Basis for Trustworthy Computer Vision

With the maturing of deep learning systems, trustworthiness is becoming ...
research
05/08/2023

Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields

Vision transformers (ViTs) that model an image as a sequence of partitio...

Please sign up or login with your details

Forgot password? Click here to reset