Stand-Alone Self-Attention in Vision Models

06/13/2019
by   Prajit Ramachandran, et al.
0

Convolutions are a fundamental building block of modern computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of vision tasks. The natural question that arises is whether attention can be a stand-alone primitive for vision models instead of serving as just an augmentation on top of convolutions. In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer. A simple procedure of replacing all instances of spatial convolutions with a form of self-attention applied to ResNet model produces a fully self-attentional model that outperforms the baseline on ImageNet classification with 12 FLOPS and 29 model matches the mAP of a baseline RetinaNet while having 39 34 is especially impactful when used in later layers. These results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/22/2019

Attention Augmented Convolutional Networks

Convolutional networks have been the paradigm of choice in many computer...
research
03/17/2020

Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

Convolution exploits locality for efficiency at a cost of missing long r...
research
03/23/2021

Scaling Local Self-Attention For Parameter Efficient Visual Backbones

Self-attention has the promise of improving computer vision systems due ...
research
11/08/2019

On the Relationship between Self-Attention and Convolutional Layers

Recent trends of incorporating attention mechanisms in vision have led r...
research
12/03/2019

Multiscale Self Attentive Convolutions for Vision and Language Modeling

Self attention mechanisms have become a key building block in many state...
research
01/31/2023

Fairness-aware Vision Transformer via Debiased Self-Attention

Vision Transformer (ViT) has recently gained significant interest in sol...
research
05/04/2022

Sequencer: Deep LSTM for Image Classification

In recent computer vision research, the advent of the Vision Transformer...

Please sign up or login with your details

Forgot password? Click here to reset