SimA: Simple Softmax-free Attention for Vision Transformers

Recently, vision transformers have become very popular. However, deploying them in many applications is computationally expensive partly due to the Softmax layer in the attention block. We introduce a simple but effective, Softmax-free attention block, SimA, which normalizes query and key matrices with simple $\ell_1$-norm instead of using Softmax layer. Then, the attention block in SimA is a simple multiplication of three matrices, so SimA can dynamically change the ordering of the computation at the test time to achieve linear computation on the number of tokens or the number of channels. We empirically show that SimA applied to three SOTA variations of transformers, DeiT, XCiT, and CvT, results in on-par accuracy compared to the SOTA models, without any need for Softmax layer. Interestingly, changing SimA from multi-head to single-head has only a small effect on the accuracy, which simplifies the attention block further. The code is available here: $\href{https://github.com/UCDvision/sima}{\text{This https URL}}$

READ FULL TEXT

page 3

page 10

page 15

research
09/15/2023

Replacing softmax with ReLU in Vision Transformers

Previous research observed accuracy degradation when replacing the atten...
research
05/04/2023

On the Expressivity Role of LayerNorm in Transformers' Attention

Layer Normalization (LayerNorm) is an inherent component in all Transfor...
research
07/27/2022

Rethinking Efficacy of Softmax for Lightweight Non-Local Neural Networks

Non-local (NL) block is a popular module that demonstrates the capabilit...
research
08/01/2023

FLatten Transformer: Vision Transformer using Focused Linear Attention

The quadratic computation complexity of self-attention has been a persis...
research
06/03/2023

Memorization Capacity of Multi-Head Attention in Transformers

In this paper, we investigate the memorization capabilities of multi-hea...
research
07/09/2022

QKVA grid: Attention in Image Perspective and Stacked DETR

We present a new model named Stacked-DETR(SDETR), which inherits the mai...
research
10/11/2021

Breaking the Softmax Bottleneck for Sequential Recommender Systems with Dropout and Decoupling

The Softmax bottleneck was first identified in language modeling as a th...

Please sign up or login with your details

Forgot password? Click here to reset