SOFT: Softmax-free Transformer with Linear Complexity

10/22/2021
by   Jiachen Lu, et al.
13

Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. However, the employment of self-attention modules results in a quadratic complexity in both computation and memory usage. Various attempts on approximating the self-attention computation with linear complexity have been made in Natural Language Processing. However, an in-depth analysis in this work shows that they are either theoretically flawed or empirically ineffective for visual recognition. We further identify that their limitations are rooted in keeping the softmax self-attention during approximations. Specifically, conventional self-attention is computed by normalizing the scaled dot-product between token feature vectors. Keeping this softmax operation challenges any subsequent linearization efforts. Based on this insight, for the first time, a softmax-free transformer or SOFT is proposed. To remove softmax in self-attention, Gaussian kernel function is used to replace the dot-product similarity without further normalization. This enables a full self-attention matrix to be approximated via a low-rank matrix decomposition. The robustness of the approximation is achieved by calculating its Moore-Penrose inverse using a Newton-Raphson method. Extensive experiments on ImageNet show that our SOFT significantly improves the computational efficiency of existing ViT variants. Crucially, with a linear complexity, much longer token sequences are permitted in SOFT, resulting in superior trade-off between accuracy and complexity.

READ FULL TEXT

page 9

page 16

research
07/05/2022

Softmax-free Linear Transformers

Vision transformers (ViTs) have pushed the state-of-the-art for various ...
research
03/09/2021

Beyond Nyströmformer – Approximation of self-attention by Spectral Shifting

Transformer is a powerful tool for many natural language tasks which is ...
research
01/28/2022

O-ViT: Orthogonal Vision Transformer

Inspired by the tremendous success of the self-attention mechanism in na...
research
05/31/2021

Choose a Transformer: Fourier or Galerkin

In this paper, we apply the self-attention from the state-of-the-art Tra...
research
05/19/2020

Normalized Attention Without Probability Cage

Attention architectures are widely used; they recently gained renewed po...
research
10/22/2021

Sinkformers: Transformers with Doubly Stochastic Attention

Attention based models such as Transformers involve pairwise interaction...
research
11/29/2022

Lightweight Structure-Aware Attention for Visual Understanding

Vision Transformers (ViTs) have become a dominant paradigm for visual re...

Please sign up or login with your details

Forgot password? Click here to reset