Sinkformers: Transformers with Doubly Stochastic Attention

10/22/2021
by   Michael E. Sander, et al.
0

Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise stochastic. In this paper, we propose instead to use Sinkhorn's algorithm to make attention matrices doubly stochastic. We call the resulting model a Sinkformer. We show that the row-wise stochastic attention matrices in classical Transformers get close to doubly stochastic matrices as the number of epochs increases, justifying the use of Sinkhorn normalization as an informative prior. On the theoretical side, we show that, unlike the SoftMax operation, this normalization makes it possible to understand the iterations of self-attention modules as a discretized gradient-flow for the Wasserstein metric. We also show in the infinite number of samples limit that, when rescaling both attention matrices and depth, Sinkformers operate a heat diffusion. On the experimental side, we show that Sinkformers enhance model accuracy in vision and natural language processing tasks. In particular, on 3D shapes classification, Sinkformers lead to a significant improvement.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/16/2021

Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

Transformers have transformed the field of natural language processing. ...
research
07/05/2022

Softmax-free Linear Transformers

Vision transformers (ViTs) have pushed the state-of-the-art for various ...
research
10/22/2021

SOFT: Softmax-free Transformer with Linear Complexity

Vision transformers (ViTs) have pushed the state-of-the-art for various ...
research
09/15/2023

Replacing softmax with ReLU in Vision Transformers

Previous research observed accuracy degradation when replacing the atten...
research
05/31/2021

Choose a Transformer: Fourier or Galerkin

In this paper, we apply the self-attention from the state-of-the-art Tra...
research
05/19/2020

Normalized Attention Without Probability Cage

Attention architectures are widely used; they recently gained renewed po...
research
05/31/2020

Doubly-Stochastic Normalization of the Gaussian Kernel is Robust to Heteroskedastic Noise

A fundamental step in many data-analysis techniques is the construction ...

Please sign up or login with your details

Forgot password? Click here to reset