Sparse Attention with Linear Units

04/14/2021
by   Biao Zhang, et al.
23

Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU, and show that sparsity naturally emerges from such a formulation. Training stability is achieved with layer normalization with either a specialized initialization or an additional gating function. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. We apply ReLA to the Transformer and conduct experiments on five machine translation tasks. ReLA achieves translation performance comparable to several strong baselines, with training and decoding speed similar to that of the vanilla attention. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment than recent sparsified softmax-based models. Intriguingly, ReLA heads also learn to attend to nothing (i.e. 'switch off') for some queries, which is not possible with sparsified softmax alternatives.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/30/2019

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures...
research
12/11/2017

Attention networks for image-to-text

The paper approaches the problem of image-to-text with attention-based e...
research
09/24/2021

Predicting Attention Sparsity in Transformers

A bottleneck in transformer architectures is their quadratic complexity ...
research
10/29/2018

On Controllable Sparse Alternatives to Softmax

Converting an n-dimensional vector to a probability distribution over n ...
research
11/10/2019

Understanding Multi-Head Attention in Abstractive Summarization

Attention mechanisms in deep learning architectures have often been used...
research
08/16/2021

Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

Softmax is widely used in neural networks for multiclass classification,...
research
02/22/2021

Linear Transformers Are Secretly Fast Weight Memory Systems

We show the formal equivalence of linearised self-attention mechanisms a...

Please sign up or login with your details

Forgot password? Click here to reset