Sparse Sinkhorn Attention

02/26/2020
by   Yi Tay, et al.
0

We propose Sparse Sinkhorn Attention, a new efficient and sparse method for learning to attend. Our method is based on differentiable sorting of internal representations. Concretely, we introduce a meta sorting network that learns to generate latent permutations over sequences. Given sorted sequences, we are then able to compute quasi-global attention with only local windows, improving the memory efficiency of the attention module. To this end, we propose new algorithmic innovations such as Causal Sinkhorn Balancing and SortCut, a dynamic sequence truncation method for tailoring Sinkhorn Attention for encoding and/or decoding purposes. Via extensive experiments on algorithmic seq2seq sorting, language modeling, pixel-wise image generation, document classification and natural language inference, we demonstrate that our memory efficient Sinkhorn Attention method is competitive with vanilla attention and consistently outperforms recently proposed efficient Transformer models such as Sparse Transformers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/12/2020

Efficient Content-Based Sparse Attention with Routing Transformers

Self-attention has recently been adopted for a wide range of sequence mo...
research
05/28/2021

Linear-Time Self Attention with Codeword Histogram for Efficient Recommendation

Self-attention has become increasingly popular in a variety of sequence ...
research
06/05/2020

GMAT: Global Memory Augmentation for Transformers

Transformer-based models have become ubiquitous in natural language proc...
research
04/23/2019

Generating Long Sequences with Sparse Transformers

Transformers are powerful sequence models, but require time and memory t...
research
06/23/2021

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

State-of-the-art models in natural language processing rely on separate ...
research
10/22/2018

Learning sparse transformations through backpropagation

Many transformations in deep learning architectures are sparsely connect...
research
07/18/2019

Neural Shuffle-Exchange Networks - Sequence Processing in O(n log n) Time

A key requirement in sequence to sequence processing is the modeling of ...

Please sign up or login with your details

Forgot password? Click here to reset