You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

11/18/2021
by   Zhanpeng Zeng, et al.
0

Transformer-based models are widely used in natural language processing (NLP). Central to the transformer model is the self-attention mechanism, which captures the interactions of token pairs in the input sequences and depends quadratically on the sequence length. Training such models on longer sequences is expensive. In this paper, we show that a Bernoulli sampling attention mechanism based on Locality Sensitive Hashing (LSH), decreases the quadratic complexity of such models to linear. We bypass the quadratic cost by considering self-attention as a sum of individual tokens associated with Bernoulli random variables that can, in principle, be sampled at once by a single hash (although in practice, this number may be a small constant). This leads to an efficient sampling scheme to estimate self-attention which relies on specific modifications of LSH (to enable deployment on GPU architectures). We evaluate our algorithm on the GLUE benchmark with standard 512 sequence length where we see favorable performance relative to a standard pretrained Transformer. On the Long Range Arena (LRA) benchmark, for evaluating performance on long sequences, our method achieves results consistent with softmax self-attention but with sizable speed-ups and memory savings and often outperforms other efficient self-attention methods. Our code is available at https://github.com/mlpen/YOSO

READ FULL TEXT
research
02/07/2021

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

Transformers have emerged as a powerful tool for a broad range of natura...
research
10/06/2021

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Transformer-based models have achieved great success in various NLP, vis...
research
05/24/2023

Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

The transformer model is known to be computationally demanding, and proh...
research
05/28/2021

Linear-Time Self Attention with Codeword Histogram for Efficient Recommendation

Self-attention has become increasingly popular in a variety of sequence ...
research
11/17/2019

MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning

In sequence to sequence learning, the self-attention mechanism proves to...
research
10/19/2021

Inductive Biases and Variable Creation in Self-Attention Mechanisms

Self-attention, an architectural motif designed to model long-range inte...
research
10/27/2022

Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost

To overcome the quadratic cost of self-attention, recent works have prop...

Please sign up or login with your details

Forgot password? Click here to reset