Random Feature Attention

03/03/2021
by   Hao Peng, et al.
0

Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA's efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2021

Luna: Linear Unified Nested Attention

The quadratic computational and memory complexities of the Transformer's...
research
10/06/2021

ABC: Attention with Bounded-memory Control

Transformer architectures have achieved state-of-the-art results on a va...
research
09/24/2021

Predicting Attention Sparsity in Transformers

A bottleneck in transformer architectures is their quadratic complexity ...
research
08/30/2019

Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel

Transformer is a powerful architecture that achieves superior performanc...
research
02/05/2023

KDEformer: Accelerating Transformers via Kernel Density Estimation

Dot-product attention mechanism plays a crucial role in modern deep arch...
research
03/23/2022

Linearizing Transformer with Key-Value Memory Bank

Transformer has brought great success to a wide range of natural languag...
research
06/05/2020

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

Transformer models have achieved state-of-the-art results across a diver...

Please sign up or login with your details

Forgot password? Click here to reset