Rethinking Attention with Performers

09/30/2020
by   Krzysztof Choromanski, et al.
5

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

READ FULL TEXT

page 7

page 21

page 22

page 24

page 25

research
05/30/2022

Chefs' Random Tables: Non-Trigonometric Random Features

We introduce chefs' random tables (CRTs), a new class of non-trigonometr...
research
06/05/2020

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

Transformer models have achieved state-of-the-art results across a diver...
research
02/05/2023

KDEformer: Accelerating Transformers via Kernel Density Estimation

Dot-product attention mechanism plays a crucial role in modern deep arch...
research
10/08/2021

Hybrid Random Features

We propose a new class of random feature methods for linearizing softmax...
research
02/08/2020

Time-aware Large Kernel Convolutions

To date, most state-of-the-art sequence modelling architectures use atte...
research
01/31/2023

Simplex Random Features

We present Simplex Random Features (SimRFs), a new random feature (RF) m...
research
08/30/2019

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures...

Please sign up or login with your details

Forgot password? Click here to reset