Rethinking Attention with Performers

by   Krzysztof Choromanski, et al.

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.



page 7

page 21

page 22

page 24

page 25


Chefs' Random Tables: Non-Trigonometric Random Features

We introduce chefs' random tables (CRTs), a new class of non-trigonometr...

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

Transformer models have achieved state-of-the-art results across a diver...

Hybrid Random Features

We propose a new class of random feature methods for linearizing softmax...

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures...

Time-aware Large Kernel Convolutions

To date, most state-of-the-art sequence modelling architectures use atte...

Transformer with Fourier Integral Attentions

Multi-head attention empowers the recent success of transformers, the st...

Your Transformer May Not be as Powerful as You Expect

Relative Positional Encoding (RPE), which encodes the relative distance ...

Code Repositories


Simply Numpy implementation of the FAVOR+ attention mechanism,

view repo


Tensorflow implementation of a linear attention architecture

view repo


Pytorch implementation of Performer from the paper "Rethinking Attention with Performers".

view repo


Implementation of transformers based architecture in PyTorch.

view repo


Lyrics and Vocal Melody Generation, conditioned on Accompaniment

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.