Rethinking Attention with Performers

09/30/2020
by   Krzysztof Choromanski, et al.
5

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

READ FULL TEXT

Authors

page 7

page 21

page 22

page 24

page 25

05/30/2022

Chefs' Random Tables: Non-Trigonometric Random Features

We introduce chefs' random tables (CRTs), a new class of non-trigonometr...
06/05/2020

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

Transformer models have achieved state-of-the-art results across a diver...
10/08/2021

Hybrid Random Features

We propose a new class of random feature methods for linearizing softmax...
08/30/2019

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures...
02/08/2020

Time-aware Large Kernel Convolutions

To date, most state-of-the-art sequence modelling architectures use atte...
06/01/2022

Transformer with Fourier Integral Attentions

Multi-head attention empowers the recent success of transformers, the st...
05/26/2022

Your Transformer May Not be as Powerful as You Expect

Relative Positional Encoding (RPE), which encodes the relative distance ...

Code Repositories

performer

Simply Numpy implementation of the FAVOR+ attention mechanism, https://teddykoker.com/2020/11/performers/


view repo

performer

Tensorflow implementation of a linear attention architecture


view repo

Performer-Pytorch

Pytorch implementation of Performer from the paper "Rethinking Attention with Performers".


view repo

Transformer-Architectures-From-Scratch

Implementation of transformers based architecture in PyTorch.


view repo

thesis

Lyrics and Vocal Melody Generation, conditioned on Accompaniment


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.