Self-attention Does Not Need O(n^2) Memory

12/10/2021
by   Markus N. Rabe, et al.
0

We present a very simple algorithm for attention that requires O(1) memory with respect to sequence length and an extension to self-attention that requires O(log n) memory. This is in contrast with the frequently stated belief that self-attention requires O(n^2) memory. While the time complexity is still O(n^2), device memory rather than compute capability is often the limiting factor on modern accelerators. Thus, reducing the memory requirements of attention allows processing of longer sequences than might otherwise be feasible. We provide a practical implementation for accelerators that requires O(√(n)) memory, is numerically stable, and is within a few percent of the runtime of the standard implementation of attention. We also demonstrate how to differentiate the function while remaining memory-efficient. For sequence length 16384, the memory overhead of self-attention is reduced by 59X for inference and by 32X for differentiation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/27/2019

IS Attention All What You Need? – An Empirical Investigation on Convolution-Based Active Memory and Self-Attention

The key to a Transformer model is the self-attention mechanism, which al...
research
03/02/2022

DCT-Former: Efficient Self-Attention with Discrete Cosine Transform

Since their introduction the Trasformer architectures emerged as the dom...
research
12/21/2020

Sub-Linear Memory: How to Make Performers SLiM

The Transformer architecture has revolutionized deep learning on sequent...
research
03/12/2020

Efficient Content-Based Sparse Attention with Routing Transformers

Self-attention has recently been adopted for a wide range of sequence mo...
research
03/22/2020

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

While the self-attention mechanism has been widely used in a wide variet...
research
05/31/2023

Recasting Self-Attention with Holographic Reduced Representations

In recent years, self-attention has become the dominant paradigm for seq...
research
06/22/2020

Limits to Depth Efficiencies of Self-Attention

Self-attention architectures, which are rapidly pushing the frontier in ...

Please sign up or login with your details

Forgot password? Click here to reset