Sub-Linear Memory: How to Make Performers SLiM

12/21/2020
by   Valerii Likhosherstov, et al.
0

The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring O(L^2) in serial time and memory as functions of input length L. Recent works proposed various linear self-attention mechanisms, scaling only as O(L) for serial computation. We perform a thorough analysis of recent Transformer mechanisms with linear self-attention, Performers, in terms of overall computational complexity. We observe a remarkable computational flexibility: forward and backward propagation can be performed with no approximations using sublinear memory as a function of L (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only O(1) memory during training, and still requires O(L) time. This discovered time-memory tradeoff can be used for training or, due to complete backward-compatibility, for fine-tuning on a low-memory device, e.g. a smartphone or an earlier-generation GPU, thus contributing towards decentralized and democratized deep learning.

READ FULL TEXT
research
09/11/2022

On The Computational Complexity of Self-Attention

Transformer architectures have led to remarkable progress in many state-...
research
12/10/2021

Self-attention Does Not Need O(n^2) Memory

We present a very simple algorithm for attention that requires O(1) memo...
research
05/27/2022

What Dense Graph Do You Need for Self-Attention?

Transformers have made progress in miscellaneous tasks, but suffer from ...
research
06/14/2023

When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

We present the first unified study of the efficiency of self-attention-b...
research
03/23/2022

Linearizing Transformer with Key-Value Memory Bank

Transformer has brought great success to a wide range of natural languag...
research
10/27/2022

Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost

To overcome the quadratic cost of self-attention, recent works have prop...
research
10/05/2022

Waveformer: Linear-Time Attention with Forward and Backward Wavelet Transform

We propose Waveformer that learns attention mechanism in the wavelet coe...

Please sign up or login with your details

Forgot password? Click here to reset