FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

05/27/2022
by   Tri Dao, et al.
10

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware – accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15 end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3× speedup on GPT-2 (seq. length 1K), and 2.4× speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4 accuracy).

READ FULL TEXT
research
06/29/2022

SALO: An Efficient Spatial Accelerator Enabling Hybrid Sparse Attention Mechanisms for Long Sequences

The attention mechanisms of transformers effectively extract pertinent i...
research
07/17/2023

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Scaling Transformers to longer sequence lengths has been a major problem...
research
04/23/2019

Generating Long Sequences with Sparse Transformers

Transformers are powerful sequence models, but require time and memory t...
research
02/05/2023

KDEformer: Accelerating Transformers via Kernel Density Estimation

Dot-product attention mechanism plays a crucial role in modern deep arch...
research
07/09/2020

Fast Transformers with Clustered Attention

Transformers have been proven a successful model for a variety of tasks ...
research
03/17/2021

Value-aware Approximate Attention

Following the success of dot-product attention in Transformers, numerous...
research
11/07/2019

Blockwise Self-Attention for Long Document Understanding

We present BlockBERT, a lightweight and efficient BERT model that is des...

Please sign up or login with your details

Forgot password? Click here to reset