EL-Attention: Memory Efficient Lossless Attention for Generation

by   Yu Yan, et al.

Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging larger batch size for faster speed. We propose memory-efficient lossless attention (called EL-attention) to address this issue. It avoids heavy operations for building multi-head keys and values, with no requirements of using cache. EL-attention constructs an ensemble of attention results by expanding query while keeping key and value shared. It produces the same result as multi-head attention with less GPU memory and faster inference speed. We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.


page 1

page 2

page 3

page 4


Fast Transformer Decoding: One Write-Head is All You Need

Multi-head attention layers, as used in the Transformer neural sequence ...

Sharing Attention Weights for Fast Transformer

Recently, the Transformer machine translation system has shown strong re...

PairConnect: A Compute-Efficient MLP Alternative to Attention

Transformer models have demonstrated superior performance in natural lan...

Boosting Mobile CNN Inference through Semantic Memory

Human brains are known to be capable of speeding up visual recognition o...

FastSeq: Make Sequence Generation Faster

Transformer-based models have made tremendous impacts in natural languag...

SqueezeNeRF: Further factorized FastNeRF for memory-efficient inference

Neural Radiance Fields (NeRF) has emerged as the state-of-the-art method...

Compositional Attention: Disentangling Search and Retrieval

Multi-head, key-value attention is the backbone of the widely successful...