Log In Sign Up

Value-aware Approximate Attention

by   Ankit Gupta, et al.

Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. However, all approximations thus far have ignored the contribution of the value vectors to the quality of approximation. In this work, we argue that research efforts should be directed towards approximating the true output of the attention sub-layer, which includes the value vectors. We propose a value-aware objective, and show theoretically and empirically that an optimal approximation of a value-aware objective substantially outperforms an optimal approximation that ignores values, in the context of language modeling. Moreover, we show that the choice of kernel function for computing attention similarity can substantially affect the quality of sparse approximations, where kernel functions that are less skewed are more affected by the value vectors.


page 1

page 2

page 3

page 4


Memory-efficient Transformers via Top-k Attention

Following the success of dot-product attention in Transformers, numerous...

Predicting Attention Sparsity in Transformers

A bottleneck in transformer architectures is their quadratic complexity ...

Transformer Quality in Linear Time

We revisit the design choices in Transformers, and propose methods to ad...

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Transformers are slow and memory-hungry on long sequences, since the tim...

Fast Transformers with Clustered Attention

Transformers have been proven a successful model for a variety of tasks ...

GMAT: Global Memory Augmentation for Transformers

Transformer-based models have become ubiquitous in natural language proc...

A Fast, Well-Founded Approximation to the Empirical Neural Tangent Kernel

Empirical neural tangent kernels (eNTKs) can provide a good understandin...