DeepAI
Log In Sign Up

Value-aware Approximate Attention

03/17/2021
by   Ankit Gupta, et al.
0

Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. However, all approximations thus far have ignored the contribution of the value vectors to the quality of approximation. In this work, we argue that research efforts should be directed towards approximating the true output of the attention sub-layer, which includes the value vectors. We propose a value-aware objective, and show theoretically and empirically that an optimal approximation of a value-aware objective substantially outperforms an optimal approximation that ignores values, in the context of language modeling. Moreover, we show that the choice of kernel function for computing attention similarity can substantially affect the quality of sparse approximations, where kernel functions that are less skewed are more affected by the value vectors.

READ FULL TEXT

page 1

page 2

page 3

page 4

06/13/2021

Memory-efficient Transformers via Top-k Attention

Following the success of dot-product attention in Transformers, numerous...
09/24/2021

Predicting Attention Sparsity in Transformers

A bottleneck in transformer architectures is their quadratic complexity ...
02/21/2022

Transformer Quality in Linear Time

We revisit the design choices in Transformers, and propose methods to ad...
05/27/2022

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Transformers are slow and memory-hungry on long sequences, since the tim...
07/09/2020

Fast Transformers with Clustered Attention

Transformers have been proven a successful model for a variety of tasks ...
06/05/2020

GMAT: Global Memory Augmentation for Transformers

Transformer-based models have become ubiquitous in natural language proc...
06/25/2022

A Fast, Well-Founded Approximation to the Empirical Neural Tangent Kernel

Empirical neural tangent kernels (eNTKs) can provide a good understandin...