Value-aware Approximate Attention

03/17/2021
by   Ankit Gupta, et al.
0

Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. However, all approximations thus far have ignored the contribution of the value vectors to the quality of approximation. In this work, we argue that research efforts should be directed towards approximating the true output of the attention sub-layer, which includes the value vectors. We propose a value-aware objective, and show theoretically and empirically that an optimal approximation of a value-aware objective substantially outperforms an optimal approximation that ignores values, in the context of language modeling. Moreover, we show that the choice of kernel function for computing attention similarity can substantially affect the quality of sparse approximations, where kernel functions that are less skewed are more affected by the value vectors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2021

Memory-efficient Transformers via Top-k Attention

Following the success of dot-product attention in Transformers, numerous...
research
09/24/2021

Predicting Attention Sparsity in Transformers

A bottleneck in transformer architectures is their quadratic complexity ...
research
02/21/2022

Transformer Quality in Linear Time

We revisit the design choices in Transformers, and propose methods to ad...
research
05/27/2022

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Transformers are slow and memory-hungry on long sequences, since the tim...
research
02/01/2023

FAVOR#: Sharp Attention Kernel Approximations via New Classes of Positive Random Features

The problem of efficient approximation of a linear operator induced by t...
research
07/09/2020

Fast Transformers with Clustered Attention

Transformers have been proven a successful model for a variety of tasks ...
research
06/02/2020

Kernel-independent adaptive construction of ℋ^2-matrix approximations

A method for the kernel-independent construction of ℋ^2-matrix approxima...

Please sign up or login with your details

Forgot password? Click here to reset