Memory-efficient Transformers via Top-k Attention

06/13/2021
by   Ankit Gupta, et al.
0

Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top-k scores with respect to the keys. Our approach offers several advantages: (a) its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA (b) it is a drop-in replacement for vanilla attention that does not require any corrective pre-training, and (c) it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. We evaluate the quality of top-k approximation for multi-head attention layers on the Long Range Arena Benchmark, and for feed-forward layers of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/09/2020

Fast Transformers with Clustered Attention

Transformers have been proven a successful model for a variety of tasks ...
research
03/17/2021

Value-aware Approximate Attention

Following the success of dot-product attention in Transformers, numerous...
research
10/11/2020

SMYRF: Efficient Attention using Asymmetric Clustering

We propose a novel type of balanced clustering algorithm to approximate ...
research
04/25/2022

On-demand compute reduction with stochastic wav2vec 2.0

Squeeze and Efficient Wav2vec (SEW) is a recently proposed architecture ...
research
06/22/2021

LV-BERT: Exploiting Layer Variety for BERT

Modern pre-trained language models are mostly built upon backbones stack...
research
05/01/2020

Multi-scale Transformer Language Models

We investigate multi-scale transformer language models that learn repres...
research
01/27/2021

CNN with large memory layers

This work is centred around the recently proposed product key memory str...

Please sign up or login with your details

Forgot password? Click here to reset