Linearizing Transformer with Key-Value Memory Bank

by   Yizhe Zhang, et al.

Transformer has brought great success to a wide range of natural language processing tasks. Nevertheless, the computational overhead of the vanilla transformer scales quadratically with sequence length. Many efforts have been made to develop more efficient transformer variants. A line of work (e.g., Linformer) projects the input sequence into a low-rank space, achieving linear time complexity. However, Linformer does not suit well for text generation tasks as the sequence length must be pre-specified. We propose MemSizer, an approach also projects the source sequence into lower dimension representation but can take input with dynamic length, with a different perspective of the attention mechanism. MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation, which yields constant memory complexity and reduced computation at inference. We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer and other linear variants in language modeling and machine translation tasks, revealing a viable direction towards further inference efficiency improvement.


page 1

page 2

page 3

page 4


Finetuning Pretrained Transformers into RNNs

Transformers have outperformed recurrent neural networks (RNNs) in natur...

Random Feature Attention

Transformers are state-of-the-art models for a variety of sequence model...

Memformer: The Memory-Augmented Transformer

Transformer models have obtained remarkable accomplishments in various N...

ABC: Attention with Bounded-memory Control

Transformer architectures have achieved state-of-the-art results on a va...

Mega: Moving Average Equipped Gated Attention

The design choices in the Transformer attention mechanism, including wea...

Sub-Linear Memory: How to Make Performers SLiM

The Transformer architecture has revolutionized deep learning on sequent...

Controlling Computation versus Quality for Neural Sequence Models

Most neural networks utilize the same amount of compute for every exampl...