Memformer: The Memory-Augmented Transformer

10/14/2020
by   Qingyang Wu, et al.
1

Transformer models have obtained remarkable accomplishments in various NLP tasks. However, these models have efficiency issues on long sequences, as the complexity of their self-attention module scales quadratically with the sequence length. To remedy the limitation, we present Memformer, a novel language model that utilizes a single unified memory to encode and retrieve past information. It includes a new optimization scheme, Memory Replay Back-Propagation, which promotes long-range back-propagation through time with a significantly reduced memory requirement. Memformer achieves 𝒪(n) time complexity and 𝒪(1) space complexity in processing long sequences, meaning that the model can handle an infinite length sequence during inference. Our model is also compatible with other self-supervised tasks to further improve the performance on language modeling. Experimental results show that Memformer outperforms the previous long-range sequence models on WikiText-103, including Transformer-XL and compressive Transformer.

READ FULL TEXT

page 1

page 2

page 3

page 4

11/13/2019

Compressive Transformers for Long-Range Sequence Modelling

We present the Compressive Transformer, an attentive sequence model whic...
10/10/2021

DCT: Dynamic Compressive Transformer for Modeling Unbounded Sequence

In this paper, we propose Dynamic Compressive Transformer (DCT), a trans...
09/13/2020

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

Transformer has become ubiquitous in the deep learning field. One of the...
07/07/2020

Do Transformers Need Deep Long-Range Memory

Deep attention models have advanced the modelling of sequential data acr...
04/12/2021

Updater-Extractor Architecture for Inductive World State Representations

Developing NLP models traditionally involves two stages - training and a...
03/23/2022

Linearizing Transformer with Key-Value Memory Bank

Transformer has brought great success to a wide range of natural languag...
02/25/2019

Star-Transformer

Although the fully-connected attention-based model Transformer has achie...