are extremely successful in a wide range of natural language processing and other tasks. Due to the self-attention mechanism transformer layer can be trained to update a vector representation of every element with information aggregated over the whole sequence. As a result, rich contextual representation for every token is generated at the end of encoding. However, a combination of local and global information in the same vector has its limitations. Distributed storage of global features results in "blurring" and makes it harder to access them. Another well-known deficiency of Transformers is poor scaling of attention span that hurts its applications to long sequences.
In our work, we propose and study MemTransformer (Memory Transformer), a straightforward and very simple modification of the Transformer that has the potential to solve the above-mentioned problems. We augment the Transformer baseline by adding [mem] tokens at the beginning of the input sequence and train the model to see if it is able to use them as universal memory storage. To assess the capacity of proposed memory implementation, we additionally studied the MemBottleneck model that removes attention between sequence elements, thus making memory the only channel to access global information about the sequence.
Our work lies at the intersection of two research directions Memory-augmented neural networks (MANNs) and Transformers.
The history of memory augmentation in neural networks is pretty long. Classic Long-Short Term Memory (LSTM)  can be seen as a simple yet powerful form of fine-grained memory augmentation with a single memory value per LSTM cell and memory control logic implemented by internal learnable gates. Thus, in LSTMs, computations are heavily intertwined with memory. In contrast to that, memory-augmented neural networks incorporate external-memory, which decouples memory capacity from the number of model parameters. Neural Turing Machines (NTMs)  and Memory Networks  are among the best-knows MANNs that provide powerful random access operations over external memory. Memory Networks [23, 21] are trained to iteratively reason by combining sequence representation with embeddings in long-term memory with the help of attention. NTMs, and their successors Differentiable Neural Computer (DNC)  and Sparse DNC 
are recurrent neural networks equipped with a content-addressable memory, similar to Memory Networks, but with the additional capability to write to memory over time. The memory is accessed by a controller network, typically an LSTM. The full model is differentiable and can be trained via back-propagation through time (BPTT). There is also a line of work to equip neural networks (typically, LSTMs) with data structures like stacks, lists, or queues[13, 8]. MANN architectures with a more advanced addressing mechanisms such as address-content separation and multi-step addressing were proposed in [10, 9, 18].
Family of Transformer models have been recently applied to many deep learning tasks and proved to be extremely successful for the language modeling tasks. The core element of Transformers is self-attention that allows updating representation for every element with information aggregated over the whole sequence. Self-attention scales as with a sequence length, and as a result, it is severely limited in application to long sequences.
Several recent approaches try to solve this problem by adding some kinds of memory elements to their architecture. Transformer-XL  adds segment-level recurrence with state reuse, which can be seen as a sort of memory. During training, the hidden state sequence computed for the previous segment is fixed and cached to be reused as an extended context when the model processes the next segment. Compressive Transformer  extends the ideas of Transformer-XL by incorporating the second level of the memory into the architecture. Memory on the second level stores information from the short-term memory of the first level in compressed form. Memory Layers  replace a feed-forward layer with a product key memory layer, that can increase model capacity for a negligible computational cost.
Among the most recent transformers with global representations are Star-Transformer , Longformer , and Extended Transformer Construction (ETC) . All these architectures reduce full self-attention to some local or patterned attention and combine it with a sparse global attention bottleneck. For example, Longformer uses selected tokens such as [CLS] or tokens for question marks to accumulate and redistribute global information to all other elements of the sequence. Our MemTransformer and MemBottleneck Transformer models can be seen as more general limit cases for this class of models. Both have unspecific [mem] tokens that can store global or copy of local information. MemTransformer has full self-attention over the memory+input sequence. In contrast, MemBottleneck has full both-way attention between the input sequence and memory but no attention between sequence tokens.
2 Memory Transformer
2.1 Background: Transformer architecture
The core of the original Transformer architecture () is a scaled dot-product attention:
here , , and are queries, keys, and values which are the same in the case of encoder with self-attention, and is a scaling factor.
The Transformer uses multi-head attention with h heads:
here , , , and are matrices of learnable transformations of the input vectors.
Then the output of the multi-head attention sublayer with a residual connection became normalized. Then there is a normalized feed-forward sublayer with a residual connection:
The process of calculating self-attention can be seen as a two-step processing flow (see fig. 1a):
Self-attention. Calculate attention between all elements of the sequence.
Update. For every element of the sequence aggregate weighted representations of all other elements and perform further element-wise transformations.
2.2 Simple MemTransformer
The first proposed model is a simple extension of a baseline Transformer we call MemTransformer. It minimally changes the original model architecture.
The idea is to add special [mem]ory tokens to the standard input (see fig. 1b) then process them in a standard way. So, the input vectors became the concatenation of the memory token vectors and the original input token vectors :
This modification can be applied independently to encoder and/or decoder. The rest of the Transformer stays the same with the multi-head attention layer processing the extended input.
2.3 MemBottleneck Transformer
In the MemTransformer input and [mem] tokens are updated inside the same traditional self-attend and update processing flow. In this case, representations of the input sequence elements potentially might be updated “as usual” without attending to the content of the memory. Here, global information can propagate in a “peer to peer” manner. To block this distributed information flow and separate storage and processing of global and local representations, we add a memory bottleneck step to the traditional Transformer layer. The resulting MemBottleneck Transformer has two-staged processing flow (see fig. 1c).
1. Memory update.
First, calculate attention between every memory token and full sequence of memory and input (see Step 1 on the fig. 1c):
here is an output of and . Then update memory token representations with information aggregated from the sequence and the memory itself with a residual connection, and perform further element-wise transformations (see Step 2 on the fig. 1c):
2. Sequence update.
Calculate attention between sequence and memory (Step 3 on the fig. 1c):
Then update sequence token representations with information aggregated from memory only and perform further element-wise transformations (Step 4 on the fig. 1c):
In other words, the memory “attends” to itself and a sequence, and the sequence “attends” only to the memory. This should force the model to accumulate and re-distribute global information through memory. Computations for MemTransformer scales linearly with the size of the sequence if the size of the memory is constant when for the traditional transformer it scales as .
3 Results and discussion
We use a standard machine translation benchmark WMT-14 DE-EN 
in our experiments. One epoch of training covered a full train set of 4.5M pairs.
Two model sizes were studied, small with = 4 and base with = 6 layers in the encoder. The decoder has the same number of layers as the encoder. The small setup had parameters , and the base setup had . For all experiments batch size was 64. Dataset was tokenized with TFDS subword text encoder111https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/SubwordTextEncoder and dictionary of 32K per language. Training times on a single NVIDIA GTX 1080 Ti GPU were about 3.5 hours per epoch for small and 14.5 hours per epoch for base models.
Our main focus was to explore variations of memory augmented Transformer, so the majority of experiments were done on small models with just 4 layers in the encoder and 4 layers in the decoder to reduce the amount of necessary computation. Every modification was run 3 times to evaluate the reproducibility of training. All values reported in the paper are averaged over these 3 runs if the opposite is not stated.
3.1 Performance metrics
The main hypothesis of our study says that adding memory to multilayered encoder-decoder architectures should result in better performance for sequence processing tasks such as machine translation. Indeed, learning curves presented in Figure 2 show that simple MemTransformer with 5, 10, or 20 [mem] tokens has faster convergence compared to the Transformer baseline of the same size. This supports our intuition that self-attention could be trained to utilize representations of extra memory elements that are not related to the input sequence to improve the quality of encoding.
The MemTransformer results suggest that if memory extends but not intervene in the Transformer sequence processing, then it is beneficial. But to what extent processing of relations between elements of the sequence can be abstracted to memory? Experiments with MemBottleneck Transformer (see fig. 2a.) shows that it is possible, but performance suffers. This can be due to the more complex architecture of the MemBottleneck that has twice more layers in the encoder part (see fig. 1c.). So, it is more difficult and longer to train compared to baseline. On the other hand, degraded performance can also be attributed to the insufficient throughput of the memory bottleneck. So, there might be a trade-off between the size of the bottleneck and the complexity of learning a deeper network. From the experiments, we see that MemBottleneck 10 learns faster and has lower loss compared to MemBottleneck 20, which points to the complexity of training but not the bottleneck width as a major factor limiting performance.
An increase in the memory size can have opposite effects for the performance of MemTransformer model as well. Closer inspection of the learning curves for small models (inset of fig. 2a.) shows that up to 10-12 epochs performance correlates with memory size, but at the end of the training, the average loss is the lowest for the model with 5 [mem] tokens. For the base models with 6 layers (see fig. 2b), a bigger memory size gives an advantage over the first 10 epochs of training. We see that 4-layer models with large memory learn faster but still end up with the same scores as small memory versions. We can speculate that for the model, it is easier to learn with larger memory, but it is hard to utilize extra memory tokens.
To verify the training results, we calculated BLEU scores on the validation set to estimate the quality of translation produced by the models. BLEU scores reported in Table1 are in good agreement with the learning curves shown in Figure 2. After 20 epochs of training, small MemTransformer models have similar scores and clearly outperform the Transformer baseline. Base 6-layer MemTransformer with 10 memory tokens improves the baseline, and doubling the memory up to 20 tokens results in an even higher score of 25.58. This is a modest but solid performance compared to recent results in the literature, which mainly vary between 20 and 30 of the BLEU score (see, for example ).
Storing of representations in dedicated memory should be useful not only during encoding but also during decoding. MemTransformer model can be easily run with memory augmented decoder. As the simplest implementation of such memory, we just prepended target sequence with [mem] tokens. Note that we left the Transformer decoder as is, so the model was forced to learn the full target sequence prediction, starting with [mem] tokens first, and then continuing with the tokens for translation. Forward masking was applied to the full input sequence. Indeed, this modification increased difficulty of the learning task, and MemTransformer Enc(0) Dec(10) with 10 [mem] tokens in decoder performed a bit worse than the baseline (see the Table 1). But adding memory to decoder allowed MemTransformer Enc(10) Dec(10) model slightly improve the baseline scores.
|Small models (4 layers per encoder/decoder)||Base models (6 layers per encoder/decoder)|
|(20 epochs)||(10 epochs)|
|MemBottleneck Transformer 10||11.20||-|
|MemBottleneck Transformer 20||10.41||-|
|MemTransformer Enc(0) Dec(10)||18.81||-|
|MemTransformer Enc(10) Dec(10)||19.07||-|
Memory models have better scores after training, but do they require memory for inference? If the performance of trained MemTransformer will stay the same for the inference without [mem] tokens, then memory was only needed to improve training and not used for the processing of an input sequence. Results of memory lesion experiments presented in Figure 3 demonstrate that removing [mem] tokens from MemTransformer input leads to a dramatic drop in BLUE score, from 25.07 to 11.75 for the MemTransformer 10 and from 25.58 to 3.87 for MemTransformer 20 (both models have 6-layers in the encoder). This is an indicator that the presence of [mem] tokens is critical for MemTransformer during inference.
Another important question is related to the universality of the learned memory controller. Is it able to utilize memory of arbitrary size, or can work only with the memory capacity it was trained? Memory lesions data (see fig. 3) suggest that MemTransformer learns a solution that is partially robust to the variations in the size of the memory. BLEU score of MemTransformer 10 with 5 [mem] tokens shows it is still able to translate with acceptable quality. On the other hand, if we add 20 more [mem] tokens to the same model, it will have scores that are lower even compared to the case when the model is evaluated without memory at all. Interestingly, the model trained with a larger memory size of 20 has weaker generalization abilities. It is evident from the more steep decline of performance with the deviation of memory size from the one used during training.
3.2 Attention patterns in memory
To gain a better understanding of the inner workings of memory augmented transformers and following previous studies [16, 4], we visually explored attention patterns. Kovaleva et al.  introduced five categories of self-attention and suggested that only "heterogeneous" patterns that spread attention across all input tokens might extract non-trivial information about the linguistic structure. Numerous observations of attention maps across baseline Transformer and MemTransformer models allow us to conclude that rhe overall structure and distribution of pattern types in the sequence to sequence part of the attention mechanism are similar. Thus, for further analysis, we skip sequence to sequence attention and focus on memory to sequence, memory to memory, and sequence to memory attention patterns. All attention maps for selected models are presented in the Appendix.
Memory to sequence attention makes it possible to selectively update vectors stored in [mem] token positions with representations of input sequence elements. Such an update is a form of soft write to memory operation. Indeed, we found many patterns consistent with writing from sequence to memory in all MemTransformer models. One of them is shown in Figure 4a. Write to memory type of attention is more frequent in the first layers and almost absent in the deeper part of the encoder.
Memory to memory attention allows recombining vectors of [mem] tokens. We found a few common patterns related to in-memory processing. The most common arrangement of activity is diagonal. The diagonal can be blurred (or "soft"), making local fusion of the neighboring memory representations possible. Examples of this operation can be seen in the left top corner of the figures 4a., 4b. and 4c. If diagonal attention is sharp (see 4c. in the middle), then corresponding memory vectors are added to themselves, so their content is amplified and became more error-tolerant. This can be seen as a store operation. Another possible operation is a block copy (examples can be found in the Appendix). It is usually manifested as a vertical block of attention. In this case, a number of consequent [mem] vectors are updated with the same values aggregated from some another sequence of [mem] vectors. A copy operation can also be performed in a [mem] to [mem] manner with preserving a source order as in figure 4b. or with reversing the order (see Appendix for examples).
Sequence to memory attention implements read from memory operation and can be found in the first layers of the encoder, but it is more pronounced in the middle layers of the decoder. A typical example of the memory "reading" is presented in Figure 4d. Note, that during decoding token representation is updated by reading from a block of subsequent [mem] tokens.
The overall pipeline of memory processing is similar for the different runs and sizes of MemTransformer. It consists of writing some information from the input sequence to the memory in the first layers of the encoder, then followed by memory processing in the intermediate layers and amplification in the output layers of the encoder. During decoding, information is read from memory. Here, the highest "reading" activity is commonly observed in the intermediate decoder layers. Interestingly, memory-related attention patterns usually have a block structure. For example, patterns form the particular MemTransformer 10 presented in Figure 4 suggest that the model had learned to split memory into three blocks. Commonly, the same memory operation is applied to all [mem]s of the same block by one particular head. During memory processing, the model can operate in a blockwise manner, as in Figure 4b, where the "block(1-3)" is copying to the "block(4-6)". We speculate that the block structure of memory processing might reduce error rate because the memory representation is "averaged" over the [mem]s of the block during reading (see fig. 4d). Experiments with MemBottleneck architecture show that the model might be able to learn how to copy representations of the input sequence into the memory of the fixed size and use only this memory during decoding.
We proposed and studied two memory augmented architectures MemTransformer and MemBottleneck Transformer. Qualitative analysis of attention patterns produced by the transformer heads trained to solve machine translation task suggests that both models successfully discovered basic operations for memory control. Attention maps show evidence for the presence of memory read/write as well as some in-memory processing operations such as copying and summation.
A comparison of machine translation quality shows that adding general-purpose memory in MemTransformer improves performance over the baseline. Moreover, the speed of training and final quality positively correlates with the memory size. On the other hand, MemBottleneck Transformer, with all self-attention restricted to the memory only, has significantly lower scores after training.
Memory lesion tests demonstrate that the performance of the pre-trained MemTransformer model critically depends on the presence of memory. Still, the memory controller learned by the model degrades only gradually when memory size is changed during inference. This indicates that the controller has some robustness and ability for generalization.
Our memory augmentation approach is universal and can extend almost any "encoder-decoder with attention" framework. We see as a particularly promising combination of MemTransformer with the recurrency mechanisms [5, 20] to process long sequences. It can be also applied to the tasks that depend on the multi-hop reasoning or planning. In this cases memory should help to store and process representations for the intermediate stages of the solution.
-  J. Ainslie, S. Ontanon, C. Alberti, P. Pham, A. Ravula, and S. Sanghai. Etc: Encoding long and structured data in transformers, 2020.
-  I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer, 2020.
-  O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint-Amand, R. Soricut, L. Specia, and A. s. Tamchyna. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA, June 2014. Association for Computational Linguistics.
-  K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy, 2019. Association for Computational Linguistics.
-  Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context, 2019.
-  A. Graves, G. Wayne, and I. Danihelka. Neural turing machines, 2014.
-  A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, Oct. 2016.
-  E. Grefenstette, K. M. Hermann, M. Suleyman, and P. Blunsom. Learning to transduce with unbounded memory, 2015.
-  C. Gulcehre, S. Chandar, and Y. Bengio. Memory augmented neural networks with wormhole connections. arXiv preprint arXiv:1701.08718, 2017.
-  C. Gulcehre, S. Chandar, K. Cho, and Y. Bengio. Dynamic neural turing machine with soft and hard addressing schemes. arXiv preprint arXiv:1607.00036, 2016.
-  Q. Guo, X. Qiu, P. Liu, Y. Shao, X. Xue, and Z. Zhang. Star-transformer, 2019.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov. 1997.
-  A. Joulin and T. Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets, 2015.
-  J. Kasai, J. Cross, M. Ghazvininejad, and J. Gu. Parallel machine translation with disentangled context transformer. arXiv preprint arXiv:2001.05136, 2020.
-  G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui. Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms. arXiv:2004.10102 [cs], 2020.
-  O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky. Revealing the Dark Secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4356–4365, Hong Kong, China, 2019. Association for Computational Linguistics.
-  G. Lample, A. Sablayrolles, M. Ranzato, L. Denoyer, and H. Jégou. Large memory layers with product keys, 2019.
-  Y. Meng and A. Rumshisky. Context-aware neural model for temporal information extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 527–536, 2018.
-  J. W. Rae, J. J. Hunt, T. Harley, I. Danihelka, A. Senior, G. Wayne, A. Graves, and T. P. Lillicrap. Scaling memory-augmented neural networks with sparse reads and writes, 2016.
-  J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap. Compressive transformers for long-range sequence modelling, 2019.
-  S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-to-end memory networks, 2015.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is All you Need. In Advances in neural information processing systems, pages 5998–6008, 2017.
-  J. Weston, S. Chopra, and A. Bordes. Memory networks, 2014.
Appendix A Attention maps for memory augmented transformers
In this section we present attention maps for two representative cases of MemTransformer and MemBottleneck transformer models. For both models we use the same input sequence.
Input sequence: Langes Kurzzeitgedächtnis ist eine Technik, die zur Verbesserung der Entwicklung von künstlicher Intelligenz wesentlich beigetragen hat.222https://de.wikipedia.org/wiki/Long_short-term_memory
Predicted translation MemTransformer 10: Long, short-term memory is a technique that has contributed significantly to improving the development of artificial intelligence.
Long, short-term memory is a technique that has contributed significantly to improving the development of artificial intelligence.
Predicted translation MemBottleneck 20: The short time memory is a technique that has helped to improve the development of artificial intelligence in a lot of sense.
Reference: Long-term short-term memory is a technique that has contributed significantly to improving the development of artificial intelligence.333https://translate.google.com
A short guide to help with interpretation of attention maps is shown on the Figure 5.
a.1 MemTransformer attention maps
Visualisation of attention maps for MemTransformer 10 (see Section 2.2) with a memory size of 10 is presented on the Figure 6 for 6 layers of encoder and on the Figure 7 for 6 layers of decoder. Every transformer layer has 8 attention heads. The model was trained for 10 epochs on WMT-14 DE-EN  dataset.
a.2 MemBottleneck Transformer attention maps
Attention patterns generated by MemBottleneck Transformer architecture (see Section 2.3) strongly suggest that the model learned to copy a given sequence into a memory, process it and use only this representations of input for decoding. The main idea of MemBottleneck is a restriction of global information exchange to memory. Therefore, an update for representations of the input sequence elements can access representations of other elements only by writing into and then reading them from memory. To do that, MemBottleneck uses two different transformer sub-layers each with its’ own set of attention heads (see fig. 1c).
Encoder attention maps (see fig. 8) suggest that, as expected, representations for the input elements are copied into memory in layers 1 and 2. Surprisingly, after that they are not properly updated anymore and the decoder mostly attends to the content of memory (see fig. 9). This impressive outcome shows that transformer can be trained to read and process all the information about the input sequence in memory only.