Accessing Higher-level Representations in Sequential Transformers with Feedback Memory

02/21/2020 ∙ by Angela Fan, et al. ∙ 5

Transformers are feedforward networks that can process input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input - the representation at a given layer can only access representations from lower layers, rather than the higher level representations already built in previous time steps. In this work, we propose the Feedback Transformer architecture that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. We demonstrate on a variety of benchmarks in language modeling, neural machine translation, summarization, and reinforcement learning that the increased representation capacity can improve over Transformer baselines.



There are no comments yet.


page 12

Code Repositories


Implementation of Feedback Transformer in Pytorch

view repo


My final project submission for the Meta Learning course at BITS Goa (conducted by TCS Research)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, the Transformer architecture (Vaswani et al., 2017)

has brought large improvements on a wide range of Natural Language Processing tasks such as sentence representations 

(Devlin et al., 2019), language modeling (Dai et al., 2019; Rae et al., 2020), and summarization (Edunov et al., 2019). Unlike more traditional recurrent architectures such as RNNs and LSTMs, the Transformer architecture processes a sequence in parallel in an order-invariant way. Additional techniques such as position embeddings (Sukhbaatar et al., 2015; Shaw et al., 2018) and attention masking are required to capture input order information.

The feedforward nature of Transformers makes it parallelizable and efficient to run on modern hardware, but it restricts the Transformer from taking full advantage of the input’s sequential property. In particular, the current hidden representation of a Transformer only accesses the past representation of lower levels, even though higher level representations of the past can already be computed. At generation time, the Transformer still generates one token at a time, so could access these representations for better performance, but they are not present at training time due to parallelization. During training, past higher level representations could be exploited to enrich the future lower level representations, enabling shallower models to have the same representation power.

Figure 1: The Feedback Transformer Architecture

merges past hidden representations from all layers into a single vector and stores it in memory. At each timestep, the hidden representations access all past representations at any depth through this memory.

Further, Transformers lack recurrence, so they struggle to maintain and update an internal state for a long time. In fact, an output from a Transformer can only go through a fixed number of computations regardless of the input length. Such a disadvantage has impact in long-context tasks, or tasks that require careful tracking of a world state. This is an inherent limitation of any feedforward model. While RNNs can maintain an internal state for an unbounded time while accumulating more computations upon it, the size of their internal state is limited by the number of layers.

We explore a modified architecture, the Feedback Transformer, that makes all previous hidden representations accessible to the computation of a representation at any depth — the model can feed back to itself any previous computation. This feedback nature allows this architecture to perform recursive computation, building stronger representations iteratively upon previous states. To achieve this, we modify the self-attention mechanism so it attends to higher level representations rather than lower ones.

As shown in Figure 1, the Feedback Transformer merges the hidden states from all layers into a single vector for every time step and stores them in a memory. Instead of self-attention, all subsequent layers attend to this memory, which means all previously computed representations can be accessed by all future layers mediated by the memory. This allows Feedback Transformers to recursively apply their own computation and maintain an internal state for unlimited time, which is something Transforms cannot achieve. Although RNNs can also maintain an internal state, the amount of information that Feedback Transformers can maintain is not limited by the number of layers.

Since our model lacks parallelism in the sequence length, it can be slow to train. We propose several ways of shortening the training time such as increasing the memory size over time. Besides, once the model is trained, it has the same speed as a Transformer during generation, which is important in many applications. Also, the training is not slower if the task requires step-by-step computation such as online reinforcement learning. There is also an added benefit that our model can reduce memory footprint during generation, as the memory size does not grow with the number of layers.

We validate our architecture on various benchmarks in language modeling, translation, summarization, and reinforcement learning. We show improvements upon Transformer baselines, particularly in cases with limited model depth. Having smaller models has a variety of benefits, such as faster decoding speed and smaller memory footprint.

2 Related work

Our work shares similarities with recurrent networks augmented with external shared memories (Graves et al., 2014; Joulin & Mikolov, 2015; Sukhbaatar et al., 2015). For example, the stack augmented RNN of Joulin & Mikolov (2015)

adds an external memory to a recurrent network to keep long term dependencies. Closer to our work, the Neural Turing Machine of 

Graves et al. (2014) models an unconstrained memory that resembles the self-attention layer of a Transformer. Further improvements to recurrent networks, such as the Gated Feedback RNN (Chung et al., 2015), are based on better controlling signal from different layers and extended to feedback through multiple pathways (Jin et al., 2017). These works are built on recurrent networks with additional components to store long term dependencies.

Other works have studied modifications to the Transformer architecture by enriching its structure with components inspired by recurrent networks. For example, Wang et al. (2019) proposes to add a local recurrent sublayer to the Transformer layer to remove the need of position embeddings in the multi-head self-attention layers. Universal Transformer (Dehghani et al., 2018) shares the parameters between the layers of a Transformer, leading a recurrent network in depth. Hao et al. (2019) and Chen et al. (2018) augment Transformers with a second, recurrent encoder. As opposed to our work, these prior investigations do not change the computational path in a Transformer to reduce the discrepancy between the training and inference time. Closer to our work, Merity (2019) proposes to add a self-attention layer on top of the past outputs from an LSTM cell. However, this approach keeps the recurrent and the self-attention mechanisms decoupled, as opposed to ours which makes the attention mechanism recurrent. In particular, the LSTM layer of Merity (2019) model still has a bottleneck corresponding to the dimension of the hidden layer.

3 Method

We propose a modification to the Transformer architecture to better adapt it to sequential modeling

the core of tasks such as machine translation or reinforcement learning. The modification aims to provide capacity to build more nuanced representations of each timestep .

3.1 Transformer Architecture

We briefly describe the Transformer architecture proposed in Vaswani et al. (2017). The core of a Transformer is a stack of identical layers. Each layer is composed of a multi-head self-attention sublayer (Attn) followed by a feedforward sublayer (FF), and each sublayer is followed by an add-norm operation that combines a skip-connection (He et al., 2016) and layer normalization (Lei Ba et al., 2016).

The -th layer of a Transformer processes an input sequence of vectors into a sequence of vectors of the same length. First, the self-attention sublayer computes a representation for each time step by taking its related input vector along with its past context, :

Within the self-attention sublayer, is used to form query vectors while its context is used to compute key and value vectors, forming a memory of the past information. The past memory is ordered by the distance in time with the help of position embeddings, which are added to the key vectors to denote the distance to the query. Then the feedforward sublayer processes each vector independently, i.e., . The Transformer layer transforms its input sequence into an output sequence :


Since the computations of do not depend on each other, it is possible to compute them in parallel. In practice, a block of steps is computed in parallel during training, where

can be viewed as the backpropagation through time (BPTT) length. This parallelization makes the training of Transformers more efficient on parallelizable hardware such as GPUs. In addition, in order to operate on sequences of unbounded length, Transformers require modifications such as caching hidden representations from previous blocks 

(Dai et al., 2019) and relative position embeddings.

3.2 Feedback Transformer

Layer by layer, Transformers build more abstract and high level representations for the entire input sequence. At each layer, the representations for the input sequence are treated in parallel, even for sequential problems where past representations could have already been computed. As a consequence, a standard Transformer does not leverage high level representations from the past to compute the current representation, even though it has access to them.

We propose to change the Transformer architecture to use the most abstract representations from the past directly as inputs for the current timestep. This means that the model does not form its representation in parallel, but sequentially token by token. More precisely, we replace the context inputs to attention modules with memory vectors that are computed over the past, i.e.,

where a memory vector is computed by summing the representations of each layer at the -th time step:


where are learnable scalar parameters. Note these scalars are the only new parameters introduced by our change, with all else the same as the standard Transformer. Here corresponds to token embeddings. The weighting of different layers by a softmax output gives the model more flexibility as it can average them or select one of them.

Figure 2: Summary of Main Difference between Feedback Transformer and Transformer. indicates the timestep and indicates the layer.

This modification of the self-attention input adapts the computation of the Transformer from parallel to sequential, summarized in Figure 2. Indeed, it gives the ability to formulate the representation based on past representations from any layer , while in a standard Transformer this is only true for . This change can be viewed as exposing all previous computations to all future computations, providing better representations of the input. Such capacity would allow much shallower models to capture the same level of abstraction as a deeper architecture. This has several practical advantages, as more shallow models have reduced memory footprint and increased decoding speed. Memory usage can be further reduced at generation time since the model only needs to store one vector per time step, rather than keeping previous states from all the layers.111However, this memory reduction cannot be combined with the computation saving trick where keys and values are stored in memory rather than the hidden representations.

An alternative view of such an architecture modification is providing the capacity for recursive computation — outputs from a sublayer can feed back to the same sublayer through the memory. The model can then maintain an internal state for unbounded time. This is a clear advantage over Transformers, in which a submodule never looks at its own output. While an RNN can also repeat its computation on its internal state, its internal state has a limited capacity determined by the number of layers and their hidden dimension. In contrast, the internal state of a Feedback Transformer is its whole memory, which can grow with the input length. This should allow the model to keep track of a large number of things within its internal state.

Our modification is not without a drawback. Note that we now need to compute the last state

of the previous time step before starting the computation of the next step. This means that, similar to recurrent neural networks, we cannot compute multiple time steps in parallel. While this reduction in parallelism will slow down computation, it will not affect the performance during generation where one needs to compute one step at a time anyway. The same is true for online reinforcement learning where the input must be processed sequentially even during training. Next, we introduce two ways of reducing training time.

Warming up BPTT length.

The model processes data in batches of tokens, where is the BPTT length and is the batch size. The reduction of parallelism during training only occurs in the dimension of , that is we cannot parallelize the training of a sequence. While we still can parallelize in the dimension of , we cannot simply reduce and increase because a long BPTT length is crucial for learning long term dependencies.

We thus propose a warm-up mechanism for the BPTT length during training to accelerate the training of our model. More precisely, we start with a small and double it after every updates. We also reduce accordingly so that the number of tokens in a batch is constant. This means that the value of can be small early in the training and we benefit from more parallelism thanks to large values. This should not affect the performance of the model — at the beginning of the training, most of the information flows from relatively close-by elements. This trick requires stable training with large batchsize, making the use of pre-normalization very important for our model (Child et al., 2019).

Initializing with Transformers.

Another way to speed up the training process is to initialize the model with an already trained Transformer model. It is possible to do this because a Feedback Transformer shares all of its parameters with a Transformer model, with the exception of few parameters . Although it is not a smooth transition as the two models work in very different manners, it shortens the training time as many of the parameters have been learned, including the embeddings.

4 Experiments

Figure 3: (left) Neural Machine Translation on WMT14 En-De. We report BLEU on the test set for varying decoder depths. Words per second is measured on GPU. (right) Document Summarization on CNN-Dailymail. We report ROUGE-L on the test set for varying decoder depths, and their decoding speed in words-per-second (wps).

Task / Model Accuracy (%)
Blocked Random Walk Train 10k Train 1M
Transformer 88.1 97.1
Feedback Transformer 100.0 100.0
Copy Char Seq
Transformer* 53.0 3.0
Universal Transformer* 91.0 35.0
Transformer 59.1 6.2
Feedback Transformer 76.2 23.6
Reverse Char Seq
Transformer* 13.0 6.0
Universal Transformer* 96.0 46.0
Transformer 50.2 5.9
Feedback Transformer 74.8 29.2
Table 1: Comparison on various constructed toy tasks. We follow Dehghani et al. (2018) for the Copy and Reverse tasks and denote their results with *.

We test our model on several different types of tasks that involve processing sequential inputs. First, we analyze the recurrent properties of our model by evaluating on various toy tasks designed to illustrate the need for recurrence. We then evaluate our architecture on neural machine translation and document summarization, where we show that the increased capacity of our model allows for more shallow models to perform strongly — this drastically increases generation speed at inference time. Further, we evaluate on three challenging language modeling benchmarks. Finally, we apply our model to two reinforcement learning tasks that require past memory for optimal performance. More details of the experiments such as hyperparameter values can be found in Appendix 


4.1 Toy Tasks

Blocked Random Walk.

We created a simple toy task to demonstrate the inherent weakness of feedforward models compared to recurrent models. In this task, a model has to keep track of an agent placed in a small grid with few blocks. The agent randomly picks one of the four move actions, but the blocks can prevent some of the moves. The model takes as an input the chosen actions, and needs to predict the agent’s location.

This task is challenging because the effect of an action depends on the current location, thus the model has to constantly keep track of the agent’s location. Even though a Transformer can access all previous actions via its attention, it cannot maintain an internal state for a long time because it cannot access all of its internal computations. In fact, an internal state can be updated only times when propagating through a -layer Transformer because each computation goes to a higher layer. Therefore, a Transformer cannot keep track of the agent’s location within its internal state for long time.

The result in Table 1 confirms this. A Transformer struggles at this task and reaches accuracy when trained 10k sequences of length 100. More training data helps, but it never solves the task completely. In contrast, the Feedback Transformer solves it successfully with accuracy. See Appendix B for more details about this task.

Copy and Reverse.

We subsequently experiment on two algorithmic tasks, copy and reverse. Following Kaiser & Sutskever (2015) and Dehghani et al. (2018), we train models on sequences of length 40 consisting of integers 0 through 9, and test on sequences of length 400. Models must either copy the entire sequence or reverse it, which requires memory over the length of the sequence and the ability to track position. Further, this task requires generalization capability as the train and test settings consist of different lengths. Results are shown in Table 1. We display the results of Universal Transformer (Dehghani et al., 2018) for comparison, though note that the model size, training time, and data size may differ. However, a reimplementation of the Transformer baseline provides similar results. Compared to the standard Transformer, our Feedback Transformer architecture has large improvements in accuracy.

4.2 Sequence to Sequence Tasks

We experiment on two sequence to sequence tasks. We use Feedback Transformers only on the decoder side because the encoder inputs are available at once and processing them in parallel is faster even during generation. We experiment with translation and summarization tasks and demonstrate that the Feedback Transformer decoder maintains competitive performance even with reduced model depth. We implement these tasks in fairseq-py (Ott et al., 2019).

Neural Machine Translation.

We evaluate the performance of the Feedback Transformer on the WMT14 En-De machine translation benchmark of 4.5 million pairs. We follow the setting of Vaswani et al. (2017) and train on the WMT16 training data using newstest2013 as the validation set and newstest2014 as the test dataset. We learn 32K joint byte pair encodings (Sennrich et al., 2016). For generation, we use beam size 5, tuning a length penalty on the validation set. We average the last 10 checkpoints and apply compound splitting, following Vaswani et al. (2017). Model quality is evaluated using tokenized BLEU.

In Figure 3 (left), we display results where more and more layers are removed from a Feedback Transformer decoder compared to a standard Transformers on WMT14 En-De. As the decoder becomes increasingly small and shallow, from six layers to one layer, the gap in performance between the Feedback Transformer and the standard Transformer widens. While the 1-layer Transformer model can only reach 27.2, the Feedback Transformer has 27.9 BLEU.

In translation, the ability to maintain performance with shallow decoders is very important, as model depth has a huge effect on decoding speed. For practical applications of translation models, the latency of generating translations is an important constraint. Reducing to just 1-layer from 6 improves decoding speed by 3.7x, while only losing 1.5 BLEU with the Feedback architecture. We report decoding speed in tokens per second on one GPU. Similar results on IWSLT De-En can be found in Appendix A.


We evaluate on the CNN-Dailymail multi-sentence summarization benchmark of 280K news articles paired with summaries (Hermann et al., 2015). We model the first 400 words of the article (See et al., 2017). We evaluate using ROUGE (Lin, 2004). For generation, we use 3-gram blocking and tune length (Fan et al., 2017).

Figure 3 (right) displays the comparative performance of the Feedback Transformer as the decoder layers are reduced. For all model depths, the Feedback architecture maintains a consistent improvement in ROUGE compared to the standard Transformer. Compared to sentence-level tasks such as machine translation, where models only need to generate individual sentences, this summarization benchmark requires multi-sentence generation. As the model must write summaries around 40 to 60 words, the increased modeling capacity of the Feedback architecture is beneficial.

Figure 4: The performance on (left) char-PTB and (right) Wikitext-103 as a function of the model depth. The number of parameters is kept constant by increasing the width.

4.3 Language Modeling

We test our model on both word-level and character-level language modeling tasks that require modeling very long context. As the tasks require processing of unbounded sequences, we use the caching mechanism (Dai et al., 2019) and relative position embeddings.


We started with char-PTB, which is the more challenging character-level version of Penn Treebank. It contains about 5M tokens in the training set. First, we investigate the effect of depth on model performance while the number of parameters is fixed. To maintain the number of parameters constant while reducing the depth, we increased FF size and the attention head dimension proportionally.

As shown in Figure 4 (left), Transformer performance degrades as the model becomes shallower. In contrast, the Feedback architecture maintains performance despite decreased depth, achieving the best result with only 2 layers.

We further finetune this 2-layer model with 10x smaller learning rate for another 1k updates and compare with existing work (excluding results with dynamic evaluation) as well as our 6-layer Transformer baseline in Table 2. We achieve competitive results and note that Melis et al. (2020) use several techniques to improve their results such weight averaging, choosing a softmax temperature based on validation, and black-box hyperparameter optimization.


We also test our model on the larger-scale character-level language modeling benchmark Enwik8 (Mahoney, 2011), containing 100M unprocessed bytes from Wikipedia. We train a relatively small 12-layer model. Since the task requires very long context, we use adaptive attention span (Sukhbaatar et al., 2019a) with 8k tokens maximum. As shown in Table 3, our proposed Feedback Transformer model achieved a new SOTA performance of bit-per-byte despite its small size.

params dev test
recurrent networks
Quasi-RNN (Bradbury et al., 2016) 13.8M - 1.187
AWD-LSTM (Merity et al., 2017) 13.8M - 1.175
TrellisNet (Bai et al., 2018) 13.4M - 1.158
LSTM (Melis et al., 2020) 24M 1.163 1.143
Mogrifier (Melis et al., 2020) 24M 1.149 1.131
Transformer networks
Transformer 10.7M 1.256 1.227
Feedback Transformer 10.7M 1.181 1.160
Table 2: Results on char-PTB dataset. We report bit per character (bpc) on the dev and test sets.

Model Params Dev Test
Recurrent networks
LSTM (Melis et al., 2020) 48M 1.18 1.20
Mogrifier (Melis et al., 2020) 48M 1.14 1.15
SHA-LSTM (Merity, 2019) 54M 1.10 1.07
Transformer networks
Trans-XL (Dai et al., 2019) 277M - 0.99
Sparse Trans. (Child et al., 2019) 95M - 0.99
AdaSpan (Sukhbaatar et al., 2019a) 181M 1.00 0.98
All-Attn. (Sukhbaatar et al., 2019b) 114M - 0.98
C-Transformer (Rae et al., 2020) 277M - 0.97
Feedback Transformer 77M 0.98 0.96
Table 3: Results on Enwik8 dataset. We report bit-per-byte (bpb) on the dev and test sets, as well as the number of parameters.


We evaluate on the language modeling benchmark Wikitext-103 (Merity et al., 2017). In contrast with other word-level LM tasks, Wikitext-103 challenges models to leverage long context, as the sentences are ordered to reflect the Wikipedia article they originate from. Being able to effectively access previous information makes the task much easier, as words written earlier in the article are more likely to be repeated.

First, we investigate the effect of depth on performance in the same way as the char-PTB experiment. The results are shown in Figure 4 (right) and demonstrate that for a fixed number of parameters, the Feedback architecture can have substantially reduced depth compared to the Transformer.

Next, we train small and large versions of our model and compare against other baselines in Table 4. Our small model outperformed some of the Transformer baselines with more than 100M parameters despite having only 40M parameters. Our large model matched the performance of the TransformerXL with almost half the parameters. Being able to maintain strong performance while reducing size is important for applications where memory is a concern. See Appendix A for more ablation results.

Model Params Dev Test
QRNN (Merity et al., 2018) 151M 32.0 33.0
Trans-XL Base (Dai et al., 2019) 151M 23.1 24.0
DEQ-Trans (Bai et al., 2019) 110M - 23.2
All-Atten (Sukhbaatar et al., 2019b) 133M 19.7 20.6
Trans-XL Large (Dai et al., 2019) 257M 17.7 18.3
Trans+LayerDrop (Fan et al., 2019) 423M - 17.7
Compress Trans (Rae et al., 2019) 257M 16.0 17.1
Feedback Transformer 139M 17.5 18.2
Small networks
Transformer 44M 24.1 25.2
Feedback Transformer 44M 21.4 22.4
Table 4: Results on WikiText-103. We compare with the state of the art on word level language modeling. We report perplexity (ppl) for the dev and test sets as well as the number of parameters.

4.4 Reinforcement learning

We apply the Feedback architecture to two reinforcement learning tasks that require memory to optimally solve because the agents have limited vision. As the model is trained online using A2C, the input must be processed sequentially even during training time. This makes the training of Feedback Transformers as fast as Transformers.

Figure 5: (left) Depiction of the Maze Navigation task and (center) its averaged cumulative reward during training. (right) Averaged cumulative reward on the Water Maze task.

Maze Navigation.

The goal is to navigate a procedurally generated random maze, depicted in Figure 5 (left). In each episode, we generate random mazes using Kruskal’s algorithm (also dead ends are eliminated by randomly removing blocks). We randomly place 8 target objects with different colors. The agent is then given a randomly selected color as a target. If the agent manages to reach the correct target, it gets a reward of and a new target is sampled. An episode ends after 200 steps. The observation includes the area around the agent as well as the target color. For optimal performance, the agent must remember the maze layout and target locations in its memory.

We train -layer Transformer models with a hidden size of and heads, setting the BPTT length to and the batch size to . The reward discount rate is . The attention span is 200 so the agent can put an entire episode in its memory. As displayed in Figure 5 (center), the Feedback Transformer converges to reach higher average reward. Results are shown averaged over 10 trials.

Water Maze.

We modify the Morris Water Maze task (Morris, 1981) to make it more challenging. The maze is defined by a goal position and a mapping that gives to each cell an integer ID — these remain fixed within an episode but change between episodes. The agent receives as an observation the cell IDs of its current location and the target cell. When the agent finds the target, it receives +1 reward and is randomly teleported. During the same episode, if the agent reaches a previously seen cell, it needs to remember how it reached the target from there to go back to the target. The grid size is . To help exploration, the agent can see if the goal is within a area around it. An episode ends after 200 steps.

We train for M steps (M episodes). We use -layer Transformer models with hidden size of and head. The attention span is so that the agent can put an entire episode in its memory. Results are shown averaged over 10 trials (the reward is reported averaged over the last 500 episodes for each trial). As shown in Figure 5 (right), the Feedback Transformer converges to higher average reward.

5 Discussion

We explore alternative constructions of forming the Feedback Memory mechanism in our architecture through an ablation study, analyze the training time, and discuss how to further reduce memory footprint during generation.

5.1 Alternative Memory Composition

We compare different ways of forming the memory on the char-PTB task. The Feedback architecture uses all layers when creating the memory vector as described in Eq. 2. As shown in Table 5, this outperforms the standard Transformer where the memory is the previous layer.

Additionally, we explore one instance of the Feedback architecture where the memory vector is only from the last layer. This performs almost as well as averaging all layers, indicating the importance of higher level representations.

Subsequently, we investigate a more RNN-like modification. In multi-layer RNNs, a layer only has recurrent connections to the same layer, not to higher layers. Similarly, we tried constraining our model so that a layer can only attend to a weighted sum of the same layer and lower layers. The resulting model remains recurrent as information can propagate through the same layer for infinitely many steps. As shown in Table 5, such modification does not perform well.

Memory composition recurrent dev bpc
All layers yes 1.197
Previous layer no 1.285 ()
Last layer yes 1.202 ()
Same + all lower layers yes 1.267 ()
Table 5: Comparison of different memory composition strategies on char-PTB dataset.

Model PPL Training days
Transformer Baseline 21.2 1.2
Feedback Transformer 19.7 17
Increasing BPTT Length 19.9 10
Initializing from Transformer 19.8 5.5
Table 6: Training Time. The number of days required for training on 32 V100 GPUs. We report perplexity on the validation set.

5.2 Training Speed of Feedback Transformer

Feedback Transformers require sequential computation, which can be significantly slower compared to the parallel computation of Transformers. There are two exceptions to this where data comes sequentially, thus prohibiting temporal parallelization. The first one is generation, which is the main usage of models on tasks such as translation and summarization. The second one is online reinforcement learning such as our maze navigation task where both Transformer and Feedback Transformer process fps.

However, when the temporal parallelization is possible, we compare several different ways of shortening the training time. In Table 6, we show the training time of 12-layer models with 140M parameters on the WikiText-103 task. Increasing the BPTT length during training as described in Section 3 reduces training time from 17 days to 10 days without much loss in performance. Further initializing from a pre-trained Transformer reduces the training time to 5.5 days, again without loss in performance.

5.3 Further Memory Reduction during Generation

The Feedback Transformer architecture allows much shallower models to achieve stronger performance compared to Transformers. For applications where latency is less important and storage speed is paramount, further memory reduction can be achieved at generation time. Since the Feedback architecture only needs to store one vector per timestep, previous states from all of the layers can be discarded, unlike in the standard Transformer. This decreases computational efficiency as some computations must be re-calculated, but allows memory usage to be reduced from to at generation time, where is the number of layers and is the context size.

5.4 Advantages in Longer-Context Tasks

Feedback architectures have an increasing advantage the longer the necessary modeling context. The machine translation benchmarks require only sentence-level computation, but we see larger advantages to our model in multi-sentence summarization and long-context language modeling benchmarks like Wikitext-103, particularly for small model sizes. This is likely due to the stronger ability of the recursive computation mechanism to build abstract representations even with shallow models.

6 Conclusion

We proposed a simple modification to the Transformer architecture that better utilizes the sequential input data structure. While the modification makes training slower, it reduces the memory consumption during inference time without additional computational cost. The increased representation power and recursive computation of the Feedback Transformer allows shallow and smaller models to have much stronger performance compared to a standard Transformer of the same size. This property benefits practical tasks where decoding speed is important, such as machine translation and document summarization. Experiments on a diverse set of tasks show that it also improves performance.


Appendix A Additional Results

a.1 IWSLT De-En

We additionally evaluate the Feedback Transformer on IWSLT De-En, a small machine translation dataset. We train a small Transformer model with 6 layers. For generation, we use beam size 5 without checkpoint averaging. Model quality is evaluated using tokenized BLEU. Results are shown in Figure 6 and show that for shallower models, the Feedback Transformer has better performance than the standard Transformer.

Figure 6: Results on the IWSLT De-En dataset.

a.2 Ablation Studies on Language Models

Here we study how different techniques affect the model performance on WikiText-103. The results shown in Table 7 indicate:

  • Pre-normalization combined with higher learning rates helps the performance, particularly for the standard Transformer.

  • Increasing the context size with adaptive span further improves the performance for both models.

  • The technique of increasing the BPTT length during training for efficiency does not affect the final performance.

  • The gap between two model is consistent along those variations.

Model Pre-norm + Adapt. Increase dev
higher LR span BPTT ppl
Transformer no no no 22.9
Transformer no no yes 22.9
Transformer yes no yes 21.0
Transformer yes yes no 20.6
Feedback no no no 19.7
Feedback no no yes 19.9
Feedback yes no yes 19.6
Feedback yes yes yes 19.0
Table 7: Ablation on WikiText-103 of various modeing choices. Results are shown without finetuning.

Appendix B Additional Implementation Details

b.1 Blocked Random Walk Details

We provide additional details for the blocked random walk toy task we explore. The maze layout is shown below:

D brownE F
G H brownI

The agent always placed at cell E initially. At every time step, one of the four direction is randomly selected and the agent moves one step in that direction. The brown cells represent blocks where agent cannot move (except for the initial step). Also, if the agent moves outside the maze it simply transported to the other side of the maze as if the maze repeats itself in all direction (e.g. moving right from F will bring the agent to D). An episode ends after 100 steps, and the maze resets back to its original state.

The input to the model is a sequence of actions taken by the agent, and a special symbol if there was a reset. The output is a sequence of location symbols corresponding to the agent’s location after each action. We generate two datasets with 10k and 1M episodes for training episodes, and 100k for testing.

We use the same setup as our language modeling experiments, except now the model predicts separate output tokens rather than a next token. We concatenate all the episodes and feed them to the model as a single sequence. The training is done with the negative-log-likelihood loss. See Table LABEL:tab:hyper for the hyperparamers used in the experiment. The attention span is set to 100, so that the models can attend to all the information they needs to solve the task.

The learning curve is show in Figure 7. We can see the baseline Transformer is struggling to learn this task, while the Feedback Transformer solves it only after 3k updates even with 10k training episodes. Note that the task is designed to be very easy for recurrent models, so a simple RNN is likely to solve it as well.

Figure 7: The learning on the blocked random walk task. The Transformer struggles learn to solve this task completely, while the Feedback Transformer solves it with 0% error just after 3k updates.

b.2 Maze Navigation Details

All agents where trained using A2C with RMSprop with a learning rate of

, entropy cost of , RMSProp epsilon regularisation parameter of , batch size of , and BPTT . LSTM model is a -layer LSTM with hidden size of .

b.3 Water Maze Details

The water maze task we designed is depicted visually in Figure 8.

All agents where trained using A2C with RMSprop with entropy cost of , RMSProp epsilon regularisation parameter of , batch size of , and BPTT . Feedback Transformer and Transformer baseline were trained with a learning rate of . LSTM model is a -layer LSTM with hidden size of . For LSTM model we used a learning rate of .

Figure 8: Depiction of the Water Maze task.

b.4 Machine Translation and Summarization

We detail the hyperparameters in Table 8. Summarization experiments are done with the Transformer base architecture size and WMT En-De experiments are done with the Transformer big architecture size. As IWSLT De-En is a smaller dataset, we use a smaller model. For all sequence to sequence experiments, only the decoder is modified to have the Feedback Transformer architecture.

Hyperparameter Summarization WMT En-De IWSLT De-En
Encoder Layers 6 6 6
Decoder Layers 6 6 6
FFN Size 2048 4096 1024
Attention Heads 8 16 4
Dropout 0.3 0.3 0.3
Hidden Size 512 1024 512
Learning Rate 0.0005 0.001 0.0005
Table 8: Hyperparamers for sequence to sequence experiments.

b.5 Language modeling

Hyperparameter Random char-PTB Enwik8 WikiText-103 WikiText-103
Walk small large
Layers 4 6 12 4 8
Hidden size () 256 384 512 512 1024
FF size
Head count () 4 4 8 8 8
Head dim
Attention span 100 512 8192* 512 512, 2048*
Dropout rate 0.2 0.5 0.5 0.1 0.3
Embed. dropout - - - 0.1 0.2
BPTT len () 64 128 128 256 256
Batch size () 512 2048 1024 512 512
Learning rate 0.0003 0.0015 0.0015 0.0007 0.0007
Gradient clip 1.0 1.0 0.1 0.1 0.1
LR warm-up steps 1k 1k 8k 8k 8k
Parameters 3.2M 10.7M 77M 44M 139M
Table 9: Hyperparamers for language modeling experiments. Here * indicates the adaptive span.

In the language modeling experiments, we added several improvements on top of the original Transformer (Vaswani et al., 2017) to better adapt to unbounded sequences:

  • Hidden representation caching (Dai et al., 2019): Since the input to the model is an unbounded sequence and the model needs to process it in small blocks, hidden representations from previous blocks are kept in cache so that any token in the current block will the same context length regardless of its position in the block.

  • Relative position embedding (Shaw et al., 2018): Relative position embeddings allow each token in a block to be processed in the same way regardless of its absolute position in the block. We found that adding shared embeddings to key vectors at every layer to be effective.

  • Adaptive attention span (Sukhbaatar et al., 2019a) Language modeling requires a model to have a very long attention span, which is computationally expensive. The adaptive span mechanism allows each attention head to learn different attention spans for efficiency.

  • Pre-normalization (Child et al., 2019): We observed that pre-normalization makes training more stable for Transformers, which allowed us to use larger batch sizes for better parallelization.

Dropouts are applied to attention and ReLU activations. In

WikiText-103 models, additional dropouts are added to the embedding layer output and the last sublayer output.

In Table 9, we present the hyperparameter values used for our experiments. We use the same hyperparameters for both Transformers and Feedback Transformers, and optimize them with Adam. The final performances are obtained by finetuning the models with a 10x smaller learning rate.

Details on the char-PTB experiments

We trained the models for 15k updates (or earlier if the validation loss stops decreasing), and funetined them for 1k steps. We varied the depth of the models while keeping the number of parameters constant. This is achieved by changing the FF size and the head dimension inverse proportionally to the depth.

Details on the enwik8 experiments

We used an adaptive span limited to 8192 tokens with a loss of 0.0000005. The training is done for 100k updates and another 10k steps is used for finetuning. The warming up BPTT length is used for speeding up the training, where the BPTT length is decreased to 64 for the first half of the training.

Details for Training on WikiText-103

We employed the adaptive input (Baevski & Auli, 2019) and the adaptive softmax (Grave et al., 2017) techniques for reducing the number of parameters within word embeddings. The models are trained for 200k steps and the finetuned for additional 10k steps. The BPTT length is increased from 32 to 256 during training, doubling after every 50k updates. The corresponding compute time change is shown in Figure 9.

Figure 9: Compute time (ms) for processing a single batch with increasing BPTT length.

While most of the models have a fixed attention span of 512, the best performance is achieved by extending the attention span to 2048 with adaptive span loss 0.00001.

After training our models, we noticed that our tokenization method differed from others by omitting end-of-line (EOL) symbols. Since our dictionary already contained the EOL token, we were able finetune our trained models on the data with EOL tokens, rather than training them from scratch. This change alone brought about 1ppl improvement.