The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization

by   Róbert Csordás, et al.

Despite successes across a broad range of applications, Transformers have limited success in systematic generalization. The situation is especially frustrating in the case of algorithmic tasks, where they often fail to find intuitive solutions that route relevant information to the right node/operation at the right time in the grid represented by Transformer columns. To facilitate the learning of useful control flow, we propose two modifications to the Transformer architecture, copy gate and geometric attention. Our novel Neural Data Router (NDR) achieves 100 compositional table lookup task, as well as near-perfect accuracy on the simple arithmetic task and a new variant of ListOps testing for generalization across computational depth. NDR's attention and gating patterns tend to be interpretable as an intuitive form of neural routing. Our code is public.



There are no comments yet.


page 7

page 18

page 19

page 20

page 21

page 22

page 23

page 24


Systematic Generalization with Edge Transformers

Recent research suggests that systematic generalization in natural langu...

The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers

Recently, many datasets have been proposed to test the systematic genera...

On the Power of Saturated Transformers: A View from Circuit Complexity

Transformers have become a standard architecture for many NLP problems. ...

Inducing Transformer's Compositional Generalization Ability via Auxiliary Sequence Prediction Tasks

Systematic compositionality is an essential mechanism in human language,...

Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN

Despite their practical success, modern seq2seq architectures are unable...

Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks

Systematic compositionality is the ability to recombine meaningful units...

Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics

Much of recent progress in NLU was shown to be due to models' learning d...

Code Repositories


The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural networks (NNs) may easily learn certain training sets, but typically they do not generalize on systematically different test sets. Examples of systematic generalization (Fodor et al., 1988) include generalization to sequences longer than those seen during training—productivity, and algorithmic combinations of previously learned rules—systematicity. Despite recent efforts (Bahdanau et al., 2019; Korrel et al., 2019; Lake, 2019; Li et al., 2019; Russin et al., 2019; Csordás et al., 2021), systematic generalization generally remains unsolved (Fodor and McLaughlin, 1990; Lake and Baroni, 2018; Liska et al., 2018; Greff et al., 2020; Hupkes et al., 2020). On some datasets, the best performing models are neuro-symbolic hybrids (Chen et al., 2020; Liu et al., 2020) using task-specific symbolic functions. However, their applicability to other datasets remains limited (Furrer et al., 2020; Shaw et al., 2020). A big question is: which type of architectural inductive bias encourages the training process to select “good” solutions which generalize systematically?

The popular Transformers (Vaswani et al., 2017) also often fail to generalize on algorithmic tasks (e.g. Liska et al. (2018); Dubois et al. (2020); Chaabouni et al. (2021); Csordás et al. (2021); Ontañón et al. (2021)), even on tasks with intuitive solutions that can be simply expressed in terms of Transformer attention patterns. Given an input sequence of length and a Transformer encoder of depth , solving an algorithmic task is often all about routing the relevant information to the right node/operation at the right time in the -by- grid represented by Transformer columns. Effectively the task is to learn to draw an adaptive control flow on the canvas of Transformer columns. In fact, recent work by Weiss et al. (2021) introduced a programming language called RASP, which is specifically designed to express solutions to sequence processing problems, and which has a direct equivalent to the operations in Transformer encoders. However, this work also shows that Transformers learn solutions expressed in RASP only through intermediate supervision of attention patterns. In some cases, even such supervision fails (Weiss et al., 2021). Generally speaking, Transformers fail to find easily interpretable and/or symbolic solutions to algorithmic tasks. We conversely hypothesize that attention-based NNs that are able to find intuitive solutions (achieving interpretable attention patterns) could improve systematic generalization.

Here we point out that regular Transformers lack some basic ingredients for learning such “intuitive” solutions to algorithmic problems. As a remedy, we propose simple architectural modifications to help them learn data routing. As a first step towards validating our model, we focus on the popular length generalization task of compositional table lookup (CTL; Liska et al. (2018); Hupkes et al. (2019); Dubois et al. (2020)), as well as two more complex tasks: a simple arithmetic task and a variant of ListOps (Nangia and Bowman, 2018) designed to test the compositional generalization ability of NNs. Our novel Neural Data Router (NDR) achieves 100% generalization accuracy (never reported before; Dubois et al. (2020)) on the CTL task, and obtains nearly perfect accuracy on both the proposed simple arithmetic and ListOps tasks. We show that the attention and gating patterns of NDR tend to be interpretable as plausible control flows.

2 Improving Transformers for Learning Adaptive Control Flow

We argue that the following components are needed to build Transformers capable of learning adaptive control flow. First, composing known operations in an arbitrary order requires that all operations are available at every computational step. This can be easily achieved by sharing the weights of the layers, as is done in Universal Transformers (Dehghani et al., 2019). Second, the network should be sufficiently deep, at least as deep as the deepest data dependency in the computational graph (e.g., in the case of a parse tree, this is the depth of the tree). Otherwise, multiple operations would be fused into a single layer and hinder natural and elegant compositions. Third, inputs in some columns should be kept unchanged until it is their turn to be processed. The regular Transformer lacks a mechanism for skipping the whole transformation step by simply copying the input to the next step/layer. We propose a special gating function, copy gate, to implement such a mechanism (Sec. 2.1). Finally, many algorithmic tasks require combining several local computations in the right order. This typically implies that attention should not focus on all possible matches at a given time but only on the closest match. We propose and investigate a new type of attention with a corresponding inductive bias called geometric attention (Sec. 2.2). Using both the geometric attention and copy gate, our model implements a “neural data routing mechanism”, which can adaptively serialize the input problem. We refer to the resulting new Transformer as Neural Data Router (NDR).

2.1 Copy Gate: Learning to Skip Operations (Vertical Flow)

Each layer of the regular Transformer consists of one self-attention and one feedforward block. The input to each of these blocks is directly connected to the corresponding output via a residual connection

(Srivastava et al., 2015; He et al., 2016). However, such a connection does not allow for skipping the transformation of the entire layer and simply passing the unchanged input to the next layer. Here we propose to add an explicit gate, which we call copy gate, to facilitate such a behavior.

We consider a -layer Transformer encoder and an input sequence of length . Since each layer corresponds to one computational step, we often refer to a layer as a step . We denote the Transformer state of column in layer as where is the state size, and denotes the states of all columns in layer . In the copy gate-augmented Transformer, each column in layer processes the input similarly to regular Transformers:


but the output is gated (using ) as:


We use the basic two-layer feedforward block (Vaswani et al., 2017) for both and which transforms input to:


but with separate parameters and different dimensionalities: for , , while for , with biases and .

When the gate is closed i.e.  in Eq. 4, the entire transformation is skipped and the input is copied over to the next layer . Crucially, we parameterize the gate (Eq. 3) as a function of the output of the self-attention (Eq. 1), such that the decision to copy or transform the input for each column depends on the states of all columns. This is a crucial difference compared to previously proposed gatings in Transformers, which are solely motivated by training stability (Parisotto et al., 2020) or by a common practice from convolution-based models (Chaabouni et al., 2021). None of the previously proposed approaches can implement the behavior of our copy gate (see Sec. 6 on related work).

The bias of the gate is initialized to (Hochreiter and Schmidhuber, 1997). This ensures that no update happens initially to create a better gradient flow between layers. It also encourages the model to skip layers unless they have an important contribution in the corresponding step.

2.2 Geometric Attention: Learning to Attend to the Closest Match (Horizontal Flow)

We propose geometric attention designed to attend to the closest matching element. Like in regular self-attention, given an input sequence with , each input is projected to key , value , query vectors, and the dot product is computed for each key/query combination. In our geometric attention, the dot product is followed by a sigmoid function to obtain a score between 0 and 1:


which will be treated as a probability of the key at (source) position

matching the query at (target) position . These probabilities are finally converted to the attention scores as follows:


where denotes the set of all (source) indices which are closer to than to , and when two indices have the same distance to , we consider the one which is to the right of (i.e., greater than ) to be closer, i.e.,


In addition, we explicitly zero out the diagonal by setting for all . The ordering of source indices is illustrated in Figure 1/Right. The resulting scores are the attention scores used to compute the weighted averages of the value vectors.

By using the term in Eq. 7, when there is a match, it downscales any other more distant matches. Two recent works (Brooks et al., 2021; Banino et al., 2021)

use such a parameterized geometric distribution in the form of Eq. 

7 (see Sec. 6 on related work).

The resulting attention function has a complexity of , similar to the regular self-attention used in Transformers (Vaswani et al., 2017). Eq. 7 can be implemented in a numerically stable way in log space. The products can then be calculated using cumulative sums, subtracting the elements for the correct indices in each position.

Figure 1: Left: an ideal sequence of computations in a Transformer for an example CTL task. Right: the order of source positions for each target, counting from the closest, used in geometric attention.

Directional encoding.

In practice, we augment Eq. 6 with an additional directional encoding. In fact, the only positional information available in the geometric attention presented above is the ordering used to define the product in Eqs. 7-8. In practice, we found it crucial to augment the score computation of Eq. 6 with additional directional information, encoded as a scalar for each target/source position pair :


where denotes the input/state at position and , are trainable parameters. This directional information is integrated into the score computation of Eq. 6 as follows (akin to how Dai et al. (2019) introduce the relative positional encoding (Schmidhuber, 1992) as an extra term in the computation of attention scores):


where the matrix maps the states to queries, is a bias for queries, maps states to keys (we note that is typically the size of the key, query and value vectors for each head, ), and are learned scaling coefficients and bias, initialized to . Using this additional directional information, each query (position ) can potentially learn to restrict its attention to either the left or right side.

3 Experiments

We evaluate the proposed methods on three tasks: the compositional table lookup (Liska et al., 2018; Hupkes et al., 2019), a custom variant of ListOps (Nangia and Bowman, 2018), and a simple arithmetic task which we propose. In all cases, the task is designed to test the compositional generalization ability of NNs: the model has to learn to apply operations seen during training in a longer/deeper compositional way (productivity).

3.1 Compositional Table Lookup


The compositional table lookup task (Liska et al., 2018; Hupkes et al., 2019; Dubois et al., 2020) is constructed based on a set of symbols and unary functions defined over these symbols. Each example in the task is defined by one input symbol and a list of functions to be applied sequentially, i.e., the first function is applied to the input symbol and the resulting output becomes the input to the second function, and so forth. There are eight possible symbols. Each symbol is traditionally represented by a 3-bit bitstring (Liska et al., 2018). However, in practice, they are simply processed as one token (Dubois et al., 2020). The functions are bijective and randomly generated. Each function is represented by a letter. An example input is ‘101 d a b’, which corresponds to the expression ; the model has to predict the correct output symbol. We note that there exists a sequence-to-sequence variant of this task (Dubois et al., 2020) where the model has to predict all intermediate steps (thus trained with intermediate supervision). We directly predict the final output. An ideal model should be able to solve this task independently of the presentation order, that is, it should not matter whether the task is encoded as ‘101 d a b’ or ‘b a d 101’. We thus study both forward (former) and backward (latter) variants of the task. To evaluate systematic generalization, the train/valid/test sets reflect different numbers of compositions: samples with 1-5/6-8/9-10 operations, respectively. No previous work has reported perfect accuracy on this task through a NN. We refer the readers to Sec. 6 for further details on the previous work.


We consider four different baselines: an LSTM (Hochreiter and Schmidhuber, 1997), DNC (Graves et al., 2016; Csordás and Schmidhuber, 2019), Universal Transformers (Vaswani et al., 2017; Dehghani et al., 2019), and its relative position variants (Csordás et al., 2021). For Transformers, the prediction is based on the last column in the final layer. Results are shown in Table 1. The LSTM and DNC perform well in the forward variant, achieving perfect generalization for longer sequences, but fail on the backward variant. In contrast, basic Transformers fail in both cases.

By introducing the copy gate (Sec. 2.1), the relative Transformer can solve the forward task, but not the backward one. Our analysis showed that the network learns to attend to the last operation based on the relative position information. Since the result is read from the last column, this position changes with the sequence length. The model thus fails to generalize to such arbitrary offsets. To address this issue, we introduce a simple mechanism to let the model choose between absolute and relative positional encodings at each position (see Appendix A). The resulting model effectively manages to use the absolute position for the prediction and perform well in both directions. However, such a combination of absolute/relative positional encoding might be an overly specific bias. A more generic solution, geometric attention (Sec. 2.2), also achieved perfect generalization and was found easier to train. We present the corresponding visualization of our model in Sec. 4.

IID Longer
Model Forward Backward Forward Backward
LSTM 1.00 0.00 0.59 0.03 1.00 0.00 0.22 0.03
DNC 1.00 0.00 0.57 0.06 1.00 0.00 0.18 0.02
Transformer 1.00 0.00 0.82 0.39 0.13 0.01 0.12 0.01
 + rel 1.00 0.00 1.00 0.00 0.23 0.05 0.13 0.01
 + rel + gate 1.00 0.00 1.00 0.00 0.99 0.01 0.19 0.04
 + abs/rel + gate 1.00 0.00 1.00 0.00 0.98 0.02 0.98 0.03
 + geom. att. 0.96 0.04 0.93 0.06 0.16 0.02 0.15 0.02
 + geom. att. + gate (NDR) 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00
Table 1: Accuracy on compositional table lookup dataset.

3.2 Simple Arithmetic

In order to validate the success of the proposed model on a task that involves more complex data flows and operations, we propose the simple arithmetic task.


The task is to execute an arithmetic expression consisting of nested modulo 10 additions and multiplications. This requires the model to process tree-structured data flows, which is presumably more difficult than the sequential processing required for the CTL task. Each operation is surrounded by brackets, such that the boundaries of operations are easy to determine. For example ‘((4*7)+2)’ should evaluate to ‘0’ (30 modulo 10). The expressions are generated randomly. The tree depth is up to 5 for the training set, 6 for the validation set, and 7-8 for the test set. The depth is measured as the number of operations, ignoring the leaves, so the example above has a depth of 2. The sequence length is limited to at most 50 tokens.


Table 2 shows the results. Even though all considered models perform well on the IID validation data, none except the NDR performs well on the generalization test set, which achieves near-perfect accuracy of 98%. We also note that the NDR learns very quickly: while all other models require about 200 K steps to converge, the NDR achieves near-perfect accuracy after 50 K steps of training.

IID (1..5) Test (7..8)
200 K 200 K 50 K
LSTM 0.99 0.00 0.74 0.02 0.72 0.01
Transformer 0.98 0.01 0.47 0.01 0.29 0.01
 + rel 1.00 0.00 0.77 0.04 0.40 0.05
 + abs/rel + gate 1.00 0.01 0.80 0.16 0.73 0.15
 + geom. att. + gate (NDR) 1.00 0.00 0.98 0.01 0.98 0.01
Table 2: Performance of different models on the simple arithmetic dataset. All models are trained for 200 K iterations, except the NDR for which we stop training at 100 K. We also report the performance of all models after 50 K iterations, where it can be seen that NDR converges significantly faster than the others.

3.3 ListOps

We also evaluate our model on a variant of the ListOps task (Nangia and Bowman, 2018) which is a popular task commonly used to evaluate parsing abilities of neural networks (Havrylov et al., 2019; Shen et al., 2019; Tay et al., 2021; Irie et al., 2021).


The task consists of executing nested list operations written in prefix notation. All operations have a list of arguments that can be either a digit (from 0 to 9) or recursively another operation with its own list of arguments. The operations are min, max, median and sum. The sum is modulo 10, and the median is followed by the floor function such that the output of any function/operation lies between 0 and 9. For example: [MED 4 8 5 [MAX 8 4 9 ] ] should return 6. There are two well-known variants of ListOps: the original by Nangia and Bowman (2018) and the “Long Range Arena” variant by Tay et al. (2021) which differ from each other by the maximum number of arguments in each function and the maximum number of tokens in a sequence. In both variants, there is no strict control of the depth of data samples: there is simply a certain pre-defined probability that each position/element in the list is expanded into another list (which may increase the tree depth). This is not suitable for evaluating systematic generalization in terms of compositionality (over the problem depth). We propose instead to generate clean train, valid, and test splits with disjoint depths. In our settings, the training set contains samples with depths of up to 5, the validation set only contains samples of depth 6, and the test set contains samples of depths 7 and 8. Importantly, we make sure that a depth- sample effectively requires computation until depth- (otherwise min, max, and med operations could potentially find the output without checking/executing all of its arguments). By dissociating the splits by the depth, we can clearly identify models which fail to generalize compositionally. Apart from the depth specifications, all train/valid/test sets share the same settings as follows: the maximum sequence length is 50 (tokens), the probability of recursively sampling another function inside a list is 30% at each position, and the maximum number of arguments for a function is 5. The train set consists of 1M, the valid and test sets of 1K sequences.


Table 3 shows the results. Like on compositional table lookup and simple arithmetic tasks, the baseline LSTM and Transformers do not generalize well on the test set consisting of deeper problems, while they achieve near-perfect accuracy on IID data. In contrast, our model achieves near-perfect generalization.

IID (1..5) Test (7..8)
LSTM 0.99 0.00 0.71 0.03
Transformer 0.98 0.00 0.74 0.03
 + rel 0.98 0.01 0.79 0.04
 + abs/rel + gate 1.00 0.01 0.90 0.06
 + geom. att. + gate (NDR) 1.00 0.00 0.99 0.01
Table 3: Performance of different models on balanced ListOps dataset. All models are trained for 200 K iterations, except all +gate variants which converge after 100 K steps. The numbers in the parentheses indicate the problem depths (1-5 for the IID, and 7-8 for the test set).

4 Analysis

In this section, we provide some visualizations of attention and gating patterns of the NDR and the corresponding analyses. For more visualizations, we refer the readers to Appendix C.

Compositional Table Lookup.

Figure 2 shows the gating and attention patterns of the NDR model for an example of the backward presentation task. As shown in Fig. 2/Bottom, the gates of different columns open sequentially one after another when the input is available for them. Fig. 2/Top shows the corresponding attention maps. Each column attends to the neighbouring one, waiting for its computation to be finished. The behavior of the last column is different: it always attends to the second position of the sequence, which corresponds to the last operation to be performed.

Figure 2: Example visualization of NDR. For other models, see Appendix C. Top: Attention map for different steps. The x/y-axis corresponds to source/target positions, respectively. Each position focuses on the column to the right, except the last one where the result is read from, which focuses on the last operation. The focus becomes clear only once the result is available. Bottom: gate activations for different steps/layers. The gates remain closed until the data dependencies are satisfied.
Figure 3: Example visualization of NDR on ListOps. The top row shows head 13 in different steps, which controls which arguments are used in which step. The bottom row shows different heads in different key steps. Please refer to Sec. 4 for the step-by-step description. More visualizations are provided in the appendix: Fig. 9 shows the max of attention over all heads for all steps, Fig. 10 shows all steps of head 13, and Fig. 11 shows the corresponding gates.


We can also identify how the NDR processes the data in ListOps. Different attention heads play different roles. We highlight the core observations in Figure 3. The input for this example is: [SM [MED [MIN 1 7 4 [MAX 2 4 0 8 9 ] ] 7 ] 5 [MED 8 5 8 ] 0 7 ]. First of all, we find that there is a head (head 13 in Figure 3, first row) which seems to be responsible for connecting operators and their arguments: the operands/arguments of an operation attend to the operator. In step 0 ( in the figure), we can recognize that the operations at the deepest level, namely MAX and the second MED have all the arguments ready (as is shown by vertical lines on the columns corresponding to MAX and MED). The model indeed identifies that these two operations are ready to be executed and that they can be processed in parallel (these arguments-to-operation attention patterns remain for a few steps). We note that at this stage, the last arguments of MIN are not ready yet ([MIN 1 7 4 [MAX 2 4 0 8 9 ] ]). We can see that only arguments which are already ready (1 7 4) attend to the operator (see the column of MIN). In step 1 (, 2nd row), we can see that head 5 copies the expected result of MAX, 9 to the column of the operator (we note that this only requires one step as 9 is always the result of MAX when it is one of the arguments of MAX). Similarly in step 2, head 7 (2nd row) seems to copy the result of the second MED, 8 to the operator column. In step 3 (, 1st row), we recognize that the result of MAX is marked as an argument for MIN in head 13 which is responsible for communication between operators and their arguments. This is shown by the new attention which appears at in head 13 from the source position MAX to the target position MIN (a pattern which is not visible at ). In head 3, (2nd row), the expected result of MIN, which is 1, is copied to the operator, similarly to the patterns we observed above for MAX and MED. In head 13, (1st row), all arguments for the first MED are now also recognized (the result of MIN which is 1, and 7). Finally in (2nd row), two heads, head 3 and head 5 seem to copy/gather two inputs needed to compute the corresponding median, 1 and 7, and store them in the column of the operator MED. A complete visualization of further steps can be found in Appendix C.2. We noticed that some of the heads do not seem to play a key role, and focused on interpreting those which seem to participate in the main computation. For ListOps, we also partially find the attention patterns described above in the baseline Transformer with relative positional encoding, at least on some inspected examples, which also explains its rather high accuracy.

5 Discussion

Learning adaptive serialization.

The NDR architecture can be understood as performing adaptive serialization of the problem. A key requirement for reusable computation is decomposing the problem into reusable building blocks, typically applied in sequential steps. The granularity of the decomposition determines the degree of reusability: fusing operations in a single step makes the processing faster (fewer steps), but also more specialized. Learning the most granular solutions is thus preferable for generalization. At the same time, not all processing should happen serially: branches of the computational graph that do not have common data dependencies can be processed independently in parallel, which we empirically observe in our NDR in the ListOps example (Sec. 4). This enables the architecture to get away with a number of computational steps reflecting the depth of the computational graph rather than the length of the input.

Bottom up approach for improving model architectures.

Transformers have seen tremendous successes across various application domains (Devlin et al., 2019; Brown and others, 2020; Dosovitskiy et al., 2021). Impressive results have been reported when they are scaled up with a large amount of data (Brown and others, 2020). On the other hand, simple tasks like those highlighted in the present work demonstrate that the Transformer architecture still struggles with basic reasoning. Particularly in algorithmic tasks, it is often the case that a sub-optimal choice of architecture/optimization method makes the model fall back to simple memorization. We argue that it is crucial to look at isolated problems which test specific generalization capability. This calls for a bottom-up approach: building on toy tasks that focus on individual aspects of generalization and using them for improving models.

6 Related Work

Gating inside Transformers.

Several prior works have proposed to use some sort of gating within Transformer architectures (Parisotto et al., 2020; Chaabouni et al., 2021). Our proposed copy gate is different from those as it satisfies two important properties. First, our copy gate allows the model to skip the entire Transformer layer (i.e., both the self-attention and the feedforward blocks) when the gate is closed. Second, the gate function is conditioned on the attention output such that the decision of opening or closing depends on information from all columns. While multiple gating variants have been proposed by Parisotto et al. (2020)

to stabilize Transformers for reinforcement learning, none of them can produce this behavior. Empirically, we also tried out a few other gating variants which do not satisfy the two properties above; we found them not to improve over regular Transformers in our preliminary experiments on compositional table lookup. Recent work by

Chaabouni et al. (2021)

also makes use of “gating” in Transformers through a gated linear unit (GLU) activation function commonly used in convolutional NNs

(Dauphin et al., 2017). Transformer models with such an activation function were reported to outperform RNN baselines on a systematic generalization task (Dessì and Baroni, 2019). Unlike our copy gate or Parisotto et al. (2020)’s gating, such a gating activation does not have the “residual” term (i.e. a closed gate zeros out the input), which allows the model to skip a transformation. In a more general context, benefits of the GLU activation in Transformers vary across tasks (Irie et al., 2019; Shazeer, 2020). In language modeling, no improvement is typically obtained by using the standard highway gate instead of the residual connection in Transformers (Irie, 2020), while it yields improvements when combined with convolutional layers (Kim and Rush, 2016).

Parameterized geometric distributions.

Two recent works (Brooks et al., 2021; Banino et al., 2021) have used a form of parameterized geometric distribution (PGD; in the form of Eq. 7). Brooks et al. (2021) have used such a distribution to parameterize the movement of a pointer on a sequence of instructions. Banino et al. (2021) have used it to implement adaptive computation time (Schmidhuber, 2012; Graves, 2016). We use the PGD to obtain a generic attention mechanism as a replacement of the standard self-attention used in Transformers (Vaswani et al., 2017).

Compositional table lookup.

CTL was proposed as a task for evaluating the compositional ability of neural networks (Liska et al., 2018). Previous works evaluated RNNs, RNNs with attention, and Transformer architectures on this task with limited success (Hupkes et al., 2019; Dubois et al., 2020). Dubois et al. (2020) have proposed a special attention mechanism to augment the recurrent architecture. While they obtained good performance for the forward presentation order, the proposed model failed in the backward one. In contrast, two of our approaches (Sec. 3.1) achieve 100% generalization accuracy on this task for both orders.

Positional encodings.

Many previous works have focused on improving positional encoding (Schmidhuber, 1992; Vaswani et al., 2017) for self-attention. Most notably, the relative positional encoding (Schmidhuber, 1992; Shaw et al., 2018; Dai et al., 2019) was found useful for improving systematic generalization of Transformers (Csordás et al., 2021). Here we also present two new approaches related to positional encoding. One is the gated combination of absolute and relative positional encoding (Sec. 3.1; details in Appendix A). We show that absolute positional encoding can complement relative positional encoding. The former enables the model to always attend to a specific position, as is needed for the CTL task in the last step, while the gating allows it to use relative positional encoding for other positions/steps. Second, we introduce directional encoding to augment geometric attention. Unlike positional encoding which can overfit to a range of positions seen during training, the direction information is found to be robust and to be a crucial augmentation of the geometric attention.

7 Conclusion

We proposed a new view on the internal operations of Transformers as a dynamic dataflow architecture between Transformer columns. This overcomes two shortcomings of traditional Transformers: the problem of routing and retaining data in an unaltered fashion, which we solve by an additional copy gate, and the problem of learning length-independent attention patterns, which we solve by geometric attention. Our new model, the Neural Data Router (NDR), generalizes to compositions longer than those seen during training on the popular compositional lookup table task in both forward and backward directions. NDR also achieves near perfect performance on simple arithmetic and ListOps tasks in settings that test systematic generalization in terms of computational depth. In general, the gates and the attention maps collectively make the architecture more interpretable than the baselines.


We thank Imanol Schlag and Sjoerd van Steenkiste for helpful discussions and suggestions on an earlier version of the manuscript. This research was partially funded by ERC Advanced grant no: 742870, project AlgoRNN, and by Swiss National Science Foundation grant no: 200021_192356, project NEUSYM. We are thankful for hardware donations from NVIDIA & IBM. The resources used for the project were partially provided by Swiss National Supercomputing Centre (CSCS) project s1023.


  • D. Bahdanau, H. de Vries, T. J. O’Donnell, S. Murty, P. Beaudoin, Y. Bengio, and A. Courville (2019) CLOSURE: assessing systematic generalization of CLEVR models. In ViGIL workshop, NeurIPS, Vancouver, Canada. Cited by: §1.
  • A. Banino, J. Balaguer, and C. Blundell (2021) PonderNet: learning to ponder. Preprint arXiv:2107.05407. Cited by: §2.2, §6.
  • E. A. Brooks, J. Rajendran, R. L. Lewis, and S. Singh (2021) Reinforcement learning of implicit and explicit control flow instructions. In

    Proc. Int. Conf. on Machine Learning (ICML)

    Virtual only, pp. 1082–1091. Cited by: §2.2, §6.
  • T. B. Brown et al. (2020) Language models are few-shot learners. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Virtual only. Cited by: §5.
  • R. Chaabouni, R. Dessì, and E. Kharitonov (2021) Can transformers jump around right in natural language? Assessing performance transfer from scan. Preprint arXiv:2107.01366. Cited by: §1, §2.1, §6.
  • X. Chen, C. Liang, A. W. Yu, D. Song, and D. Zhou (2020) Compositional generalization via neural-symbolic stack machines. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Virtual only. Cited by: §1.
  • R. Csordás, K. Irie, and J. Schmidhuber (2021) The devil is in the detail: simple tricks improve systematic generalization of Transformers. In

    Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP)

    Punta Cana, Dominican Republic. Cited by: §1, §1, §3.1, §6.
  • R. Csordás and J. Schmidhuber (2019) Improving differentiable neural computers through memory masking, de-allocation, and link distribution sharpness control. In Int. Conf. on Learning Representations (ICLR), New Orleans, LA, USA. Cited by: §B.2, §3.1.
  • Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhutdinov (2019) Transformer-XL: attentive language models beyond a fixed-length context. In Proc. Association for Computational Linguistics (ACL), Florence, Italy, pp. 2978–2988. Cited by: Appendix A, §2.2, §6.
  • Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier (2017) Language modeling with gated convolutional networks. In Proc. Int. Conf. on Machine Learning (ICML), Sydney, Australia, pp. 933–941. Cited by: §6.
  • M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser (2019) Universal Transformers. In Int. Conf. on Learning Representations (ICLR), New Orleans, LA, USA. Cited by: §2, §3.1.
  • R. Dessì and M. Baroni (2019) CNNs found to jump around more skillfully than RNNs: compositional generalization in seq2seq convolutional networks. In Proc. Association for Computational Linguistics (ACL), Florence, Italy, pp. 3919–3923. Cited by: §6.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional Transformers for language understanding. In Proc. North American Chapter of the Association for Computational Linguistics on Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, pp. 4171–4186. Cited by: §5.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In Int. Conf. on Learning Representations (ICLR), Virtual only. Cited by: §5.
  • Y. Dubois, G. Dagan, D. Hupkes, and E. Bruni (2020) Location attention for extrapolation to longer sequences. In Proc. Association for Computational Linguistics (ACL), Virtual only, pp. 403–413. Cited by: §1, §1, §3.1, §6.
  • J. A. Fodor, Z. W. Pylyshyn, et al. (1988) Connectionism and cognitive architecture: a critical analysis. Cognition 28 (1-2), pp. 3–71. Cited by: §1.
  • J. Fodor and B. P. McLaughlin (1990) Connectionism and the problem of systematicity: why Smolensky’s solution doesn’t work. Cognition 35 (2), pp. 183–204. Cited by: §1.
  • D. Furrer, M. van Zee, N. Scales, and N. Schärli (2020) Compositional generalization in semantic parsing: pre-training vs. specialized architectures. Preprint arXiv:2007.08970. Cited by: §1.
  • A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwinska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. P. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis (2016) Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. Cited by: §3.1.
  • A. Graves (2016)

    Adaptive computation time for recurrent neural networks

    In Int. Conf. on Learning Representations (ICLR)Workshop Track, Vancouver, Canada. Cited by: §6.
  • K. Greff, S. van Steenkiste, and J. Schmidhuber (2020) On the binding problem in artificial neural networks. Preprint arXiv:2012.05208. Cited by: §1.
  • S. J. Hanson (1990) A stochastic version of the delta rule. Physica D: Nonlinear Phenomena 42 (1-3), pp. 265–272. Cited by: §B.2.
  • S. Havrylov, G. Kruszewski, and A. Joulin (2019) Cooperative learning of disjoint syntax and semantics. In Proc. North American Chapter of the Association for Computational Linguistics on Human Language Technologies (NAACL-HLT), Minneapolis, USA, pp. 1118–1128. Cited by: §3.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proc. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Las Vegas, NV, USA, pp. 770–778. Cited by: §2.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation, pp. 1735–1780. Cited by: §2.1, §3.1.
  • D. Hupkes, V. Dankers, M. Mul, and E. Bruni (2020) Compositionality decomposed: how do neural networks generalise?.

    Journal of Artificial Intelligence Research

    , pp. 757–795.
    Cited by: §1.
  • D. Hupkes, A. Singh, K. Korrel, G. Kruszewski, and E. Bruni (2019) Learning compositionally through attentive guidance. In Proc. Int. Conf. on Computational Linguistics and Intelligent Text Processing, La Rochelle, France. Cited by: §1, §3.1, §3, §6.
  • K. Irie, I. Schlag, R. Csordás, and J. Schmidhuber (2021)

    Going beyond linear Transformers with recurrent fast weight programmers

    Preprint arXiv:2106.06295. Cited by: §3.3.
  • K. Irie, A. Zeyer, R. Schlüter, and H. Ney (2019) Language modeling with deep Transformers. In Proc. Interspeech, Graz, Austria, pp. 3905–3909. Cited by: §6.
  • K. Irie (2020)

    Advancing neural language modeling in automatic speech recognition

    Ph.D. Thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Cited by: §6.
  • Y. Kim and Y. J. D. S. A. Rush (2016) Character-aware neural language models. In Proc. AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA. Cited by: §6.
  • K. Korrel, D. Hupkes, V. Dankers, and E. Bruni (2019) Transcoding compositionally: using attention to find more generalizable solutions. In Proc. BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, ACL, Florence, Italy, pp. 1–11. Cited by: §1.
  • B. M. Lake and M. Baroni (2018) Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. In Proc. Int. Conf. on Machine Learning (ICML), Stockholm, Sweden, pp. 2873–2882. Cited by: §1.
  • B. M. Lake (2019) Compositional generalization through meta sequence-to-sequence learning. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Vancouver, Canada, pp. 9788–9798. Cited by: §1.
  • Y. Li, L. Zhao, J. Wang, and J. Hestness (2019) Compositional generalization for primitive substitutions. In Proc. Conf. on Empirical Methods in Natural Language Processing and Int.Joint Conf. on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4292–4301. Cited by: §1.
  • A. Liska, G. Kruszewski, and M. Baroni (2018) Memorize or generalize? Searching for a compositional RNN in a haystack. In AEGAP Workshop ICML, Stockholm, Sweden. Cited by: §1, §1, §1, §3.1, §3, §6.
  • Q. Liu, S. An, J. Lou, B. Chen, Z. Lin, Y. Gao, B. Zhou, N. Zheng, and D. Zhang (2020) Compositional generalization by learning analytical expressions. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Virtual only. Cited by: §1.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In Int. Conf. on Learning Representations (ICLR), New Orleans, LA, USA. Cited by: §B.2.
  • N. Nangia and S. R. Bowman (2018) ListOps: A diagnostic dataset for latent tree learning. In Proc. North American Chapter of the Association for Computational Linguistics on Human Language Technologies (NAACL-HLT), New Orleans, USA, pp. 92–99. Cited by: §1, §3.3, §3.3, §3.
  • S. Ontañón, J. Ainslie, V. Cvicek, and Z. Fisher (2021) Making Transformers solve compositional tasks. Preprint arXiv:2108.04378. Cited by: §1.
  • E. Parisotto, H. F. Song, J. W. Rae, R. Pascanu, Ç. Gülçehre, S. M. Jayakumar, M. Jaderberg, R. L. Kaufman, A. Clark, S. Noury, M. Botvinick, N. Heess, and R. Hadsell (2020) Stabilizing Transformers for reinforcement learning. In Proc. Int. Conf. on Machine Learning (ICML), Vol. 119, Virtual only, pp. 7487–7498. Cited by: §2.1, §6.
  • J. Russin, J. Jo, R. C. O’Reilly, and Y. Bengio (2019) Compositional generalization in a deep seq2seq model by separating syntax and semantics. Preprint arXiv:1904.09708. Cited by: §1.
  • J. Schmidhuber (1992) Learning complex, extended sequences using the principle of history compression. Neural Computation 4 (2), pp. 234–242. Cited by: §2.2, §6.
  • J. Schmidhuber (2012) Self-delimiting neural networks. Technical report Technical Report IDSIA-08-12, arXiv:1210.0118v1, The Swiss AI Lab IDSIA. Cited by: §6.
  • P. Shaw, M. Chang, P. Pasupat, and K. Toutanova (2020) Compositional generalization and natural language variation: can a semantic parsing approach handle both?. Preprint arXiv:2010.12725. Cited by: §1.
  • P. Shaw, J. Uszkoreit, and A. Vaswani (2018) Self-attention with relative position representations. In Proc. North American Chapter of the Association for Computational Linguistics on Human Language Technologies (NAACL-HLT), New Orleans, Louisiana, USA, pp. 464–468. Cited by: §6.
  • N. Shazeer (2020) GLU variants improve transformer. Preprint arXiv:2002.05202. Cited by: §6.
  • Y. Shen, S. Tan, S. A. Hosseini, Z. Lin, A. Sordoni, and A. C. Courville (2019) Ordered memory. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Vancouver, Canada, pp. 5038–5049. Cited by: §3.3.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §B.2.
  • R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Training very deep networks. In Proc. Advances in Neural Information Processing Systems (NIPS), Montreal, Canada, pp. 2368–2376. Cited by: §2.1.
  • Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler (2021) Long Range Arena : A benchmark for efficient transformers. In Int. Conf. on Learning Representations (ICLR), Virtual only. Cited by: §3.3, §3.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, pp. 5998–6008. Cited by: Appendix A, §1, §2.1, §2.2, §3.1, §6, §6.
  • G. Weiss, Y. Goldberg, and E. Yahav (2021) Thinking like Transformers. In Proc. Int. Conf. on Machine Learning (ICML), Virtual only, pp. 11080–11090. Cited by: §1.

Appendix A Details of Attention with Combined Absolute/Relative Positional Encoding

The use of copy gates enables Transformers to generalize to longer lengths in the forward presentation order of the CTL task (Sec. 3.1), but that alone was not enough to make the model generalize in the backward order variant of the task. Examining the attention maps reveals that the model uses position-based attention to read out the result instead of content-based attention. In the backward presentation order, the last column of the transformer should focus on the second column, whose relative position changes dynamically with the length of the sequence. We solve this issue by adding an option to choose between absolute and relative positional encodings to the attention head.

In what follows, we describe the operation within a single layer/step. This allows us to omit the layer/step-index for better readability, and thus denote the state of column/position as instead of . We use the relative positional embedding variant of self-attention by Dai et al. (2019). Our attention matrix with the gated absolute/relative positional encodings can be decomposed as follows:


where the matrix maps the states to queries, maps states to keys, while maps positional embeddings to keys. is the size of the key, query and value vectors for each head, set as . are learned vectors. is the standard sinusoidal embedding for position (Vaswani et al., 2017). Softmax is applied to the second dimension of to obtain the final attention scores, . Component (a) corresponds to content-based addressing, (b, e) to content-based positional addressing, (c) represents a global content bias, while (d, e) represent a global position bias.

We introduce term (e) for the positional embedding which can switch between absolute and relative positional encodings using the scalar gate (Eq. 11; parameterized by and ), which is the function of the state at target position .

Appendix B Implementation Details

A PyTorch implementation of our models together with the experimental setup is available under

. The performance of all models is reported as mean and standard deviations over 5 different seeds.

b.1 Dataset Details

Compositional table lookup.

Our implementation uses 8 symbols as input arguments and 9 randomly sampled bijective functions denoted by lower case letters of the English alphabet. All functions are included in the train set in combination with all possible input symbols. The rest of the training set consists of random combinations of functions applied to a random symbol as an argument, up to length 5. The total size of the train set is 53,704 samples. The samples are roughly balanced such that there are similar numbers of samples for each depth. There are different validation sets: an IID set, which matches the distribution of the train set, and a depth validation, which includes samples of lengths 6, 7 and 8. The test set consists of sequences of lengths 9 and 10.

Simple arithmetic.

The dataset is constructed by sampling random digits (0-9) and operations + (add) and (multiply). The operations are performed modulo 10. Parentheses surround the arguments of the operations. The depth of the resulting tree is computed, and rejection sampling is used to ensure the same number of samples from each depth be present in the given split. The maximum length of samples is 50 tokens, sub-operations are sampled with probability 0.2. 100 K samples are used for training, 1 K for both test and validation sets. The train set consists of 0-5 operations, the validation set of 6 and the test set of 7 operations.


Random digits are sampled from range 0-9. Operations are sample from the set sum-modulo (SM), which is a sum modulo 10, min (MIN), max (MAX) and median followed by the floor function (MED). The maximum number of arguments for each operation is 5. A sub-operation is sampled with probability 0.3. 1 M samples are used for training, 1 K for test and validation. The train set consists of 0-5 operations, 6 for the validation set, and 7 for the test set.

For each sample, we calculate a number which we call dependency depth. To understand it, note that MIN and MAX operations only select one of their operands, MED selects 1 or 2. In SUM, all operands are needed to perform the operation. If we construct a parse tree and prune away the branches which were not

selected by any operation and measure the depth of such a tree, the resulting number is the dependency depth. This ensures that the deeper parts of the tree contribute to the result calculation, preventing shallow heuristics, like ignoring all branches of the tree that are too deep and still getting the correct result with a high chance. We also ensure that the number of samples is the same for all possible dependency depths in each split.

b.2 Model Details

We use the AdamW optimizer (Loshchilov and Hutter, 2019)

for all of our models. Standard hyperparameters are listed in Tab.

4, 5 and 6. Additionally, models with gating use dropout (Hanson, 1990; Srivastava et al., 2014) applied to the content-based query and the position-query components of 0.1 for most models, except for non-gated Transformers on ListOps, where this value is 0.05. In the case of geometric attention, since the channels of the positional encoding does not have any redundancy, dropout is applied just to the content-query.

In the case of Transformers with the copy gate but without geometric attention, we use instead of in Eq. 2.

The hyperparameters of the gateless Transformers differ significantly from the gated ones. This is because they were very hard to train to achieve good performance even on the IID set, requiring extensive hyperparameter optimization. One might argue that fewer layers make them less competitive on longer sequences. However, we were unable to train them to perform well even on IID data with comparable sizes.

All Transformer variants have a begin (B) and end (E) token included on sequence boundaries. RNNs (LSTM and DNC) have no such tokens. All Transformers are encoders only, and the results are read from the last column (corresponding to the end token).

The DNC has 21 memory cells, 4 read heads, and an LSTM controller. It contains recently introduced improvements (Csordás and Schmidhuber, 2019).

We use gradient clipping with magnitude 5 (for CTL) or 1 (for simple arithmetic and ListOps) for all of our models.

Hyperparameters were obtained by a Bayesian hyperparameter search of Weights & Biases222 over the systematically different (OOD) validation set for the +abs/rel + gate models and were reused for all other gated models. For the non-gated models, we used the +rel variant for tuning. It was not possible to tune the baselines using only the OOD validation set because their performance was too bad on that set. We thus used a mixture of IID and OOD validation sets to tune the hyperparameters for the baselines.

We train all models for a fixed number of iterations and measure their validation performance every 1000 iterations. For each model, we select the best checkpoint according to the validation performance, and report its test accuracy.

batch s. learning rate wd. do.
LSTM 200 - - 1 256 - 0.5 20K
DNC 200 - - 1 256 - 0.5 20K
Transformer 128 256 4 11 512 0.0025 0.1 30K
 + rel 128 256 4 11 512 0.0025 0.1 30K
 + rel + gate 256 1024 1 14 512 0.01 0.5 30K
 + abs/rel + gate 256 1024 1 14 512 0.01 0.5 30K
 + geom. att. 128 256 4 11 512 0.0025 0.1 30K
 + geom. att. + gate 256 512 1 14 512 0.01 0.5 30K
Table 4: Hyperparameters used for different models on the compositional table lookup task. We denote the feedforward size as , weight decay as “wd.”, dropout as “do.”. The model is trained for iterations.
batch s. learning rate wd. do.
LSTM 200 - - 2 256 - 0.5 200K
Transformer 128 256 4 11 512 0.0025 0.5 200K
 + rel 128 256 4 11 512 0.0025 0.5 200K
 + abs/rel + gate 256 1024 4 15 512 0.01 0.5 100K
 + geom. att. + gate 256 1024 4 15 512 0.01 0.5 100K
Table 5: Hyperparameters used for different models on the simple arithmetic task. We denote the feedforward size as , weight decay as “wd.”, dropout as “do.”. The model is trained for iterations.
batch s. learning rate wd. do.
LSTM 512 - - 4 512 0.08 0.1 200k
Transformer 256 1024 16 6 512 0.05 0.015 200k
 + rel 256 1024 16 6 512 0.05 0.015 200k
 + abs/rel + gate 512 2048 16 20 512 0.09 0.1 100k
 + geom. att. + gate 512 1024 16 20 512 0.09 0.1 100k
Table 6: Hyperparameters used for different models on the ListOps task. We denote the feedforward size as , weight decay as “wd.”, dropout as “do.”. The model is trained for iterations.

Appendix C Additional Analysis

c.1 Compositional Table Lookup

In the main text, we only had space to show the gate and attention activity of the NDR for a few timesteps. Here we show the corresponding visualization of all steps in Figures 7 and 8, as well as the attention map for the baseline Transformer with relative positional encoding in Figure 4. We also show the Transformer + abs/rel + gate variant in Fig. 5 and Fig. 6. Please directly refer to the caption of the figures for the corresponding analysis. In general, the visualization for our NDR and the abs/rel + gate variant is easily interpretable, unlike that of the baseline Transformer model.

Figure 4: Attention map for every computational step for a baseline Transformer with relative positional encoding on CTL. The attention pattern gets blurry very quickly, and the model does not generalize to longer sequences.
Figure 5: Attention map for every computational step for a Transformer with gating and relative/absolute positional encoding (presented in Figure 2) on CTL. The attention pattern is relatively stable over time, and it gets blurrier only after the given column is processed and updated. The gate sequence for the same input can be seen in Figure 6.
Figure 6: Gates for every computational step for a Transformer with gating and relative/absolute positional encoding on CTL. The gates are closed until all arguments of the given operation become available. The attention maps for the same input can be seen in Figure 5.
Figure 7: Attention map for every computational step of the NDR on CTL. The network correctly and clearly focuses on the last element of the sequence, and the last sharp read happens in step 10 - corresponding to the 10 function calls in the example. The gate sequence for the same input can be seen in Figure 8.
Figure 8: Gates for every computational step of the NDR on CTL. The gates remain closed until all arguments of the given operations become available. The attention maps for the same input can be seen in Figure 7.

c.2 ListOps

Figures 9 and 11 shows the attention and gate patterns of our NDR architecture on an example from the ListOps dataset. We highlighted notable attention patterns in Sec. 4.

Different heads seem to specialize in different functions. As already mentioned in Sec. 4, head 13 of the NDR architecture, shown in Figure 10, seems to specialize in selecting which arguments belong to which operator.

The gating patterns are also very interesting. In the early stages, the deepest parts of the input are updated: [MAX 2 4 0 8 9] and [MED 8 5 8], which are independent branches of the parse tree that can be processed in parallel. In the following steps, the update patterns spread up in the parse tree, updating the operations that have their arguments available. In this task, the input is read from the first column, which is written in a very late stage.

Figure 9: Attention maps for every computational step of the NDR on ListOps. The network has 16 heads; the max of them is shown. The input has only depth 4, which explains the early stopping of the computation, roughly after 8-9 steps, after which the attention barely changes. The corresponding gate maps for the same input can be seen in Figure 11.
Figure 10: Attention maps for head 13 of the NDR in every computational step on ListOps. This head shows the operands for each operation. Following it, we observe the hierarchy and the order in which the operations are performed.
Figure 11: Gates for every computational step of the NDR on ListOps. Gates open for the deepest operations in the tree, processing proceeds upwards in the computational tree. The input has only depth 4, which explains the early stopping of the computation, roughly after 8-9 steps. The attention maps for the same input can be seen in Figure 9.