ndr
The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".
view repo
Despite successes across a broad range of applications, Transformers have limited success in systematic generalization. The situation is especially frustrating in the case of algorithmic tasks, where they often fail to find intuitive solutions that route relevant information to the right node/operation at the right time in the grid represented by Transformer columns. To facilitate the learning of useful control flow, we propose two modifications to the Transformer architecture, copy gate and geometric attention. Our novel Neural Data Router (NDR) achieves 100 compositional table lookup task, as well as near-perfect accuracy on the simple arithmetic task and a new variant of ListOps testing for generalization across computational depth. NDR's attention and gating patterns tend to be interpretable as an intuitive form of neural routing. Our code is public.
READ FULL TEXT VIEW PDFThe official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".
Neural networks (NNs) may easily learn certain training sets, but typically they do not generalize on systematically different test sets. Examples of systematic generalization (Fodor et al., 1988) include generalization to sequences longer than those seen during training—productivity, and algorithmic combinations of previously learned rules—systematicity. Despite recent efforts (Bahdanau et al., 2019; Korrel et al., 2019; Lake, 2019; Li et al., 2019; Russin et al., 2019; Csordás et al., 2021), systematic generalization generally remains unsolved (Fodor and McLaughlin, 1990; Lake and Baroni, 2018; Liska et al., 2018; Greff et al., 2020; Hupkes et al., 2020). On some datasets, the best performing models are neuro-symbolic hybrids (Chen et al., 2020; Liu et al., 2020) using task-specific symbolic functions. However, their applicability to other datasets remains limited (Furrer et al., 2020; Shaw et al., 2020). A big question is: which type of architectural inductive bias encourages the training process to select “good” solutions which generalize systematically?
The popular Transformers (Vaswani et al., 2017) also often fail to generalize on algorithmic tasks (e.g. Liska et al. (2018); Dubois et al. (2020); Chaabouni et al. (2021); Csordás et al. (2021); Ontañón et al. (2021)), even on tasks with intuitive solutions that can be simply expressed in terms of Transformer attention patterns. Given an input sequence of length and a Transformer encoder of depth , solving an algorithmic task is often all about routing the relevant information to the right node/operation at the right time in the -by- grid represented by Transformer columns. Effectively the task is to learn to draw an adaptive control flow on the canvas of Transformer columns. In fact, recent work by Weiss et al. (2021) introduced a programming language called RASP, which is specifically designed to express solutions to sequence processing problems, and which has a direct equivalent to the operations in Transformer encoders. However, this work also shows that Transformers learn solutions expressed in RASP only through intermediate supervision of attention patterns. In some cases, even such supervision fails (Weiss et al., 2021). Generally speaking, Transformers fail to find easily interpretable and/or symbolic solutions to algorithmic tasks. We conversely hypothesize that attention-based NNs that are able to find intuitive solutions (achieving interpretable attention patterns) could improve systematic generalization.
Here we point out that regular Transformers lack some basic ingredients for learning such “intuitive” solutions to algorithmic problems. As a remedy, we propose simple architectural modifications to help them learn data routing. As a first step towards validating our model, we focus on the popular length generalization task of compositional table lookup (CTL; Liska et al. (2018); Hupkes et al. (2019); Dubois et al. (2020)), as well as two more complex tasks: a simple arithmetic task and a variant of ListOps (Nangia and Bowman, 2018) designed to test the compositional generalization ability of NNs. Our novel Neural Data Router (NDR) achieves 100% generalization accuracy (never reported before; Dubois et al. (2020)) on the CTL task, and obtains nearly perfect accuracy on both the proposed simple arithmetic and ListOps tasks. We show that the attention and gating patterns of NDR tend to be interpretable as plausible control flows.
We argue that the following components are needed to build Transformers capable of learning adaptive control flow. First, composing known operations in an arbitrary order requires that all operations are available at every computational step. This can be easily achieved by sharing the weights of the layers, as is done in Universal Transformers (Dehghani et al., 2019). Second, the network should be sufficiently deep, at least as deep as the deepest data dependency in the computational graph (e.g., in the case of a parse tree, this is the depth of the tree). Otherwise, multiple operations would be fused into a single layer and hinder natural and elegant compositions. Third, inputs in some columns should be kept unchanged until it is their turn to be processed. The regular Transformer lacks a mechanism for skipping the whole transformation step by simply copying the input to the next step/layer. We propose a special gating function, copy gate, to implement such a mechanism (Sec. 2.1). Finally, many algorithmic tasks require combining several local computations in the right order. This typically implies that attention should not focus on all possible matches at a given time but only on the closest match. We propose and investigate a new type of attention with a corresponding inductive bias called geometric attention (Sec. 2.2). Using both the geometric attention and copy gate, our model implements a “neural data routing mechanism”, which can adaptively serialize the input problem. We refer to the resulting new Transformer as Neural Data Router (NDR).
Each layer of the regular Transformer consists of one self-attention and one feedforward block. The input to each of these blocks is directly connected to the corresponding output via a residual connection
(Srivastava et al., 2015; He et al., 2016). However, such a connection does not allow for skipping the transformation of the entire layer and simply passing the unchanged input to the next layer. Here we propose to add an explicit gate, which we call copy gate, to facilitate such a behavior.We consider a -layer Transformer encoder and an input sequence of length . Since each layer corresponds to one computational step, we often refer to a layer as a step . We denote the Transformer state of column in layer as where is the state size, and denotes the states of all columns in layer . In the copy gate-augmented Transformer, each column in layer processes the input similarly to regular Transformers:
(1) | ||||
(2) |
but the output is gated (using ) as:
(3) | ||||
(4) |
We use the basic two-layer feedforward block (Vaswani et al., 2017) for both and which transforms input to:
(5) |
but with separate parameters and different dimensionalities: for , , while for , with biases and .
When the gate is closed i.e. in Eq. 4, the entire transformation is skipped and the input is copied over to the next layer . Crucially, we parameterize the gate (Eq. 3) as a function of the output of the self-attention (Eq. 1), such that the decision to copy or transform the input for each column depends on the states of all columns. This is a crucial difference compared to previously proposed gatings in Transformers, which are solely motivated by training stability (Parisotto et al., 2020) or by a common practice from convolution-based models (Chaabouni et al., 2021). None of the previously proposed approaches can implement the behavior of our copy gate (see Sec. 6 on related work).
The bias of the gate is initialized to (Hochreiter and Schmidhuber, 1997). This ensures that no update happens initially to create a better gradient flow between layers. It also encourages the model to skip layers unless they have an important contribution in the corresponding step.
We propose geometric attention designed to attend to the closest matching element. Like in regular self-attention, given an input sequence with , each input is projected to key , value , query vectors, and the dot product is computed for each key/query combination. In our geometric attention, the dot product is followed by a sigmoid function to obtain a score between 0 and 1:
(6) |
which will be treated as a probability of the key at (source) position
matching the query at (target) position . These probabilities are finally converted to the attention scores as follows:(7) |
where denotes the set of all (source) indices which are closer to than to , and when two indices have the same distance to , we consider the one which is to the right of (i.e., greater than ) to be closer, i.e.,
(8) |
In addition, we explicitly zero out the diagonal by setting for all . The ordering of source indices is illustrated in Figure 1/Right. The resulting scores are the attention scores used to compute the weighted averages of the value vectors.
By using the term in Eq. 7, when there is a match, it downscales any other more distant matches. Two recent works (Brooks et al., 2021; Banino et al., 2021)
use such a parameterized geometric distribution in the form of Eq.
7 (see Sec. 6 on related work).The resulting attention function has a complexity of , similar to the regular self-attention used in Transformers (Vaswani et al., 2017). Eq. 7 can be implemented in a numerically stable way in log space. The products can then be calculated using cumulative sums, subtracting the elements for the correct indices in each position.
In practice, we augment Eq. 6 with an additional directional encoding. In fact, the only positional information available in the geometric attention presented above is the ordering used to define the product in Eqs. 7-8. In practice, we found it crucial to augment the score computation of Eq. 6 with additional directional information, encoded as a scalar for each target/source position pair :
(9) |
where denotes the input/state at position and , are trainable parameters. This directional information is integrated into the score computation of Eq. 6 as follows (akin to how Dai et al. (2019) introduce the relative positional encoding (Schmidhuber, 1992) as an extra term in the computation of attention scores):
(10) |
where the matrix maps the states to queries, is a bias for queries, maps states to keys (we note that is typically the size of the key, query and value vectors for each head, ), and are learned scaling coefficients and bias, initialized to . Using this additional directional information, each query (position ) can potentially learn to restrict its attention to either the left or right side.
We evaluate the proposed methods on three tasks: the compositional table lookup (Liska et al., 2018; Hupkes et al., 2019), a custom variant of ListOps (Nangia and Bowman, 2018), and a simple arithmetic task which we propose. In all cases, the task is designed to test the compositional generalization ability of NNs: the model has to learn to apply operations seen during training in a longer/deeper compositional way (productivity).
The compositional table lookup task (Liska et al., 2018; Hupkes et al., 2019; Dubois et al., 2020) is constructed based on a set of symbols and unary functions defined over these symbols. Each example in the task is defined by one input symbol and a list of functions to be applied sequentially, i.e., the first function is applied to the input symbol and the resulting output becomes the input to the second function, and so forth. There are eight possible symbols. Each symbol is traditionally represented by a 3-bit bitstring (Liska et al., 2018). However, in practice, they are simply processed as one token (Dubois et al., 2020). The functions are bijective and randomly generated. Each function is represented by a letter. An example input is ‘101 d a b’, which corresponds to the expression ; the model has to predict the correct output symbol. We note that there exists a sequence-to-sequence variant of this task (Dubois et al., 2020) where the model has to predict all intermediate steps (thus trained with intermediate supervision). We directly predict the final output. An ideal model should be able to solve this task independently of the presentation order, that is, it should not matter whether the task is encoded as ‘101 d a b’ or ‘b a d 101’. We thus study both forward (former) and backward (latter) variants of the task. To evaluate systematic generalization, the train/valid/test sets reflect different numbers of compositions: samples with 1-5/6-8/9-10 operations, respectively. No previous work has reported perfect accuracy on this task through a NN. We refer the readers to Sec. 6 for further details on the previous work.
We consider four different baselines: an LSTM (Hochreiter and Schmidhuber, 1997), DNC (Graves et al., 2016; Csordás and Schmidhuber, 2019), Universal Transformers (Vaswani et al., 2017; Dehghani et al., 2019), and its relative position variants (Csordás et al., 2021). For Transformers, the prediction is based on the last column in the final layer. Results are shown in Table 1. The LSTM and DNC perform well in the forward variant, achieving perfect generalization for longer sequences, but fail on the backward variant. In contrast, basic Transformers fail in both cases.
By introducing the copy gate (Sec. 2.1), the relative Transformer can solve the forward task, but not the backward one. Our analysis showed that the network learns to attend to the last operation based on the relative position information. Since the result is read from the last column, this position changes with the sequence length. The model thus fails to generalize to such arbitrary offsets. To address this issue, we introduce a simple mechanism to let the model choose between absolute and relative positional encodings at each position (see Appendix A). The resulting model effectively manages to use the absolute position for the prediction and perform well in both directions. However, such a combination of absolute/relative positional encoding might be an overly specific bias. A more generic solution, geometric attention (Sec. 2.2), also achieved perfect generalization and was found easier to train. We present the corresponding visualization of our model in Sec. 4.
IID | Longer | |||
Model | Forward | Backward | Forward | Backward |
LSTM | 1.00 0.00 | 0.59 0.03 | 1.00 0.00 | 0.22 0.03 |
DNC | 1.00 0.00 | 0.57 0.06 | 1.00 0.00 | 0.18 0.02 |
Transformer | 1.00 0.00 | 0.82 0.39 | 0.13 0.01 | 0.12 0.01 |
+ rel | 1.00 0.00 | 1.00 0.00 | 0.23 0.05 | 0.13 0.01 |
+ rel + gate | 1.00 0.00 | 1.00 0.00 | 0.99 0.01 | 0.19 0.04 |
+ abs/rel + gate | 1.00 0.00 | 1.00 0.00 | 0.98 0.02 | 0.98 0.03 |
+ geom. att. | 0.96 0.04 | 0.93 0.06 | 0.16 0.02 | 0.15 0.02 |
+ geom. att. + gate (NDR) | 1.00 0.00 | 1.00 0.00 | 1.00 0.00 | 1.00 0.00 |
In order to validate the success of the proposed model on a task that involves more complex data flows and operations, we propose the simple arithmetic task.
The task is to execute an arithmetic expression consisting of nested modulo 10 additions and multiplications. This requires the model to process tree-structured data flows, which is presumably more difficult than the sequential processing required for the CTL task. Each operation is surrounded by brackets, such that the boundaries of operations are easy to determine. For example ‘((4*7)+2)’ should evaluate to ‘0’ (30 modulo 10). The expressions are generated randomly. The tree depth is up to 5 for the training set, 6 for the validation set, and 7-8 for the test set. The depth is measured as the number of operations, ignoring the leaves, so the example above has a depth of 2. The sequence length is limited to at most 50 tokens.
Table 2 shows the results. Even though all considered models perform well on the IID validation data, none except the NDR performs well on the generalization test set, which achieves near-perfect accuracy of 98%. We also note that the NDR learns very quickly: while all other models require about 200 K steps to converge, the NDR achieves near-perfect accuracy after 50 K steps of training.
IID (1..5) | Test (7..8) | ||
200 K | 200 K | 50 K | |
LSTM | 0.99 0.00 | 0.74 0.02 | 0.72 0.01 |
Transformer | 0.98 0.01 | 0.47 0.01 | 0.29 0.01 |
+ rel | 1.00 0.00 | 0.77 0.04 | 0.40 0.05 |
+ abs/rel + gate | 1.00 0.01 | 0.80 0.16 | 0.73 0.15 |
+ geom. att. + gate (NDR) | 1.00 0.00 | 0.98 0.01 | 0.98 0.01 |
We also evaluate our model on a variant of the ListOps task (Nangia and Bowman, 2018) which is a popular task commonly used to evaluate parsing abilities of neural networks (Havrylov et al., 2019; Shen et al., 2019; Tay et al., 2021; Irie et al., 2021).
The task consists of executing nested list operations written in prefix notation. All operations have a list of arguments that can be either a digit (from 0 to 9) or recursively another operation with its own list of arguments. The operations are min, max, median and sum. The sum is modulo 10, and the median is followed by the floor function such that the output of any function/operation lies between 0 and 9. For example: [MED 4 8 5 [MAX 8 4 9 ] ] should return 6. There are two well-known variants of ListOps: the original by Nangia and Bowman (2018) and the “Long Range Arena” variant by Tay et al. (2021) which differ from each other by the maximum number of arguments in each function and the maximum number of tokens in a sequence. In both variants, there is no strict control of the depth of data samples: there is simply a certain pre-defined probability that each position/element in the list is expanded into another list (which may increase the tree depth). This is not suitable for evaluating systematic generalization in terms of compositionality (over the problem depth). We propose instead to generate clean train, valid, and test splits with disjoint depths. In our settings, the training set contains samples with depths of up to 5, the validation set only contains samples of depth 6, and the test set contains samples of depths 7 and 8. Importantly, we make sure that a depth- sample effectively requires computation until depth- (otherwise min, max, and med operations could potentially find the output without checking/executing all of its arguments). By dissociating the splits by the depth, we can clearly identify models which fail to generalize compositionally. Apart from the depth specifications, all train/valid/test sets share the same settings as follows: the maximum sequence length is 50 (tokens), the probability of recursively sampling another function inside a list is 30% at each position, and the maximum number of arguments for a function is 5. The train set consists of 1M, the valid and test sets of 1K sequences.
Table 3 shows the results. Like on compositional table lookup and simple arithmetic tasks, the baseline LSTM and Transformers do not generalize well on the test set consisting of deeper problems, while they achieve near-perfect accuracy on IID data. In contrast, our model achieves near-perfect generalization.
IID (1..5) | Test (7..8) | |
---|---|---|
LSTM | 0.99 0.00 | 0.71 0.03 |
Transformer | 0.98 0.00 | 0.74 0.03 |
+ rel | 0.98 0.01 | 0.79 0.04 |
+ abs/rel + gate | 1.00 0.01 | 0.90 0.06 |
+ geom. att. + gate (NDR) | 1.00 0.00 | 0.99 0.01 |
In this section, we provide some visualizations of attention and gating patterns of the NDR and the corresponding analyses. For more visualizations, we refer the readers to Appendix C.
Figure 2 shows the gating and attention patterns of the NDR model for an example of the backward presentation task. As shown in Fig. 2/Bottom, the gates of different columns open sequentially one after another when the input is available for them. Fig. 2/Top shows the corresponding attention maps. Each column attends to the neighbouring one, waiting for its computation to be finished. The behavior of the last column is different: it always attends to the second position of the sequence, which corresponds to the last operation to be performed.
We can also identify how the NDR processes the data in ListOps. Different attention heads play different roles. We highlight the core observations in Figure 3. The input for this example is: [SM [MED [MIN 1 7 4 [MAX 2 4 0 8 9 ] ] 7 ] 5 [MED 8 5 8 ] 0 7 ]. First of all, we find that there is a head (head 13 in Figure 3, first row) which seems to be responsible for connecting operators and their arguments: the operands/arguments of an operation attend to the operator. In step 0 ( in the figure), we can recognize that the operations at the deepest level, namely MAX and the second MED have all the arguments ready (as is shown by vertical lines on the columns corresponding to MAX and MED). The model indeed identifies that these two operations are ready to be executed and that they can be processed in parallel (these arguments-to-operation attention patterns remain for a few steps). We note that at this stage, the last arguments of MIN are not ready yet ([MIN 1 7 4 [MAX 2 4 0 8 9 ] ]). We can see that only arguments which are already ready (1 7 4) attend to the operator (see the column of MIN). In step 1 (, 2nd row), we can see that head 5 copies the expected result of MAX, 9 to the column of the operator (we note that this only requires one step as 9 is always the result of MAX when it is one of the arguments of MAX). Similarly in step 2, head 7 (2nd row) seems to copy the result of the second MED, 8 to the operator column. In step 3 (, 1st row), we recognize that the result of MAX is marked as an argument for MIN in head 13 which is responsible for communication between operators and their arguments. This is shown by the new attention which appears at in head 13 from the source position MAX to the target position MIN (a pattern which is not visible at ). In head 3, (2nd row), the expected result of MIN, which is 1, is copied to the operator, similarly to the patterns we observed above for MAX and MED. In head 13, (1st row), all arguments for the first MED are now also recognized (the result of MIN which is 1, and 7). Finally in (2nd row), two heads, head 3 and head 5 seem to copy/gather two inputs needed to compute the corresponding median, 1 and 7, and store them in the column of the operator MED. A complete visualization of further steps can be found in Appendix C.2. We noticed that some of the heads do not seem to play a key role, and focused on interpreting those which seem to participate in the main computation. For ListOps, we also partially find the attention patterns described above in the baseline Transformer with relative positional encoding, at least on some inspected examples, which also explains its rather high accuracy.
The NDR architecture can be understood as performing adaptive serialization of the problem. A key requirement for reusable computation is decomposing the problem into reusable building blocks, typically applied in sequential steps. The granularity of the decomposition determines the degree of reusability: fusing operations in a single step makes the processing faster (fewer steps), but also more specialized. Learning the most granular solutions is thus preferable for generalization. At the same time, not all processing should happen serially: branches of the computational graph that do not have common data dependencies can be processed independently in parallel, which we empirically observe in our NDR in the ListOps example (Sec. 4). This enables the architecture to get away with a number of computational steps reflecting the depth of the computational graph rather than the length of the input.
Transformers have seen tremendous successes across various application domains (Devlin et al., 2019; Brown and others, 2020; Dosovitskiy et al., 2021). Impressive results have been reported when they are scaled up with a large amount of data (Brown and others, 2020). On the other hand, simple tasks like those highlighted in the present work demonstrate that the Transformer architecture still struggles with basic reasoning. Particularly in algorithmic tasks, it is often the case that a sub-optimal choice of architecture/optimization method makes the model fall back to simple memorization. We argue that it is crucial to look at isolated problems which test specific generalization capability. This calls for a bottom-up approach: building on toy tasks that focus on individual aspects of generalization and using them for improving models.
Several prior works have proposed to use some sort of gating within Transformer architectures (Parisotto et al., 2020; Chaabouni et al., 2021). Our proposed copy gate is different from those as it satisfies two important properties. First, our copy gate allows the model to skip the entire Transformer layer (i.e., both the self-attention and the feedforward blocks) when the gate is closed. Second, the gate function is conditioned on the attention output such that the decision of opening or closing depends on information from all columns. While multiple gating variants have been proposed by Parisotto et al. (2020)
to stabilize Transformers for reinforcement learning, none of them can produce this behavior. Empirically, we also tried out a few other gating variants which do not satisfy the two properties above; we found them not to improve over regular Transformers in our preliminary experiments on compositional table lookup. Recent work by
Chaabouni et al. (2021)also makes use of “gating” in Transformers through a gated linear unit (GLU) activation function commonly used in convolutional NNs
(Dauphin et al., 2017). Transformer models with such an activation function were reported to outperform RNN baselines on a systematic generalization task (Dessì and Baroni, 2019). Unlike our copy gate or Parisotto et al. (2020)’s gating, such a gating activation does not have the “residual” term (i.e. a closed gate zeros out the input), which allows the model to skip a transformation. In a more general context, benefits of the GLU activation in Transformers vary across tasks (Irie et al., 2019; Shazeer, 2020). In language modeling, no improvement is typically obtained by using the standard highway gate instead of the residual connection in Transformers (Irie, 2020), while it yields improvements when combined with convolutional layers (Kim and Rush, 2016).Two recent works (Brooks et al., 2021; Banino et al., 2021) have used a form of parameterized geometric distribution (PGD; in the form of Eq. 7). Brooks et al. (2021) have used such a distribution to parameterize the movement of a pointer on a sequence of instructions. Banino et al. (2021) have used it to implement adaptive computation time (Schmidhuber, 2012; Graves, 2016). We use the PGD to obtain a generic attention mechanism as a replacement of the standard self-attention used in Transformers (Vaswani et al., 2017).
CTL was proposed as a task for evaluating the compositional ability of neural networks (Liska et al., 2018). Previous works evaluated RNNs, RNNs with attention, and Transformer architectures on this task with limited success (Hupkes et al., 2019; Dubois et al., 2020). Dubois et al. (2020) have proposed a special attention mechanism to augment the recurrent architecture. While they obtained good performance for the forward presentation order, the proposed model failed in the backward one. In contrast, two of our approaches (Sec. 3.1) achieve 100% generalization accuracy on this task for both orders.
Many previous works have focused on improving positional encoding (Schmidhuber, 1992; Vaswani et al., 2017) for self-attention. Most notably, the relative positional encoding (Schmidhuber, 1992; Shaw et al., 2018; Dai et al., 2019) was found useful for improving systematic generalization of Transformers (Csordás et al., 2021). Here we also present two new approaches related to positional encoding. One is the gated combination of absolute and relative positional encoding (Sec. 3.1; details in Appendix A). We show that absolute positional encoding can complement relative positional encoding. The former enables the model to always attend to a specific position, as is needed for the CTL task in the last step, while the gating allows it to use relative positional encoding for other positions/steps. Second, we introduce directional encoding to augment geometric attention. Unlike positional encoding which can overfit to a range of positions seen during training, the direction information is found to be robust and to be a crucial augmentation of the geometric attention.
We proposed a new view on the internal operations of Transformers as a dynamic dataflow architecture between Transformer columns. This overcomes two shortcomings of traditional Transformers: the problem of routing and retaining data in an unaltered fashion, which we solve by an additional copy gate, and the problem of learning length-independent attention patterns, which we solve by geometric attention. Our new model, the Neural Data Router (NDR), generalizes to compositions longer than those seen during training on the popular compositional lookup table task in both forward and backward directions. NDR also achieves near perfect performance on simple arithmetic and ListOps tasks in settings that test systematic generalization in terms of computational depth. In general, the gates and the attention maps collectively make the architecture more interpretable than the baselines.
We thank Imanol Schlag and Sjoerd van Steenkiste for helpful discussions and suggestions on an earlier version of the manuscript. This research was partially funded by ERC Advanced grant no: 742870, project AlgoRNN, and by Swiss National Science Foundation grant no: 200021_192356, project NEUSYM. We are thankful for hardware donations from NVIDIA & IBM. The resources used for the project were partially provided by Swiss National Supercomputing Centre (CSCS) project s1023.
Proc. Int. Conf. on Machine Learning (ICML)
, Virtual only, pp. 1082–1091. Cited by: §2.2, §6.Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP)
, Punta Cana, Dominican Republic. Cited by: §1, §1, §3.1, §6.Adaptive computation time for recurrent neural networks
. In Int. Conf. on Learning Representations (ICLR)Workshop Track, Vancouver, Canada. Cited by: §6.Proc. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Las Vegas, NV, USA, pp. 770–778. Cited by: §2.1.Journal of Artificial Intelligence Research
, pp. 757–795. Cited by: §1.Going beyond linear Transformers with recurrent fast weight programmers
. Preprint arXiv:2106.06295. Cited by: §3.3.Advancing neural language modeling in automatic speech recognition
. Ph.D. Thesis, Computer Science Department, RWTH Aachen University, Aachen, Germany. Cited by: §6.The use of copy gates enables Transformers to generalize to longer lengths in the forward presentation order of the CTL task (Sec. 3.1), but that alone was not enough to make the model generalize in the backward order variant of the task. Examining the attention maps reveals that the model uses position-based attention to read out the result instead of content-based attention. In the backward presentation order, the last column of the transformer should focus on the second column, whose relative position changes dynamically with the length of the sequence. We solve this issue by adding an option to choose between absolute and relative positional encodings to the attention head.
In what follows, we describe the operation within a single layer/step. This allows us to omit the layer/step-index for better readability, and thus denote the state of column/position as instead of . We use the relative positional embedding variant of self-attention by Dai et al. (2019). Our attention matrix with the gated absolute/relative positional encodings can be decomposed as follows:
(11) | ||||
(12) |
where the matrix maps the states to queries, maps states to keys, while maps positional embeddings to keys. is the size of the key, query and value vectors for each head, set as . are learned vectors. is the standard sinusoidal embedding for position (Vaswani et al., 2017). Softmax is applied to the second dimension of to obtain the final attention scores, . Component (a) corresponds to content-based addressing, (b, e) to content-based positional addressing, (c) represents a global content bias, while (d, e) represent a global position bias.
We introduce term (e) for the positional embedding which can switch between absolute and relative positional encodings using the scalar gate (Eq. 11; parameterized by and ), which is the function of the state at target position .
A PyTorch implementation of our models together with the experimental setup is available under
https://github.com/robertcsordas/ndr. The performance of all models is reported as mean and standard deviations over 5 different seeds.
Our implementation uses 8 symbols as input arguments and 9 randomly sampled bijective functions denoted by lower case letters of the English alphabet. All functions are included in the train set in combination with all possible input symbols. The rest of the training set consists of random combinations of functions applied to a random symbol as an argument, up to length 5. The total size of the train set is 53,704 samples. The samples are roughly balanced such that there are similar numbers of samples for each depth. There are different validation sets: an IID set, which matches the distribution of the train set, and a depth validation, which includes samples of lengths 6, 7 and 8. The test set consists of sequences of lengths 9 and 10.
The dataset is constructed by sampling random digits (0-9) and operations + (add) and (multiply). The operations are performed modulo 10. Parentheses surround the arguments of the operations. The depth of the resulting tree is computed, and rejection sampling is used to ensure the same number of samples from each depth be present in the given split. The maximum length of samples is 50 tokens, sub-operations are sampled with probability 0.2. 100 K samples are used for training, 1 K for both test and validation sets. The train set consists of 0-5 operations, the validation set of 6 and the test set of 7 operations.
Random digits are sampled from range 0-9. Operations are sample from the set sum-modulo (SM), which is a sum modulo 10, min (MIN), max (MAX) and median followed by the floor function (MED). The maximum number of arguments for each operation is 5. A sub-operation is sampled with probability 0.3. 1 M samples are used for training, 1 K for test and validation. The train set consists of 0-5 operations, 6 for the validation set, and 7 for the test set.
For each sample, we calculate a number which we call dependency depth. To understand it, note that MIN and MAX operations only select one of their operands, MED selects 1 or 2. In SUM, all operands are needed to perform the operation. If we construct a parse tree and prune away the branches which were not
selected by any operation and measure the depth of such a tree, the resulting number is the dependency depth. This ensures that the deeper parts of the tree contribute to the result calculation, preventing shallow heuristics, like ignoring all branches of the tree that are too deep and still getting the correct result with a high chance. We also ensure that the number of samples is the same for all possible dependency depths in each split.
We use the AdamW optimizer (Loshchilov and Hutter, 2019)
for all of our models. Standard hyperparameters are listed in Tab.
4, 5 and 6. Additionally, models with gating use dropout (Hanson, 1990; Srivastava et al., 2014) applied to the content-based query and the position-query components of 0.1 for most models, except for non-gated Transformers on ListOps, where this value is 0.05. In the case of geometric attention, since the channels of the positional encoding does not have any redundancy, dropout is applied just to the content-query.In the case of Transformers with the copy gate but without geometric attention, we use instead of in Eq. 2.
The hyperparameters of the gateless Transformers differ significantly from the gated ones. This is because they were very hard to train to achieve good performance even on the IID set, requiring extensive hyperparameter optimization. One might argue that fewer layers make them less competitive on longer sequences. However, we were unable to train them to perform well even on IID data with comparable sizes.
All Transformer variants have a begin (B) and end (E) token included on sequence boundaries. RNNs (LSTM and DNC) have no such tokens. All Transformers are encoders only, and the results are read from the last column (corresponding to the end token).
The DNC has 21 memory cells, 4 read heads, and an LSTM controller. It contains recently introduced improvements (Csordás and Schmidhuber, 2019).
We use gradient clipping with magnitude 5 (for CTL) or 1 (for simple arithmetic and ListOps) for all of our models.
Hyperparameters were obtained by a Bayesian hyperparameter search of Weights & Biases^{2}^{2}2https://wandb.ai/ over the systematically different (OOD) validation set for the +abs/rel + gate models and were reused for all other gated models. For the non-gated models, we used the +rel variant for tuning. It was not possible to tune the baselines using only the OOD validation set because their performance was too bad on that set. We thus used a mixture of IID and OOD validation sets to tune the hyperparameters for the baselines.
We train all models for a fixed number of iterations and measure their validation performance every 1000 iterations. For each model, we select the best checkpoint according to the validation performance, and report its test accuracy.
batch s. | learning rate | wd. | do. | ||||||
LSTM | 200 | - | - | 1 | 256 | - | 0.5 | 20K | |
DNC | 200 | - | - | 1 | 256 | - | 0.5 | 20K | |
Transformer | 128 | 256 | 4 | 11 | 512 | 0.0025 | 0.1 | 30K | |
+ rel | 128 | 256 | 4 | 11 | 512 | 0.0025 | 0.1 | 30K | |
+ rel + gate | 256 | 1024 | 1 | 14 | 512 | 0.01 | 0.5 | 30K | |
+ abs/rel + gate | 256 | 1024 | 1 | 14 | 512 | 0.01 | 0.5 | 30K | |
+ geom. att. | 128 | 256 | 4 | 11 | 512 | 0.0025 | 0.1 | 30K | |
+ geom. att. + gate | 256 | 512 | 1 | 14 | 512 | 0.01 | 0.5 | 30K |
batch s. | learning rate | wd. | do. | ||||||
---|---|---|---|---|---|---|---|---|---|
LSTM | 200 | - | - | 2 | 256 | - | 0.5 | 200K | |
Transformer | 128 | 256 | 4 | 11 | 512 | 0.0025 | 0.5 | 200K | |
+ rel | 128 | 256 | 4 | 11 | 512 | 0.0025 | 0.5 | 200K | |
+ abs/rel + gate | 256 | 1024 | 4 | 15 | 512 | 0.01 | 0.5 | 100K | |
+ geom. att. + gate | 256 | 1024 | 4 | 15 | 512 | 0.01 | 0.5 | 100K |
batch s. | learning rate | wd. | do. | ||||||
---|---|---|---|---|---|---|---|---|---|
LSTM | 512 | - | - | 4 | 512 | 0.08 | 0.1 | 200k | |
Transformer | 256 | 1024 | 16 | 6 | 512 | 0.05 | 0.015 | 200k | |
+ rel | 256 | 1024 | 16 | 6 | 512 | 0.05 | 0.015 | 200k | |
+ abs/rel + gate | 512 | 2048 | 16 | 20 | 512 | 0.09 | 0.1 | 100k | |
+ geom. att. + gate | 512 | 1024 | 16 | 20 | 512 | 0.09 | 0.1 | 100k |
In the main text, we only had space to show the gate and attention activity of the NDR for a few timesteps. Here we show the corresponding visualization of all steps in Figures 7 and 8, as well as the attention map for the baseline Transformer with relative positional encoding in Figure 4. We also show the Transformer + abs/rel + gate variant in Fig. 5 and Fig. 6. Please directly refer to the caption of the figures for the corresponding analysis. In general, the visualization for our NDR and the abs/rel + gate variant is easily interpretable, unlike that of the baseline Transformer model.
Figures 9 and 11 shows the attention and gate patterns of our NDR architecture on an example from the ListOps dataset. We highlighted notable attention patterns in Sec. 4.
Different heads seem to specialize in different functions. As already mentioned in Sec. 4, head 13 of the NDR architecture, shown in Figure 10, seems to specialize in selecting which arguments belong to which operator.
The gating patterns are also very interesting. In the early stages, the deepest parts of the input are updated: [MAX 2 4 0 8 9] and [MED 8 5 8], which are independent branches of the parse tree that can be processed in parallel. In the following steps, the update patterns spread up in the parse tree, updating the operations that have their arguments available. In this task, the input is read from the first column, which is written in a very late stage.
Comments
There are no comments yet.