1 Introduction
Neural networks have become the preferred model for pattern recognition and prediction in perceptual tasks and natural language processing
[16, 6] thanks to their flexibility and their ability to learn complex solutions. Recently, researchers have turned their attention towards imbuing neural networks with the capability to perform algorithmic reasoning, thereby allowing them to go beyond pattern recognition and logically solve more complex problems [12, 15, 13, 17]. These are often inspired by concepts in conventional computer systems (e.g., pointers [33], external memory [26, 12]).Unlike perceptual tasks, where the model is only expected to perform well on a specific distribution from which the training set is drawn, in algorithmic reasoning the goal is to learn a robust solution that performs the task regardless of the input distribution. This ability to generalize to arbitrary input distributions—as opposed to unseen instances from a fixed data distribution—distinguishes the concept of strong generalization from ordinary generalization. To date, neural networks still have difficulty learning algorithmic tasks with strong generalization [32, 11, 15].
In this work, we study this problem by learning to imitate the composable subroutines that form the basis of common algorithms, namely selection sort, merge sort, Dijkstra’s algorithm for shortest paths, and Prim’s algorithm to find a minimum spanning tree. We choose to focus on subroutine imitation as: (1) it is a natural mechanism that is reminiscent of how human developers decompose problems (e.g., developers implement very different subroutines for merge sort vs. selection sort), (2) it supports introspection to understand how the network may fail to strongly generalize, and (3) it allows for providing additional supervision to the neural network if necessary (inputs, outputs, and intermediate state).
By testing a powerful sequencetosequence transformer model [30] within this context, we show that while it is able to learn subroutines for a given data distribution, it fails to strongly generalize as the test distribution deviates from the training distribution. Further analysis of this failure case reveals that transformers have difficulty separating “what” to compute from “where” to compute, manifesting in attention weights whose entropy increases over long sequence lengths. This, in turn, results in misprediction and compounding errors.
Our solution to this problem is a conditional masking mechanism that has the transformer predict both a value and a pointer. The pointer is used to update the mask for subsequent computation; this corresponds to a new form of decoder. We call the resulting architecture a Neural Execution Engine (NEE), and show that NEEs achieve nearperfect generalization over a significantly larger range of test values than existing models. We also find that a NEE that is trained on one subroutine (e.g., comparison) can be used in a variety of other algorithms (e.g., Dijkstra, Prim) asis without retraining.
Another essential component of algorithmic reasoning is representing and manipulating numbers [34]. To achieve strong generalization, the employed number system must work over large ranges and generalize outside of its training domain (as it is intractable to train the network on all integers). In this work, we leverage binary numbers, as binary is a hierarchical representation that expands exponentially with the length of the bit string (e.g., 8bit binary strings represent exponentially more data than 7bit binary strings), thus making it possible to train and test on significantly larger number ranges compared to prior work [11, 15]. We demonstrate that the binary embeddings trained on downstream tasks (e.g., addition, multiplication) lead to wellstructured and interpretable representations with natural interpolation capabilities.
2 Background
2.1 Transformers and Graph Attention Networks
Transformers are a family of models that represent the current stateoftheart in sequence learning [30, 6, 23]. Given input token sequences and output token sequences , where , a transformer learns a mapping . First, the tokens are individually embedded to form . The main module of the transformer architecture is the selfattention layer, which allows each element of the sequence to concentrate on a subset of the other elements.^{1}^{1}1We do not use positional encodings (which we found to hurt performance) and use singleheaded attention.
Selfattention layers are followed by a pointwise feedforward neural network layer, forming a selfattention block. These blocks are composed to form the encoder and decoder of the transformer, with the outputs of the encoder being used as queries and keys for the decoder. More details can be found in
[30].An important component for our purposes is the selfattention mask. This is used to prevent certain positions from propagating information to other positions. A mask,
, is a binary vector where the value
indicates that the input should be considered, and indicates that the input should be ignored. The vector is broadcast to zero out attention weights of ignored input numbers. Typically this is used for decoding, to ensure that the model can only condition on past outputs during sequential generation. Graph Attention Networks [31] are transformers where the encoder mask reflects the structure of a given graph. In our case, we will consider masking in the encoder as an explicit way for the model to condition on the part of the sequence that it needs at a given point in its computation, creating a dynamic graph. We find that this focuses the attention of the transformer and is a critical component for achieving strong generalization.2.2 Numerical Subroutines for Common Algorithms
We draw examples from different algorithmic categories to frame our exploration into the capability of neural networks to perform generalizable algorithmic reasoning. Figure 1
shows the pseudocode and subroutines for several commonly studied algorithms; specifically, selection sort, merge sort, Dijkstra’s algorithm for shortest paths, and Prim’s algorithm to find a minimum spanning tree. These algorithms contain a broad set of subroutines that we can classify into three categories:
Comparison subroutines are those involving a comparison of two or more numbers.
Arithmetic subroutines involve transforming numbers through arithmetic operations (we focus on addition in Figure 1, but explore multiplication later).
Pointer manipulation requires using numerical values (pointers) to manipulate other data values in memory. One example is shown for merge sort, which requires merging two sorted lists. This could be trivially done by executing another sort on the concatenated list, however the aim is to take advantage of the fact that the two lists are sorted. This involves maintaining pointers into each list and advancing them only when the number they point to is selected.
2.3 Number Representations
Beyond subroutines, numerics are also critically important in teaching neural networks to learn algorithms. Neural networks generally use either categorical, onehot, or integer number representations. Prior work has found that scalar numbers have difficulty representing large ranges [29] and that binary is a useful representation that generalizes well [15, 25]. We explore embeddings of binary numbers as a form of representation learning, analogous to word embeddings in language models [18], and show that they learn useful structure for algorithmic tasks.
3 Neural Execution Engines
A neural execution engine (NEE) is a transformerbased network that takes as input binary numbers and an encoding mask, and outputs either data values, a pointer, or both. Here, we consider input and output data values to be bit binary vectors, or a sequence of such vectors, and the output pointer to be a onehot vector of the length of the input sequence. The pointer is used to modify the mask the next time the NEE is invoked. A NEE is essentially a graph attention network [31] that can modify its own graph, resulting in a new decoding mechanism.
The NEE architecture is shown in Figure 2. It is a modification of the transformer architecture. Rather than directly mapping one sequence to another, a NEE takes an input sequence and mask indicating which elements are relevant for computation. It encodes these using a masked transformer encoder. The decoder takes in a single input, the zero vector, and runs the transformer decoder to output a binary vector corresponding to the output value. The last layer of attention in the last decoder block is used as a pointer to the next region of interest for computation. This and the original mask vector are fed into a final temporal convolution block that outputs a new mask. This is then applied in a recurrent fashion until there are no input elements left to process (according to the mask). In the remainder of this section, we go into more detail on the specific elements of the NEE.
Conditional Masking
The encoder of a NEE takes as input both a set of values and a mask, which is used to force the encoder to ignore certain inputs. We use the output pointer of the decoder to modify the mask for a subsequent call of the encoder. In this way, the inputs represent a memory state, and the mask represents a set of pointers into the memory. A NEE effectively learns where to focus its attention for performing computation.
Learning an attention mask update is a challenging problem in general, as the mask updates themselves also need to strongly generalize. In many algorithms, including the ones considered here, the mask tends to change within a local neighbourhood around the point of interest (the element pointed to by the NEE). For example, in iterative algorithms, the network needs to attend to the next element to be processed which is often close to the last element that was processed.
We therefore use a small temporal (1D) convolutional neural network (CNN),
. The CNN accepts as input the current mask vectorand the onehot encoded output pointer
from the decoder. It outputs the next mask vector . Mathematically, , where denotes concatenation,denotes sigmoid function,
, and represent the pointwise feedforward layer, 1D convolutional layer and featurewise normalization layer, respectively. At inference, we simply choose the argmax of the pointer output head to produce .The intuition behind this design choice is that through convolution, we enforce a position ordering to the input by exchanging information among the neighbourhoods. The convnet is shift invariant and therefore amenable to generalizing over long sequences. We also experimented with a transformer encoderdecoder, using an explicit positional encoding, however we found that this often fails due to the difficulty in dealing with unseen positions.
Bitwise Embeddings
As input to a NEE, we embed binary vectors using a linear projection. This is equivalent to defining a learnable vector for each bit position, and then summing these vectors elementwise, modulated by the value of their corresponding bit. That is, given an embedding vector for each bit , for an bit input vector , we would compute . For example, emb+.
Two important tokens for our purposes are start and end . These are commonly used in natural language data to denote the start and end of a sequence. We use as input to the decoder, and to denote the end of an input sequence. This allows us to train a NEE to learn to emit when it has completed an algorithm.
Additionally, we also use these symbols to define both 0 and . These concepts are important for many algorithms, particularly for initialization. For addition, we require that and . As a more concrete example, in shortest path, the distance from the source node to other nodes in the graph can be denoted by since they’re unexplored. We set and train the model to learn an embedding vector for such that . That is, the model will learn for all and that .
4 Current Limitations of Sequence to Sequence Generalization
Learning Selection Sort
We first study how well a stateoftheart transformerbased sequence to sequence model (Section 2.1) learns selection sort. Selection sort involves iteratively finding the minimum number in a list, removing that number from the list, and adding it to the end of the sorted list. We model selection sort using sequence to sequence learning [28] with input examples of unsorted sequences (length ) of bit binary numbers (=8 and =8) and output examples of the correctly sorted sequence. To imitate a sorting subroutine, we provide supervision on intermediate states: at each stage of the algorithm the transformer receives the unsorted input list, the partially sorted output list, and the target number.
The numbers used as inputs and outputs to a vanilla transformer are onehot encoded. Later, we will explore performance when the numbers are binary encoded (Section 2.3)^{2}^{2}2We also experimented with onehot 256dimensional outputs for other approaches used in the paper with similar results. See the supplementary material.. The decoder uses a greedy decoding strategy.
We find that uniformly random numbers are easier to sort than distributions with more similar numbers (e.g., 1 vs. 2, 53 vs. 54). We therefore include both kinds of distributions in the training set^{3}^{3}3Throughout this work, the preponderance of errors are regenerated numbers that are off by small differences. (95% random numbers, 5% numbers with small differences).
The performance of this vanilla transformer, evaluated as achieving an exact content and positional match to the correct output example, is shown in Figure 4 (where the test distribution consists of 60% random numbers and 40% numbers with small differences). The transformer is able to learn to sort the testlength distribution (at 8 numbers), but performance rapidly degrades as the input data distribution shifts to longer sequences and by 100 integers, performance is under 10%.
One of the main issues we found is that the transformer has difficulty distinguishing close numbers. We make a number of small architectural modifications in order to boost its accuracy. We describe these modifications, and provide ablations in Appendix A.2. As Figure 4 shows, given these modifications, sequencetosequence transformers are capable of learning this algorithm on sequences of length with a high degree of accuracy, reaching almost perfect accuracy with proper modifications. However, the model fails to generalize to longer sequences than those seen at training time, and performance also sharply drops as the sequence length increases.
Attention Fidelity
To understand why performance degrades as the test sequences get longer, we plot the attention matrix of the last layer in the decoder (Figure 4a). During decoding, the transformer accurately attends to the first few numbers in the sequence (distinct dots in the chart) but the attention distribution becomes “fuzzy” as the number of decoding steps increases beyond 8 numbers, often resulting in the same number being repeatedly predicted.
Since the transformer had difficulty clearly attending to values beyond the training sequence length, we separate the supervision of where the computation needs to occur from what the computation is. Where the computation needs to occur is governed by the transformer mask. To avoid overly soft attention scores, we aim to restrict the locations in the unsorted sequence where the transformer could possibly attend in every iteration. This is accomplished by producing a conditional mask which learns to ignore the data elements that have already been appended to the sorted_list and feed that mask back into the transformer (shown on the bottomleft side of Figure 1). Put another way, we have encoded the current algorithmic state (the sorted vs. unsorted list elements) in the attention mask rather than the current decoder output.
This modification separates the control (which elements should be considered) from the computation itself (find the minimum value of the list). This allows the transformer to learn output logits of much larger magnitude, resulting in sharper attention, as shown in Figure
4b. Our experimental results consequently demonstrate strong generalization, sorting sequences of up to length 100 without error, as shown in Figure 4. Next, we evaluate this mechanism on a variety of other algorithms.5 Experiments
5.1 Executing Subroutines
Selection Sort
Selection sort (described in Sec. 4) is translated to the NEE architecture in Fig. 1. The NEE learns to find the minimum of the list, and learns to iteratively update the mask by setting the mask value of the location of the minimum to . We show the results for selection sort in Fig. 4 and Table 5.1, the NEE is able to strongly generalize to inputs of length 100 with nearperfect accuracy.
AccuracySizes  25  50  75  100 

Selection sort  100.00  100.00  100.00  100.00 
Merge sort  100.00  100.00  100.00  100.00 
Shortest path  100.00  100.00  100.00  100.00 
Minimum spanning tree  100.00  100.00  100.00  100.00 
Merge Sort
The code for one implementation of mergesort is shown in Fig. 1. It is broadly broken up into two subroutines, data decomposition (merge_sort) and an action (merge). Every call to merge_sort divides the list in half until there is one element left, which by definition is already sorted. Then, merge unrolls the recursive tree, combining every 2 elements (then every 4, 8, etc.) until the list is fully sorted. Recursive algorithms like merge sort generally consist of these two steps (the “recursive case" and the “base case").
We focus on the merge function, as it involves challenging pointer manipulation. For two sorted sequences that we would like to merge, we concatenate them and delimit them using the token: . Each sequence has a pointer denoting the current number being considered, represented by setting that element to in the mask and all other elements in that sequence to , e.g., for two length2 sequences delimited by tokens. The smaller of the two currently considered numbers is chosen as the next number in the merged sequence. The pointer for the chosen sequence is advanced by masking out the current element in the sequence and unmasking the next, and the subroutine repeats.
More concretely, the NEE in Fig. 1 implements this computation. Every timestep, the model outputs the smallest number from the unmasked numbers and the two positions to be considered next. When the pointers both point to , then the subroutine returns. Table 5.1 demonstrates that the NEE is able to strongly generalize on merge sort over long sequences (up to length 100) while trained on sequences of length .
Composable Subroutines: Shortest Path
While both merge sort and selection sort demonstrated that a NEE can compose the same subroutine repeatedly to sort a list with perfect accuracy, programs often compose multiple different subroutines to perform more complex operations. In this section, we study whether multiple NEEs can be composed to execute a more complicated algorithm.
To that end, we study a graph algorithm, Dijkstra’s algorithm to find shortest paths, shown in Fig. 1. The algorithm consists of four major steps:
(1) Initialization: set the distance from the source node to the other nodes to infinity, then append them into a queue structure for processing; (2) Compute newly found paths from the source node to all neighbours of the selected node; (3) Update path lengths if they are smaller than the stored lengths; (4) Select the node with the smallest distance to the source node and remove it from the queue. The algorithm repeats steps (2)–(4) as long as there are elements in the queue.
Computing Dijkstra’s algorithm requires the NEEs to learn the three corresponding subroutines (Fig. 1). Finding the minimum between the possible_paths and shortest_path as well as the minimum current shortest_path can be accomplished through the NEE trained to accomplish the same goal for sorting. The new challenge is to learn a numerical subroutine, addition. This process is described in detail in Section 5.2.
We compose pretrained NEEs to perform Dijkstra’s algorithm (Fig. 1). The NEEs themselves strongly generalize on their respective subroutines, therefore they also strongly generalize when composed to execute Dijkstra’s algorithm. This persists across a wide range of graph sizes. A stepbystep view is shown in the Appendix. The examples are ErdősRényi random graphs. We train on graphs with up to 8 nodes and test on graphs of up to 100 nodes, with 100 graphs evaluated at each size. Weights are randomly assigned within the allowed 8bit number range. We evaluate the prediction accuracy on the final output (the shortest path of all nodes to the source nodes) and achieve 100% test accuracy with graph sizes up to 100 nodes (Table 5.1).
Composable Subroutines: Minimum Spanning Tree
As recent work has evaluated generalization on Prim’s algorithm [32], we include it in our evaluation. This algorithm is shown in Fig. 1: We compose pretrained NEEs to compute the solution, training on graphs of 8 nodes and testing on graphs of up to 100 nodes. The graphs are ErdősRényi random graphs. We evaluate the prediction accuracy on the whole set, which means the prediction is correct if and only if the whole set predicted is a minimum spanning tree. Table 5.1 shows that we achieve strong generalization on graphs of up to 100 nodes, whereas [32] sees accuracy drop substantially at this scale. We also test on other graph types (including those from [32]) and perform well. Details are provided in the appendix.
5.2 Number representations
Learning Arithmetic
A core component of many algorithms, is simple addition. While neural networks internally perform addition, our goal here is to see if NEEs can learn an internal number system using binary representations. This would allow it to gracefully handle missing data and can serve as a starting point towards more complex numerical reasoning. To gauge the relative difficulty of this versus other arithmetic tasks, we also train a model for multiplication.
The results are shown in Table 2. Training on the entire 8bit number range (256 numbers) and testing on unseen pairs of numbers, the NEE achieves 100% accuracy. In addition to testing on unseen pairs, we test performance on completely unseen numbers by holding out random numbers during training. These results are also shown in Table 2, and the NEE demonstrates high performance even while training on 25% of the number range (64 numbers). This is a promising result as it suggests that we may be able to extend the framework to much larger bit vectors, where observing every number in training is intractable. For multiplication, we train on 12bit numbers and also observe 100% accuracy (Additional details are provided in the Appendix).
Training Numbers  256  224  192  128  89  76  64 

Accuracy%  100.00  100.00  100.00  100.00  100.00  99.00  96.53 
To understand the number system that the NEE has learned, we visualize the structure of the learned embeddings using a 3dimensional PCA projection, and compare the embeddings learned from sorting, multiplication, and addition, shown in Figure 5 (a), (b), and (c) respectively. For the addition visualization, we show the embeddings with 65% of the numbers held out during training. In Figure 5 (a) and (b), each node is colored based on the number it represents; in Figure 5 (c), heldout numbers are marked red. We find that a highly structured number system has been learned for each task. The multiplication and addition embeddings consist of multiple lines that exhibit humaninterpretable patterns (shown with arrows in Figure 5 (b) and (c)). The sorting embeddings exhibit many small clusters, and the numbers placed in a "Z" curve increase by 1 (shown with arrows in Figure 5 (a)). On held out numbers for the addition task, NEE places the embeddings of the unseen numbers in their logical position, allowing for accurate interpolation. More detailed visualizations are provided in the Appendix.
6 Related Work
Learning subroutines
Inspired by computing architectures, there have been a number of proposals for neural networks that attempt to learn complex algorithms purely from weak supervision, i.e., input/output pairs [12, 13, 15, 17, 33]. Theoretically, these are able to represent any computable function, though practically they have trouble with sequence lengths longer than those seen in training and do not strongly generalize. Unlike these networks that are typically trained on scalar data values in limited ranges, or focus purely on pointer arithmetic, we train on significantly larger (8bit) number ranges, and demonstrate strong generalization in a wide variety of algorithmic tasks.
Recent work on neural execution [32] explicitly models intermediate execution states (strong supervision) in order to learn graph algorithms. They also find that the entropy of attention weights plays a significant role in generalization, and address the problem by using max aggregation and entropy penalties [32]. Despite this solution, a drop in performance is observed over larger graphs, including with Prim’s algorithm. On the other hand, in this work, we demonstrate strong generalization on Prim’s algorithm on much larger graphs than those used in training (Section 5
). NEE has the added benefit that it does not require additional heuristics to learn a lowentropy mask—it naturally arises from conditional masking.
Work in neural program synthesis [19, 22, 7, 3, 8]—which uses neural networks with the goal of generating and finding a “correct” program such that it will generalize beyond the training distribution—has also employed strong supervision in the form of execution traces [24, 25, 4]. For instance, [4] uses execution traces with tail recursion (where the subroutines call themselves), albeit for program synthesis, and shows that this leads to improved generalization.
The computer architecture community has also explored using neural networks to execute approximate portions of algorithms, as there could be execution speed and efficiency advantages [10]. Increasing the size of our learned subroutines could allow neural networks and learned algorithms to replace general purpose CPUs.
Learning arithmetic
Several works have used neural networks to learn number systems for performing arithmetic, though generally on small number ranges [5]. For example, [21] directly embeds integers in the range as vectors and trains these, along with matrices representing relationships between objects. [27] expands on this idea, modeling objects as matrices so that relationships can equivalently be treated as objects, allowing the system to learn higherorder relationships. [29] explores the (poor) generalization capability of neural networks on scalarvalues inputs outside of their training range, and develops new architectures that are better suited to scalar arithmetic, improving extrapolation.
Several papers have used neural networks to learn binary arithmetic with some success [14, 15]. [11] develops a custom architecture that is tested on performing arithmetic, but trains on symbols in the range of and does not demonstrate strong generalization. Also, recent work has shown that graph neural networks are capable of learning from 64bit binary memory states provided execution traces of assembly code, and observes that this representation numerically generalizes better than onehot or categorical representations [25]. Going beyond this, we directly explore computation with binary numbers, and the resultant structure of the learned representations.
7 Conclusion
We propose neural execution engines (NEEs), which leverage a learned mask to imitate the functionality of larger algorithms. We demonstrate that while stateoftheart sequence models (transformers) fail to strongly generalize on tasks like sorting, imitating the smaller subroutines that compose to form a larger algorithm allows NEEs to strongly generalize across a variety of tasks and number ranges. There are many natural extensions within and outside of algorithmic reasoning. For example, one could use reinforcement learning to replace imitation learning, and learn to increase the efficiency of known algorithms, or link the generation of NEElike models to source code. Growing the sizes of the subroutines that a NEE learns could allow neural networks to supplant general purpose machines for execution efficiency, since generalpurpose machines require individual sequentially encoded instructions
[34]. Additionally, the concept of strong generalization allows us to reduce the size of training datasets, as a network trained on shorter sequences or small graphs is able to extrapolate to much longer sequences or larger graphs, thereby increasing training efficiency. We also find the link between learned attention masks and strong generalization as an interesting direction for other areas, like natural language processing.8 Broader Impact of this Work
This work is a very incremental step in a much broader initiative towards neural networks that can perform algorithmic reasoning. Neural networks are currently very powerful tools for perceptual reasoning, and being able to combine this with algorithmic reasoning in a single unified system could form the foundation for the next generation of AI systems. True strong generalization has a number of advantages: strongly generalizing systems are inherently more reliable. They would not be subject to issues of data imbalance, adversarial examples, or domain shift. This could be especially useful in many important domains like medicine. Strong generalization can also reduce the size of datasets required to learn tasks, thereby also providing environmental savings by reducing the carbon footprint of running largescale workloads. However, strong generalization could be more susceptible to inheriting the biases of the algorithms on which they are based. If the underlying algorithm is based on incorrect assumptions, or limited information, then strong generalization will simply reflect this, rather than correct it.
Appendix A Appendix
a.1 Hyperparameters
8bit binary numbers are used in all tasks except the multiplication task, where 12bit binary numbers are used. For sorting, we found it sufficient to use bitwise embeddings of dimension . For more difficult tasks like addition and multiplication, we found it necessary to increase the dimension to and
, respectively. We used no positional encoding for the sorting tasks, and singleheaded attention for all tasks. The remaining NEE hyperparameters, aside from the changes described next, were set to their defaults.
Hyperparameters  Value 

Number of encoder (decoder) layers  6 
Number of layers in the feed forward network  2 
Number of hidden units in the feed forward network  128 
Mask filter size  3 
Mask number of filters  16 
Ratio of residual connection 
1.5 
Dropout rate  0.1 
Optimizer  Adam 
Warmup steps  4000 
Learning rate 
a.2 Sorting ablations
In Figure 4, we show the performance of a modified transformer model in a sequencetosequence setup (Section 4). In this section, we will elaborate on the modifications we made to the transformer, and how those modifications affect the generalization performance. We illustrate this by evaluating the results of selection sort.
We study 3 different data distributions: the first is where we train on uniformly random sequences with tokens in . The second is a mixed setting, where of the examples are drawn uniformly, and are drawn from a more difficult distribution, where the numbers are closer in value. The third is the most difficult setting, where all sequences have numbers that are close to each other in value.
We ablate specific architectural changes in these settings. The original and modified encoder are represented visually in in Figure 6. Specifically, the architectural choices we test are as follows and the ones applied in the modified transformer are checked (in that they provide a net benefit):

Scaling up the strength of the residual connections by a factor of . (✓)

Using an MLPbased attention module [1]. (✓)

Symmetrizing the MLPbased attention by flipping the order of the inputs and averaging the resulting logit values. (✓)

Using the standard scaled dot product attention.

Using a binary encoding of input values. (✓)

Using a onehot encoding of the input values.

Using a binary encoding as the input, but without any linear embedding.

Sharing the bitwise embedding projection between the query, key, and value in the attention mechanism. (✓)
The test accuracy, measured as getting all values and their positions correct on sequences of length in the mixed setting, is shown in Table 4. We can see that the architectural changes help improve performance on these sequences up to nearperfect accuracy.
Models  Accuracy @ seq_len = 8 

all_modifications  99.00% 
all_modifications_wo_sym  98.33% 
all_modifications_excp_res  95.89% 
all_modifications_excp_shared_proj  98.44% 
all_modifications_excp_dot_att  97.56% 
all_modifications_one_hot_emb  89.56% 
all_modifications_binary_emb  84.78% 
vanilla  96.67% 
Vanilla_binary_emb  77.11% 
Vanilla_one_hot_emb  93.11% 
In Figure 7, we show the strong generalization performance of the different architectures. While some changes are able to improve performance in this regime, the performance ultimately drops steeply as the length of the test sequence increases. This is consistent across all test scenarios and suggests that standard modifications on the transformer architecture are unlikely to prevent attention weights from losing sharpness with longer sequences (Fig. 4).

Here we list out some random and hard examples as well as the corresponding output (containing some errors) from the vanilla transformer with onehot encoded input numbers (each number has an independent embedding), which is commonly used in natural language models.
The symbol e represents the end token. It can be seen that the model makes more mistakes (in bold and italics) with hard examples.
Random examples:
100 62 114 66 241 1 63 237 e
181 52 71 254 246 145 118 28 e
Output from Vanilla_one_hot_emb:
1 62 63 66 100 114 237 53 e
28 52 71 118 145 181 246 254 e
Hard examples:
132 126 131 129 127 130 128 125 e
238 239 241 240 243 237 242 244 e
Output from Vanilla_one_hot_emb:
125 126 127 128 129 130 132 e e
237 238 240 244 243 e 237 242 e
a.3 Graph algorithms tested on different graph types
Prior work [32] has shown that performance on graph algorithms may depend on different types of graphs. For comparison, we further explore NEE performance on graph algorithms (Dijkstra and Prim) and we consider two scenarios: (1) Training NEEs with traces from selection sort (and addition) (2) Training NEEs with traces from corresponding graph algorithms and using ErdősRényi random graphs as training graphs. For both scenarios, we use 20000 training sequences/graphs of size 8 and 2000 validation sequences/graphs and test on 100 graphs of the following types with various sizes:

ErdősRényi random graphs [9]
: each pair of nodes has probability
to form an edge, we use uniformly sampled from . 
Dregular random graphs: every node is connected to other nodes ( needs to be even and ).

Barabási–Albert random graphs [2]: A graph of nodes is grown by attaching new nodes each with edges that are preferentially attached to existing nodes with high degree. We choose .
We assign random weights to the graphs such that they do not overflow the current number system (integers 0255). Based on the findings in Section A.2 that close numbers are hard to identify, thus in the training data, () are hard examples (weights are very close) when training shortest paths (minimum spanning tree). All the training graphs are ErdősRényi random graphs while in the test graphs, every graph type contributes to 25 graph samples.
AccuracySizes  25  50  75  100 

Shortest path ()  100.00  100.00  100.00  100.00 
Minimum spanning tree ()  100.00  100.00  100.00  100.00 
Shortest path ()  100.00  100.00  100.00  99.91 
Minimum spanning tree ()  100.00  99.00  93.00  92.00 
a.4 Detailed visualization of learned number embeddings
In Figure 10 we show more detailed visualizations of the learned bitwise embeddings. These are 3dimensional PCA projections of the full embedding matrix, capturing approximately
of the total variance. The main takeaway is that the network is able to learn a coherent number system with a great deal of structure, and that this structure varies depending on the specific task of the network. This is reminiscent of
[21], where linear embeddings learned the correct structure to solve a simple modular arithmetic task. Also, the network learns to embed infinity, outside of this structure.Future work will also investigate the resulting embedding from a NEE that performs multiple or more complex tasks.


References
 [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations, 2015.
 [2] AlbertLászló Barabási and Réka Albert. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
 [3] Rudy Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Leveraging grammar and reinforcement learning for neural program synthesis. International Conference on Learning Representations, 2018.
 [4] Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via recursion. International Conference on Learning Representations, 2017.
 [5] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. International Conference on Learning Representations, 2019.
 [6] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. In Association for Computational Linguistics, 2019.

[7]
Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdelrahman
Mohamed, and Pushmeet Kohli.
Robustfill: Neural program learning under noisy i/o.
International Conference on Machine Learning
, 2017.  [8] Honghua Dong, Jiayuan Mao, Tian Lin, Chong Wang, Lihong Li, and Denny Zhou. Neural logic machines. International Conference on Learning Representations, 2019.
 [9] Paul Erdős and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60, 1960.
 [10] Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Neural acceleration for generalpurpose approximate programs. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012.
 [11] Karlis Freivalds, Emīls Ozoliņš, and Agris Šostaks. Neural shuffleexchange networkssequence processing in o (n log n) time. In Advances in Neural Information Processing Systems, 2019.
 [12] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 [13] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka GrabskaBarwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471, 2016.
 [14] Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stackaugmented recurrent nets. In Advances in neural information processing systems, pages 190–198, 2015.
 [15] Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. International Conference on Learning Representations, 2016.
 [16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [17] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural randomaccess machines. International Conference on Learning Representations, 2016.
 [18] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [19] Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with gradient descent. International Conference on Learning Representations, 2016.
 [20] Mark EJ Newman and Duncan J Watts. Renormalization group analysis of the smallworld network model. Physics Letters A, 263(46):341–346, 1999.
 [21] Alberto Paccanaro and Geoffrey E. Hinton. Learning distributed representations of concepts using linear relational embedding. IEEE Transactions on Knowledge and Data Engineering, 13(2):232–244, 2001.
 [22] Emilio Parisotto, Abdelrahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neurosymbolic program synthesis. International Conference on Learning Representations, 2017.
 [23] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
 [24] Scott Reed and Nando De Freitas. Neural programmerinterpreters. International Conference on Learning Representations, 2016.
 [25] Zhan Shi, Kevin Swersky, Daniel Tarlow, Parthasarathy Ranganathan, and Milad Hashemi. Learning execution through neural code fusion. International Conference on Learning Representations, 2020.
 [26] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. Endtoend memory networks. In Advances in neural information processing systems, pages 2440–2448, 2015.
 [27] Ilya Sutskever and Geoffrey E Hinton. Using matrices to model symbolic relationship. In Advances in neural information processing systems, pages 1593–1600, 2009.
 [28] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 [29] Andrew Trask, Felix Hill, Scott E Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic logic units. In Advances in Neural Information Processing Systems, 2018.
 [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural information processing systems, 2017.
 [31] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. International Conference on Learning Representations, 2018.
 [32] Petar Velickovic, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell. Neural execution of graph algorithms. International Conference on Learning Representations, 2020.
 [33] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700, 2015.
 [34] John Von Neumann. First draft of a report on the edvac. IEEE Annals of the History of Computing, 15(4):27–75, 1993.
Comments
There are no comments yet.