Many applications in artificial intelligence require the ability to learn and perform tasks that have algorithmic structure, such as learning a sequence of precise actions together with conditional decisions and branching. Therefore, an important research problem is learning to induce algorithms from input-output examples. A particular focus of research attention has been onneural algorithm induction,
which are neural architectures for representing algorithms by including unbounded intermediate state and mechanisms for learning control flow. The neural Turing machine(ntm) and its successor the Differentiable Neural Computer (dnc) augment a recurrent network with an external memory using differentiable read/write mechanisms, and are able to learn simple algorithmic tasks such as array copying and sorting. Neural Random-Access Machines (NRAM) (nram) learn algorithms by generating a fuzzy circuit comprising of pre-defined modules together with modules for dereferencing and storing values in a fixed memory.
Existing architectures do not fully leverage a key fact that algorithmic tasks can be solved by flexibly combining the results of smaller, reusable procedures, which we call modules. A modular architecture aims to represent in a learning system these aspects of programming languages intended for human developers: (i) procedural abstractions that perform computations to produce outputs given some inputs and that can be reused in the algorithm multiple times; (ii) control flow constructs such as branching and loops to compose and combine intermediate results. Although there has been a recent movement toward modular neural architectures for supervised tasks (andreas2016neural; modularnetworks; routingnetworks; houdini; crl), the design space of modular networks has not been fully explored for algorithm induction.
Towards this end, we present a new architecture Main (short for Modular Algorithm Induction Network) for neural algorithm induction. Our architecture consists of a neural controller that interacts with a variable-length read/write tape where inputs, outputs, and intermediate values are stored. Each module is a small computational procedure that reads from and writes to a small fixed number of tape cells (a given fixed set of modules are specified in advance). At each time step, the controller selects a module to use together with the tape location of the module’s input arguments and the write location of the module output. This architecture is trained end-to-end using reinforcement learning.
A comparison of architectural design choices in Main with those in recent neural program induction approaches NTM (ntm), LSA (lsa), NRAM (nram), NGPU (ngpu), CRL (crl), and DNC (dnc) is presented in Table 1. Unlike previous architectures, Main allows for learning to compose modules together with the corresponding argument values for the chosen modules using a general domain-agnostic mechanism for module and argument choices. It uses a generic linear tape layout together with a parallel history tape of landmark symbols that indicate the most recently read and written cells. Finally, the architecture allows for random access of the memory cells and uses a memoryless controller with a length-invariant self-attention based encoding of input tape contents.
|- General module selection||✗||✓||✓|
|- General argument selection||✓||✗||✓|
|Encoder type||Attn||FF||FF||CNN||RNN||Attn||CNN + Attn|
There are two key design choices in Main that we found crucial for good performance. The first is in the representation of the previous history of the computation. Like previous work (dnc), we find that providing history to the model is an important source of information. We introduce a simple but effective discrete representation, introducing a parallel history tape of landmark symbols that indicate the most recently read and written cells. This helps the model to learn common patterns of control flow, such as maps and reduce operations over the tape. The second choice is in the encoder architecture of the abstracted tape view with the controller. We find that using an attention-based encoder performs much better than a recurrent network based encoding within the controller, which has been employed by previous architectures for neural algorithm induction.
We evaluate our architecture Main on five algorithmic tasks including array copying, reversing, increment, filter-even, and multi-digit add. Main can learn these tasks in an end-to-end manner using input-output examples and is also able to generalize to longer input lengths. We observe that both parallel tape history and length-invariant input tape encoding based on self-attention are necessary for the architecture to learn these tasks. Moreover, for tasks such as filter-even and multi-digit add that require the controller to perform some computations, we observe that abstracting the tape contents to only certain landmark positions can help it learn the corresponding algorithms.
2 Related Work
Learning to Compose and Modular Networks: Compositional Recursive Learner (CRL) (crl)
is a framework for learning algorithmic procedures for composing representation transformations, where both the transformations and their compositions are simultaneously learnt from a sparse supervision. The controller that learns to compose transformation is trained using reinforcement learning, whereas the transformations themselves are trained using supervised learning. The CRL architecture supports two forms of transformations: reducers and translators. Because of specialized transformations, CRL imposes a restriction on always selecting either full tape or three consecutive input tokens as arguments (that result in 1 resulting token) for the transformations. In contrast, our architecture allows the controller to select arbitrary locations on the input tape to select as module arguments and also the write locations for their outputs.
There has also been some recent related work on learning Modular Networks (modularnetworks) and Routing Networks (routingnetworks). In routing networks, a router learns to select a sequence of function blocks to compose given an input, and is trained using reinforcement learning. modularnetworks
use a probabilistic model to represent the module choice as a latent variable and use Expectation Maximization (EM) to learn the module parameters and the decomposition choice in an end-to-end manner by maximizing a variational lower bound of the likelihood. Both of these approaches learn modular composition of functions with the aim of reusability and better generalization across multiple tasks. Our architecture, in contrast, is designed for learning algorithms where the modules are pre-specified but the task of the controller agent is to learn to compose those modules to achieve desired input-output behavior.
Neural Program Induction and Synthesis: Several neural architectures have been proposed recently to learn algorithmic tasks. Neural Turing Machine (NTM) (ntm) extends an LSTM controller with an external memory together with differentiable read-write mechanism. Differentiable Neural Computer (DNC) (dnc), a successor to NTM, added capabilities of freeing up unused memory for processing long sequences and a temporal link matrix for better tracking the write order. lsa propose an architecture with an RNN controller that learns to navigate a structured input grid by selecting move actions and outputs tokens on an output grid. Unlike our architecture, which separates control from compute, these architectures require the controller to learn both the control structure as well as the desired computations.
Neural RAM (NRAM) (nram) uses a neural controller that learns to produce a fuzzy circuit consisting of a fixed number of modules. The controller learns to wire the modules with inputs coming from a fixed set of registers or intermediate outputs, and also learns to write output back to registers and memory tape. There are three key differences between Neural RAM and our architecture. First, NRAM uses a fixed sequence of 14 modules (with an option to repeat the whole sequence multiple times). In contrast, our controller learns to select appropriate module at each time step. Second, since registers and memory tape in NRAM architecture store distributions over integers, the modules also need to be differentiable to compute output over such inputs. Our architecture, on the other hand, does not require the modules to be differentiable and also writes discrete values on the input tape. Finally, our architecture uses a parallel history tape of recent read and write positions for accessing the input tape unlike pointer dereferencing in NRAM.
We summarize the characteristics of these architectures, comparing them to our work in Table 1. To understand the rows of this table, “Modules” means whether computation is performed by composable modules. “General module selection” means the modules can be selected in any order at any point in the computation (NRAM enforces a fixed ordering). “General argument selection” means module arguments can come from anywhere in memory without restriction (CRL enforces inputs be locally adjacent). “Read/write history” means that history of previous read and write locations is provided as context to the controller. “Random access” means the architecture can look up arbitrary memory contents and agregate information across memory regions, rather than be restricted by relative head movements (LSA). “Memoryless controller” means the architecture does not require the controller to have hidden memory over time, i.e. the controller is not recurrent, and only uses external memory for context. Finally, the “Encoder type” indicates how memory is consumed by the controller, feed-forward (FF), convolutional (CNN), recurrent (RNN), or attention-based (Attn).
There is also a growing interest in using neural models to generate symbolic programs as output. RobustFill (robustfill) trains an attention based encoder-decoder model that generates a program given a set of input-output examples. Deepcoder (deepcoder)
learns a probability distribution over a set of functions in a DSL to guide an enumerative search.egnps also use an encoder-decoder based approach to synthesize Karel programs (karelsynthesis), but in addition also execute the partially decoded program to guide the decoder. In contrast, our architecture uses a more general mechanism to read and write on the variable length input tape, and uses reinforcement learning to learn to compose the desired modules for each individual task.
We describe Main, an architecture which can express arbitrary modular Turing machines. Our architecture has three parts: the memory which stores the state of the computation, the modules which update the tape, and the controller which chooses a module to execute at each stage of the computation as illustrated in Figures 3. Once is selected out of the module set (4 modules depicted), and the read/write locations on the tape are chosen, the module sees only the read inputs, and modifies the tape only at the write position. Each module is a user provided function (which could in principle also be learned). In our experiments we picked modules of two arguments and one output, but our architecture is agnostic to the number of read/write heads as shown in Figure 1.
The memory is a finite tape of discrete tokens/symbols. Formally, let be memory states at each computation step . Each is itself an array of tokens, indexed as . At the start of the computation, is initialized with the program input, and necessary empty space to perform intermediate computation and write the output. The memory length is dynamically set based on the size of the given input, so every neural architecture component that depends on needs to be length invariant.
The memory contains landmark tokens, which are are task-specific tokens, e.g. ‘$’, ‘+’, ‘&’, that provide positional information. For example, these tokens indicate where the input starts, where the output should be written, etc. In our experiments we found that the controller would often overwrite the landmark tokens during training. As a workaround, we provide the positions of landmark tokens as immutable metadata. These are the lambdas inside shown above. Additionally, we place Start-Of-Tape and End-Of-Tape landmarks at the first and last tape positions.
The initial tape contains the input and additional space for scratch work. At the end, the output of computation will be read from designated regions of the tape, called target locations, which are initialized with the empty token ‘.’. Note that the architecture is free to overwrite any position during the course of computation, including important input tokens. When scoring the final tape , we only look at designated target positions, designated by the ‘.’ token. The initial tape configurations for Copy and Multi-digit Add Tasks is shown in Figure 2. The tape configuration in addition to the input tape also consists of landmark positions as well as the read/write heads history.
The modules are functions that read a narrow portion of the memory, and compute new values that are stored in memory. Each module is a function of
arguments, and outputs a vector of size, i.e. number of read/write heads corresponds to the number of inputs and outputs of the modules. During the computation, each input and output of the modules will be a single cell on the memory tape. Some example modules are the maximum module which returns the maximum of two inputs, and the sum module which returns the sum of two inputs mod base .
The controller is a policy over actions that specifies a module-tape interaction. The controller cannot directly modify memory. Instead it selects a module and the locations on the tape the module will read from and write to. Specifically, the controller defines a distribution , where
are random variables over tape positions (support is); one for each head. The variables select read-head locations, and select write-head locations. The random variable is one of the modules. Finally is the context, which contains the current state of memory and information about the computation history. We will define this in more detail shortly.
The controller defines an end-to-end computation as follows. At each step of the computation, let the current state of the memory by To choose the next module and the locations of the read and write heads, we sample from the controller. This produces a module choice , and locations for the read heads, and for the write heads. Because the controller is free to choose any tape location for each read and write head independently, the memory is random access. The tape is updated by calling the module , with the tape contents under the read heads as input, and then writing to the position specified by the write heads. More formally,
The selected write indices are updated with module’s output, and other tape elements are unchanged.
The context is the input to the controller, which represents both the current tape contents and an action history that represents information about the previous computation. The action history consists of two parts:
Fixed-size part: Contains the module choice and head locations chosen at the previous time step.
Tape values underneath heads: .
Variable-size part:One-hot encoding of , the read and write head locations at the previous step of the computation.
The tape values underneath the heads are technically redundant information, because the controller can use combined with the head locations from the variable-sized part to lookup the corresponding tape values. However, we found that the controller had a hard time doing this in our experiments, and providing the head values as auxiliary input to the controller proved helpful.
Formally, the context is a tuple , where is a fixed sized encoding (does not depend on the tape length ), and is a variable sized encoding (depends on the tape length ). First, describes the current tape values underneath the tape heads chosen at the previous time step:
where is the previous module choice. Second, the variable sized encoding is a matrix concatenation that provides complete information about :
We can think of as a stack of binary channels, each a row-vector of length . The one-hot function produces a vector of length which is filled with 0s, and 1 at position . Each is a one-hot vector of token . Their horizontal concatenation produces a channel for each token (indicates its presence or absense). The and channels indicate the positions of landmarks and heads respectively ( are fixed throughout the episode and not indexed by ). Note that at no actions have been taken, so the previous-action encodings are all 0s.
Now we describe the controller architecture. Given the context , the controller begins with a sequence encoder for the variable length encoding . We pass through two 1D convolutional layers (along the
dimension) with filter width 3 and stride 1. This is the only aspect of our encoder that provides awareness of the local ordering of cells on the tape.
After that, we explore two different seq2fixed encoders:
RNN encoder: BiLSTM produces fixed length embedding (concat of embedding for each direction) which is fed into the controller.
Attention encoder: A context-independent query set is learned (fixed during evaluation). Queries are fed into attention over the tape, and the resulting weighted sum over tape values is a fixed length embedding that is fed into the controller.
Next, we pass the resulting fixed-sized embedding for along with
into a feed forward network which outputs attention queries for the read/write head actions, and logits for the module selection action. Logits for the read/write head actions are produced with dot-product attention over(independent attention head for each read/write head).
Learning the Controller.
We frame our setup in the reinforcement learning paradigm, where the controller is the agent, and other components (memory and modules) are part of the environment. From the perspective of algorithm induction, only the program input and target outputs are external, while everything inMain is part of the black box that performs the computation. The controller is trained end-to-end on all its actions with Impala (impala), a distributed variant of REINFORCE. Simultaneously learning to halt with RL, while learning what computation to perform, proved to be unstable in our experiments. We removed the additional complication of learning when to halt by providing a halting oracle. The oracle is given the correct output, and immediately ends the episode if the answer is in memory. The oracle is used in evaluation as well.
We now present an empirical evaluation of our architecture Main in order to establish that (i) it can learn to perform algorithmic tasks, (ii) attention is important for length generalization, and (iii) separating control flow and data flow through limited view helps in learning.
Tasks: We consider five algorithmic tasks with six pre-defined modules. All but the Multi-Digit Addition task are given the same module pool. We found that Multi-Digit Addition was more difficult to learn, and to reduce the action space we cut down the module pool to only ones needed to perform the computation. For simplicity, we make all the modules have the same number of inputs and outputs. Specifically in our experiments there are two read-heads and one write head. Modules which naturally read less than 2 inputs ignore their additional inputs. We consider the following six modules: Identity, Increment, Max, Sum, SumInc, and Reset. The semantics of the modules is presented in Appendix A.1. We consider the following tasks.
Copy: Given an array of base 10 digits in the memory tape and a pointer to the destination, the task is to copy all elements from the array to the destination (e.g. see Figure 2(a)). We provide landmarks to start of the input and start of the output location. For an input array of length , the output would be also of length , and so the total memory tape size would be , including 1 separation token. Modules: Reset, Identity, Increment, Max, Sum
Reverse: Same tape size as Copy task, with the goal of writing the input digits in reverse order.
Increment: Same tape size as Copy task, with the goal of writing the result of adding 1 to each digit of the input array (modulo base) to the destination.
Filter Even: Given an array of base 16 digits in the tape, the goal is to output a sequence containing only the even-valued digits in the same order. Modules: Reset, Identity, Increment, Max, Sum
Multi-Digit Addition: Given two arrays of base 10 digits separated by ‘+’, the goal is to output the sum of the integers (denoted by input arrays) as a sequence of digits. Modules: Sum, SumInc
4.1 Experimental Setup
The average success rate and the variance ofMain for different algorithmic tasks for different input generalization lengths of 10, 20, and 100.
We train the controller with Impala, a distributed asynchronous algorithm. We used 50 data collection workers to sample episodes from the most recent policy. There is one training worker which queues up episodes sent by the data collectors into training batches. We train until 30M timesteps across all episodes. We use a curriculum over task difficulty, which essentially corresponds to the input length. For each episode, an input-output pair is sampled from the task generator given a difficulty level, and the difficulty level is sampled uniformly in range , where is the maximum curriculum setting. We vary from 2 to 10 with a linear schedule starting at 1M and ending at 18M. During training, each data collector generates new task inputs on the fly.
For evaluation, we precompute datasets of 100 test inputs for generalization with larger input lengths 10, 20, and 100 sampled from the same data generator. We run evaluation repeatedly and concurrently with training, and take the highest observed success rate as the final metric. For a given input, we compute success as whether the controller controller produced exactly the desired output. We take the average success across the 100 evaluation inputs as the success rate. Because we run each experiment 10 times, we can estimate the variance of success rate due to random weight initialization, randomly sampling from the policy, and stochastic effects of asynchronous training. We report average success rate (across the 10 trials) and the empirical standard deviation.
|Copy||Reverse||Increment||Filter Even||Multi-Digit Add|
|No Tape Values||3||6||1||0||1|
|No Action History||0||0||0||0||0|
|No Action History Tape Values||7||8||7||5||0|
4.2 Results and Ablations
Table 2 presents the experiment results of evaluating Main on the five algorithmic tasks with different ablation choices for the generalization input length of 100. For each task, we report the number of runs out of 10 that achieved 100% success on the test set of 100 evaluation inputs (each of length 100). Additionally, the average of success rates together with their variance for four of the tasks for different input generalization lengths of 10, 20, and 100 is shown in Figure 4.
As shown in Table 2, Main can learn to generalize perfectly for inputs of length 100 when trained on inputs with length up to 10 only. For the copy, reverse, increment, and filter-even tasks, Main can generalize in majority of runs out of 10. The multi-digit addition task is particularly challenging in our architectural setting. For learning this task, the controller first needs to learn to select appropriate individual digits to be added for each timestep. In addition, it also needs to learn to use the digits and module choices selected in the previous timestep to decide whether to add a carry or not while computing the addition. Remarkably, Main was able to learn one such controller.
Now we evaluate whether the special architectural features of Main are necessary for good performance. First, we evaluate whether the controller needs to observe the values on the tape, by considering an ablation (labeled “No Tape Values” in the table) which removes the top third of (i.e. removes the tape values ). It may then seem like the controller cannot do anything, but it still has access to the action history metadata and values under the previously placed read/write heads. Notably, we found that without tape values, the controller performance goes down, except in Multi-Digit Addition, which is the only setting in which could generalize to length 100. Since this task requires the controller to compute whether a carry bit should be used for adding the intermediate result for two digits, hiding the tape contents and only providing the head values constrains the space of possible argument choices for modules and helps the controller to learn desired computation.
Next, we evaluate the usefulness of action history, by considering an ablation (labeled “No Action History” in the table) which removes the fixed context and the bottom third of , so that the controller does not have information about previous actions. As expected, these architectures do not generalize, achieving generalization accuracy of . The reason is that without the action history, the controller does not have a mechanism to remember which part of the computation it is at currently, and therefore cannot learn iterative computations, which is required by all of our tasks.
Recall that contains the tape values underneath the previous read-write locations, even though the controller could infer this from other context. Next, we evaluate whether this is helpful by removing this in the ablation labeled “No Action History Tape Values”. On most tasks, the performance is comparable to the full model, except on filter-even in which this ablation is slightly better.
Finally, we evaluate whether our attention-based encoder could be replaced with a simpler recurrent controller, as has been used in previous work (crl). For shorter inputs of length , the recurrent encoder achieves perfect average evaluation accuracy for the simpler tasks such as copy, reverse, and increment (Figure 4). However, when evaluated on longer input lengths, its performance degrades significantly. For filter-even and multi-digit addition tasks, the generalization accuracy of the recurrent encoder goes to almost for length . On the other hand, attention based tape encoder always result in significantly higher generalization accuracies across all the tasks.
We presented a new neural architecture Main that learns algorithmic tasks from input-output examples. Main uses a neural controller that interacts with a variable-length input tape to learn to compose modules. At each time step, the controller chooses which module to use together with the corresponding tape locations for module arguments and for writing the module output back to the tape. This architecture is trained end-to-end using reinforcement learning and we show that it can learn several algorithms successfully that generalize perfectly to inputs of much longer length (100) than the ones used for training (up to ).
Appendix A Appendix
a.1 Semantics of Modules
The semantics of the modules we consider are as follows.
if isnumeric() else ‘.’.
if isnumeric() and isnumeric() else ‘0’.
if isnumeric() and isnumeric() else ‘0’.