Teaching machines to learn programs is a challenging task. Numerous models have been proposed for learning programs, e.g. Neural Turing Machine(Graves et al., 2014), Differentiable Neural Computer (Graves et al., 2016), Neural GPU (Kaiser & Sutskever, 2015), Neural Programmer (Neelakantan et al., 2015), Neural Random Access Machine (Kurach et al., 2015) and Neural Programmer-Interpreter (Reed & de Freitas, 2016)
. These models are usually equipped with some form of memory components with differentiable access. Most of these models are trained on program input-output pairs and the neural network effectively learns to become the particular target program, mimicking a particular Turing machine.
(referred to in this paper as RNPI). NPI has three components: a core controller that is typically implemented by a recurrent neural network, a program memory that stores embeddings of learned programs, and domain-specific encoders that enable a single NPI to operate in diverse environments. Instead of learning any particular program, the core module learns tointerpret arbitrary programs represented as program embeddings, mimicking a universal Turing machine. This integration of the core (interpreter) and a learned program memory (programmer) offers NPIs with better flexibility and composability by allowing the model to learn new programs by combining subprograms. Despite these merits, the NPI model bears some theoretical and practical limitations that hinder its application in real world problems.
One hypothetical theoretical property of the NPI model that makes it appealing for multi-task, transfer, and life-long learning settings is its universality, i.e. the capability to represent and interpret any program. As the NPI relies solely on the core to interpret programs, universality requires a fixed core to interpret potentially many programs. A universal fixed core is critical for learning and re-using learned programs in a continual manner, because a core with changing weights may fail to interpret old learned programs after learning new ones. Although the original NPI paper shows empirically that a single shared core can interpret 21 programs to solve five tasks, and that a trained NPI with fixed core can learn a new simple program MAX, it is unclear whether a universal NPI exists or how universal it could be. Specifically, given the infinite set of all possible programs, the subset of programs that can be interpreted by a fixed core is not explicitly defined. Even though a universal NPI exists, it may still be intractable to provable guarantee of universality by the verification method proposed in Cai et al. (2017), because there may be infinite programs to verify.
Practically, as proposed in Reed & de Freitas (2016), the training of an NPI model relies on a very strong form of supervision, i.e. the example execution traces of programs. This form of training data is typically more costly to obtain than input-output examples. Training with weaker form of supervision is desirable to unlock NPI’s full potential.
In this paper, we propose to overcome these limitations of NPI by incorporating combinator abstraction into the NPI model and augmenting the original NPI architecture with necessary components and mechanisms to support this abstraction. We refer to this new architecture as the Combinatory Neural Programmer-Interpreter (CNPI). As an important abstraction technique in functional programming, combinators, a.k.a. higher-order functions, are used to express some common programming patterns shared across different programs. We find that combinator abstraction can dramatically reduce the number and complexity of programs (i.e. combinators) that need to be interpreted by the core, while still allowing the CNPI to represent and interpret arbitrary complex programs by the collaboration of the core with the other components. We propose a small set of four combinators to capture four most pervasive programming patterns. Due to the finiteness and simplicity of this combinator set and the offloading of some burden of interpretation from the core, we are able construct a CNPI with a fixed core that can represent and interpret an infinite number of programs which is adequate for solving most algorithmic tasks. This CNPI is universal with respect to the set of all combinatorizable programs. Moreover, we show empirically that besides supervised training on execution traces, it is possible to train the CNPI by policy gradient reinforcement learning with appropriately designed curricula.
2 Overview of combinator abstraction
2.1 Review of NPI with its limitations
. Then we analyze its limitations to motivate our combinator abstraction. The NPI model has three learnable components: a task-agnostic core controller, a program memory, and domain-specific encoders that allow the NPI to operate in diverse environments. The core controller is a long short-term memory (LSTM) network(Hochreiter & Schmidhuber, 1997) that acts as a router between programs. At each time step, the core can decide either to select another programs to call with certain arguments, or to end the current program. When the program returns, control is returned to the caller by popping the caller’s LSTM hidden units and program embedding off of a program call stack and resuming execution in this context.
NPI’s inference procedure is as follows (see Section 3.1 and Algorithm 1 in Reed & de Freitas (2016) for more detail). At time step , an encoder takes in the environment observation and arguments and generates a state . The core LSTM takes in the state , a program embedding and the previous hidden state to update its hidden state
. From the top LSTM hidden state several decoders generate the following outputs: the return probability, the next program’s key embedding , and the arguments to the next program . The next program’s ID is obtained by comparing the key embedding to each row of key memory . Then the program embedding is retrieved from program memory holding programs as:
The above-described NPI architecture bears two limitations. First, as shown in the above equation, at each time step, a decision must be made by the core to select the next program to call out of all currently learned programs in the program memory. As grows large, e.g. to hundreds or thousands, interpreting programs correctly becomes a more and more difficult task for a single core. What makes things worse is that the core has to learn to interpret new programs without forgetting old ones. Second, it is common that programs with different names and functionalities share some common underlying programming patterns. We take two programs used in Reed & de Freitas (2016) and Cai et al. (2017) as example (see Figure 1): the ADD1 program in grade school addition, and the BSTEP program to perform one pass of bubble sort. We use their recursive forms described in Cai et al. (2017). The two programs share a very common looping pattern. However, the core needs to learn each of these programs separately without taking any advantage of their shared patterns. The total number of programs that need to be learned by the core thus become infinite. We argue that these two limitations make it very challenging, if not impossible, to construct a universal NPI.
2.2 Our approach using combinator abstraction
To overcome the limitations of the NPI, we propose to incorporate combinator abstraction into the NPI architecture. In functional programming, combinators are a special kind of higher-order functions that serve as power abstraction mechanisms, increasing the expressive power of programming languages. We adapt the concept to neural programming and make it play the central role in improving the universality of NPI.
Conceptually, a combinator is a “program template” with blanks as formal arguments that are callable as subprograms. An actual program can be formed by wrapping a combinator with another program called an applier, which invokes the combinator and passes the actual arguments to be called when executing the combinator. Alternatively, an applier applies a combinator to a set of actual programs as callable arguments. Note that the callable arguments themselves can also be wrapped programs (i.e. appliers), and programs with increasing complexity can thus be built up. As in the original NPI, the interpretation of a combinator is conditioned on the output of a lightweight domain-specific encoder which we call an detector. It is also provided on the fly by the applier. Figure 1 illustrates the usage of combinator abstraction in the NPI architecture.
In the CNPI architecture, combinators are the only type of programs that need to be interpreted by the core. By prohibiting a combinator to call programs other than those passed to it as arguments, the selection range for next program to call at each time step is reduced from a growing to a constant , which is the maximum number of arguments for combinators ( 9 in our proposed model). Meanwhile, compared to the infinity of all possible programs, the number of useful combinators is finite and typically small. In practice, we construct a small set of four combinators to express four most pervasive programming patterns. Therefore, the core only needs to interpret a small number of simple programs. We will show that a quite small core suffices for this job, and that by the collaboration of this core and the other components a universal CNPI can be constructed.
3 Combinatory NPI model
3.1 Combinators and combinatory programs
We propose a set of four combinators to express four most pervasive programming patterns for algorithmic tasks: sequential, conditional, linear recursion and tree recursion (i.e. multi-recursion). The pseudo-code for these combinator are shown in Figure 2. Each combinator has four callable arguments , , and and one detector argument . is a default argument referring to the combinator itself and is used for recursive call. For linrec and treerec, we give more readable aliases to , and to hint their typical roles. The detector argument detects some condition (e.g. a pointer P2 reaching the end of array) in the environment and provides signals for the combinator to condition its execution. It outputs if the condition satisfies, otherwise . For seq a default blind detector is passed, which always outputs . Although not directly callable, detectors can also be viewed as programs “running in background” as perception modules. In this paper, the conditions to detect is often used to name detectors and we append a ‘?’ to their names to differentiate them from callable programs. Like primitive actions (ACTs), we could also define primitive detectors (DETs) for specific tasks. Note that this combinator set is by no means unique or minimal. They take their current forms mainly for ease of use and learning.
The four combinators are classified into two categories.seq, cond and linrec are basic combinators, which only call their callable arguments during execution. treerec is an advanced combinator. Besides callable arguments, an advanced combinator can also call built-in programs, such as _push_sentinel and _mapself in treerec. These built-in programs are used to facilitate multiple recursive calls to in treerec combinator. Basically, prepares states necessary for each recursive call and push these states to a stack. The built-in combinator _mapself shares a similar structure with linrec. It loads the states one by one from the stack and makes the recursive call with each state until a sentinel is met (The sentinel is pushed to the stack before by _push_sentinel, which is a built-in ACT). More details on built-in programs and treerec are given in Appendix A, and examples of using them can be found in Appendix B.
We now describe how to compose combinator programs using combinators by taking the BSTEP program (i.e. one pass of bubble sort) as example. The normal and combinatory version of the program are shown in Figure 3 (a) and (b) respectively. Recall that an applier applies a combinator to a set of actual programs (ACTs or other predefined appliers) to form a new actual program. Composing a combinatory program amounts to defining appliers iteratively. As shown in Figure 3 (c), during the execution of a combinatory programs, combinators and appliers call each other to form an alternating call sequence until reaching a ACT. Combinators, appliers and detectors are all highly constrained programs, and thus are all easily interpretable and learnable. Nevertheless, they can collaborate to build arbitrarily complex programs.
3.2 CNPI architecture and algorithm
Having introduced combinators and how to use them to compose combinatory programs, we now describe how these programs are interpreted by the CNPI and the necessary augmentations to the original NPI architecture to enable the interpretation. The complete inference procedure for CNPI is given in Algorithm 1. An example execution of the BSTEP program is illustrated in Figure 4.
Appliers are effectively one-line programs that apply a program , which could be either a combinator or an ACT, to a set of arguments. To interpret an applier we just need to identify the program to be called and its arguments, prepare environment for the invocation, and make the invocation. For easy of interpretation, we propose to store the key embeddings of and its detector and callable arguments , , and directly in the applier’s program embedding :
where denotes concatenation 111For the seq combinator and ACTs which do not need detector arguments, the blind detector’s key embedding is stored. For ACTs with arguments the arguments’ values are stored in place of the key embeddings.. We use a fixed parser to extract the key embeddings from the applier’s embedding. Then the combinator or ACT ID is computed by comparing the key with each row of memory and finding the best match. The callable arguments’ IDs are computed similarly. In CNPI architecture, the models for detectors are stored in a detector memory (, ) which has the same key-value structure as the program memory. The detector argument ID is computed by comparing to each row of . Note that the core LSTM does not participate in the interpretation of appliers. As the format for storing these key embeddings is predefined, the fixed parser can parse any applier’s embedding.
We use a dynamically constructed data structure called a frame to pass arguments to a combinator. Each frame is a table of bindings which associate formal callable arguments with their corresponding actual IDs, with the number of callable arguments for combinators. When calling a combinator, a new frame is created. The IDs of the combinator’s callable arguments (including the combinator’s ID as it corresponds to the argument) are filled into the frame 222If the combinator is treerec, the IDs of built-in programs also need to be appended to the end of the frame in a predefined order. In this case the frame’s size is , with the total number of built-in programs.. In practice we do not use a key-value structure for frames. Instead the frame only stores values, i.e. the arguments’ IDs in a fixed order of , , and .
The interpretation of combinators is in general similar to the inference procedure in Algorithm 1 in Reed & de Freitas (2016). Here we highlight several key differences. In the initialization stage, besides retrieving combinator embedding from the program memory, the detector model is also loaded from the detector memory. Then instead of using the combinator embedding as input to the LSTM at every time step, we use it to initialize the LSTM’s state, i.e. each layer’s hiddens and cells. We find empirically that this parameterization has better efficiency and accuracy for our combinators; see Section 5.1
. The second difference is that we binarize the output of detectorto get a binary condition before feeding it to the LSTM:
where is an indicator function. This operation effectively decouples the detector from the core LSTM. This enables us to verify the core’s behaviors separately without considering any specific detectors, given that the correct condition is provided. This is difficult to achieve in the original NPI architecture where the core is trained jointly with the encoders.
The third and most important difference is on how the next subprogram to call is computed. We use a decoder
to compute a score vectorto assign a score for each formal callable argument. The argument with the maximum score is selected and its actual program ID is retrieved from the frame . This ID is used in turn to retrieve the program embedding from the program memory when the next program is executed:
where is the maximum number of callable arguments for combinators. We consider this indirection of subprogram embedding retrieval, together with the dynamic binding of formal arguments to actual programs in the frames, to be the key to the superior universality and learnability of CNPI.
When calling the subprogram the same detector ID and frame are re-used, which is equivalent to passing the combinator’s arguments to all of its subprograms. This facilitates recursion as these arguments are needed by linrec and treerec when calling (see Figure 2). For other subprogram calls to appliers or ACTs, these arguments are safely ignored.
CNPI has four components: the core, the program (combinator and applier) memory, the detector memory, and the parser, of which the first three are learnable. The combinators are trained jointly with the core. Detectors and appliers are trained separately.
Supervised learning (SL) of CNPI uses execution traces of combinatory programs. A single element of an execution trace consists of a step input-step out pair, which takes one of the two forms: for combinator execution and for applier execution. is the formal callable argument ID to be called by the applier at time step . is the correct condition at time step and is used as the output target for detectors. and provide targets for the core. Detectors and the core are trained on the elements of the trace, using stochastic gradient ascent to maximize the likelihood of their corresponding targets.
where are parameters of the detector model, are the collective parameters of the core and the combinator embedding, is the length of the sequence of elements. The probability of calling subprogram is computed by applying a softmax to the scores produced by : . In SL the applier embeddings do not need to be trained; they are just generated from elements of the trace according to equation (1).
CNPI can also be trained by policy gradient reinforcement learning (RL) 333In this paper we only train CNPI by RL on tasks that can be solved by the three basic combinators.. No execution trace is given and the core tries to complete the task by making program calls following the probabilities and feeding-forward the LSTM. An episode ends if the task is completed or the number of steps reaches MAX_NSTEP. In our experiments, MAX_NSTEP = , where is the number of callable arguments for combinators, is the complexity of the problem to be solved. A reward is given when an episode ends at step :
At each time step , a condition
is sampled from a Bernoulli distribution defined by the output of the detector. The next program to be called is identified as, where is sampled from . The core is trained using stochastic gradient ascent on a mixed objective with two parts: an RL objective of maximizing expected reward, plus an SL objective of maximizing the likelihood of correct flag of program return:
The RL objective is derived from the REINFORCE algorithm (Williams, 1992). Note that the SL objective only takes effect on episodes where a positive reward is received on task completion. This combination of RL and SL objectives to optimize a policy is also used in Oh et al. (2017) to learn parameterized skills. The detector is also trained to maximize expected reward using REINFORCE:
Once the detector and the core have been learned, applier embeddings can also be learned using RL. After the program is called with detector and callable arguments and , a reward is given according to whether the task has been completed. The applier embedding parameterized by is updated as:
where the identifiers , and are sampled respectively from the distributions derived from the corresponding keys stored in the applier’s embedding: , , .
Note that in both SL and RL, detectors are trained separately from the core. This decoupling facilitates the sharing of detectors across programs and the verification of the behavior of the core.
Training CNPI with SL to solve algorithmic tasks consists of three steps. First, train and verify the core jointly with the combinators with synthetic abstract traces, i.e. sequences of elements corresponding to the correct invocation of formal callable arguments given conditions (a total of 11 traces for the four combinators). After having been verified for correct behavior, the core and the combinator embeddings are fixed. This step is done only once before solving any specific task. Second, for a new task, identify the conditions needed to solve the task, train and verify detectors to detect these conditions, and then add them to the detector memory. Finally, iteratively define appliers from the bottom up by adding them to the program memory with program embeddings set according to equation (1) given elements of the traces, and call the topmost applier to solve the task. We state the universality of CNPI with the following theorem and proposition:
If 1) the core along with the program embeddings of the set of four combinators and the built-in combinator _mapself are trained and verified before being fixed, and 2) the detectors for a new task are trained and verified, then CNPI can 1) interpret the combinatory programs of the new task correctly with perfect generalization (i.e. with any input complexity) by adding appliers to the program memory, and 2) maintain correct interpretation of already learned programs.
Any recursive program is combinatorizable, i.e., can be converted to a combinatory equivalent.
Theorem 1 states that CNPI is universal with respect to the set of all combinatorizable programs and that appliers can be continually added to the program memory to solve new tasks. Proposition 1 shows that this set of programs is adequate for solving most algorithmic tasks, considering that most, if not all, algorithmic tasks have a recursive solution. We prove Theorem 1 in Appendix C. For Proposition 1, instead of giving a formal proof, we propose a concrete algorithm for combinatorizing any program set expressing an recursive algorithm in Appendix B.
We argue that universality is a property harder to achieve than the generalization property discussed in Cai et al. (2017), which provides provable guarantees of perfect generalization for several programs. However, the authors did not consider the problem of universality with a fixed core. In fact, although RNPI can be trained on a particular task and verified for perfect generalization, after training on a new task causing changes to the parameters of the core, the property of perfect generalization on old tasks may not hold any more. In contrast, CNPI provides both generalization and universality. Table 1 qualitatively compares CNPI with NPI and RNPI.
|Model||provable perfect generalization||provable universality||# verifications of||# trainings of|
|programs / combinators||encoders / detectors||programs / combinators||encoders / detectors|
|NPI||per task||per task|
|RNPI||✓||per task||per task||per task||per task|
|CNPI||✓||✓||once||per condition||once||per condition|
Due to the decomposition of programs into combinators and appliers, and the decoupling of detectors from the core, we can verify the perfect generalization of a particular program using much fewer test inputs than RNPI. For example, to verify the perfect generalization of bubble sort with RNPI we need 2078 test inputs for 6 subprograms while with CNPI we need only 123 for 4 detectors.
While the previous section analyzes the universality of CNPI, this section shows results on the empirical evaluation of its learnability via both SL and RL experiments. We mainly report results on learning the core and the combinators, assuming that detectors for the tasks have been trained. Learning a detector in our CNPI architecture is a standard binary classification problem, which can be trivially solved by training a classifier.
To evaluate how CNPI improves learnability over the original NPI architecture, in some experiments we use the RNPI model as a baseline. It has the same architecture as NPI and allows recursive calls. For a CNPI with callable arguments (denoted as CNPI-), we construct a counterpart RNPI with existing actual programs (either composite programs or ACTs) in the program memory as base programs (denoted as RNPI-). These base programs are divided into sets corresponding to different tasks (e.g. grade school addition and bubble sort). Then new programs are learned over each set by calling the corresponding base programs as subprograms. Note that some of these new programs may share same patterns (e.g. ADD1 and BSTEP, we call them isomorphic programs), but in RNPI they are treated as different programs and the core needs to learn all of them. For fair comparison, the counterpart RNPI uses the same detector as the CNPI.
For all experiments, we used a one-layer LSTM for the core. We trained the CNPI using plain SGD with batch size 1, and learning rate of 0.5 and 0.1 for SL and RL experiments respectively. For the SL experiments, the learning rate was decayed by a factor of 0.1 if prediction accuracy did not improve for 10 epochs.
5.1 Supervised learning results
We found that, as expected, a CNPI can be trained to learn the set of four combinators using synthetic abstract traces without any difficulty. From Section 4 we know that this CNPI is able to learn all combinatorizable programs (including the four in Appendix B) with perfect generalization. To further stress the learning capacity of the core, we enlarge the small set of four combinators to a full set of all possible combinators with the following two constraints: 1) branching can only happen at the beginning of the execution; 2) call to the argument, i.e. recursive call, can only be made at the end of the execution (i.e. only tail recursion is allowed). For , this full set has 57 combinators (including the three basic combinators).
Cores with different number of LSTM cells were trained to learn this full set of combinators. We compare two methods of feeding the combinators embedding to the core LSTM: use the embedding as input to the LSTM at every time step (Emb-as-Input), as is done in Reed & de Freitas (2016), and using it as the initial state (i.e. hiddens and cells) of the LSTM (Emb-as-State0). For both methods the combinator embedding size is set to be equal to the LSTM size. Note that the Emb-as-State0 model has fewer parameters than Emb-as-Input with the same number of cells. We also trained a sequence-to sequence model from Sutskever et al. (2014) where an encoder LSTM takes in the text code representation of the combinator (a simplified version of the pseudo-code in Figure 2) and the last state of the encoder is used as the combinator embedding to initialize the core LSTM’s state. This seq2seq model can be seen as a miniature of an instruction-to-action architecture. We see in Figure 5 that the Emb-as-State0 model achieves better prediction accuracy than the Emb-as-Input model with the same size. Particularly, the Emb-as-State0 model with only 5 cells can learn all the 57 combinators with 100%. This LSTM is much smaller than the one used in Reed & de Freitas (2016) which has two layers of size 256. The seq2seq model can also achieve 100% accuracy with 7 cells. In subsequent experiments we used a core LSTM of size 16 if not mentioned otherwise.
We compare the abilities of CNPI and RNPI to learn new combinators/programs with a fixed core. The models were first trained on a combinator/program set to get 100% accuracy, then trained on a new set with the core fixed. Finally the models were tested on the old set to see if they are still remembered by the models. For the CNPI-4 model the old and new combinators were generated by a random even split of the full combinator set. For the RNPI-4x2 model with two sets of base programs, we constructed a full set of all possible composite programs for each set of base programs, as with combinators. Then old and new programs were randomly sampled from the two sets respectively. Note that the old and new programs generated this way have certain proportion of isomorphic programs, and this proportion grows with the percentage of random sampling. In the RNPI experiment, the program key embeddings need to be learned jointly with the program embeddings, otherwise the model would not be able to learn the new programs. As shown in Table 2, although both models can be trained on the new set with high accuracy, when tested on old ones, RNPI shows catastrophic forgetting, which becomes more severe as there are more isomorphic programs between the old and new set. In contrast, CNPI remembers old combinators perfectly.
5.2 Reinforcement learning results
We find that curriculum learning is necessary for training CNPI with RL. Table 3 shows the curriculum we used for training CNPI for the sorting task. For each subtask, the programs to be learned (including detectors) are bolded and colored. The curriculum has two stages. In the first stage, the combinators were trained with simple auxiliary tasks, using ACTs and DETs as arguments. The learned combinators are then used as prerequisite for solving the actual tasks. The tasks for each combinator is designed to ensure that the task will be completed if and only if the combinator is correctly executed as defined in Figure 2. In the second stage, detectors and appliers can be learned in two forms: we can either define a sketch for solving the task (similar to the policy sketches in Andreas et al. (2017)), with some learnable arguments (as in the Compare and swap and Output max tasks), or learn an applier to solve the task using already learned arguments (as in the Sort task). Note that for brevity we define some appliers for resetting pointers directly without any learning after at the end of Stage 1. In fact, they could also be learned the same way as the appliers in Stage 2.
|Subtask||Description||Program to learn|
|Stage 1: Learning the core and combinators|
|Swap and output easy||Output A[P1]||seq0(; OUT_1, NOP, NOP)|
|Swap and output||Output A[P1], then swap A[P1] and A[P2], finally output A[P2]||seq(; OUT_1, SWAP_12, OUT_2)|
|Conditional output easy||Output A[P2] if P2 is in array, otherwise output ‘OK’||cond0(A[P2]END?; OUT_2, NOP, OUT_OK)|
|Conditional output||Output and clear A[P2] if P2 is in array, otherwise output ‘OK’||cond(A[P2]END?; OUT_2, CLEAR_2, OUT_OK)|
|Copy easy||Output the first element if array is empty, otherwise output ‘OK’||linrec0(A[P2]END?; OUT_2, NOP, OUT_OK)|
|Copy||Output elements sequentially till the end of array, then output ‘OK’||linrec(A[P2]END?; OUT_2, P2_RIGHT, OUT_OK)|
|Reset pointers||Reset P1 and P2 to the appropriate beginning position||RESET_1: linrec(A[P1]END?; P1_LEFT, NOP, NOP) RESET_2: linrec(A[P2]END?; P2_LEFT, NOP, P2_RIGHT) RESET: seq(; RESET_1, RESET_2, NOP)|
Stage 2: Learning detectors and appliers
|Conditional swap||Conditionally swap two elements||COMPSWAP: cond( A[P1]A[P2]?, SWAP_12, NOP, NOP)|
|Output max||Find and output the max element in the array then clear it||MAX: linrec(A[P2]END?; STEP, P2_RIGHT, OUTCLEAR_1)|
|Sort||Sort the array by repeatedly outputting the current max element||SORT: linrec(A[P3]END?; MAX, RESET, NOP)|
Though being quite simple programs, we find that the three basic combinators are still difficult to learn with plain policy gradient RL, even with the curriculum. To facilitate learning. we use the adaptive sampling technique proposed in Reed & de Freitas (2016). Example traces are fetched with frequency proportional to the model’s current prediction error. We set the sampling frequency using a softmax over a moving average of prediction error over last 10 episodes, with temperature 1. Besides, for each combinator’s auxiliary task we design an easy version of the task, which corresponds to a partial completion of the true task (see Table 3). Then a curriculum can be formed for each combinator by either mixing the easy and true task (mixed), or complete the easy task first before going to the true one (gradual). For each different use of adaptive sampling and curriculum we ran 100 experiments with a maximum number of 5000 episodes for each experiment. Table 5 shows the success rate that all three auxiliary tasks are completed along with success rates for completing the task for each combinator. As shown in Table 5, both adaptive sampling and the curriculum help training considerably. A relatively high success rate of 91% can be obtained with adaptive sampling and the gradual curriculum. We can also know from Table 5 that of the three combinators seq is the easiest to learn while linrec is the hardest.
|Sampling method||No curriculum||Mixed curriculum||Gradual curriculum|
|Uniform||7 (11/10/7)||17 (33/28/17)||49 (94/49/49)|
|Adaptive||31 (74/41/31)||78 (87/80/78)||91 (92/91/91)|
|Stage||CNPI -4||RNPI -4x2||RNPI -4x3|
We compare the success rate of training CNPI (CNPI-4) with its counterpart RNPI models RNPI-4x. For each combinator’s auxiliary task, we constructed different versions of the task by providing different ACTs and DETs. For example, for the Copy task we used different pointers (e.g. P1) to move in different directions (e.g. to left), and output different symbols when finished (e.g. ‘DONE’) to generate tasks. Then we trained RNPI to learn actual programs in parallel to solve these tasks. The RNPI-4x models were trained with adaptive sampling and gradual curriculum for 10000 episodes, and the success rate over 100 trials are shown in Table 5. Due to the enlarged candidate set of the next program to call from 4 to 4 and the increased number of programs to be learned from 3 to , it is much more difficult to train RNPI with RL. RNPI-4x2 finishes the first stage of the curriculum to complete the easy tasks with a success rate of 25% while fails completely on the final stage to complete the true tasks. RNPI-4x3 can not even finish the easy stage.
We trained the A[P1]
A[P2]? detector in the Conditional swap task with the RL objective. The input to the detector is the one-hot encoding of the two elements. Then the STEP applier was learned in the context of the Output max task by maximizing the expected reward of completing the task; see equation (8). The embedding of STEP was learned successfully in 79% out of the 100 experiments we ran. In each successfully trial one of the two appliers, seq(; COMPSWAP, P1_LEFT, NOP) and cond(A[P1]A[P2]?; NOP, NOP, MOVE_12) was learned, which was equivalent to finding the max element by a pass of bubble sort and selection sort respectively. MOVE_12 is a primitive applier we defined to move P1 forward until reaching P2. Finally, the Sort applier was learned to complete the Sort task, with success rate of 62% over 100 experiments. Both bubble sort and selection sort were learned by calling the two learned STEP appliers respectively.
The problem of improving the universality and learnability of NPI is addressed for the first time by incorporating combinator abstraction from functional programming. Analysis and experimental results have shown that CNPI is universal with respective to all combinatorizable programs and can be trained with both strong and weak supervision. We believe that the proposed approach is quite general and has potential applications besides solving algorithmic tasks. One scenery is training agents by RL to follow instructions and generalize (e.g., Oh et al. (2017), Andreas et al. (2017), Denil et al. (2017)). Natural language contains “higher-order” words such as “then” and “twice”, which play critical role but the interpretation of which may cause trouble to vanilla sequence-to-sequence models (Lake & Baroni, 2017). By representing these words as combinators and equipping the agent with CNPI-like components, it would be possible to construct agents that display more complex and structured behavior and that generalize better. We leave this for future work.
We thank Mingli Yuan for valuable discussion and feedback.
Andreas et al. (2017)
Jacob Andreas, Dan Klein, and Sergey Levine.
Modular multitask reinforcement learning with policy sketches.
International Conference on Machine Learning (ICML), 2017.
- Cai et al. (2017) Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via recursion. In International Conference on Learning Representations (ICLR), 2017.
- Denil et al. (2017) Misha Denil, Sergio Gómez Colmenarejo, Serkan Cabi, David Saxton, and Nando de Freitas. Programmable agents. arXiv preprint arXiv:1706.06383, 2017.
- Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.
- Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Kaiser & Sutskever (2015) Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.
- Kurach et al. (2015) Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random-access machines. arXiv preprint arXiv:1511.06392, 2015.
- Lake & Baroni (2017) Brenden M Lake and Marco Baroni. Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks. arXiv preprint arXiv:1711.00350, 2017.
- Neelakantan et al. (2015) Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with gradient descent. arXiv preprint arXiv:1511.04834, 2015.
- Oh et al. (2017) Junhyuk Oh, Singh Satinder, Lee Honglak, and Kholi Pushmeet. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning (ICML), 2017.
- Reed & de Freitas (2016) Scott Reed and Nando de Freitas. Neural programmer-interpreters. In International Conference on Learning Representations (ICLR), 2016.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
- Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
Appendix A Built-in programs to support tree recursion.
We use some built-ins facilities, including a state stack, a combinator, four ACTs and a detector, to support tree recursion. They are listed in Table 6. The pseudo-code for the built-in combinator _map_self is shown in Figure 6. The built-in ACT _load_state need to be overloaded for each specific task because the state needed to for a recursive call may be different for each task. See examples in B for the usage of _load_state.
Appendix B Combinatory programs for algorithmic tasks
Below we show the combinatory programs compared with the corresponding normal programs for three algorithmic tasks bubble sort, quick sort and traverse in topological sort. Bubble sort has a nested two levels of linear recursion both of which are expressed by linrec. Quick sort uses bi-recursion and traverse in topological sort uses multi-recursion. Both are expressed by treerec together with SAVE_STATE and _load_state.
Normal and combinatory programs for bubble sort.
(a) Normal Program
(b) Combinatory Program
Normal and combinatory programs for quick sort.
(a) Normal Program
(b) Combinatory Program
Normal and combinatory programs for traverse in topological sort.
(a) Normal Program
(b) Combinatory Program
A general algorithm for converting any program set expressing an recursive algorithm to a combinatory one is given in Algorithm 2. For a program it first removes any multiple recursive calls by using _push_state and _mapself, then removes any loop by replacing them with tail recursion. Finally an iterative maximum matching procedure is used to convert the program to a set of appliers iteratively. We put forward a proposition that any recursive program can be combinatorized in this way. Note that non-recursive programs, (e.g. the stack-based iterative program for topological sort used in Cai et al. (2017)) may still be combinatorized by Algorithm 2, but the process is less straightforward.
Appendix C Proof of Theorem 1
Before proving Theorem 1, we first give a formal definition of combinatory programs and a lemma on the interpretation of appliers.
An ACT is a combinatory program.
A program with an applier as entrance is a combinatory program if all of ’s callable arguments are combinatory programs.
Only that which can be generated by the clause 1-2 in finite steps is a combinatory program.
Proof. Because all key embeddings in program key memory and detector key memory (in this proof we use to denote both and
for convenience) have unit norm, the dot product of any two keys equals to their cosine similarity (). Suppose a key embedding in the right-hand side of equation (1), which will be output by Split in line 22 of Algorithm 1, is set as , then for any , . According to lines 23-25 of Algorithm 1, the correct program (the combinator, detector or callable arguments) ID, namely , will be selected, guaranteeing the correct interpretation of the applier. ∎
Note that the unit norm constraint for key embeddings is convenient to satisfy in practice.
Following the above recursive definition of combinatory programs and the procedure of iteratively adding appliers to the program memory from the bottom up, we give an induction proof of Theorem 1. The distinguishing feature of CNPI that enables this proof is the dynamic binding of formal detectors and callable arguments to actual programs, which makes verification of combinator’s execution (by the core) and verification of their invocation (by appliers) independent of each other. In contrast, it is impossible to conduct such a proof with NPI and RNPI which lack this feature.
Proof. Base case: It is obvious that programs composed of a single ACT (including built-in ACT) can be interpreted correctly with perfect generalization (abbreviated as perfectly interpretable).
Induction step: Assume that the programs referenced by the callable arguments of an applier are all perfectly interpretable, we prove that program with as entrance is perfectly interpretable. Firstly, from Lemma 1 when is interpreted the right combinator will be invoked with the right detector and callable argument IDs. Secondly, because the combinators and the detectors have been verified, the programs referenced by the callable arguments of are guaranteed to be called at the right time. Finally, when these programs are called, they can be perfectly interpreted. Put it all together, can be interpreted correctly. Besides, as the calls to argument which support recursion are also guaranteed to be made at the right time (in linrec and _mapself), can be interpreted correctly with any input complexity, i.e. with perfect generalization.
When adding new detectors/appliers to detector/program memory, the weights of the core, key embeddings and program embeddings of combinators and existing appliers are all hold fixed. Thus, the correct interpretation of learned programs composed of these existing appliers can be proved in exactly the same way, i.e., CNPI maintains correct interpretation of already learned programs. ∎