The ability to autonomously synthesize programs from data is one of the main goals in developing intelligent systems. Namely, given specific requirements, such as input/output pairs or more formal specifications, the model must learn a program that meets those demands. Several approaches to try to solve this task are based on the neural controller - interface framework, in which a neural network interacts with an external structured environment by using primitives, such as read/write heads or other atomic functionspmlr-v48-zaremba16; reed2016neural; DBLP:journals/corr/GravesWD14; graves2016hybrid; pierrot2019learning; pierrot2020learningDBLP:journals/corr/GravesWD14, and Differentiable Neural Computers graves2016hybrid can learn simple procedures by training on input/output pairs. These models use external memory, and they are differentiable, which means that they can be trained end-to-end by gradient descent. However, they are essentially black boxes since it is non-trivial to inspect them to understand their discovered algorithm. More importantly, supervision is provided at the level of full program outputs only, making training strongly susceptible to overfitting as shown for the priority sort task of DBLP:journals/corr/GravesWD14.
Neural-Programmer interpreters (NPI) reed2016neural
use a reinforcement learning paradigm to learn highly compositional programs by exploiting execution traces. An execution trace is an ordered list of commands that, if executed, solve a given task. Execution traces provide finer-grain supervision during training, encouraging convergence towards more accurate programs. NPI assumes to have at hand a large dataset of execution traces for the given problem, substantially limiting its applicability in real-world scenarios, where an oracle capable of providing such a dataset could be unavailable. A recent model, named AlphaNPI, shows how to train NPI without execution traces. AlphaNPI can solve both discrete and continuous control problems (pierrot2019learning; pierrot2020learning), and it works by combining NPI with Monte Carlo Tree Search (MCTS), which is used to discover the execution traces. This approach showed outstanding results, both in terms of generalization and interpretability of the final models. However, the spectrum of tasks it can learn is still somewhat limited since it generates programs that use an ad-hoc language that does not present any high-level construct of a formal programming language such as loops, variables or function arguments.
This work presents some promising results towards generalized neural interpreters and discusses additional future challenges that still need to be addressed. We also provide a brief discussion comparing ILP and neural-based models. The main contributions are the following:
We propose a new version of the AlphaNPI model, which brings us closer to learning real-world programs by learning functions that also accept arguments.
We propose an Approximate recursive Monte Carlo Tree Search (A-MCTS) procedure to improve convergence.
We developed a refined training strategy featuring re-training over failed tasks that improves robustness and generalization.
Problem Statement and Original Setting
We consider the problem of learning a complex algorithm, such as Quicksort, by having an agent interacting with an environment by choosing actions . The initial actions are called atomic actions. In the Quicksort case, the environment is a list of integers, and the atomic actions are element swaps and one-step pointer moves. The agent learns to leverage and to combine these atomic actions to produce higher-level programs. These learnt programs are then added to the collection of available actions. Therefore, the agent can benefit from both atomic and more complex actions to produce advanced algorithmic behaviour. Each action has assigned a level. The level 0 actions are the atomic actions that do not need to be learned. Discovered programs have positive levels. Each program can only call lower-level action or lower-level learned programs. The original work supported also recursion, by letting programs calling themselves. However, in this work, we do not experiment with recursive programs. In addition, each program and atomic action has associated pre-conditions, in the form of boolean constraints, conditioned on the environment state. Pre-conditions ensures we are consistently generating programs that are correct and feasible. For instance, the pre-condition of the Quicksort program is that all pointers must be at the beginning of the list. Pre-conditions are not learned, but they are provided beforehand. The objective is to learn a library of programs organized into a hierarchy. In our case, each task corresponds to learning a single program (e.g., the partition function of Quicksort). The reward function returns 1 if the program is correct, 0 otherwise. The goal of the agent is to maximize the expected reward for all the tasks. This goal configures as a multi-task reinforcement learning and as a discrete search problem. The search space is sparse since there exist many possible programs but very few viable ones. It also makes the reward function sparse since we get a positive reward and thus learn if and only if we obtain the correct program.
We change the original setting by making actions and programs accept arguments, . Each action/program can take an ordered list of at most three arguments. In our environment, these arguments are the pointers used to manipulate the list. For example, the quicksort program accepts two arguments, the pointer placed at the beginning of the list and the pointer placed at the end of the list. Additionally, an action/program can also accept no arguments, such as the stop action, which terminates the execution of the simulation. As for the pre-conditions, the maximum number of arguments that a program can accept is given apriori. We also extend the pre-conditions to consider the arguments to decide if a program is feasible. For example, the swap pointer action cannot be called by giving as argument the same pointer twice.
We start from the architecture developed by pierrot2019learning, and adapt it to the novel setting by introducing an additional module, . Thanks to this new module, we learn compact programs that can adapt their behaviour by accepting arguments instead of learning a separate program for each possible value of the arguments. For example, we can learn a move action to shift pointers that can operate on multiple pointers at the same time, move(pointer_1,pointer_3). With the original method, we would have had to learn two separate actions, move_pointer_1 and move_pointer_3.
The architecture is shown in Figure 1 and consists of six modules: an encoder (), a programs matrix (), an LSTM core (), a program network (), an arguments network () and a value network (). The encoder takes as input an environment and encodes it as a set of features . The program matrix takes as input an index and returns the corresponding program embedding . The LSTM core executes programs while conditioning on , the feature set and its internal state . The program and arguments networks take the LSTM output,
, and return probabilities over the action space,, and argument space,
, respectively. Lastly, the value network takes instead the LSTM output to estimate the value function. The equation of the architecture are the following:
Approximate Monte Carlo Tree Search
Monte Carlo Tree Search coulom2006backupmonte; Kocsis06banditbased is an algorithm to search large combinatorial spaces represented by trees efficiently. It was successfully used to solve many tasks, such as mastering complex games and protein folding silver2017mastering; AlphaFold2021. The original method is made by four phases: selection, recursion, expansion and backup. Many different flavours of MCTS are available (for instance, DBLP:journals/corr/abs-2103-04931; pmlr-v119-grill20a; xiao2018memory). In this work, we build upon the recursive MCTS developed by pierrot2019learning. We propose an Approximate MCTS (A-MCTS) by extending the expansion phase not to expand all the possible nodes available. This improvement is needed to account for larger search spaces that would be prohibitive to explore thoroughly. In our setting, we have discrete arguments, and the original MCTS requires creating a node for each program/argument pair available. If we increase the number of available arguments, the number of program/argument combinations grows significantly.
In our setting, each node of the tree represents the environment’s state at time t and each transition represents a program call. Figure 2 shows a description of the A-MCTS algorithm. Given the policies and , we add Dirichlet noise to foster exploration and we sample program/arguments pairs, , (
is a hyperparameter of the model) and we add them as possible future available actions. The simulation phase will be shorter with fewer nodes. By exploring fewer nodes, we reward immediately good solutions. Thanks to the random perturbations, we will eventually explore all the available configurations if good solutions are sparse.
During a training iteration, the agent selects a program to learn. It executes
episodes by exploring the search space with A-MCTS. The data gathered during the episodes are aggregated to construct the tree policy vectors for both programs and arguments,and . After each successful episode, we collect the observations, the hidden states, the task indexes, the rewards, and into an execution trace. More formally, an execution trace for a given program is a tuple . An execution trace gives the exact sequence of actions that, if applied to the input , produce the output
. The final discovered execution traces are stored in a replay buffer. Then, a mini-batch is sampled from the replay buffer, and the agent is trained on this data to minimize this adjusted loss function:
We want to push the network to reproduce the execution traces found by jointly minimizing the cross-entropy between the policies discovered by MCTS and the policies generated by the network. minimizes the cross-entropy between the program policies, and . minimizes the same cross-entropy but between the two arguments policies, and . Instead, pushes the network to generate the correct value function .
We also employ curriculum learning pmlr-v70-andreas17a to focus on learning programs by following the hierarchy, such as to learn lower-level programs first. Moreover, curriculum learning ensures choosing the next program to learn by looking at the success rate of the single programs. This rule means that programs that fail often are picked more often for learning.
Given the environment state, the agent emits a program policy and an argument policy . The policies are such that , where M is the total number of available programs or arguments. The best program/arguments pair is given by:
Re-train over failed environments
We noticed that a small fraction of training environments are substantially harder to solve and are thus mostly neglected when optimizing average performance during training. The result is that sub-optimal programs are learned, and the performance degradation becomes apparent when programs are evaluated on the longer lists used for testing. To tackle this problem and improve robustness and generalization, we implemented a function to re-train over previously failed environments. Namely, given a task to learn, if we get a 0 reward, we will store the used environment into a buffer called . Then, if we happen to try to learn the task again, we will run A-MCTS by sampling with a small probability an environment from
in which the model had failed to solve in previous iterations. In the other case, we sample the next state from a normal distribution over the environment conditioned on the task.
The buffer implements some curriculum learning since, given task , if we use previously failed states, we will sample the failed states by looking at the success rate. If a state fails more often than others, then it will be sampled more frequently.
We tested our approach on a sorting task. It is a familiar setting used by many others reed2016neural; pierrot2019learning; DBLP:journals/corr/GravesWD14. We focused on learning the hierarchy of programs needed to perform the Quicksort algorithm. Unlike the Bubblesort algorithm used in previous works pierrot2019learning, Quicksort cannot be easily written without functions that can accept arguments. Therefore, we choose it as the target for our experiments. Quicksort is a divide-and-conquer algorithm 10.1093/comjnl/5.1.10. It works by selecting a “pivot” value that partitions the list into sub-lists which it then sorts. The algorithm has an average time complexity of , also presenting good space complexity, extra bits to sort elements. The algorithm is more refined to reach these performances, especially the method used to select the pivot. The code for the experiments is freely available111github.com/geektoni/learning˙programs˙with˙arguments.
We consider an environment composed of three main components: a list of integers, three pointers that can reference the list’s elements and a stack that can save and retrieve the pointers’ current positions. The programs can accept at most three arguments. Each argument represents a reference to one of the three pointers available in the environment. We also added an empty value to signal arguments that the given function has to ignore. Additionally, as in regular programming languages, the arguments order count (e.g., function(a,b,c) has a different meaning from function(b,c,a)).
We trained the architecture with lists from 2 to 7 elements. The list’s elements were integers, and we constrained them between 0 and 10. We trained four different models. is the baseline, and it uses the vanilla architecture from pierrot2019learning. is again a baseline model which uses our A-MCTS instead. and employ our custom architecture presented in the previous sections. The former uses the standard MCTS, and the latter uses A-MCTS instead.
The higher-level programs that compose our hierarchy and which will be learned are: partition_update, partition, quicksort_update and quicksort. They correspond to certain pieces of the original Quicksort algorithm. The complete lists and descriptions of learnable programs and atomic actions can be found in the Appendix. The and models learn functions that do not accept arguments. Therefore, we provided them with an augmented set of atomic actions, which we built by taking the cartesian product between the atomic actions and the possible arguments they can accept.
While training, we also recorded the total number of nodes expanded by the MCTS and A-MCTS procedures after each training step. The idea is to check if we can converge to a solution by doing an approximate exploration of the search space by expanding fewer nodes.
As further evaluation studies, we validated the trained models on lists of up to 60 elements to investigate the generalization capabilities. Table 1 presents the validation results for all the learnable programs.
From the results, the is the only model which can learn the Quicksort algorithm correctly, whereas the other models fail. It also shows some generalization capacities since it can sort lists with a different length than those used for training. We can also see how the various models struggled when learning the quicksort_update function.
Interestingly, the models trained with the approximate procedure A-MTCS show similar convergence rates to those trained with MCTS. They also show good generalization performances, albeit not reaching the level of the original MCTS. From Figure 3, we can look at the total nodes expanded by both the approximate and exact procedures. For specific programs, such as partition_update or partition, the A-MCTS can converge to a solution by exploring fewer nodes with respect to the counterpart. However, the A-MCTS behaviour seems to be less stable than MCTS, and further investigations are required to be able to exploit its exploration efficiency without losing robustness.
Comparison with ILP
Currently, many common program synthesis problems can be solved either by ILP or by neural-interpreter methods. We argue that neural-based solutions require less formalization and specifications to learn an algorithm. However, they are more data-intensive than a standard ILP procedure. As we move closer to real-world programs, we think that neural-based models are easier to use since they rely on looser specifications. Moreover, we think the use of a differentiable model can speed up convergence when used as a prior for the discrete search method. In our case, a differentiable model implicitly learns a representation of the target program given the traces, thus leading to faster trace discovery.
This work presents some initial results to improve program synthesis by generating more human-like programs starting from input/output examples. We propose an improved AlphaNPI architecture to learn programs that can accept arguments to diversify their behaviour. The support for arguments enables our revised architecture to produce richer programs and learn more complex algorithms with respect to the original work. It also generates code that is more similar to the one produced with current high-level programming languages. Additionally, we present an Approximate Monte Carlo Tree Search method that enables us to converge by exploring only a fraction of the search space. We benchmarked our method by learning a well-known sorting algorithm, Quicksort, showing how our approach manages to learn it effectively while AlphaNPI fails to converge. We also show how the learned program generalizes when sorting lists of increasing length. This work is a first attempt at enriching neural program synthesis algorithms with additional features of high-level programming languages. Many open issues are still to be addressed. First, while the approximate MCTS procedure effectively reduces the search space and can potentially contribute to scale up the complexity of learnable programs, further investigations are needed to improve its robustness. Second, in this work only atomic actions can accept arguments, whereas higher-level programs can accept only the “empty” argument. This behaviour happens because all methods operate in the same environment, and there is no proper program scope. Additional studies are required to add this notion of program space to better use the new arguments.
|quicksort||Sort the list.||5|
|quicksort_update||Execute partition call and populate the stack with the next pointers.||4|
|partition||Order a sub-array and return the pivot for the next operation.||2|
|partition_update||Swap the elements pointed by (pivot) and if pointed value is less than pointed value.||1|
|save_ptr_1||Save pointer 1 position into the temporary registry.||0|
|load_ptr_1||Load the temporary registry value into the pointer 1 position.||0|
|push||Push pointers values inside the stack.||0|
|pop||Pop and restore pointers values from the stack.||0|
|swap||Swap elements pointed by pointer 1 and by pointer 2.||0|
|swap_pivot||Swap elements pointed by pointer 1 and by pointer 3 (pivot).||0|
|ptr_i_left||Move pointer (where ) one step left.||0|
|ptr_i_right||Move pointer (where ) one step right.||0|
|quicksort||Sort the list.||0||5|
|quicksort_update||Execute partition call and populate the stack with the next pointers.||0||4|
|partition||Order a sub-array and return the pivot for the next operation.||0||2|
|partition_update||Swap the elements pointed by (pivot) and if pointed value is less than pointed value.||0||1|
|save_ptr||Save a given pointer position into the temporary registry.||1||0|
|load_ptr||Load the temporary registry value into the given pointer position.||1||0|
|push||Push pointers values inside the stack.||0||0|
|pop||Pop and restore pointers values from the stack.||0||0|
|swap||Given two pointers, swap the corresponding elements||2||0|
|ptr_left||Move pointer one, two or all three pointers one step left.||3||0|
|ptr_right||Move pointer one, two or all three pointers one step right.||3||0|
|quicksort||Pointer 1 and 3 are at the extreme left of the list. Pointer 2 is at the extreme right of the list. The stack is empty.|
|quicksort_update||The stack is not empty and the temporary registry is empty.|
|partition||The temporary registry must contain Pointer 1 position.|
|partition_update||Pointer 1 position must be lower than Pointer 3 position. Pointer 3 position must be lower than Pointer 2 position. The temporary registry must be not empty.|
|load_ptr_1||The temporary stack must not be empty.|
|push||Check custom boolean condition .|
|pop||The stack must not be empty.|
|swap||Pointer 1 position must be different than pointer 2 position.|
|swap_pivot||Pointer 1 position must be different than pointer 3 position.|
|ptr_i_left||Pointer (where ) is not at the extreme left of the list.|
|ptr_i_right||Pointer (where ) is not at the extreme right of the list.|