1 Introduction
Program synthesis is one of the central problems in Artificial Intelligence studied from the early days
(Manna & Waldinger, 1971; Waldinger & Lee, 1969)and has seen a lot of recent interest in the machine learning and programming languages community
(Alur et al., 2013; Gulwani et al., 2012, 2015; Muggleton, 1991; Lin et al., 2014; SolarLezama, 2013). The recent neural approaches can be broadly classified into two categories:
program induction and program synthesis. Both approaches share the objective of learning program semantics but do so in different ways. Program induction aims to embed the semantics of a particular algorithm into a differentiable model trained endtoend, whereas the goal of program synthesis is for a model to learn the semantics of a domainspecific language (DSL) and produce programs defined by corresponding specifications. Both problems necessitate large datasets, either of I/O pairs in the case of program induction, or programs with corresponding I/O pairs in the case of program synthesis.However, since large datasets for program induction and synthesis tasks do not exist, these approaches train models on large synthetically generated datasets. Presumably, if a model can accurately predict arbitrary program outputs (for induction) or programs in the DSL (for synthesis) then it has likely learnt the correct algorithm or DSL semantics.
Although this approach has led to some impressive synthesis results in many domains, synthetically generating datasets that cover all DSL programs and the corresponding input space can be problematic, especially for more complex DSLs like Karel the Robot (Pattis, 1981) which includes complex controlflow primitives (while loops and if conditionals) and operators. Likewise, for induction tasks, the sampling procedure for program specifications may lead to undesirable biases in the training distribution that inhibit strong generalization.
In this paper, we consider two problem settings. The first is the Karel domain and the recently proposed Karel synthesis model (Bunel et al., 2018). We identify many distributions of input examples and DSL programs for which the Karel synthesis model performs poorly. The second problem setting is a program induction problem in which a model is trained to execute and predict the output of simple arithmetic expressions, which we denote the Calculator domain; for this domain, we considered common synthetic data generation strategies including one from tensor2tensor (Vaswani et al., 2018)
, an opensource deep learning library. Upon analysis, we find evidence of undesirable artifacts resulting from certain biases in the generation algorithm.
Our results indicate that models trained with common methodologies for synthesizing datasets fail to learn the full semantics of the DSL, even when they perform well on a test set, and suggest the need of a more principled way to generate synthetic datasets. For some program and input distributions, the stateoftheart neural synthesis models perform quite poorly, often achieving less than
generalization accuracy. In our paper, we develop a new methodology for creating training distributions over programs in the DSL to mitigate some of these issues. Moreover, unlike previous works that have ignored considering the distributions over input space, we show that input distributions also play a significant role in determining the synthesizer performance. Our methodology involves defining the distribution over DSL programs and input space using a set of random variables to encode much of the valuable features which describe the data, e.g. in the Karel domain, the amount of control flow nesting in programs or the number of markers present in the inputs.
Our methodology allows us to identify several specialized distributions over the input space and Karel programs on which the current state of the art synthesis models (Bunel et al., 2018) perform poorly when trained on traditional program and input distributions, and tested on our new distributions. From this, we design new training distributions by ensuring greater uniformity over the random variables in our methodology. By retraining the same architecture on these new training data distributions, we observe a greater ability to generalize, with significant improvements when evaluated on the aforementioned test sets. We also observe similar improvements in the Calculator domain as well.
This paper makes the following key contributions:

We propose a new methodology to generate different desirable distributions over the space of datasets for program induction and synthesis tasks.

We instantiate the methodology for the Karel and Calculator domains and show that model generalization is worse on datasets generated by our technique.

We then retrain models in both domains and demonstrate that models achieve greater overall generalization performance when trained on datasets generated with our methodology.
2 Related Work
2.1 Training Models with Synthetic Data
In certain domains like computer vision and robotics, collecting highquality realworld training data incurs significant cost, and so many researchers have investigated the use of large amounts of synthetic data. For example,
Christiano et al. (2016); Peng et al. (2017); Pinto et al. (2017); Bousmalis et al. (2017) aim to learn robotics policies that compensate for differences between the real world and the simulation. Within computer vision, Shrivastava et al. (2016)demonstrate learning from entirely synthetic images for gaze and pose estimation.
2.2 Neural Program Induction and Synthesis.
Program induction methods learn differentiable modules such as stack (Joulin & Mikolov, 2015), RAM (Kurach et al., 2016), GPU (Kaiser & Sutskever, 2015), and readwrite external memory (Graves et al., 2014) to represent algorithms. Other methods attempt to learn differentiable control flow operations (Gaunt et al., 2016; Neelakantan et al., 2015). These approaches reconstruct outputs given inputs, inferring the underlying algorithms. Other methods instead learn neural modules from program traces rather than from I/O examples (Reed & de Freitas, 2015; Xiao et al., 2018; Cai et al., 2017).
Devlin et al. (2017); Parisotto et al. (2017) use neural program synthesis techniques for learning string editing programs in RobustFill. Similarly, Balog et al. (2016) learn array programs in DeepCoder, and Bhupatiraju et al. (2017) learn to compose API calls. Bunel et al. (2018) apply neural program synthesis to the Karel domain we consider in our paper; we use their architecture and dataset. Many of these approaches report high test performance, but good performance on syntheticallyspecificed programs may not indicate the model’s ability to generalize to arbitrary userdesired programs. In fact, our results show that under certain distributions, these models perform quite poorly. To verify these models appropriately generalize to reasonably complex arbitrary programs, the test set should sufficiently represent the universe of these programs and their specifications. Likewise, to avoid biasing the result, the test set should also draw uniformly from these programs and specifications. For example, for RobustFill, the data generation methodology only sampled programs uniformly from the string DSL, but did not take into account the distribution over input strings such as their length, frequencies of occurrence of regular expressions and their nesting, common words and constants etc.
3 Data Generation Methodology
Currently, automated data generation focuses in large part on a constructive process, whose parameters can be tuned. We propose a complementary approach in which we perform a subsequent filtering step on this process to ensure that the resulting distribution has certain properties.
We define a salient random variable as one whose distribution in the final dataset is of interest. In the case of program synthesis, we consider two kinds of salient variables: variables denoting important features of a program in the given DSL, such as its length and degree of nesting; and variables denoting features for the input space.
In many cases, we can modify our sampling procedure to ensure a desirable distribution of a particular salient variable. However, for some salient variables, it is infeasible to tune the parameters of a given sampling procedure in order to obtain a desired distribution for that salient variable. For example, if we sample programs directly from a contextfree grammar, it is difficult to control the distribution of various salient variables such as program length, degree of nesting, etc. This is a notable problem in both the Karel and Calculator domains.
Furthermore, within the context of program synthesis specifically, there is often an additional challenge: not all inputs are valid for all programs. For example, in the Karel domain, an input for a given program would be invalid if the program attempts to perform illegal actions for that input (such as move into walls or pickMarker in a cell containing no markers). The requirement that the program/input pairs suit each other itself acts as an unpredictable filter that makes it difficult to ensure uniformity of salient variables by tuning generation parameters.
As such, we propose a methodology for randomly sampling a dataset (consisting of elements of ) while ensuring that a given salient variable (where is finite and discrete), denoted as a random variable
, has a uniform distribution throughout
. To do this, we first sample an example from an original distribution . We then add towith probability
, wherewith , the probabilities induced over via , calculated empirically based on counts computed with past samples drawn from . We repeat the above until is of a desired size. For full pseudocode see Section B.1 in the appendix.
We use
as a hyperparameter to trade off the runtime of the above procedure with the level of
’s uniformity in . In Section B.2 (in the Appendix), we provide a probabilistic bound on the uniformity of in the resulting distribution, for the case where . Also, for when , we can show that drawing a single sample is possible in calls to the original sampler (proof in Section B.3, with empirical experiments in Section B.4 and B.5). Increasing increases the algorithm’s speed, at the cost of allowing the distribution of in to further diverge from uniform.4 Application to Karel: Experiments with New Test Distributions
Karel is an educational programming language (Pattis, 1981) where the programmer writes imperative programs with conditionals and loops to produce a sequence of actions for an agent (a robot named Karel) which lives in a rectangular grid world. For a detailed description of the particular instantiation of the Karel language and input grid specification that we consider, see Section A. The program synthesis task that we consider is as follows: given a set of pairs of input and output grids , find a Karel program such that executing on results in , results in , and so on. In the paper, unless otherwise specified. An example Karel synthesis task with the I/O examples and the corresponding Karel program to be synthesized is shown in Figure 1.
In this section, we employ the Karel instantiation of our abstract data generation methodology in Section 4.1 to generate different test datasets. By imposing a more uniform distribution over the salient random variables when generating the I/O specifications and target programs which make up the test set, we observe much lower accuracies of the previous Karel synthesis models (Bunel et al., 2018) compared to the original test set.
4.1 Salient Variables in Karel
We devised the following salient random variables to describe the input space in Karel:

Grid size: Dimensions of the grid in which Karel can act.

Marker ratio: Fraction of cells with at least one marker.

Wall ratio: Fraction of cells which contain a wall.

Marker count: Number of markers that are present in a cell containing markers.

Number of grids: Number of I/O grids shown to the model to specify the desired program.
For the program space in the Karel DSL, we consider the following random variables:

Program size: Size of the program in terms of number of tokens.

Control flow ratio: Number of control flow structures appearing in the program.

Nested control flow: The amount of control flow nesting in programs (e.g. while inside if).
4.2 Changing the I/O Distribution
We reproduced the encoderdecoder model of Bunel et al. (2018) and trained it using the provided synthetic training set with the teacherforcing maximum likelihood objective. On the existing test set, our model achieves 73.52% generalization accuracy, slightly higher than the 71.91% accuracy reported in Bunel et al. (2018). Generalization accuracy denotes how often the model’s output is correct on both the 5 I/O examples shown to the model and the remaining heldout 6^{th} I/O example.
To test how the model may be sensitive to changes in the I/O examples used to specify the program, we created new test sets by sampling new input grids and running them on each of the programs in the existing test set to obtain new I/O pairs. By keeping the programs themselves the same, we avoid inadvertent changes in the inherent difficulty of the task (the complexity of the programs to be synthesized).
Salient random variables with uniform distribution.
We first generated grids such that they would follow a distribution that is as uniform as possible in the salient features in Section 4.1. We used the following procedure to sample each grid: 1) sample the grid size (height and width) from ; 2) sample the marker ratio and wall ratio ; 3) for each cell in the grid, sample and ; 4) if and , sample marker count , otherwise set ; 5) place walls and markers in grid according to and ; 6) place Karel at a random location (not containing a wall) and with a random orientation. After generating 5 input grids for a given program, we ensure that the program does not crash on any of them and also check whether the 5 input grids exhibit complete branch coverage (i.e., each branch is taken by at least one of the 5 inputs). If either of these conditions are not satisfied, we discard all 5 grids and start over.
On this dataset, the model trained on existing data achieved generalization accuracy of only 27.9%, which was a drop of 44.6pp from the existing test set’s generalization accuracy of 73.52%.
Salient random variables with narrow distributions.
We further investigated the performance drop noted above by synthesizing “narrower" datasets that captured different parts of the joint probability space over the salient input random variables. For each narrow dataset, we selected and (both between and ) as well as a distribution which would be the same for all I/O grids. Then, we follow the procedure below for each grid: 1) sample the grid size (height and width) ; 2) randomly choose cells to contain walls, and cells for markers; 3) sample for all cells chosen to contain markers; 4) place Karel at a random location (not containing a wall) and with a random orientation. In our experiments, we primarily used 3 different distributions for : truncated at 9, , and which, when sampled, has a value equal to 10 minus a sample from Geom(0.5), truncated at 1.
4.3 Changing the Program Distribution
We will now examine how the existing model can surprisingly fail to perform well at synthesizing certain programs that are different from those in the existing validation and test sets.
Performance on complex DSL constructs.
We examined whether or not the model could succeed in synthesizing programs which require nesting of conditional constructs. This was of interest since these programs were relatively rare in the training dataset. We generated an evaluation dataset comprised solely of programs that contained while inside while statements, and another dataset in which all programs had while inside if statements.^{1}^{1}1To avoid any negative effects from changes in the I/O distribution, we attempted to ensure that and matches that of the provided training and test sets. We found that the model fared very poorly on these datasets, achieving only 0.64% and 2.23% accuracy respectively.
Program length  

Model type  1  2  3  4  5  6  7  8 
Baseline  16.00%  30.00%  44.24%  52.88%  56.56%  66.94%  67.16%  73.06% 
ActionOnly Augmented  20.00%  41.60%  52.24%  61.72%  63.04%  72.20%  72.74%  78.12% 
Programs only containing actions.
Intuitively, much of the difficulty in the Karel program synthesis task should come from inferring the control flow statements, i.e. if, ifElse, and while. Synthesizing a Karel program that only contains actions is intrinsically a much more straightforward task, which a relatively simple search algorithm (such as A^{*}) can perform well.
We performed an experiment using test datasets generated by enumerating actiononly programs of various lengths. As there are five actions (move, turnLeft, turnRight, putMarker, pickMarker), there exist textually unique actiononly programs for length . We sampled up to 500 unique programs of lengths . For each program, we generated 10 specifications, each containing 5 I/O pairs. We sampled each I/O pair from the set of all input grids in the existing training data (of which there are 6.7 million), as to match its distribution as closely as possible.
Table 2 shows the results. Remarkably, even though the underlying programs have relatively low complexity, the model’s accuracy is lower on every one of these actiononly test sets than the existing provided test set. The generalization accuracy grows as the program length becomes longer, even though those programs should be harder to synthesize.
Among the existing actiononly programs in the test set, the model’s generalization accuracy on that subset is 99.24%. Given the surprising nature of this result, we investigated the difference between the actiononly programs we generated, and those in the existing test set. We found that in the existing training and test sets, all programs contain at least two actions, and also contain at least one move action somewhere in the program. These and any undiscovered differences in the distribution of programs seem to have caused the gap in performance.
5 Application to Karel: Changing Training Distributions
In Section 4, we saw that the existing model performs much more poorly on certain test datasets that we constructed, compared to its performance on the existing test set as reported in Section 4.2. In light of the framework in Section 3, various imbalances of the salient random variables in the existing training data could have caused these gaps in performance. Then, a natural solution is to train using datasets constructed to avoid undesirable skews in the salient random variables, which should hopefully perform better across a variety of distributions.
5.1 Training Datasets with Uniform I/O
We generated a training dataset by taking the programs of the existing training set and synthesizing I/O pairs using the procedure described in Section 4.2. Furthermore, to make the “number of grids” salient variable uniform, we modify the training procedure by uniformly sampling a number between 1 and 5 for each minibatch, and using that many I/O grids to specify the program to the model. We trained a model on this data and then evaluated it on the same set of narrow distribution evaluation datasets as mentioned in Section 4.2. Table 1 and Figure 2 compares how this new model performs to the baseline model. The model trained on uniform I/O distributions maintains much higher generalization accuracy on the test sets of Section 4.2 than the baseline model. Note that the uniform I/O distribution is not simply a union of the tested distributions and is intended to cover all possible input specifications.
5.2 Realworld Benchmarks
We evaluated both the baseline model, and the uniform model described in the previous paragraph, on a set of 36 realworld Karel programming problems. This dataset was compiled from the Hour of Code Initiative and Stanford University’s introductory computer science course, CS106A, with the problems being handdesigned as educational exercises for students. We found that the baseline model got 4 correct (11.1%) while the uniform model got 7 correct (19.4%) when both models were trained with 5 shown I/O pairs, i.e., without making the “number of grids” salient variable uniform. This further demonstrates the uniform model’s increased ability to generalize to outofdistribution datasets, including those which are of interest to humans. Both models’ accuracies are still low compared to the performance on the synthetically generated test set.
After further analysis, we believe the models face two challenges on the realworld test set: (1) many of the realworld problems require long programs to solve compared to the synthetic test set, (2) the specifications in the realworld examples always contains fewer than 5 I/O pairs, and often only a single one. However, the original training methodology for the model assumes that it is provided with a diverse set of 5 I/O pairs. When we modified the training procedure to vary the number of shown I/O examples as in Section 5.1, the baseline model got 12 correct (33.3%) and the uniform model got 11 correct (30.6%). This shows that the homogenization on the number of I/O pairs was effective. Overall, the realworld dataset’s I/O distribution was similar to the MSR training datset in terms of the salient random variables we homogenized, so it is unsurprising that the baseline model was able to outperform the uniform model, consistent with the results on the MSR test set in Figure 2.
5.3 Augmenting Dataset with Actiononly Programs
We observed in Section 4.3 that the model fails to do well on either actiononly programs or programs with many controlflow statements. In the case of actiononly programs, we found that the training data had been pruned to only include programs with at least two actions and at least one move, and in the case of programs with complex control flow, we found a similar sparsity in the train set.
As discussed in Section 3, the principled way to counteract this sparsity is to introduce uniformity into a set of salient variables. This methodology allows us to counteract both naturally sparse data (such as complex controlflow) and spurious data preprocessing (such as enforcing programs to have at least two actions).
In our case, we introduce uniformity into the length of actiononly programs by synthesizing 20,000 programs of each length 1 to 20, by uniformly selecting tokens from the five action choices and generating I/O with other salient random variables as close to the original training set as possible; we append these new programs to the original dataset to train a new model. Table 2 shows a clear improvement when homogenizing this salient random variable. The new model achieved 71.8% accuracy on the original test set, which is very close to the 73.52% accuracy of the baseline model.
5.4 Training Datasets with Narrowly Distributed I/O
As done in for the uniform training dataset, we generated “narrow” training datasets by keeping the same programs as in the existing training data and replacing the I/O pairs with the process from Section 4.2.
We trained a variety of models and evaluated them on 12 datasets of different I/O feature distributions. Figure 3
summarizes the results of evaluating each model on each dataset by noting the performance of the model on the narrow dataset of the same type and the outcome on every other narrow dataset. For the models trained on the low variance datasets, we observed that they all consistently achieved between 60 and 70 percent accuracy on their own training distribution; however, the uniform model was able to achieve similar performance (between 57 and 80 percent accuracy) as shown in Table
1 and Figure 2. As such, we hypothesize that models trained on wide, uniform distributions can still perform comparable to models trained on narrow subdistributions, even when tested on the same subdistributions.6 Application to Calculator
The Calculator task is given as follows: given an expression such as "5+4*(2+3)", compute the result modulo 10; in this case, 5. Calculator is a program induction task rather than a program synthesis task like Karel; nevertheless, creating data for the Calculator problem involves sampling from a contextfree grammar. Additionally, Calculator is not as intricate a domain as Karel and thus we can more completely control the environment of data generation with less fear of lurking variables.
6.1 Calculator Environment
Calculator model.
Similar to the work by Zaremba & Sutskever (2014), we implement an LSTM that parses calculator expressions on a character level. We perform a 10class classification problem using a dense network on the final hidden state of the LSTM. The prediction is correct if it exactly matches the evaluation result of the expression, modulo 10.
Distributions of Calculator tasks.
We propose 4 distributions for calculator tasks: direct CFG sampling (DCFG), tensor2tensor sampling (T2T), “runs” CFG sampling (RCFG), and balanced sampling (BAL).
Two of our distributions represent reasonable ways in which a researcher might choose to sample data. The first is DCFG, which involves returning a digit with some probability , or else recursively sampling two productions and combining them with a , , or , each with probability . This corresponds to a direct, weighted sampling of the CFG for the calculator grammar.
The second is T2T, which is used by the tensor2tensor library to sample arithmetic expressions in one of its examples.^{2}^{2}2https://github.com/tensorflow/tensor2tensor/blob/8bd81e8fe9dafd4eb1dfa519255bcbe3e33c7ffa/
tensor2tensor/data_generators/algorithmic_math.py
It involves sampling a depth , then ensuring that the resulting AST has depth by forcing a random side of the operation production to be sampled to and the other side to be sampled to a depth .
The other two distributions represent potentially difficult or nonstandard problems that might appear in practical environments. RCFG is similar to DCFG but involves increasing the frequency of “runs” of the associative operations and by picking 2, 3, or 4 subexpressions and then combining them with the given symbol. BAL (balanced sampling) involves selecting a depth and then creating an AST that is a balanced binary tree at that depth. Importantly, regardless of sampling technique, redundant parentheses are removed. This is to increase the difficulty somewhat as order of operations needs to be established.
6.2 Salient Variables and Methodology
We use the following salient variables: length (rounded to the nearest even number), number of operations, number of pairs of parentheses, mean parenthesized depth, and maximum parenthesized depth. Parenthesized depth is defined for each digit and refers to the number of nested parentheses it is in. For example in (1+2)*(34)+5, the 1, 2, 3, and 4 are at depth 1 while the 5 is at depth 0.
We constructed distributions in total, corresponding to a total of 2 task distributions, T2T and DCFG, which represent the “natural” sampling techniques a researcher might employ, and homogenization strategies, one unhomogenized and five homogenized corresponding to each salient variable with . We then evaluated each model on a fresh evaluation set sampled from a mixture of the four unhomogenized distributions (T2T, DCFG, RCFG, BAL).
6.3 Results
The original performances and improvements created by homogenizing different random variables can be found in Table 3. On average, homogenizing the DCFG and T2T distributions caused the accuracy to increase by 5.00pp and 2.84pp, respectively.
We note that the Calculator domain is much simpler than Karel when considering both input complexity (grid worlds versus arithmetic expressions) and output complexity (a DSL program versus a single digit). Furthermore, the difference in distributions between the naive sampling approaches and the versions with one homogenized random variable are not as different in Calculator as what we observed in Karel (see Table 1 for the dramatic effect of ). We hypothesize that it is this difference in complexity that explains the smaller (but still consistent) effect of homogenizing salient random variables in Calculator as compared to in Karel.
Original  Length  Max Depth  Mean Depth  #Operations  #Parens  

T2T  83.83%  +4.35pp  +4.24pp  +2.14pp  +1.19pp  +2.32pp 
DCFG  78.25%  +3.84pp  +5.92pp  +4.02pp  +6.72pp  +4.51pp 
7 Conclusion
We demonstrate that existing sampling methods for randomly generating inputoutput examples have unintended and overlooked distribution flaws in both the Calculator and the Karel domain. These flaws prevent models trained on these distributions from generalizing to other test distributions, even if the are very simple. To resolve these problems, we propose a robust strategy for controlling and evaluating the bias of synthetic data distributions over programs and specifications by defining certain random variables that capture desired features of the program and input spaces, such as the number of parentheses in a calculator expression, and specifically manipulating their distributions. Equipped with our method, deep networks exhibit an increase in crossdistribution test accuracy, at the expense of a minor decrease in ondistribution test accuracy. We believe this methodology would lead to more rigorous evaluation of the synthesis techniques and moreover, aid them in learning better models that generalize well.
Equipped with a set of handdesigned salient random variables, we demonstrate the effectiveness of homogenizing synthetic datasets over this set. One of the core limitations of our approach is that the salient random variables are engineered by hand. This requires the scientist to have insights about the structure of the training examples they are randomly generating. Therefore, a promising extension of this algorithm is to automatically select which salient random variables to use, and automatically compute these variables; potentially via the use of a general unsupervised learning algorithm.
We evaluate our method on two domains: the Karel DSL, and a calculator expression parser. There is a natural question of whether the methods developed in this paper will improve outofdistribution generalization on applications other than program synthesis which use synthetic data—for example, a convolutional neural network that receives renderings of a virtual environment, for the robotics and vision domains mentioned in Section
2.1. Providing a thorough evaluation of our proposed homogenization algorithm on alternative domains is a promising area for future work.Acknowledgements
This material is in part based upon work supported by the National Science Foundation under Grant No. TWC1409915, Berkeley Deep Drive, and DARPA D3M under Grant No. FA87501720091. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation and DARPA.
References
 Alur et al. (2013) Rajeev Alur, Rastislav Bodík, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman, Sanjit A. Seshia, Rishabh Singh, Armando SolarLezama, Emina Torlak, and Abhishek Udupa. Syntaxguided synthesis. In Formal Methods in ComputerAided Design, FMCAD 2013, Portland, OR, USA, October 2023, 2013, pp. 1–8, 2013.
 Balog et al. (2016) Matej Balog, Alexander L. Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder: Learning to write programs. arXiv preprint arXiv:1611.01989, 2016.
 Bhupatiraju et al. (2017) Surya Bhupatiraju, Rishabh Singh, Abdelrahman Mohamed, and Pushmeet Kohli. Deep API programmer: Learning to program with apis. CoRR, abs/1704.04327, 2017. URL http://arxiv.org/abs/1704.04327.
 Bousmalis et al. (2017) Konstantinos Bousmalis, Alex Irpan, Paul Wohlhart, Yunfei Bai, Matthew Kelcey, Mrinal Kalakrishnan, Laura Downs, Julian Ibarz, Peter Pastor, Kurt Konolige, Sergey Levine, and Vincent Vanhoucke. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. CoRR, abs/1709.07857, 2017. URL http://arxiv.org/abs/1709.07857.

Bunel et al. (2018)
Rudy Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet
Kohli.
Leveraging grammar and reinforcement learning for neural program synthesis.
In ICLR, 2018.  Cai et al. (2017) Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via recursion. CoRR, abs/1704.06611, 2017. URL http://arxiv.org/abs/1704.06611.
 Christiano et al. (2016) Paul F. Christiano, Zain Shah, Igor Mordatch, Jonas Schneider, Trevor Blackwell, Joshua Tobin, Pieter Abbeel, and Wojciech Zaremba. Transfer from simulation to real world through learning deep inverse dynamics model. CoRR, abs/1610.03518, 2016. URL http://arxiv.org/abs/1610.03518.
 Devlin et al. (2017) Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdelrahman Mohamed, and Pushmeet Kohli. Robustfill: Neural program learning under noisy I/O. In ICML, pp. 990–998, 2017.
 Gaunt et al. (2016) Alexander L. Gaunt, Marc Brockschmidt, Rishabh Singh, Nate Kushman, Pushmeet Kohli, Jonathan Taylor, and Daniel Tarlow. Terpret: A probabilistic programming language for program induction. CoRR, abs/1608.04428, 2016. URL http://arxiv.org/abs/1608.04428.
 Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 Gulwani et al. (2012) Sumit Gulwani, William R. Harris, and Rishabh Singh. Spreadsheet data manipulation using examples. Commun. ACM, 55(8):97–105, 2012. doi: 10.1145/2240236.2240260. URL http://doi.acm.org/10.1145/2240236.2240260.
 Gulwani et al. (2015) Sumit Gulwani, José HernándezOrallo, Emanuel Kitzelmann, Stephen H. Muggleton, Ute Schmid, and Benjamin G. Zorn. Inductive programming meets the real world. Commun. ACM, 58(11):90–99, 2015. doi: 10.1145/2736282. URL http://doi.acm.org/10.1145/2736282.
 Joulin & Mikolov (2015) Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stackaugmented recurrent nets. In NIPS, pp. 190–198, 2015.
 Kaiser & Sutskever (2015) Lukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. CoRR, abs/1511.08228, 2015.
 Kurach et al. (2016) Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural randomaccess machines. ICLR, 2016.
 Lin et al. (2014) Dianhuan Lin, Eyal Dechter, Kevin Ellis, Joshua B. Tenenbaum, and Stephen Muggleton. Bias reformulation for oneshot function induction. In ECAI 2014  21st European Conference on Artificial Intelligence, 1822 August 2014, Prague, Czech Republic  Including Prestigious Applications of Intelligent Systems (PAIS 2014), pp. 525–530, 2014.
 Manna & Waldinger (1971) Zohar Manna and Richard J. Waldinger. Toward automatic program synthesis. Commun. ACM, 14(3):151–165, 1971.

Muggleton (1991)
Stephen Muggleton.
Inductive logic programming.
New generation computing, 8(4):295–318, 1991.  Neelakantan et al. (2015) Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with gradient descent. CoRR, abs/1511.04834, 2015. URL http://arxiv.org/abs/1511.04834.
 Parisotto et al. (2017) Emilio Parisotto, Abdelrahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neurosymbolic program synthesis. ICLR, 2017.
 Pattis (1981) Richard E Pattis. Karel the robot: a gentle introduction to the art of programming. John Wiley & Sons, Inc., 1981.
 Peng et al. (2017) Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Simtoreal transfer of robotic control with dynamics randomization. CoRR, abs/1710.06537, 2017. URL http://arxiv.org/abs/1710.06537.
 Pinto et al. (2017) Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for imagebased robot learning. CoRR, abs/1710.06542, 2017. URL http://arxiv.org/abs/1710.06542.
 Reed & de Freitas (2015) Scott E. Reed and Nando de Freitas. Neural programmerinterpreters. CoRR, abs/1511.06279, 2015. URL http://arxiv.org/abs/1511.06279.
 Shrivastava et al. (2016) Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. CoRR, abs/1612.07828, 2016. URL http://arxiv.org/abs/1612.07828.
 SolarLezama (2013) Armando SolarLezama. Program sketching. STTT, 15(56):475–495, 2013. doi: 10.1007/s1000901202497. URL https://doi.org/10.1007/s1000901202497.
 Vaswani et al. (2018) Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416, 2018. URL http://arxiv.org/abs/1803.07416.
 Waldinger & Lee (1969) Richard J. Waldinger and Richard C. T. Lee. PROW: A step toward automatic program writing. In Proceedings of the 1st International Joint Conference on Artificial Intelligence, Washington, DC, USA, May 79, 1969, pp. 241–252, 1969.
 Xiao et al. (2018) Da Xiao, JoYu Liao, and Xingyuan Yuan. Improving the universality and learnability of neural programmerinterpreters with combinator abstraction. CoRR, abs/1802.02696, 2018. URL http://arxiv.org/abs/1802.02696.
 Zaremba & Sutskever (2014) Wojciech Zaremba and Ilya Sutskever. Learning to execute. CoRR, abs/1410.4615, 2014. URL http://arxiv.org/abs/1410.4615.
Appendix A The Karel Domain

GridWidth : GridHeight: Markers: Walls: KarelLoc: Orientation: 

(a) 
(b) 
A declarative specification of the space of valid input worlds to Karel programs is shown in Figure 4(b). As with Bunel et al. (2018), we assume a bound on the input grid size to be . Each cell in a grid can either be empty, contain an obstacle (i.e. a wall, specified by the list Walls), or contain markers (defined using the list Markers). The agent starts at some cell denoted by KarelLoc in the grid (which may contain markers but no obstacle) with a particular orientation direction denoted by Orientation.
The grammar for the Karel DSL we consider in this work is shown in Figure 4(a). The DSL allows the Karel agent to perform a move action to move one step in the grid in the direction of the orientation, actions turnLeft and turnRight to change its orientation direction, and actions pickMarker and putMarker to manipulate markers. The language contains if, ifElse, while constructs with conditionals {front,left,right}IsClear, markersPresent, and their negations. The repeat construct allows for a fixed number of repetitions. Note that the language does not contain any variables or auxiliary functions.
Appendix B Salient Variable Homogenization Algorithm
The following is a more formal description of the Algorithm described in Section 3, as well as proofs of correctness, and an investigation of the parameter’s pratical effect.
b.1 Full Pseudocode
Let be some space that is sampled by some original distribution . Let be the space of a salient variable which is calculated by ; it is a finite set. Let be our tolerance.
b.2 Proof of Correctness
Let the initial sampling distribution be and the resulting distribution be . We use the notation to refer to the probability that given that is sampled from the distribution . refers to the count for the salient variable value among past samples from , as defined in SampledHomogenize.
Theorem.
The Salient Variable Homogenization algorithm produces samples from a distribution close to uniform. Formally, after samples are drawn from distribution , we have that the resulting homogenized distribution satisfies
with probability at least , where and has been set to .
Proof.
We can simplify the probability as
(for ; see Probability Simplification Lemma below).
Let . We have that by assumption and substituting in . By the Count Bounding Lemma we have that for all with probability at least . The following computations assume for all and thus are valid with probability .
We have . We also have
and thus .
Combining the previous two ranges via a division, we have
We have that and since for small . Thus, we have that and therefore we have
we thus have that , completing our proof. ∎
Lemma (Count Bounding).
We have that if samples have been drawn from that
where
Proof.
We can model as a sum of independent Bernoulli trials with probability . Using Chernoff bound, we have that
and
Thus, we have by union bound that
For we have that and . Thus, we can restate the previous inequality as
or in other words
Thus we can bound the RHS as
where the first inequality is since and the second is since .
We can again apply union bound to get that
∎
Lemma (Probability Simplification).
Proof.
Since we can simplify
we have that
∎
b.3 Efficiency Analysis
We show that the number of samples from the original distribution required to produce a sample from the homogenized distribution is in expectation.
In the Salient Variable Homogenization algorithm, we have the probability of not rejecting a given sample as . We know that
Since each sample is independent, we can model this as a geometric distribution, and thus we have that the expected number of tries
. We thus have that in expectation, we need to sample samples from the original distribution to produce one homogenized sample.b.4 Empirical Effect Of Varying
The solid line is an upper bound derived in Section B.3. Seen in the measured samples for different values of , the upper bound appropriately reflects the maximum height of the samples, with very little remaining space on the DCFG dataset and some but not much on the T2T dataset, and is thus a close bound. As the values of the bound approaches the limit 1, indicating no samples are rejected by the algorithm.
Original  

T2T  83.83%  +2.84pp  +1.31pp  +1.33pp  +2.45pp 
DCFG  78.25%  +5.00pp  +4.50pp  +3.54pp  +3.42pp 
Increasing the parameter has the theoretical affect of causing the homogenized distribution to deviate more from uniform in its salient random variables, as shown in B.3. In practice, we discover that performance boosts tend to decrease with increasing , although the effect was not as pronounced on the T2T dataset, potentially because the unhomogenized T2T distribution is closer to uniform and thus homogenization has a limited effect for any larger values.
b.5 Empirical Evidence for Increase in Uniformity
Empirically, the Salient Variable Homogenization algorithm led to increases in uniformity in the variable being homogenized. We measure uniformity by KL divergence between the distribution being measured and the uniform distribution. A table of relative improvements is given in Table 5.
Length  Max Depth  Mean Depth  #Operations  #Parens  

DCFG  42.98%  30.77%  27.05%  43.95%  23.63% 
T2T  46.68%  30.45%  13.99%  38.82%  36.91% 