Representing Partial Programs with Blended Abstract Semantics

12/23/2020 ∙ by Maxwell Nye, et al. ∙ MIT 3

Synthesizing programs from examples requires searching over a vast, combinatorial space of possible programs. In this search process, a key challenge is representing the behavior of a partially written program before it can be executed, to judge if it is on the right track and predict where to search next. We introduce a general technique for representing partially written programs in a program synthesis engine. We take inspiration from the technique of abstract interpretation, in which an approximate execution model is used to determine if an unfinished program will eventually satisfy a goal specification. Here we learn an approximate execution model implemented as a modular neural network. By constructing compositional program representations that implicitly encode the interpretation semantics of the underlying programming language, we can represent partial programs using a flexible combination of concrete execution state and learned neural representations, using the learned approximate semantics when concrete semantics are not known (in unfinished parts of the program). We show that these hybrid neuro-symbolic representations enable execution-guided synthesizers to use more powerful language constructs, such as loops and higher-order functions, and can be used to synthesize programs more accurately for a given search budget than pure neural approaches in several domains.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inductive program synthesis – the problem of inferring programs from examples – offers the promise of building machine learning systems which are interpretable, generalize quickly, and allow us to solve structured tasks such as planning and interacting with computer systems. In recent years, neurally-guided program synthesis, which use deep learning to guide search over the space of possible programs, has emerged as a promising approach

(Balog et al., 2016; Devlin et al., 2017). In this framework, partially-constructed programs are judged to determine if they are on the right track and to predict where to search next (see Figure 1). A key challenge in neural program synthesis is representing the behavior of partially written programs, in order to make these judgments. In this work, we present a novel method for representing the semantic content of partially written code, which can be used to guide search to solve program synthesis tasks.

Consider a tower building domain where a hand drops blocks, Tetris-style, onto a vertical 2D scene. For example, imagine that buildColumn(n) stacks vertically-oriented blocks at the current cursor location, and moveHand(n) moves the cursor spaces to the right. Given an image of a scene (Figure 1), our task is to write a program which builds a tower matching the image . To do this, a model can perform search in the space of programs, iteratively adding code until the program is complete. While attempting to synthesize a program, imagine arriving at a partially-constructed program (short for sketch), where HOLE signifies unfinished code:

Note that this partial program can’t reach the goal state, because the target image has columns of height 2, but this program can only build columns of height 1. For an algorithm to determine if it should expand or explore another part of the search space, it needs to answer the following question: Is on the right track to satisfy the goal? Answering this question requires an effective representation of partial programs.

Figure 1: Schematic overview of the search procedure and representational scheme. We characterize program synthesis as a goal-conditioned search through the space of partial programs (left), and propose a novel representational scheme (blended abstract semantics) to facilitate this search process. Left: a particular trajectory through the space of partial programs, where the goal is to find a program satisfying the target image. Right: three encoding schemes for partial programs, which can each be used as the basis of a code-writing search policy and code-assessing value function.

Neural program synthesis techniques differ in how they represent programs. Some represent programs by their syntax (Devlin et al., 2017; Allamanis et al., 2018), encoding the structure of the programs using sequence and graph neural networks. Recently, approaches which instead represent partial programs via their semantic state have been shown to be particularly effective. In these execution-guided neural synthesis approaches (Chen et al., 2018; Ellis et al., 2019; Zohar and Wolf, 2018), partial programs are executed and represented with their return values. (To see why this is helpful, consider two distinct syntactic expressions and ; a syntax-based model needs to represent them separately, whereas a model using a semantic representation will represent both as .) However, execution is not always possible for a partial program. In our running example, before the HOLE is filled with an integer value, we cannot meaningfully execute the partially-written loop in . This is a common problem for languages containing higher-order functions and control flow, where execution of partially written code is often ill-defined.111See Peleg et al. (2020) for a discussion in the context of bottom-up synthesis. Thus, a key question is: How might we represent the semantics of unfinished code?

A classic method for representing program state, known as abstract interpretation (Cousot and Cousot, 1977), can be used to reason about the set of states that a partial program could reach, given the possible instantiations of the unfinished parts of the program. Using abstract interpretation, an approximate execution model can determine if an unfinished program will eventually satisfy a goal specification. For example, in our tower-building domain, an abstract interpreter could be designed to track, for every horizontal location, the minimum tower height which all continuations are guaranteed to exceeded. 11todo: 1abstract interp example – wording is bad However, this technique is often low-precision; hand-designed abstract execution models greatly overapproximate the set of possible execution states, and contain no notion of what code is likely to be written.

We hypothesize that, by mimicking the compositional structure of abstract interpretation, learned components can be used to effectively represent ambiguous program state. In this work, we make two contributions: we introduce neural abstract semantics, in which a compositional, approximate execution model is used to represent partially written code. We further introduce blended abstract semantics, which aims to represent the state of unfinished programs as faithfully as possible by concretely executing program components whenever possible, and otherwise, approximating program state with a learned abstract execution model.22todo: 2Jacob: ”if a piece of code is fully written, executing it will make it easier for you to decide what to do next. if it’s not fully written, you want to guess the result of execution as precisely as possible. we hypothesize that NNs can do better at guessing than people; thus blended execution as an alternative to (1) abstract interp, (2), “blind” synthesis, (3) execution-guided synthesis” 33todo: 3Evan: essentially approximate and composition are nice in abstract interprettion. what is not nice is it cannot take advantage of distributions. we have both: modules to take advantage of composition, and learn end-to-end to take advantage of a particular code-completion-distribution. (put that in somehow)44todo: 4seperate neural and blended sematics

Consider again the partial program and the blended abstract semantics encoding in Figure 1. The sub-expression buildColumn(1) is fully concrete, and can thus be concretely executed to render an image. On the other hand, for functions whose arguments are not fully defined, such as moveHand, we instead employ abstract neural modules to represent the execution state. For this example, blended neural execution makes it easy to recognize that is not a suitable partial program, because no integer argument to moveHand—which controls the spacing between the columns—would make the state in match the goal .

This combination of learned execution and concrete execution allows robust representation of partial programs, which can be used for downstream synthesis tasks. We show that our model can effectively learn to represent partial program states for languages where previous execution-guided synthesis techniques are not applicable. In summary,

  • We introduce blended neural semantics, a novel method for representing the semantic state of partially written programs inspired by abstract interpretation.

  • We describe how to integrate our program representations into existing approaches for learning search policies and search heuristics.

  • We validate our new approach with program synthesis experiments in three domains: tower construction, list processing, and string editing. We show that our approach outperforms neural synthesis baselines, solving at least 5% more programs in each domain.

2 Related work

Synthesizing programs from examples is a classic AI problem (Backus et al., 1957) which has seen advances from the Programming Languages community (Gulwani et al., 2017; Gottschlich et al., 2018; Solar-Lezama, 2008).

Neurally-guided search Recently, much progress has been made using neural methods to aid search. Enumerative approaches (Balog et al., 2016; Shi et al., 2020) use neural methods to guide an enumerative synthesizer, and can be quite suitable for small-scale domains, but can scale poorly to larger programs and domains. Translation-based techniques (Devlin et al., 2017)

treat program synthesis as a sequence-to-sequence problem, and employ state-of-the-art neural sequence modeling techniques, such as recurrent neural networks (RNNs) with attention. Hybrid approaches which use sketches

(Murali et al., 2017; Nye et al., 2019; Dong and Lapata, 2018) trade off computation between translation and enumeration components. These techniques can exhibit better generalization than translation-based approaches but more precise predictions than enumerative approaches (Nye et al., 2019). To combine neural learning and search, our approach follows the framework laid out in Ellis et al. (2019)

, where neural networks are used to guide a search over the space of possible partial programs defining a Markov decision process (MDP).

Program representation Prior work has studied neural representation of programs. Odena and Sutton (2019) propose property signatures to represent input-output examples, and use property signatures to guide an enumerative search. Graph neural networks have also been used to encode the syntax of programs (Allamanis et al., 2018; Brockschmidt et al., 2018; Dinella et al., 2019) for bug fixing, variable naming, and synthesis. This work has mostly focused on performing small edits to programs from real datasets. Our objective is to synthesize entire programs from specifications.

Execution-guided synthesis Recent work has introduced the notion of “execution-guided neural program synthesis” (Ellis et al., 2019; Chen et al., 2018; Zohar and Wolf, 2018). In this framework, the neural representations used for search are conditioned on the executed program state instead of the program syntax. These techniques have been shown to solve difficult search problems outside the scope of enumerate or syntax-based neural synthesis alone. However, such execution-guided approaches have several limitations. We aim to generalize execution guided synthesis, so that it can be applicable to a wider range of domains, search techniques, and programming language constructs.

Abstract Interpretation Our work is directly inspired by abstract interpretation-based synthesis (Singh and Solar-Lezama, 2011; Wang et al., 2017; Hu et al., 2020). These approaches use abstract interpretation (Cousot and Cousot, 1977) to determine if a candidate partial program is realizable under the given specification, thereby pruning the search space of programs. We see our approach as a learning-based extension to this line of work.

Neural Modules We employ neural module networks (Andreas et al., 2016; Johnson et al., 2017) to implement blended abstract semantics, which aims to provide a learned execution scheme inspired by abstract interpretation. This approach is also related to other tree-structured encoders (Socher et al., 2011; Dyer et al., 2016).

3 Blended Abstract Semantics

Consider the problem of synthesizing arithmetic expressions from input–output pairs. Suppose we have the following context-free grammar for expressions:

and a specification consisting of the input–output pairs . Suppose further that we have a candidate program . To check that this program is consistent with the specification, we can evaluate it on the inputs in the specification according to the concrete semantics of the language: to evaluate (2 * x) + 1 on the example , we observe that the expression (2 * x) evaluates to the integer , and the expression 1 evaluates to the integer ; thus the whole expression evaluates to , as desired. Repeating this process with returns the value .

Formally: Let denote a context (e.g. )). The concrete value of an expression in a context is:222Domains with lambdas have slightly more complicated semantics. See Appendix A for details.

The goal of synthesis is to find an expression under concrete semantics.

Iterative construction of partial programs

Where did the expression (2 * x) + 1 come from? Neurally-guided synthesis techniques generally employ search. In this work, we use top-down search: starting with the top-level (incomplete) expression HOLE, we consider all possible expansions, , , , and select the one we believe is most likely to succeed (Figure 1 left). Concrete semantics cannot be used for this selection, because expressions such as x + HOLE are incomplete and cannot be executed. Thus, we need a different mechanism to guide search. The more effectively we can filter the set of incomplete candidate programs, the faster our synthesis algorithm.

Conventional abstract interpretation solves this problem by defining an alternative semantics for which even incomplete expressions can be evaluated. Consider the candidate expression HOLE * 2. No matter how the HOLE is filled, the expression returns an even number, so it cannot be consistent with the specification above. In many problems, we can define a space of “abstract values” (like even integer) and abstract semantics so that the abstract value of a partial program can be determined. This allows us to rule out partial programs on the basis of the abstraction alone (Wang et al., 2017). However, constructing appropriate abstractions is difficult and requires domain-specific engineering; an ideal procedure would automatically discover an effective space of abstract interpretations.

Neural abstract semantics
Figure 2: (a) Example applications of the Embed function. (b) Neural abstract module for +. (c) Neural placeholder module encoding a HOLE with the context . (d) Neural abstract semantic encoding of the partial program 1 + HOLE with the context .

As a first step, we implement the abstract interpretation procedure with a neural network. This is a natural choice: neural networks excel at representation learning, and the goal of abstract interpretation is to encode an informative representation of the set of values that could be returned by a partial program. For the program 1 + HOLE, we can encode the expression 1 to a learned representation (Figure 2a, top), likewise encode HOLE (Figure 2c), and finally employ a learned abstract implementation of the + operation (Figure 2b).

For concrete leaf nodes, such as constants or variables bound to constants, neural semantics are given using a state embedding function

, which maps any concrete state in the programming language into a vector representation:

. If the input to Embed is already vector-valued, Embed performs the identity operation. Neural placeholders provide a method for computing a vector representation of unwritten code, denoted by the HOLE token. To compute the representation for HOLE, we define a neural embedding function which takes a context and outputs a vector. For each built-in function (including higher-order functions), the neural abstract semantics of are given by a separate neural module (a learned vector-valued function as in Andreas et al. (2016)) with the same arity as . Therefore, computing the neural semantics means applying the neural function to its arguments, which returns a vector. Since the neural semantics mirrors the concrete semantics, its implementation does not require changes to the underlying programming language. Formally, neural semantics involve a slightly larger set of cases than concrete semantics:

This encoding is only one way to define a neural semantics, adopting a relatively simple and generic representation for all program components. For a discussion of its limitations and other, more sophisticated representations that could be explored in future work, see Appendix C.

Blended abstract semantics
55todo: 5EvanTODO : go back to this

Notice that for an expression such as (2 * x) + HOLE, the concrete value of the sub-expression (2 * x) is known, since it contains no holes. The neural semantics above don’t make use of this knowledge. To improve upon this, we extend neural semantics and introduce blended semantics, which alternates between neural and concrete interpretation as appropriate for a given expression:

  • If the expression is a constant or a variable, use the concrete semantics.

  • If the expression is a HOLE, use the neural semantics.

  • If the expression is a function call, recursively evaluate the expressions that are the arguments to the function. If all arguments evaluate to concrete values, execute the function concretely. If any argument evaluates to a vector representation, transform all concrete values to vectors using Embed and apply the neural semantics of the function.

Formally, we can write:

Because blended abstract semantics replaces concrete sub-components with their concrete values, we expect blended semantics to result in more robust representations, especially for long or complex programs where large portions can be concretely executed.

4 Program synthesis with Blended Abstract Semantics

To perform synthesis, we experiment with methods to guide search introduced in Ellis et al. (2019).333We review here, but see Ellis et al. (2019) for more details. In this work, the search over partial programs is formulated as an MDP, in which each state is a pair consisting of a partial-program and a specification, and actions are expansions of HOLEs under rules under the grammar. We assume a reward of 1 for programs which satisfy . In this framework, we learn to search by training a policy that proposes expansions to , and optionally a value function

that predicts the probability that

is solvable via any expansion of .

Let , where are input-output pairs. Let denote the blended abstract semantic representation of with input . The representation of a state is:

Here, is a learnable weight matrix, and the representation is averaged across all input-output pairs of . Given this state representation, the policy and value function are:

Here, MLP

is a multi-layer perceptron. Note the value function outputs a value between

and ; this allows for a probabilistic interpretation (see below).

End-to-end training

We train our policy

using imitation learning. Starting from the empty partial program

, we generate a sequence of partial programs by sampling a sequence of expansions 444Often a PCFG is used to sample these expansions. See Appendix B for details. from the grammar . Let be the completed program. We obtain specifications by sampling a set of inputs and obtaining outputs using concrete semantics . Thus, from a sequence of expansions , we can collect a set of triplets as training data. This process is repeated to generate the training set . We can then perform supervised training, maximizing the log likelihood of the following:

We train the value function by sampling rollouts of partial programs from a fully-trained

, minimizing the error in a Monte-Carlo estimate of the expected reward

(i.e., the probability of success under the policy).

For our error function, we use logistic loss rather than the more common mean squared error (MSE).


We explore a variety of code-writing search algorithms. Using only a policy, we can employ sample-based search and best-first search (where the log probability of generating under is used as the scoring function). With the addition of a learned value function, we can perform A*-based search with as a heuristic (see Ellis et al. (2019) for details).

5 Experiments

Figure 3: Example tower-building constructions.

We evaluate our model in two domains containing language constructs not handled by concrete execution-guided synthesis approaches: a tower-building domain with looping constructs, and a list-processing domain with higher order functions. We additionally test on a string-editing domain for which execution-guided synthesis is possible, but requires language modification; there, we examine how our approach fares without these modifications.

5.1 Looping constructs: Tower construction

We begin by investigating how our model performs in generative programming domains with higher-level control flow such as loops. Looping programs are an essential part of sophisticated programming languages, and aren’t naturally handled by previous execution-guided synthesis approaches. Our experiments in the tower-building domain employ a domain specific language (DSL) similar to the language depicted in the introduction to construct towers in a 2D world, adapted from Ellis et al. (2020). This domain is inspired by classic AI tasks (Winston, 1972). It is also related to important problems in the AI literature, such as generalized planning, where plans can be represented as programs, and often require looping constructs (Jiménez et al., 2019; Srivastava et al., 2015).

As above, the goal is to construct a program which successfully renders to a target image (examples in Figure 3). Language details can be found in Appendix B. We compared against two ablation models (see Figure 1): (1) Neural abstract semantics (defined in Section 3

), which does not apply concrete execution to concrete subtrees, and (2) RNN encoding, which encodes partial programs using a gated recurrent unit (GRU):

. We also consider an additional encoder-decoder baseline. This baseline uses a convolutional neural network (CNN) to encode the spec image, and then employs an LSTM decoder to decode the tokens of the target program, while attending over the image representation via Spatial Transformer Network attention

(Jaderberg et al., 2015). This baseline is inspired by the architecture used for the Karel domain in Bunel et al. (2018), with the addition of spatial attention. We train all models on 480000 programs sampled from the DSL. More details can be found in Appendix B.

To evaluate our model, we constructed a test set of tower-building problems involving combinations of tower-building motifs seen during training. Tower objects seen during training were composed in previously unseen ways, by stacking towers on top of each other, or placing them side-by-side. We evaluate our models by performing best-first search from the learned policy. We also test using a value function, where we are doing A* search with the negative log likelihood of the policy as the prior cost and the value function as the heuristic future cost estimate.

Figure 4: Left: Overall synthesis results in the tower-building domain. We plot the percentage of test problems solved as a function of the number of partial programs considered per synthesis problem. Right: Comparing the value function to a hand-coded abstract interpretation. Blended abstract semantics outperforms baselines in synthesis tasks, and obtains higher classification precision than hand-coded abstract interpretation.
Synthesis results

Figure 4 left shows our overall synthesis results in the tower-building domain, measuring the percentage of test problems solved as a function of the number of search nodes (partial programs) considered per problem. The sequence encoding performs poorly and is unable to solve a majority of test problems. The neural abstract semantics model achieves better performance, solving about half of the test problems within the allotted search budget. Blended execution outperforms each baseline. We also observe that adding a value function as a search heuristic further increases performance of our blended model, which is consistent with the findings in Ellis et al. (2019).

Comparison to abstract interpretation

How does the learned value function compare to hand-coded abstract interpretation? During synthesis, we can use abstract interpretation to prune the search space by rejecting candidate partial program candidates for which the desired output state is not within the abstract state achieved by executing the partial program (Singh and Solar-Lezama, 2011; Wang et al., 2017; Hu et al., 2020)

. Used in this way, classic abstract interpretation is conservative; it can be thought of as a classifier with perfect recall, but poor precision, only rejecting the partial programs it knows for sure to be unsuitable. Can our value function also detect these clearly bad partial programs, but ascribe low value to less obviously bad candidates? To test this, we conditioned the model on tasks from our test corpus, and sampled 15 search trajectories from our blended semantics policy for each task. For each partial program encountered during search, we compute the model’s value judgment, and recorded whether each rollout was successful. Treating rollout success as a noisy label of partial program quality, and using the value function as a classifier, we plot precision vs recall of the value judgements as we vary the classification threshold. Figure


right shows our results for this experiment. As the classification threshold is varied, our learned value maintains comparable recall compared to the hand-coded abstraction, while achieving better precision. For high classification thresholds, our model achieves performance comparable to the hand-coded abstract interpretation, and additional precision is gained by lowering the classification threshold. The RNN value performs worse on this test, achieving lower precision and recall.

5.2 Higher-order functions: functional list processing

Figure 5: Example programs from the list processing (left) and string editing (right) domains.
Figure 6: Synthesis results for list processing. Models were trained and tested on programs each containing 2-3 higher order functions with lambdas of depth 3. Blended semantics achieves the highest performance.
Figure 7: Synthesis results for string editing. Blended semantics outperforms all baselines, except for the execution-guided REPL model, which relies on domain-specific language modifications.

In our second experimental domain, we seek to answer two questions: How well does our model perform on input-output synthesis? How effectively can it synthesize programs containing higher-order functions? Although previous work (Zohar and Wolf, 2018) has successfully applied execution-guided approaches to list processing (using the DeepCoder language), the use of higher-order functions was severely limited: only a small, predefined set of “lambdas,” (such as (*2), is_even, (>0)) were used as arguments for higher-order functions. For example, synthesizing a program which “filters all elements divisible by 3 from a list” is not possible with this DSL. However, in real programming languages, higher-order functions must be able to accept a combinatorially large set of possible lambda functions as input. This presents a challenge for execution-guided synthesis approaches such as Zohar and Wolf (2018), for which the assumption of a small set of lambda functions is key. To this end, we modified the Deepcoder DSL to allow a richer set of possible programs. We replaced the predefined set of lambda functions with a grammar allowing for the combinatorial combination of grammar elements (examples in Figure 5 left). The modified grammar is given in Appendix B. All models were trained on 500000 training programs.


Figure 7 shows the results of synthesis using best-first search from a policy on test problems sampled the same distribution as the training problems. Our blended model finds the highest overall number of correct programs, achieving 5-10% higher accuracy given the same search budget compared to the neural semantics and RNN encoding schemes. We additionally implemented the DeepCoder model (Balog et al., 2016), which conditions the search only on input-output examples and not partially constructed programs. This model achieves considerably lower accuracy for a given search budget. The blended model also yielded superior results on numerous variations of these tasks (increasing number of higher order functions, varying integer ranges, performing sample-based search, etc). In the sample-based search condition, we also compare against a RobustFill baseline, which is outperformed by blended semantics. See Figure 8 in Appendix B for details.

5.3 IO programming: String Editing

In our final experiment, we examine how our model performs on domains for which execution-guided synthesis is possible, but requires extensive changes to the underlying DSL.

For example, in the RobustFill DSL, a function getSubStr(i,j) slices a string from index to index . This function is not executable until both and are known. In order to perform execution-guided synthesis, Ellis et al. (2019) needed to replace getSubString with two separate functions: getSubStrStart_i and getSubStrEnd_j, where each half can be executed in the read-eval-print loop (REPL). This process must be performed manually for every language construct which takes multiple arguments.

Here we seek to answer the question: can our model be used to successfully synthesize programs using the language as-is? To this end, we implement the code-writing policy using the DSL presented in Devlin et al. (2017) without modification (example programs in Figure 5 right). Because the REPL system in Ellis et al. (2019) uses a domain-specific, hand-designed partial program executor in-the-loop, we do not expect that our approach, which must learn to approximate the semantics, would outperform the REPL system, but we hope that it could achieve much of the gains. We additionally compare against another relevant baseline: RobustFill (Devlin et al., 2017). In contrast to the original paper, we train the RobustFill model using the same “unmodified” version of the DSL as our model, whose syntax has not been modified to aid with prediction. We train all models on 2 million programs sampled from the DSL. At test time, we used a sample-based search procedure, because the branching factor is prohibitively large for breadth-first search procedures explored above. While the blended encoding does not achieve the accuracy of the execution-guided REPL system, it outperforms the other baselines, including the RobustFill model, neural abstract semantics and the RNN baseline.

6 Conclusion

We introduced blended abstract semantics, a method for representing partially written code based on concrete execution and learned approximate semantics. We demonstrated how our approach, which combines abstract interpretation with representation learning, can be trained end-to-end as the basis for search policies and search heuristics. In program synthesis tasks, models equipped with blended abstract semantics outperformed neural baselines in several domains. Immediate future directions include exploring the use of blended abstract semantics for other synthesis tasks, including programming from language instruction and bug fixing. More generally, we hope that approaches which integrate learning and symbolic methods can be used to build systems which can write code more effectively and robustly.


The authors gratefully acknowledge Kevin Ellis and Eric Lu for productive conversations. We additionally thank anonymous reviewers for helpful comments. M. Nye is supported by an NSF Graduate Fellowship and an MIT BCS Hilibrand Graduate Fellowship.


  • M. Allamanis, M. Brockschmidt, and M. Khademi (2018) Learning to represent programs with graphs. In International Conference on Learning Representations, Cited by: §1, §2.
  • R. Alur, D. Fisman, R. Singh, and A. Solar-Lezama (2016) Sygus-comp 2016: results and analysis. arXiv preprint arXiv:1611.07627. Cited by: §B.3.
  • J. Andreas, M. Rohrbach, T. Darrell, and D. Klein (2016) Neural module networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 39–48. Cited by: §2, §3.
  • J. W. Backus, R. J. Beeber, S. Best, R. Goldberg, L. M. Haibt, H. L. Herrick, R. A. Nelson, D. Sayre, P. B. Sheridan, H. Stern, et al. (1957) The fortran automatic coding system. In Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability, pp. 188–198. Cited by: §2.
  • M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin, and D. Tarlow (2016) Deepcoder: learning to write programs. arXiv preprint arXiv:1611.01989. Cited by: §B.2, §B.2, §1, §2, §5.2.
  • M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov (2018) Generative code modeling with graphs. arXiv preprint arXiv:1805.08490. Cited by: §2.
  • R. Bunel, M. Hausknecht, J. Devlin, R. Singh, and P. Kohli (2018)

    Leveraging grammar and reinforcement learning for neural program synthesis

    arXiv preprint arXiv:1805.04276. Cited by: §B.1, §5.1.
  • X. Chen, C. Liu, and D. Song (2018) Execution-guided neural program synthesis. ICLR. Cited by: §1, §2.
  • P. Cousot and R. Cousot (1977) Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pp. 238–252. Cited by: Appendix C, §1, §2.
  • J. Devlin, J. Uesato, S. Bhupatiraju, R. Singh, A. Mohamed, and P. Kohli (2017) Robustfill: neural program learning under noisy i/o. arXiv preprint arXiv:1703.07469. Cited by: §B.2, §B.3, §1, §1, §2, §5.3.
  • E. Dinella, H. Dai, Z. Li, M. Naik, L. Song, and K. Wang (2019) Hoppity: learning graph transformations to detect and fix bugs in programs. In International Conference on Learning Representations, Cited by: §2.
  • L. Dong and M. Lapata (2018) Coarse-to-fine decoding for neural semantic parsing. arXiv preprint arXiv:1805.04793. Cited by: §2.
  • C. Dyer, A. Kuncoro, M. Ballesteros, and N. A. Smith (2016) Recurrent neural network grammars. arXiv preprint arXiv:1602.07776. Cited by: §2.
  • K. Ellis, M. Nye, Y. Pu, F. Sosa, J. Tenenbaum, and A. Solar-Lezama (2019) Write, execute, assess: program synthesis with a repl. In Advances in Neural Information Processing Systems, pp. 9165–9174. Cited by: §B.3, §1, §2, §2, §4, §4, §5.1, §5.3, §5.3, footnote 3.
  • K. Ellis, C. Wong, M. Nye, M. Sable-Meyer, L. Cary, L. Morales, L. Hewitt, A. Solar-Lezama, and J. B. Tenenbaum (2020)

    DreamCoder: growing generalizable, interpretable knowledge with wake-sleep bayesian program learning

    arXiv preprint arXiv:2006.08381. Cited by: §B.1, §B.1, §B.1, §5.1.
  • J. Gottschlich, A. Solar-Lezama, N. Tatbul, M. Carbin, M. Rinard, R. Barzilay, S. Amarasinghe, J. B. Tenenbaum, and T. Mattson (2018) The three pillars of machine programming. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pp. 69–80. Cited by: §2.
  • S. Gulwani, O. Polozov, and R. Singh (2017) Program synthesis. Foundations and Trends(r) in Programming Languages Series, Now Publishers. External Links: ISBN 9781680832921, Link Cited by: §2.
  • Q. Hu, J. Cyphert, L. D’Antoni, and T. Reps (2020) Exact and approximate methods for proving unrealizability of syntax-guided synthesis problems. arXiv preprint arXiv:2004.00878. Cited by: Appendix C, §2, §5.1.
  • M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015)

    Spatial transformer networks

    In Advances in neural information processing systems, pp. 2017–2025. Cited by: §B.1, §5.1.
  • S. Jiménez, J. Segovia-Aguas, and A. Jonsson (2019) A review of generalized planning. arxiv. Cited by: §5.1.
  • J. Johnson, B. Hariharan, L. Van Der Maaten, J. Hoffman, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick (2017) Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2989–2998. Cited by: §2.
  • V. Murali, L. Qi, S. Chaudhuri, and C. Jermaine (2017) Neural sketch learning for conditional program generation. arXiv preprint arXiv:1703.05698. Cited by: §2.
  • M. Nye, L. Hewitt, J. Tenenbaum, and A. Solar-Lezama (2019) Learning to infer program sketches. arXiv preprint arXiv:1902.06349. Cited by: §2.
  • A. Odena and C. Sutton (2019) Learning to represent programs with property signatures. In International Conference on Learning Representations, Cited by: §2.
  • H. Peleg, R. Gabay, S. Itzhaky, and E. Yahav (2020) Programming with a read-eval-synth loop. Proc. ACM Program. Lang., OOPSLA. External Links: Link Cited by: footnote 1.
  • S. J. Reddi, S. Kale, and S. Kumar (2018) On the convergence of adam and beyond. ICLR. Cited by: Appendix B.
  • K. Shi, D. Bieber, and R. Singh (2020)

    TF-coder: program synthesis for tensor manipulations

    arXiv preprint arXiv:2003.09040. Cited by: §2.
  • R. Singh and A. Solar-Lezama (2011) Synthesizing data structure manipulations from storyboards. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pp. 289–299. Cited by: Appendix C, §2, §5.1.
  • R. Socher, C. C. Lin, A. Y. Ng, and C. D. Manning (2011) Parsing natural scenes and natural language with recursive neural networks. In ICML, Cited by: §2.
  • A. Solar-Lezama (2008) Program synthesis by sketching. Ph.D. Thesis, University of California, Berkeley. Cited by: §2.
  • S. Srivastava, S. Zilberstein, A. Gupta, P. Abbeel, and S. Russell (2015) Tractability of planning with loops. In

    Twenty-Ninth AAAI Conference on Artificial Intelligence

    Cited by: §5.1.
  • X. Wang, I. Dillig, and R. Singh (2017) Program synthesis using abstraction refinement. Proceedings of the ACM on Programming Languages 2 (POPL), pp. 1–30. Cited by: Appendix C, §2, §3, §5.1.
  • P. Winston (1972) The mit robot. Machine Intelligence. Cited by: §5.1.
  • A. Zohar and L. Wolf (2018) Automatic program synthesis of long programs with a learned garbage collector. In Advances in Neural Information Processing Systems, pp. 2098–2107. Cited by: §1, §2, §5.2.

Appendix A Semantics

Here we fully define the semantics for concrete, neural, and blended semantics, covering details that were omitted in the main paper, namely, the semantics of lambda expressions.

Concrete semantics

In this work, we consider domains where the underlying programming language is functional. Let be a lambda expression of one argument, and let be an assignment of the variable to the value . Lambda application is defined as follows:

That is, we evaluate the function body , replacing all instances of the function variable with the value . For example, , where denotes the execution semantics of the built-in function . The concrete semantics is defined:

Neural semantics

For built-in functions , we use , a neural module function of vector inputs, where is the arity of . We note that our treatment of functions discussed here applies to both first-order and higher-order functions.

In our domains, lambda expressions are only used as arguments to higher-order functions. Therefore, since we will never apply a lambda expression directly under neural semantics, we only require vector representation of lambdas. However, we still require a mechanism to represent arbitrary lambda expressions built combinatorially from primitive functions. This representation is constructed in a modular fashion by encoding the body of the lambda expression.

Note that in representing a lambda expression, we used the context with the assignment . This is used to account for the fact that, at the time of the lambda function’s definition (as an argument to a higher-order function), its argument is still unknown under neural execution. Likewise, to represent a variable whose value has been assigned to null, we return a vector representing the variable.

The definition for blended semantics proceeds in an analogous fashion, with concrete subtrees executed concretely.

Appendix B Experimental Details

All models are trained with the AMSGrad (Reddi et al., 2018) variant of the Adam optimizer with a learning rate of 0.001. All RNNs are 1-layer and bidirectional GRUs, where the final hidden state is used as the output representation. Unless otherwise stated, holes are encoded by applying the Embed function to the context, and then applying a type-specific neural module to the resulting vector. For the tower and string-editing domains, which use continuation-passing style, holes are filled in left-to-right. For the list editing domain, holes are filled in right-to-left.

b.1 Towers

We employ the tower-building domain and DSL introduced in Ellis et al. (2020), which consists of the basic commands: PlaceHorizontalBlock, PlaceVerticalBlock, MoveHand, ReverseHand, Embed, Loop and integers n from 1 to 8. (The higher-order Embed function takes an expression as input, executes it, and then returns the hand to its initial location.)

Formally, the syntax of the tower-building domain is given as follows:

P = [S]
S = embed(P) | moveHand(n) | reverseHand() | loop(n, P) | placeHorizontalBlock() | placeVerticalBlock()
n = 1..8

A program P is executed sequentially one statement (S) at a time, with each statement modifying the state of the tower construction. In a tower construction, a state is defined as follows:

state: (handLoc, handOrientation, history)
history: [(loc, blockType)]
Loc: (x:int, y:int)

We assume the initial tower state is: s = (0, 1, [])

The semantics of applying each statement S applied to a state s is as follows:

placeHorizontalBlock = placeBlock(3, 1)

placeHorizontalBlock = placeBlock(1, 3)

placeBlock(w:int, h:int) : s:state ->
  (s.handLoc, s.handOrientation, s.history +
  topY(s.history, s.handLoc.x, s.handLoc.x+w) ), w, h))

topY(history, xlow, xhigh) : max( y |  y = y’+h
  where ((x’,y’), w,h) in history
  and [x, x+w] intersect [xlow, xhigh])

moveHand(dist:int): s:state ->
  (s.handLoc + s.handOrientation*dist, s.handOrientation, s.history)

reverseHand(): s:state ->
  (s.handLock, s.handOrientation*-1, s.history)

loop(n:int, fn): s:state -> fn^n(s)

embed(fn): s:state ->
  let s’ = fn(s) in (s.handLoc, s.handOrientation, s’.history)

In order to test the compatibility of our approach with library-learning techniques, we additionally use library functions learned by the DreamCoder system by combining the above functions. Following Ellis et al. (2020), the grammar is implemented in continuation-passing style. Our training data consisted of tower programs randomly sampled from a PCFG generative model (Ellis et al., 2020).

We trained policy networks on 480000 programs. We trained value functions on 240000 rollouts from the policy. We perform search for up to 300 seconds per problem.

For the blended semantics and neural semantics models, all neural modules consist of a single linear layer (input dimension and output dimension ) followed by ReLU activation. Tower images are embedded with a simple CNN-ReLU-MaxPool architecture, as in Ellis et al. (2020).

For the tower-building domain, we also include a baseline inspired by Bunel et al. (2018). This model uses a CNN image encoder to encode the spec tower image. The image representation is passed through an MLP layer and initializes an LSTM decoder, which sequentially decodes the program tokens. As an additional modification to the original model from Bunel et al. (2018), at each decoding step we perform spatial attention over the image encoding via a Spatial Transformer Network (Jaderberg et al., 2015)

. Our CNN image encoder consists of 4 convolutional layers (Conv + ReLU) with kernel size 3, and a single max-pool of size 2 after the first convolution. Our LSTM decoder has hidden size 512 and embedding size 128.

Hand-coded abstract interpretation

We implemented an abstract domain which tracked, a) the range of possible locations of the “hand” and b) for each horizontal location, the minimum height which must be achieved by the partially constructed tower. This representation allows us to eliminate invalid partial programs because once a block is dropped, it cannot be removed through any subsequent commands.

b.2 List processing

Data for this domain was generated by modifying the DeepCoder dataset (Balog et al., 2016). Specifically, DeepCoder training programs of size 2 (containing 2-3 higher order functions, such as map f (filter g input) or zipwith f (map g input) (map h input) ) were modified by changing the lambdas in the program (f, g, and h in the above examples) from a small set of constant lambdas such as (*2) to depth-3 lambdas sampled from our modified grammar (see below). For example: (x.max(x+2,x/2)). For each program, 5 example input lists were sampled, each with length 10 and values in the range [-64, 64]. The program was then executed to yield the corresponding outputs. Programs with output or intermediate values outside of the range [-64, 64] were discarded. Programs producing the identity function or constant functions were also discarded. We trained and tested only on functions of type [int] [int]. At test time when running search, we similarly reject programs with intermediate values outside of the desired integer range.

All policy networks were trained on 500000 programs. We perform search for 180 seconds per problem.

All neural modules consist of a single linear layer (input dimension and output dimension ) followed by ReLU activation. Integers are encoded digit-wise via a GRU. Lists are encoded via a GRU encoding over the representations of the integers they contain.

Unbound variables within a lambda function are embedded via a learned representation parameterized by the variable name (one vector representing x and one representing y). When encoding holes within lambda functions, we ignore context, and instead embed holes only as a function of the hole type.

The DeepCoder baseline is based on Balog et al. (2016) but uses the same input-output example encoding architecture as the other methods implemented here. Our implementation uses an MLP with 1 64-dimensional hidden layer with a tanh activation to produce the production rule probabilities from the input-output encodings.

In the sample-based search condition in Figure 8, we additionally compare against a RobustFill model. Our RobustFill implementation is equivalent to the “Attention-A” model from Devlin et al. (2017). We use GRUs of hidden size 64 and embedding size 64, in keeping with the models above.

Modified lambda grammar

Below is the grammar used for lambda functions:
L (x,y.S) | (x.S)
S I | B
I I+I | I*I | I/I | min(I,I) | max(I,I) | A
B I > I | or(B,B) | and(B,B) | I%I==0
A x | y | N
N -2 | -1| 0 | 1 | 2

Variations on training and testing conditions

Many variations on the training and testing conditions achieve similar results to those shown in the main paper (i.e., blended semantics consistently achieves the highest performance). Several of these variations are shown in Figure 8.

b.3 String Editing

For the string editing tasks, we use the DSL from Devlin et al. (2017). We train on randomly sampled programs, sampling I/O pairs and propagating constraints from programs to inputs to ensure that input strings are relevant for the target program (see Devlin et al. (2017)). We condition on 4 I/O examples for each program. We used string editing problems from the SyGuS (Alur et al., 2016) program synthesis competition as our test corpus. We trained all models on 2 million training programs. At test time, we sample programs from the model for a maximum timeout of 30 seconds.

Input and output strings are encoded by embedding each character via a 20-dimensional character embedding and concatenating the resulting vectors to form a representation for each string. Representations of “expressions” (as defined in the RobustFill DSL) are concatenated together using an “append” module. Following Ellis et al. (2019), neural modules consist of a single dense block with 5 layers and a growth rate of 128 (input dimension and output dimension ).

Figure 8: Variations on the list processing task. Top Left: using integer values in the range [-32, 32] instead of [-64, 64]. Top Right: Using sample-based search instead of best-first search, including a comparison to RobustFill. Bottom Left: extending the training and testing data to allow for [int] int functions. Bottom Right: the original model tested on deeper DeepCoder programs with 3-6 higher-order functions. Notably, in all test conditions, the blended semantics consistently outperforms the RNN and neural semantics.

Appendix C Limitations of our neural semantics

Our approach to neural semantics differs conceptually in several ways from abstract interpretation (Cousot and Cousot, 1977), and the way it has previously been used to constrain symbolic search for program synthesis (Singh and Solar-Lezama, 2011; Wang et al., 2017; Hu et al., 2020). Two main differences stand out as potential limitations of our approach, and avenues to be explored further in future work. Firstly, our method treats loops generically, identically to other higher-order functions. This means neural modules must compute a representation of a looping program using a fixed, feed-forward computation, without an iterative fixpoint computation. It is possible that a more rigorous treatment of loops could lead to improved accuracy, though it would introduce additional complexity to the method. We believe that this may be an interesting direction for future work. Secondly, our handling of lambda functions also differs from abstract interpretation. As discussed in Appendix A, when using a lambda as an argument to a higher-order function, our method represents the lambda expression as a vector, without encoding the values of its arguments. A method which can treat lambdas more generically would also be an interesting direction for future work. We thank an anonymous reviewer for highlighting these two points.