When was the last time you typed out a large body of code all at once, and had it work on the first try? If you are like most programmers, this hasn’t happened much since “Hello, World.” Writing a large body of code is a process of trial and error that alternates between trying out a piece of code, executing it to see if you’re still on the right track, and trying out something different if the current execution looks buggy. Crucial to this human work-flow is the ability to execute the partially-written code, and the ability to assess the resulting execution to see if one should continue with the current approach. Thus, if we wish to build machines that automatically write large, complex programs, designing systems with the ability to effectively transition between states of writing, executing, and assessing the code may prove crucial.
In this work, we present a model that integrates components which write, execute, and assess code to perform a stochastic search over the semantic
space of possible programs. We do this by equipping our model with one of the oldest and most basic tools available to a programmer: an interpreter, or read-eval-print-loop (REPL), which immediately executes partially written programs, exposing their semantics. The REPL addresses a fundamental challenge of program synthesis: tiny changes in syntax can lead to huge changes in semantics. The mapping between syntax and semantics is a difficult relation for a neural network to learn, but comes for free given a REPL. By conditioning the search solely on the execution states rather than the program syntax, the search is performed entirely in thesemantic space. By allowing the search to branch when it is uncertain, discard branches when they appear less promising, and attempt more promising alternative candidates, the search process emulates the natural iterative coding process used by a human programmer.
In the spirit of systems such as AlphaGo (alphaGo, ), we train a pair of models – a policy that proposes new pieces of code to write, and a value function that evaluates the long-term prospects of the code written so far, and deploy both at test time in a symbolic tree search. Specifically, we combine the policy, value, and REPL with a Sequential Monte Carlo (SMC) search strategy at inference time. We sample next actions using our learned policy, execute the partial programs with the REPL, and re-weight the candidates by the value of the resulting partial program state. This algorithm allows us to naturally incorporate writing, executing, and assessing partial programs into our search strategy, while managing a large space of alternative program candidates.
Integrating learning and search to tackle the problem of program synthesis is an old idea experiencing a recent resurgence ganin2018synthesizing ; schmidhuber2004optimal ; ellis2018learning ; balog2016deepcoder ; nye2019learning ; TikZ ; devlin2017robustfill ; kalyan2018neural ; tian2019learning ; pu2018selecting . Our work builds on recent ideas termed ‘execution-guided neural program synthesis,’ independently proposed by (chen2018execution, ) and (zohar2018automatic, )
, where a neural network writes a program conditioned on intermediate execution states. We extend these ideas along two dimensions. First, we cast these different kinds of execution guidance in terms of interaction with a REPL, and use reinforcement learning techniques to train an agent to both interact with a REPL,and to assess when it is on the right track. Prior execution-guided neural synthesizers do not learn to assess the execution state, which is a prerequisite for sophisticated search algorithms, like those we explore in this work. 22todo: 2moved these sentences around per Armando request. maybe cut second half? Second, we investigate several ways of interleaving the policy and value networks during search, finding that an SMC sampler provides an effective foundation for an agent that writes, executes and assesses its code.
We validate our framework on two different domains (see Figure 1): inferring 2D and 3D graphics programs (in the style of computer aided design, or CAD) and synthesizing text-editing programs (in the style of FlashFill gulwani2011automating ). In both cases we show that a code-writing agent equipped with a REPL and a value function to guide the search achieves faster, more reliable program synthesis.
Explicitly, our contributions are:
We cast the problem of program synthesis as a search problem and solve it using Sequential Monte Carlo (SMC) by employing a pair of learned functions: a value and a policy. 33todo: 3change to REPL and search in semantic space only
We equip the model with a REPL, bridging the gap between the syntax and semantics of a program.
We empirically validate our technique in two different program synthesis domains, text-editing programs and 2D/3D graphics programs, and out-perform alternative approaches.
2 An Illustrative Example
To make our framework concrete, consider the following program synthesis task of synthesizing a constructive solid geometry (CSG) representation of a simple 2D scene (see Figure 2). CSG is a shape-modeling language that allows the user to create complex renders by combining simple primitive shapes via boolean operators. The CSG program in our example consists of two boolean combinations: union and subtraction and two primitives: circles and rectangles , specified by position , radius , width and height , and rotation . This CSG language can be formalized as the context-free grammar below:
The synthesis task is to find a CSG program that renders to . Our policy constructs this program one piece at a time, conditioned on the set of expressions currently in scope. Starting with an empty set of programs in scope, , the policy proposes an action that extends it. This proposal process is iterated to incrementally extend to contain longer and more complex programs. In this CSG example, the action is either adding a primitive shape, such as a circle , or applying a boolean combinator, such as , where the action also specifies its two arguments and .
To help the policy make good proposals, we augment it with a REPL, which takes a set of programs in scope and executes each of them. In our CSG example, the REPL renders the set of programs to a set of images. The policy then takes in the REPL state (a set of images), along with the specification to predict the next action . This way, the input to the policy lies entirely in the semantic space, akin to how one would use a REPL to iteratively construct a working code snippet. Figure 2 demonstrates a potential roll-out through a CSG problem using only the policy.
However, code is brittle, and if the policy predicts an incorrect action, the entire program synthesis fails. To combat this brittleness, we use Sequential Monte Carlo (SMC) to search over the space of candidate programs. Crucial to our SMC algorithm is a learned value function which, given a REPL state, assesses the likelihood of success on this particular search branch. By employing , the search can be judicious about which search branch to prioritize in exploring and withdraw from branches deemed unpromising. Figure 3 demonstrates a fraction of the search space leading up to the successful program and how the value function helps to prune out unpromising search candidates.
3 Our Approach
Our program synthesis approach integrates components that write, execute, and assess code to perform a stochastic search over the semantic space of programs. Crucial to this approach are three main components: First, the definition of the search space; Second, the training of the code-writing policy and the code-assessing value function; Third, the Sequential Monte Carlo algorithm that leverages the policy and the value to conduct the search by maintaining a population of candidate search branches.
3.1 The Semantic Search Space of Programs
The space of possible programs is typically defined by a context free grammar (CFG), which specifies the set of syntactically valid programs. However, when one is writing the code, the programs are often constructed in a piece-wise fashion. Thus, it is natural to express the search space of programs as a markov decision process (MDP) over the set of partially constructed programs.
State The state is a tuple where is a set of partially-constructed program trees (intuitively, ‘variables in scope’), and is the goal specification. Thus, our MDP is goal conditioned. The start state is .
Action The action is a production rule from the CFG (a line of code typed into the REPL).
Transitions The transition, , takes the set of partial programs and applies the action to either:
1. instantiate a new sub-tree if is a terminal production:
2. combine multiple sub-trees if is a non-terminal:
Note that in the case of a non-terminal, the children are removed, or ‘garbage-collected’ zohar2018automatic .
Reward The reward is if there is a program that satisfies the spec, and otherwise.
Note that the state of our MDP is defined jointly in the syntactic space, , and the semantic space, . To bridge this gap, we use a REPL, which evaluates the set of partial programs into a semantic or “executed” representation. Let be a set of programs, and let denote the execution of a program , then we can write the REPL state as .
3.2 Training the Code-Writing Policy and the Code-Assessing Value
Given the pair of evaluated program states and spec (, ), the policy outputs a distribution over actions, written , and the value function predicts the expected reward starting from state (,
). In our MDP, expected total reward conveniently coincides with the probability of a rollout satisfyingstarting with the partial program :
Thus the value function simply performs binary classification, predicting the probability that a state will lead to a successful program.
Pretraining . Because we assume the existence of a CFG and a REPL, we can generate an infinite stream of training data by sampling random programs from the CFG, executing them to obtain a spec, and then recovering the ground-truth action sequence. Specifically, we draw samples from a distribution over synthetic training data, , consisting of triples of the spec, the sequence of actions, and the set of partially constructed programs at each step: . During pretraining, we maximize the log likelihood of these action sequences under the policy:
Training and . We fine-tune the policy and train the value function by sampling the policy’s roll-outs against in the style of REINFORCE. Specifically, given , the policy’s rollout consists of a sequence of actions , a sequence of partial programs , and a reward . Given the specific rollout, we train and to maximize:
3.3 An SMC Inference Algorithm That Interleaves Writing, Executing, and Assessing Code
At test time we interleave code writing, i.e. drawing actions from the policy, and code assessing, i.e. querying the value function (and thus also interleave execution, which always occurs before running these networks). 44todo: 4almost there but a little awkward Sequential Monte Carlo methods doucet2001introduction , of which particle filters are the most famous example, are a flexible framework for designing samplers that infer a sequence of latent variables conditioned on a paired sequence of observed variables . Following ellis2018learning we construct an SMC sampler for our setting by identifying the policy rollout over time as the sequence of latent variables (i.e. ), and identify the spec as the observed variable at every time step (i.e. ), which are connected to the policy and value networks by defining
and, like a particle filter, we approximately sample from by maintaining a population of particles – each particle a state in the MDP – and evolve this population of particles forward in time by sampling from , importance reweighting by , and then resampling.
SMC techniques are not the only reasonable approach: one could perform a beam search, seeking to maximize ; or, A* search by interpreting as cost-so-far and
as heuristic cost-to-go; or, as popularized by AlphaGo, one could use MCTS jointly with the learned value and policy networks. SMC confers two main benefits: (1) it is a stochastic search procedure, immediately yielding a simple any-time algorithm where we repeatedly run the sampler and keep the best program found so far; and (2) the sampling/resampling steps are easily batched on a GPU, giving high throughput unattainable with serial algorithms like A*.
To assess the relative importance of the policy, value function, and REPL, we study a spectrum of models and test-time inference strategies in both of our domains. For each model and inference strategy, we are interested in how efficiently it can search the space of programs, i.e. the best program found as a function of time spent searching. We trained a pair of models: our REPL model, which conditions on intermediate execution states (architectures in Figure 4), and a ‘no REPL’ baseline, which decodes a program in one shot using only the spec and syntax. This baseline is inspired by the prior work CSGNet sharma2018csgnet and RobustFill devlin2017robustfill for CAD and string editing, respectively. We compare our SMC sampler with a simple beam search using both policy and value, as well as a pair of inference strategies employing only the policy: simple policy rollouts, and a beam search decoding to find the highest likelihood program under . The policy-only beam search method is inspired by chen2018execution and zohar2018automatic , which performed execution-guided synthesis in Karel and list processing domains, respectively.
4.1 Inverse CAD
Modern mechanical parts are created using Computer Aided Design (CAD), a family of programmatic shape-modeling techniques. Here we consider two varieties of inverse CAD: inferring programs generating 3D shapes, and programs generating 2D graphics, which can also be seen as a kind of high-level physical object understanding in terms of parts and relations between parts.
We use CSG as our CAD modeling language, where the specification is an image, i.e. a pixel/voxel array for 2D/3D, and the the goal is to write a program that renders to the target image. These programs build shapes by algebraically combining primitive drawing commands via addition and subtraction, including circles and rotated rectangles (for 2D, w/ coordinates quantized to ) as well as spheres, cubes, and cylinders (for 3D, w/ coordinates quantized to ), although in both cases the quantization could in principle be made arbitrarily fine. Our REPL renders each partial program to a distinct canvas, which the policy and value networks take as input.
Experimental evaluation We train our models on randomly generated scenes with up to 13 objects. As a hard test of out-of-sample generalization we test on randomly generated scenes with up to 30 or 20 objects for 2D and 3D, respectively. Figure 5 measures the quality of the best program found so far as a function of time, where we measure the quality of a program by the intersection-over-union (IoU) with the spec. Incorporating the value function proves important for both beam search and sampling methods such as SMC. Given a large enough time budget the ‘no REPL’ baseline is competitive with our ablated alternatives: inference time is dominated by CNN evaluations, which occur at every step with a REPL, but only once without it. Qualitatively, an integrated policy, value network, and REPL yield programs closely matching the spec (Figure 6). Together these components allow us to infer very long programs, despite a branching factor of 1.3 million per line of code: the largest programs we successfully infer111By “successfully infer” we mean IoU0.9 go up to 19 lines of code/102 tokens for 3D and 22 lines/107 tokens for 2D, but the best-performing ablations fail to scale beyond 3 lines/19 tokens for 3D and 19 lines/87 tokens for 2D.
4.2 String Editing Programs
Learning programs that transform text is a classic program synthesis task lau2001programming made famous by the FlashFill system, which ships in Microsoft Excel gulwani2011automating . We apply our framework to string editing programs using the RobustFill programming language devlin2017robustfill , which was designed for neural program synthesizers.
REPL Our formulation suggests modifications to the RobustFill language so that partial programs can be evaluated into a semantically coherent state (i.e. they execute and output something meaningful). Along with edits to the original language, we designed and implemented a REPL for this domain, which, in addition to the original inputs and outputs, includes three additional features as part of the intermediate program state: The committed string maintains, for each example input, the output of the expressions synthesized so far. The scratch string maintains, for each example input, the partial results of the expression currently being synthesized until it is complete and ready to be added to the committed string. Finally, the binary valued mask features indicate, for each character position, the possible locations on which transformations can occur. 55todo: 5this might be a halfway explanation…
Experimental Evaluation We trained our model and a reimplementation of RobustFill on string editing programs randomly sampled from the CFG. We originally tested on string editing programs from nye2019learning (comprising training tasks from ellis2018learning and the test corpus from alur2016sygus ), but found performance was near ceiling for our model. We designed a more difficult dataset of 87 string editing problems from 34 templates comprising address, date/time, name, and movie review transformations. This dataset required synthesis of long and complex programs, making it harder for pure neural approaches such as RobustFill.
The performance of our model and baselines is plotted in Figure 7, and examples of best performing programs are shown in Figure 8. The value-guided SMC sampler leads to the highest overall number of correct programs, requiring less time and fewer nodes expanded compared to other inference techniques. We also observe that beam search attains higher overall performance with the value function than beam search without value. Our model demonstrates strong out-of sample generalization: Although it was trained on programs whose maximum length was 30 actions and average length approximately 8 actions, during test time we regularly achieved programs with 40 actions or more, representing a recovery of programs with description length greater than 350 bits.
|6/12/2003||date: 12 mo: 6 year: 2003||Dr Mary Jane Lennon||Lennon, Mary Jane (Dr)|
|3/16/1997||date: 16 mo: 3 year: 1997||Mrs Isaac McCormick||McCormick, Isaac (Mrs)|
|Held out test instance:||Held out test instance:|
|12/8/2019||date: 8 mo: 12 year: 2019||Dr James Watson||Watson, James (Dr)|
|SMC (Ours)||date: 8 mo: 12 year: 2019||Watson, James (Dr)|
|Rollout||date: 8 mo: 1282019||Watson, James|
|Beam w/value||date: 8 mo: 12 year:2019||Watson, JamesDr|
|Beam||date: 8 mo: 12 year:||Watson, James (|
|RobustFill||date:12/8/2019||Dr James Watson|
Within the program synthesis community, both text processing and graphics program synthesis have received considerable attention gulwani2011automating .We are motivated by works such as InverseCSG Du:2018:IAC:3272127.3275006 , CSGNet sharma2018csgnet , and RobustFill devlin2017robustfill , but our goal is not to solve a specific synthesis problem in isolation, but rather to push toward more general frameworks that demonstrate robustness across domains.
We draw inspiration from recent neural “execution-guided synthesis” approaches zohar2018automatic ; chen2018execution which leverage partial evaluations of programs, similar to our use of a REPL. We build on this line of work by explicitly formalizing the task as an MDP, which exposes a range of techniques from the RL and planning literatures. Our addition of a learned value network demonstrates marked improvements on methods that do not leverage such learned value networks. Prior work simmons2018program combines tree search with -learning to synthesize small assembly programs, but do not scale to large programs with extremely high branching factors, as we do (e.g., the action-programs we synthesize for text editing or the 1.3 million branching factor per line of code in our 3D inverse CAD).
We have presented a framework for learning to write code combining two ideas: allowing the agent to explore a tree of possible solutions, and executing and assessing code as it gets written. This has largely been inspired by previous work on execution-guided program synthesis, value-guided tree search, and general behavior observed in how people write code.
An immediate future direction is to investigate programs with control flow like conditionals and loops. A Forth-style stack-based rather1993evolution language offer promising REPL-like representations of these control flow operators. But more broadly, we are optimistic that many tools used by human programmers, like debuggers and profilers, can be reinterpreted and repurposed as modules of a program synthesis system. By integrating these tools into program synthesis systems, we believe we can design systems that write code more robustly and rapidly like people.
We gratefully acknowledge many extended and productive conversations with Tao Du, Wojciech Matusik, and Siemens research. In addition to these conversations, Tao Du assisted by providing 3D ray tracing code, which we used when rerendering 3D programs. Work was supported by Siemens research, AFOSR award FA9550-16-1-0012 and the MIT-IBM Watson AI Lab. K. E. and M. N. are additionally supported by NSF graduate fellowships.
- (1) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- (2) Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, SM Eslami, and Oriol Vinyals. Synthesizing programs for images using reinforced adversarial learning. arXiv preprint arXiv:1804.01118, 2018.
- (3) Jürgen Schmidhuber. Optimal ordered problem solver. Machine Learning, 54(3):211–254, 2004.
Kevin Ellis, Lucas Morales, Mathias Sablé-Meyer, Armando Solar-Lezama, and
Learning libraries of subroutines for neurally–guided bayesian program induction.In Advances in Neural Information Processing Systems, pages 7805–7815, 2018.
- (5) Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder: Learning to write programs. arXiv preprint arXiv:1611.01989, 2016.
- (6) Maxwell Nye, Luke Hewitt, Joshua Tenenbaum, and Armando Solar-Lezama. Learning to infer program sketches. arXiv preprint arXiv:1902.06349, 2019.
- (7) Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Josh Tenenbaum. Learning to infer graphics programs from hand-drawn images. In Advances in Neural Information Processing Systems, pages 6059–6068, 2018.
- (8) Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. Robustfill: Neural program learning under noisy i/o. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 990–998. JMLR. org, 2017.
- (9) Ashwin Kalyan, Abhishek Mohta, Oleksandr Polozov, Dhruv Batra, Prateek Jain, and Sumit Gulwani. Neural-guided deductive search for real-time program synthesis from examples. arXiv preprint arXiv:1804.01186, 2018.
- (10) Yonglong Tian, Andrew Luo, Xingyuan Sun, Kevin Ellis, William T Freeman, Joshua B Tenenbaum, and Jiajun Wu. Learning to infer and execute 3d shape programs. arXiv preprint arXiv:1901.02875, 2019.
- (11) Yewen Pu, Zachery Miranda, Armando Solar-Lezama, and Leslie Kaelbling. Selecting representative examples for program synthesis. In International Conference on Machine Learning, pages 4158–4167, 2018.
- (12) Xinyun Chen, Chang Liu, and Dawn Song. Execution-guided neural program synthesis. 2018.
- (13) Amit Zohar and Lior Wolf. Automatic program synthesis of long programs with a learned garbage collector. In Advances in Neural Information Processing Systems, pages 2094–2103, 2018.
- (14) Sumit Gulwani. Automating string processing in spreadsheets using input-output examples. In ACM Sigplan Notices, volume 46, pages 317–330. ACM, 2011.
- (15) Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo methods. In Sequential Monte Carlo methods in practice, pages 3–14. Springer, 2001.
- (16) Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu Maji. Csgnet: Neural shape parser for constructive solid geometry. In , pages 5515–5523, 2018.
- (17) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700, 2015.
- (18) Tessa Lau. Programming by demonstration: a machine learning approach. PhD thesis, 2001.
- (19) Rajeev Alur, Dana Fisman, Rishabh Singh, and Armando Solar-Lezama. Sygus-comp 2016: Results and analysis. arXiv preprint arXiv:1611.07627, 2016.
- (20) Tao Du, Jeevana Priya Inala, Yewen Pu, Andrew Spielberg, Adriana Schulz, Daniela Rus, Armando Solar-Lezama, and Wojciech Matusik. Inversecsg: Automatic conversion of 3d models to csg trees. ACM Trans. Graph., 37(6):213:1–213:16, December 2018.
- (21) Riley Simmons-Edler, Anders Miltner, and Sebastian Seung. Program synthesis through reinforcement learning guided tree search. arXiv preprint arXiv:1806.02932, 2018.
- (22) Elizabeth D Rather, Donald R Colburn, and Charles H Moore. The evolution of forth. In ACM SIGPLAN Notices, volume 28, pages 177–199. ACM, 1993.
- (23) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in neural information processing systems, pages 3391–3401, 2017.
Appendix A Appendix
a.1 Graphics Programming Language
The programs are generated by the CFG below:
In principle the 2D language admits arbitrary quadrilaterals. When generating synthetic training data we constrain the quadrilaterals to be take the form of rectangles rotated by 45 increments, although in principle one could permit arbitrary rotations by simply training a higher capacity network on more examples.
a.2 String Editing Programming Language
Our modified string editing language, based on devlin2017robustfill is defined as follows:
a.3 Training details
We performed supervised pretraining for 24000 iterations with a batch size of 4000. We then performed REINFORCE for 12000 epochs with a batch size of 2000. Training took approximately two days with one p100 GPU. We use the Adam optimizer wiht default settings.
Our Robustfill baseline was a re-implementation of the “Attn-A" model from devlin2017robustfill . We implemented the “DP-beam" feature, wherein during test-time beam search, partial programs which lead to an output string which is not a prefix of the desired output string are removed from the beam. We trained for 50000 iterations with a batch size of 32. Training also took approximately two days with one p100 GPU.
We performed supervised pretraining with a batch size of 32, training on a random stream of CSG programs with up to 13 shapes, for approximately three days with one p100 GPU. We use the Adam optimizer wiht default settings. Over three days the 3D model saw approximately 1.7 million examples and the 2D model saw approximately 5 million examples. We fine-tuned the policy using REINFORCE and trained the value network for approximately 5 days on one p100 GPU. For each gradient step during this process we sampled random programs and performed rollouts for a total batch size of . During reinforcement learning the 3D model saw approximately 0.25 million examples and the 2D model saw approximately 9 million examples.
For both domains, we performed no hyperparameter search. We expect that with some tuning, results could be marginally improved, but our goal is to design a general approach which is not sensitive to fine architectural details.
a.4 Data and test-time details
For both domains, we used a 2-minute timeout for testing for each problem, and repeatedly doubled the beam/number of particles until timeout is reached.
We originally tested on string editing programs from nye2019learning (comprising training tasks from ellis2018learning and the test corpus from alur2016sygus ), but found our performance was near ceiling for our model (Figure 10). Thus, we designed our own dataset, as described in the main text. Generation code for this dataset can be found in our supplement, in the file generate_test_robust.py.
We generate a scene with up to objects by sampling a number between 1 to , and then sampling a random CSG tree with that many objects. We then remove any subtrees that do not affect the final render (e.g., subtracting pixels from empty space). Our held-out test set is created by sampling 30 random scenes with up to objects for 3D and objects for 2D. Running python driver.py demo --maxShapes 30 using the attached supplemental source code will generate example random scenes. Figure 9 illustrates ten random 3-D/2-D scenes and contrasts different model outputs.
a.5 Architecture details
a.5.1 String Editing
For this domain, our neural architecture involves encoding each example state separately and pooling into a hidden state, which is used to decode the next action. To encode each example, we learn an embedding vector of size 20 for each character and apply it to each position in the input string, output string, committed string, and scratch string. For each character position, we concatenate these embedding vectors, additionally concatenating the values of the masks for that spatial position. We then perform a 1-d convolution with kernel size 5 across the character positions. Followingzohar2018automatic , we concatenate the vectors for all the character positions, and pass this through a dense block with 10 layers and a growth rate of 128 to produce a hidden vector for a single example. We perform an average pooling on the hidden vector for each example. We then concatenate the resulting vector with a 32-dim embedding of the previous action and apply a linear layer, which results in the final state embedding, from which we decode the next action. Our value network is identical, except the final layer instead decodes a value.
a.5.2 Inverse CAD
The policy is a CNN followed by a pointer network which decodes into the next line of code. A pointer network vinyals2015pointer is an RNN that uses a differentiable attention mechanism to emit pointers, or indices, into a set of objects. Here the pointers index into the set of partial programs in the current state, which is necessary for the union and difference operators. Because the CNN takes as input the current REPL state – which may have a variable number of objects in scope – we encode each object with a separate CNN and sum their activations, i.e. a ‘Deep Set’ encoding zaheer2017deep . The value function is an additional ‘head’ to the pooled CNN activations.
Concretely the neural architecture has a spec encoder, which is a CNN inputting a single image, as well as a canvas encoder, which is a CNN inputting a single canvas in the REPL state, alongside the spec, as a two-channel image. The canvas encoder output activations are summed and concatenated with the spec encoder output activations to give an embedding of the state:
for , weight matrices.
For the policy we pass the state encoding to a pointer network, implemented using a GRU with 512 hidden units and one layer, which predicts the next line of code. To attend to canvases , we use the output of the canvas encoder as the ‘key’ for the attention mechanism.
For the value function we passed the state in coding to a MLP with 512 hidden units w/ a hidden ReLU activation, and finally apply a negated ‘soft plus’ activation to the output to ensure that the logits output by the value network is nonpositive:
2-D CNN architecture: The 2D spec encoder and canvas encoder both take as input
images, passed through 4 layers of 3x3 convolution, with ReLU activations after each layer and 2x2 max pooling after the first two layers. The first 3 layers have 32 hidden channels and the output layer has 16 output channels.
3-D CNN architecture: The 3D spec encoder and canvas encoder both take as input voxel arrays, passed through 3 layers of 3x3 convolution, with ReLU activations after each layer and 4x4 max pooling after the first layer. The first 2 layers have 32 hidden channels and the output layer has 16 output channels.
No REPL baseline: Our “No REPL” baselines using the same CNN architecture to take as input the spec, and then use the same pointer network architecture to decode into the program, with the sole difference that, rather than attend over the CNN encodings of the objects in scope (which are hidden from this baseline), the pointer network attends over the hidden states produced at the time when each previously constructed object was brought into scope.