Synthesize, Execute and Debug: Learning to Repair for Neural Program Synthesis

07/16/2020 ∙ by Kavi Gupta, et al. ∙ berkeley college DTU 9

The use of deep learning techniques has achieved significant progress for program synthesis from input-output examples. However, when the program semantics become more complex, it still remains a challenge to synthesize programs consistent with the specification. In this work, we propose SED, a neural program generation framework that incorporates synthesis, execution, and debugging stages. Instead of purely relying on the neural program synthesizer to generate the final program, SED first produces initial programs using the neural program synthesizer component, then utilizes a neural program debugger to iteratively repair the generated programs. The integration of the debugger component enables SED to modify the programs based on the execution results and specification, which resembles the coding process of human programmers. On Karel, a challenging input-output program synthesis benchmark, SED reduces the error rate of the neural program synthesizer itself by at least 7.5 outperforms the standard beam search for decoding.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Program synthesis is a fundamental problem that has attracted a lot of attention from both the artificial intelligence and programming languages community, with the goal of generating a program that satisfies the given specification 

manna1971toward; gulwani2012spreadsheet. One of the most popular forms of specifications is to provide a few examples of program inputs and the desired outputs, which has been studied in various domains, including string manipulation gulwani2012spreadsheet; Devlin2017RobustFillNP and graphics applications ellis2018learning; Ellis2019WriteEAExtendExecution.

There has been an emerging line of work studying deep learning techniques for program synthesis from input-output examples Devlin2017RobustFillNP; vijayakumar2018neural; bunel2018leveraging; chen2018execution. Some recent work demonstrate that a large performance gain can be achieved by properly leveraging the execution results to guide the program synthesis process shin2018improving; sun2018neural; Zohar2018AutomaticPSExtendExecution; chen2018execution; Ellis2019WriteEAExtendExecution. In particular, in chen2018execution; Zohar2018AutomaticPSExtendExecution; Ellis2019WriteEAExtendExecution, they propose to make predictions based on the intermediate results obtained by executing the partially generated programs. This scheme is well-suited for sequential programming tasks Zohar2018AutomaticPSExtendExecution; Ellis2019WriteEAExtendExecution; however, for programs that are loop-containing, the execution results of partial programs are not always informative chen2018execution. Furthermore, existing work on program synthesis generate the entire program from scratch without further editing, even if the predicted program is already inconsistent with the input-output specification and thus incorrect. On the contrary, after observing wrong program outputs, human programmers would go through the written code and attempt to fix the program fragments that cause the issue.

Inspired by the trial-and-error human coding procedure, we propose SED, which augments existing neural program synthesizers with a neural debugger component to repair the generated programs. Given the input-output specification, SED first synthesizes candidate programs with a neural program generation model. Next, SED executes the predicted programs and see if any of them satisfies the input-output examples. If none of them does, SED proceeds into the debugging

stage, where it selects the most promising program from the candidates, then generates editing operations with a neural network component acting as the debugger. The neural program debugger iteratively modifies the program until it passes the specification or reaches the maximal number of editing iterations. Besides the syntactic information of candidate programs as token sequences, our debugger also leverages the semantic information provided by their execution traces, which facilitates it to fix the semantic errors.

We evaluate SED on Karel benchmark bunel2018leveraging; devlin2017neural, where the programs satisfying input-output examples could include control flow constructs such as conditionals and loops. With different choices of the neural program synthesis model, SED consistently improves the performance of the synthesizer itself by a large margin of at least . Meanwhile, when the synthesizer performs the greedy decoding and only provides a single program for editing, SED outperforms the standard beam search applied to the synthesizer, which further demonstrates the effectiveness of our SED framework.

2 Problem Setup

In this section, we present the setup of the input-output program synthesis problem, which is the main focus of this work. We will also introduce the program repair problem handled by our neural debugger component.

Program synthesis from input-output examples. In a standard input-output program synthesis problem Devlin2017RobustFillNP; bunel2018leveraging, the synthesizer is provided with a set of input-output pairs (or in short), which serves as the specification of the desired program semantics. Let the ground truth program be , the goal of the program synthesizer is to generate a program in a domain-specific language (DSL) , so that for any valid input , . In practice, besides , usually another set of “held out” input-output examples is generated to verify the equivalence between the generated and ground truth programs, which could be imprecise due to the typically small test set.

Program repair with input-output specification. In our program repair setting, besides the same input-output specification given for the synthesis problem above, a buggy program is also provided, which is inconsistent with the specification. The goal is to perform editing operations, so that the modified program becomes correct.

Intuitively, program repair is no more difficult than the program synthesis problem, as the editing process can completely remove the provided program and synthesize a new program. However, it is usually beneficial to utilize , especially when it is close to a correct program Shin2018TowardsSP.

Karel domain. The Karel programming language goes back to the 1980’s as an introductory language for Stanford CS students karel, and some recent work have proposed deep learning approaches with this domain as a test bed bunel2018leveraging; shin2018improving; chen2018execution; Shin2018TowardsSP; Shin2019SyntheticDF. A program in this language describes the movements of a robot inside a 2D grid world, and we present a sample program in Figure 1. In the grids, the arrows represent the robot, the grey cells are obstacles that can not be manipulated, and the dots are markers. Besides an action set for the robot, Karel language also includes control flow constructs, i.e., if, repeat and while. The full grammar specification is discussed in Appendix B.

3 SED: Synthesize, Execute and Debug

Figure 1: A sample debugging process of SED. Given the input-output examples, the synthesizer provides a wrong program that misses the repeat-loop in the ground truth. Our debugger then performs a series of edits, which results in a correct program. Note that the INSERT operation does not advance the pointer in the input program, so several edits are applied to the move token.

In this section, we demonstrate SED, which learns to debug for neural program synthesis. In the synthesis phase, SED uses a neural program synthesizer to produce candidate programs. When the programs do not satisfy the specification according to the execution results, a debugging phase is required before providing the final program for evaluation. Figure 1 provides an example of how SED works. In the following, we first present the neural network architecture, then describe the training and inference procedures.

3.1 Synthesizer

Our SED framework is largely agnostic to the choice of the synthesizer, as long as it achieves non-trivial prediction performance, thus it is beneficial to leverage its predicted programs for the debugger component. In particular, SED is compatible with existing neural program synthesis models that largely employ the encoder-decoder architectures bunel2018leveraging; chen2018execution; Devlin2017RobustFillNP. A common model architecture for input-output program synthesis includes an encoder to embed the input-output pairs, which could be an LSTM for string manipulation tasks Devlin2017RobustFillNP

, or a convolutional neural network for our Karel task 

bunel2018leveraging; chen2018execution; shin2018improving. Then, an LSTM decoder generates the program based on the input embedding.

3.2 Debugger

Figure 2: The neural debugger model in SED. The encoder consists of three parts: (1) IOEmbed for I/O embedding; (2) TraceEmbed that convolves the traces with their corresponding I/O pairs; and (3) ProgramEncoder that jointly embeds each program token with its corresponding execution steps in the trace. EditDecoder is used for generating edits. We outline our proposed TraceEmbed component in red dots, which is the key architectural difference compared to Shin2018TowardsSP.

We present the debugger architecture in Figure 2. We follow previous work for Karel domain to use a convolutional neural network for I/O embedding, a bi-directional LSTM to encode the program for debugging, and an LSTM to sequentially generate the edit operation for each input program token Shin2018TowardsSP. The debugger supports 4 types of edit operations: KEEP copies the current program token to the output; DELETE removes the current program token; INSERT[] adds a program token ; and REPLACE[] replaces the current program token with . Therefore, the total number of edit operations is , where is the Karel vocabulary size. For KEEP, REPLACE and DELETE, the LSTM moves on to process the next program token after the current edit operation, while for INSERT, the next edit operation still based on the current program token, as shown in Figure 1.

The input program serves as a cue to the desired syntactic structure; however, it may not be sufficient to reveal the semantic errors. Motivated by the breakpoint support in Integrated Development Environments (IDEs) for debugging, we propose an execution trace embedding technique and incorporate it into the original debugger architecture, as highlighted in Figure 2. Specifically, we first execute the input program on each input grid , and obtain the execution trace , where . For each state , we use a convolutional neural network for embedding:



means the concatenation of vectors

and .

To better represent the correspondence between each program token and the execution states it affects, we construct a bipartite graph , where is the set , and is the set of program token indices. We set iff the program token was either executed to produce ; or initiates a loop, e.g., repeat, and the body of that loop produces when executed. For each program token , let be the weight of each edge connecting and its associated execution state, we compute a vector representation of its related execution states below:


Finally, the program token representation fed into the edit decoder is , where is the original program token embedding computed by the bi-directional program encoder.

3.3 Training

We design a two-stage training process for the debugger, as discussed below.

Stage 1: Pre-training with synthetic program mutation. We observe that if we directly train the debugger with the predicted programs of the synthesizer, the training hardly makes progress. One main reason is because a well-trained synthesizer only makes wrong predictions for around training samples, which results in a small training set for the debugger model. Although the synthesizer could produce a program close to one that satisfies the input-output specification, it is mostly distinct from the annotated ground truth program, as indicated in our evaluation. Therefore, we build a synthetic program repair dataset in the same way as Shin2018TowardsSP to pre-train the debugger. Specifically, for each sample in the original Karel dataset, we randomly apply several mutations to generate an alternative program from the ground truth program . Note that the mutations operate on the AST, thus the edit distance between program token sequences of and may be larger than the number of mutations, as shown in Figure 4. We defer more details on mutations to Appendix D. We generate an edit sequence to modify from to , then train the debugger with the standard cross-entropy loss using this edit sequence as supervision.

Stage 2: Fine-tuning with the neural program synthesizer. After pre-training, we fine-tune the model with the incorrect programs produced by the neural program synthesizer. Specifically, we run the decoding process using the synthesizer model on the Karel training set, then use those wrong predictions to train the debugger.

3.4 Inference Procedure

1:function ()
2:      “Fringe” of the search space: programs yet to be expanded
3:      Already expanded programs
4:     for  do
7:         if  then
8:              return Success
9:         end if
11:     end for
12:     return Probable failure, unless the program was found on the final step
13:end function
Algorithm 1 Best first search

During inference, we achieve the best results using a best first search, as described in Algorithm 1. In the algorithm, we denote the synthesizer model as , which produces a list of candidate programs for input-output examples . The debugger model produces a list of candidate programs given the input program . The function executes the program on the examples , and returns a value in representing the proportion of input-output examples that are satisfied.

Within our SED framework, we view program synthesis from specification as a search on an infinite tree with every node except the root node being annotated with a program , and having children . Our goal is to search for a program satisfying . While does not ensure that the generated program is semantically correct, as does not include held-out test cases, it is a necessary condition, and we find it sufficient as the termination condition of our search process.

We design two search algorithms for SED. Our first algorithm is a greedy search, which iteratively selects the program from the beam output of the previous edit iteration that passes the greatest number of input-output examples (and has not yet been further edited by the debugger), and returns the edited program when it passes all input-output examples, or when it reaches the maximal number of edit operations allowed, denoted as . See Algorithm 2 in Appendix C for more details.

A more effective scheme employs a best-first search. Compared to the greedy search, this search algorithm keeps track of all the programs encountered, as shown in line 10 of Algorithm 1, so that it can fall back to the debugger output from earlier edit iterations rather than get stuck, when none of the programs from the current edit iteration is promising.

4 Evaluation

In this section, we demonstrate the effectiveness of SED for Karel program synthesis and repair. We first discuss the evaluation setup, then present the results.

4.1 Evaluation Setup

The Karel benchmark devlin2017neural; bunel2018leveraging is one of the largest publicly available input-output program synthesis dataset that includes 1,116,854 samples for training, 2,500 examples in the validation set, and 2,500 test examples. Each sample is provided with a ground truth program, 5 input-output pairs as the specification, and an additional one as the held-output test example. We follow prior work bunel2018leveraging; shin2018improving; chen2018execution to evaluate the following metrics: (1) Generalization. The predicted program is said to generalize if it passes the all the 6 input-output pairs during testing. This is the primary metric we consider. (2) Exact match. The predicted program is an exact match if it is the same as the ground truth.

Program repair. In addition to the Karel program synthesis task introduced above, we also evaluate our debugger component on the mutation benchmark in Shin2018TowardsSP. Specifically, to construct the test set, for each sample in the original Karel test set, we first obtain 5 programs by randomly applying 1 to 5 mutations starting from the ground truth program , then we generate 5 test samples for program repair, where the -th sample includes as the program to repair, and the same input-output pairs and ground truth program as the original Karel test set.

4.2 Synthesizer Details

We consider two choices of synthesizers. The first synthesizer is LGRL bunel2018leveraging, which employs a standard encoder-decoder architecture as discussed in Section 3.1. During inference, we apply a beam search with beam size . We also evaluate a variant that performs the greedy decoding, i.e., , denoted as LGRL-GD. The second synthesizer is the execution-guided neural program synthesis model proposed in chen2018execution, denoted as EGNPS. The model architecture of EGNPS similar to LGRL, but it leverages the intermediate results obtained by executing partial programs to guide the subsequent synthesis process. We present the performance of these synthesizers in the first row ( “Without Debugger”) of Table 1.

4.3 Debugger Details

We compare our debugger architecture incorporated with the trace embedding component to the baseline in Shin2018TowardsSP, and we refer to ours and the baseline as  TraceEmbed and No TraceEmbed respectively. For the program synthesis task, all models are pre-trained on the training set of the synthetic mutation benchmark with 1-3 mutations. For the fine-tuning results, LGRL and LGRL-GD are fine-tuned with their synthesized programs, as discussed in Section 3.3. For EGNPS, we evaluate the debugger fine-tuned with LGRL, because EGNPS decoding executes all partial programs generated in the beam at each step, which imposes a high computational cost when evaluating the model on the training set.

4.4 Results

Mutation benchmark for program repair.

Figure 3 shows the results on the mutation benchmark. For each debugger architecture, we train one model with programs generated using 1-3 mutations, and another one with 1-5 mutations. Our most important observation is that the debugger demonstrates a good out-of-distribution generalization performance. Specifically, when evaluating on 4-5 mutations, although the performance of models trained only on 1-3 mutations are worse than models trained on 1-5 mutations, they are already able to repair around programs with 5 mutations, which is desirable when adapting the model for program synthesis. On the other hand, each model achieves better test performance when trained on a similar distribution. For example, models trained on 1-3 mutations achieve better performance when evaluating on 1-3 mutations than those trained on 1-5 mutations.

Meanwhile, for each number of mutations, the lowest error is achieved by the model with our TraceEmbed component, demonstrating that leveraging execution traces is helpful. However, such models tend to overfit more to the training data distribution, potentially due to the larger model sizes.

Figure 3: Results on the mutation benchmark, where x-axis indicates the number of mutations to generate the programs for repair in the test set. In the legend, “3” refer to models trained on 1-3 mutations, “5” refer to models trained on 1-5 mutations.

The Karel program synthesis benchmark.

Synthesizer+Debugger LGRL-GD LGRL EGNPS
Synthesizer only 39.00% (65.72%) 22.00% (63.40%) 19.40% (56.24%)
No TraceEmbed+No Finetune 18.72% (63.12%) 14.92% (62.80%) 11.68% (54.16%)
No TraceEmbed+Finetune 16.20% (60.92%) 14.32% (62.48%) 11.52% (53.68%)
TraceEmbed+No Finetune 18.80% (63.76%) 14.60% (62.88%) 11.48% (54.12%)
TraceEmbed+Finetune 16.32% (61.28%) 14.28% (62.68%) 11.36% (53.52%)
Table 1: Results on the test set for Karel program synthesis, where we present the generalization error with exact match error in parentheses for each synthesizer / debugger combination.

Table 1 presents our main results for program synthesis, where the debugger runs 100 edit steps. Firstly, SED consistently boosts the performance of the neural program synthesizer it employs, reducing the generalization error by at least . In particular, with LGRL-GD as the synthesizer, SED significantly outperforms LGRL without the debugger, which shows that the iterative debugging performed by SED is more effective than the standard beam search. Meanwhile, with EGNPS as the synthesizer, even if the synthesizer already leverages the execution information to guide the synthesis, SED still provides additional performance gain, which confirms the benefits of incorporating the debugging stage for program synthesis.

Figure 4:

Left: The distribution of edit distances for the mutation benchmark by number of mutations. Middle and right: The joint distributions of the edit distances between the initial program predicted by the LGRL synthesizer (

init), the gold program, and the program predicted by SED that passes all IO cases (pred). Dashed lines correspond to .

To understand how SED repairs the synthesizer predictions, Figure 4 demonstrates the distribution of the edit distances between the initial and ground truth programs in the pre-training dataset (leftmost), and several distributions of edit distances among the ground truth, the predicted programs by the synthesizer, and the repaired programs by SED that is semantically correct (middle and right). The debugger is a TraceEmbed model without fine-tuning, and it performs 100 editing steps. Firstly, we observe in the middle graph that SED tends to repair the initial program towards a correct one that is closer to the prediction than the ground truth, and we also provide an example in Figure 1. The rightmost graph further shows that the repaired programs are generally closer to the initial programs than the gold ones, which is also the reason why the improvement of exact match achieved by SED is much smaller than the generalization metric. Comparing these distributions to the leftmost graph, we note that without fine-tuning, SED is already able to repair initial programs not only include semantic errors that might not correspond to the mutations it is trained on, but also with larger edit distances to the ground truth than the training samples, which again demonstrates the generalizability of SED.

Figure 5: Comparison of different architectures and training process for program synthesis. TE refers to TraceEmbed, and F refers to fine-tuning on the data generated by the same synthesizer. Note the logarithmic scale of the axis.
Figure 6: Comparison of best first and greedy search strategies. All models use TraceEmbed+Finetune as defined in Table 1.
Figure 5: Comparison of different architectures and training process for program synthesis. TE refers to TraceEmbed, and F refers to fine-tuning on the data generated by the same synthesizer. Note the logarithmic scale of the axis.

Next, we discuss the effects of our trace embedding component and fine-tuning, and we further present the results with different number of edit steps in Figure 6. We observe that fine-tuning improves the results across the board, and has a particularly pronounced effect for LGRL-based models, where the data source for fine-tuning comes from the same synthesizer. Meanwhile, the debugger accompanied with the trace embedding mostly achieves better performance, especially when fine-tuning was not performed. This is potentially because the trace embedding component introduces additional model parameters, thus the model could suffer more from overfitting due to the small training set for fine-tuning.

Figure 6 compares the best first and greedy search strategies. We see that best first search always outperforms greedy search, often being able to achieve a similar performance in half the number of edit steps. This effect is more pronounced for LGRL and EGNPS synthesizers, as they provide more than one program to start with, which best first search can more effectively exploit.

5 Related Work

Program synthesis from input-output examples. Program synthesis from input-output specification is a long-standing challenge with many applications Lieberman2001YourWI; gulwani2012spreadsheet; MuggletonLPT14, and recent years have witnessed significant progress achieved by deep learning approaches balog2016deepcoder; parisotto2016neuro; Devlin2017RobustFillNP; bunel2018leveraging. Different domain-specific languages (DSLs) have been investigated, such as AlgoLISP polosukhin2018neural; Zohar2018AutomaticPSExtendExecution for array manipulation, FlashFill parisotto2016neuro; Devlin2017RobustFillNP; vijayakumar2018neural for string transformation, and Karel bunel2018leveraging; shin2018improving; chen2018execution studied in this work. While most existing work only uses the execution results to post-select among a set of candidate programs predicted by a synthesizer, some recent work leverage more fine-grained semantic information such as the intermediate execution states to improve the synthesizer performance sun2018neural; Zohar2018AutomaticPSExtendExecution; chen2018execution; Ellis2019WriteEAExtendExecution. In our evaluation, we demonstrate that SED further provides performance gain by leveraging execution results to repair the synthesized programs.

Program repair. There has been a line of work studying neural networks for program repair. Gupta2017DeepFixFC; Wang2018DynamicNP; Shin2018TowardsSP; vasic2019neural; dinella2020hoppity. While most of these work focus on syntactic error correction, in Wang2018DynamicNP, they predict the semantic error types for programming submissions, where they use execution traces to learn the program embedding. In Shin2018TowardsSP, they study program repair on Karel where the wrong programs are generated with synthetic mutation, and we use its model as a baseline for our debugger component. Meanwhile, iterative repair is used as part of the decompilation pipeline in Fu2019CodaAE. In this work, our SED framework incorporates the program repair scheme for input-output program synthesis, where the relationship between the program and the specification is typically complex.

6 Conclusion

Program synthesis and program repair have typically been considered as largely different domains. In this work, we present the SED framework, which incorporates a debugging process for program synthesis, guided with execution results. The iterative repair process of SED outperforms the beam search when the synthesizer employs the greedy decoding, and it significantly boosts the performance of the synthesizer alone, even if the synthesizer already employs a search process or incorporates the execution information. Additionally, we found that even though there is a program aliasing problem for supervised training, our two-stage training scheme alleviates this problem, and achieves strong generalization performance. Our SED framework could potentially be extended to a broad range of specification-guided program synthesis applications, and we consider it as future work.


This material is in part based upon work supported by the National Science Foundation under Grant No. TWC-1409915, Berkeley DeepDrive, and DARPA D3M under Grant No. FA8750-17-2-0091. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Appendix A Hyperparameters

a.1 LGRL Training

The LGRL model was trained with a learning rate of 1, that decayed by 50% every

steps on 50 epochs of the Karel training dataset, using minibatch SGD with batch size 128 and gradient clipping with magnitude 1. It was run in greedy decoding mode to form the LGRL-GD synthesizer, and run using beam search with

beams to form the LGRL synthesizer.

a.2 EGNPS Model

The EGNPS model was trained for 10 epochs on the Karel dataset, with a learning rate of , and the batch size is 16. See chen2018execution for more details. During the inference time, it was run in the search mode with a beam size of 64.

a.3 Debugger Training

The debugger was trained with a learning rate of 1, that decayed by 50% every steps on 50 epochs of the Karel training dataset using random mutations, sampled with probability proportional to the number of mutations. Minibatch SGD was used with a batch size of 64, and gradient clipping with magnitude 1. The models were finetuned on examples from the training dataset that were incorrect, also for 50 epochs, with a learning rate of .

a.4 TraceEmbed Architecture

The TraceEmbed unit is a residual convolutional network. The input, output, and trace grids are stacked along the channel axis, thus preserving locality in space while allowing features and which grid to be fully connected. The network is composed of an initial convolution that takes the

channels of the input grids to 64 channels, then three ResNet blocks, each of which consists two layers of of batch normalization, ReLU, and convolution followed by a residual connection. All convolutions are

with a padding of 1. The last layer is a fully connected layer that flattens the entire grid into an embedding of size

(the same size as a program token embedding).

Appendix B More Descriptions of the Karel Domain

Figure 7 presents the grammar specification of the Karel DSL. Specifically, the DSL describing its movements inside a grid consisting of cells which are of size 2x2 to 18x18 and containing between 0 to 10 objects. These movements are described with move, turnLeft, turnRight and interactions with the markers are pickMarker and putMarker. The language contains constructs with conditionals such while and for loops with front, left, right, IsClear, markerspresent, and negations. Each cell of the grid is represented as a -dimensional vector corresponding to the features described in Table 2.

Figure 7: Grammar for the Karel task.
Robot facing North
Robot facing East
Robot facing South
Robot facing West
Grid boundary
1 marker
2 markers
3 markers
4 markers
5 markers
6 markers
7 markers
8 markers
9 markers
10 markers
Table 2: Representation of each cell in the Karel state.

Appendix C Full Greedy Algorithm

The full greedy algorithm is in Algorithm 2.

1:function ()
3:      Already expanded programs
4:     if  then
5:         return Success
6:     end if
7:     for  do
10:         if  then
11:              return Success
12:         end if
13:     end for
14:     return Failure
15:end function
Algorithm 2 Greedy search algorithm

Appendix D Mutations

There are six types of mutations that we consider, identical to the ones used in Shin2018TowardsSP. Three mutations, insert, delete, replace, each take a node and either delete it, replace it with some action , or insert the action next to this node. The mutation wrap(, , ) wraps the series of nodes in a control construct specified by the control type , where and control value , which is a conditional for if/while and a number for repeat. The unwrap mutation takes in a node whose root is a construct in and replaces it with its body. The mutation replaceControl takes a node whose root is a construct in and replaces the control value or number of repetitions with , an appropriately typed control value. Each mutation maintains the syntactic validity of the program.