SPoC: Search-based Pseudocode to Code

06/12/2019 ∙ by Sumith Kulal, et al. ∙ Stanford University 0

We consider the task of mapping pseudocode to long programs that are functionally correct. Given test cases as a mechanism to validate programs, we search over the space of possible translations of the pseudocode to find a program that passes the validation. However, without proper credit assignment to localize the sources of program failures, it is difficult to guide search toward more promising programs. We propose to perform credit assignment based on signals from compilation errors, which constitute 88.7 Concretely, we treat the translation of each pseudocode line as a discrete portion of the program, and whenever a synthesized program fails to compile, an error localization method tries to identify the portion of the program responsible for the failure. We then focus search over alternative translations of the pseudocode for those portions. For evaluation, we collected the SPoC dataset (Search-based Pseudocode to Code) containing 18,356 programs with human-authored pseudocode and test cases. Under a budget of 100 program compilations, performing search improves the synthesis success rate over using the top-one translation of the pseudocode from 25.6



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the task of mapping natural language descriptions to functionally correct computer programs that are long enough to have significant intermediate state (e.g., 10–20 lines) and perform non-trivial computations. Previous work on executable semantic parsing mainly focuses on translating short text descriptions to a one-line program zelle96geoquery; vadas2005programming; zettlemoyer07relaxed; zhong2017seq2sql; lin2018nl2bash; dong2018coarse, and while recent work explored generating longer programs from text descriptions ling2016latent; yin2017syntactic; rabinovich2017abstract; hayati2018retrieval; iyer2018mapping; hashimoto2018edit, these programs are mostly evaluated on syntactic metrics (e.g., exact match and BLEU score) rather than functional correctness. In contrast, program synthesis in the programming languages community emphasizes producing programs with the correct semantics, typically captured by a set of input-output test cases that the program must compute correctly gulwani2011automating; feser2015synthesizing. However, input-output pairs usually give little information about the intermediate states of the program, making it difficult to synthesize long programs.

Synthesizing a general class of programs of significant length and internal complexity is extremely challenging without some description of the steps of computation. To that end, we propose a framework for synthesizing programs from natural language pseudocode and test cases. The test cases provide the semantic specification, while the pseudocode provides guidance for the intermediate computations the program should perform.

To synthesize a functionally correct program, instead of relying on the top-one translation of the pseudocode, we search over the space of possible translations to find one that passes the test cases. In this work, we treat the translation of each pseudocode line as a discrete portion of the program. As illustrated in Figure 1, each pseudocode line translates to a code line with approximately one or two atomic statements, and a program can be synthesized by choosing a candidate translation for each pseudocode line.

However, common search methods for machine translation, such as beam search over the possible sequences of code tokens zavershynskyi2018naps, only use the sparse reward of whether the program succeeds. Without a proper credit assignment to pinpoint the causes of program failures, it is difficult to guide search toward more promising programs. Since empirically 88.7% of failures during search are due to compilation errors, we propose to perform credit assignment based on signals extracted from compilation results. At a high level, when a program fails to compile, we use an error localization method to identify which portion of the program is responsible for the failure, and then focus the search on alternative translations of the pseudocode for that portion.

We propose two error localization methods. The first method uses a multiclass classifier to pick one of the code lines as the offending line, which is then down-weighted in subsequent search iterations. In contrast to previous error correction models

gupta2017deepfix, our model also uses the error message and pseudocode for prediction. This is crucial when the compilation error can be fixed in multiple ways, but only some of which are consistent with the pseudocode. The second method, prefix-based pruning, uses additional compilations to find a code prefix that causes the error. Unlike the classification model, the identified code prefix is guaranteed to be erroneous and can be blacklisted entirely.

For evaluation, we collected and release a new dataset, SPoC (Search-based Pseudocode to Code)111 The dataset can be downloaded at https://cs.stanford.edu/~sumith/spoc/. , containing 18,356 C++ programs (14.7 lines on average). In contrast to other language-to-code datasets ling2016latent; oda2015learning; iyer2018mapping, all programs contain multiple test cases for validation. And in contrast to the closely related NAPS dataset zavershynskyi2018naps, which also contains test cases but only 6% human-authored pseudocode, all programs in SPoC are associated with human-authored pseudocode of a consistent annotation granularity. Section 3 details the comparison between SPoC and related datasets.

Using the top-one translation of the pseudocode yields a success rate of 24.6% on the test set. Under a limited budget of 100 synthesis trials (i.e., 100 code compilations and executions), our best method achieves a success rate of 44.7%. The multiclass error localization model reduces the number of synthesis trials needed in 15.5% of the programs, with a median absolute reduction of 26 trails and a median relative reduction of 42%. On the other hand, prefix-based pruning slightly increases the number of compilations for easier problems, but is more effective on harder programs, making it outperform the multiclass classifier under larger budgets.

1 in function main int main() {
2 let n be integer int n;
3 read n cin >> n;

let A be vector of integers

vector<int> A;
5 set size of A = n A.resize(n);
6 read n elements into A for(int i = 0; i < A.size(); i++) cin >> A[i];
7 for all elements in A for(int i = 0; i < A.size(); i++) {
8 set min_i to i int min_i = i;
9 for j = i + 1 to size of A exclusive for(int j = i+1; j < A.size(); j++) {
10 set min_i to j if A[min_i] > A[j] if(A[min_i] > A[j]) { min_i = j; }
11 swap A[i], A[min_i] swap(A[i], A[min_i]);
12 print all elements of A for(int i=0; i<A.size(); i++) cout<<A[i]<<" ";
Public test case 1 (out of 5): 5 3 2 4 1 5 1 2 3 4 5
Hidden test case 1 (out of 8): 8 9 2 4 5 6 2 7 1 1 2 2 4 5 6 7 9
Figure 1: Given pseudocode lines (with indentation levels ) and public test cases, our task is to synthesize a program with code lines . The program is evaluated against both public and hidden test cases.

2 Problem statement

Figure 1 illustrates the setup of our synthesis task. The system is given (a) a sequence of pseudocode lines , where each is a string with indentation level ; and (b) public test cases in the form of input-output string pairs . The task is to synthesize a program consisting of code lines . The program is accepted if it successfully compiles and passes all public test cases (i.e., the compiled binary prints the string after reading the input ) as well as additional hidden test cases .

At training time, the system has access to a training dataset where each example contains pseudocode , a gold program , and both public and hidden test cases.

At test time, the system has access to pseudocode , public test cases (not hidden ones), and a computation budget. For a fair comparison under different computing environments, we use the number of synthesis trials as the budget, where in each trial, the system can issue a single call to the compiler and execute the compiled program on public test cases. The system must output a single final program, which will be validated on both public and hidden test cases.

3 Dataset

Recall that our goal is to synthesize programs of significant length and complexity. To this end, we argue that it is important to have both description of the intermediate computation and a functional specification. Table 1 shows that most existing datasets ling2016latent; oda2015learning; iyer2018mapping have some varying levels of description, but lack mechanisms to validate the correctness of programs. This inevitably leads previous work to resort to proxy metrics, such as exact match accuracy, BLEU score, and tree node F1 score, which only measure syntactic similarity rather than functional correctness ling2016latent; yin2017syntactic; rabinovich2017abstract; hayati2018retrieval; iyer2018mapping; hashimoto2018edit.

One notable exception and the inspiration for our work is the NAPS dataset zavershynskyi2018naps

which contains both description (pseudocode) and a functional specification (test cases) of competitive programming problems. However, most of their pseudocode is generated by heuristic rule-based templates, which in turn has a low information content compared to the human-authored pseudocode. Furthermore, the dataset suffers from the inconsistent granularity of text description, as the artificial pseudocode is fine-grained (e.g., “increase var0 by 1”) whereas the human-written pseudocode tends to be abstract (e.g., “compute factorial”) as the annotators were encouraged to provide high-level descriptions. This discrepancy is reflected on the ratio of the length of pseudocode to that of code, which is 1:1.82 in their synthetic dataset, and 1:3.26 in their human-authored dataset.

As no existing dataset contains both high-quality human-authored description with a consistent level of granularity and a mechanism to validate functional correctness, we created a new dataset called SPoC (Search-based Pseudocode to Code), which consists of programs, pseudocode, and test cases. The programs are non-trivial solutions to competitive programming problems, and each program is paired with public and hidden test cases. We collected natural language pseudocode for each code line from curated crowdworkers, which by design, ensures the consistent granularity of description.

MTG HS Django CONCODE111We counted the number of programs in the released dataset. Since the programs are provided as a sequence of tokens, the number of lines per program is approximated based on the number of ;, {, and }. NAPS222We excluded partial programs (smaller pieces of full programs) in the dataset when counting. SPoC
ling2016latent ling2016latent oda2015learning; ling2016latent iyer2018mapping zavershynskyi2018naps
Programming language Java Python Python Java UAST C++
Number of programs (total) 13,297 665 18,805 2,184,310 17,477 18,356
Lines per program (average) 30.4 7.7 1 4.4 21.7 14.7
Type of natural language input — card text — comment documentation — pseudocode —
Additional input — card metadata — - class context - -
Granularity of text description program program line program varies line
(class) (class) (method)
Fraction of human-annotated text 100% 100% 100% 100% 6% 100%
Number of annotators (total) n/a n/a 1 n/a n/a 59
Test cases
Number of test cases (average) - - - - 7.5 38.6
Table 1: Datasets for natural language to code. In contrast to other datasets, our SPoC dataset contains human-authored pseudocode with a consistent granularity of description and test cases.

3.1 Data collection

Programs and test cases.

Similar to the NAPS dataset zavershynskyi2018naps, we scraped competitive programming problems and their test cases from codeforces.com. Each problem has multiple programs submitted by participants as solutions to the problem. We collected accepted C++ programs from problems marked as the easiest level based on their metric. Based on our pilot study, we filtered out programs with constructs that are difficult to consistently annotate with pseudocode (i.e., programs with #define macros, classes, structs, templates, switch statements, and mallocs).


We decompose each program into code lines. To obtain slightly higher-level descriptions for common constructs, we group any block with only one statement with the preceding control statement (e.g., the one-line for loop “for (int i = 0; i < n; i++) cin >> x[i];” allows a high-level description “read n values into x”).


We recruited 59 crowdworkers on Amazon Mechanical Turk to write pseudocode for each line of code. To our surprise, we were able to identify the workers (rather than curated specialists) who are capable of annotating C++ code by using a qualification round, in which we manually inspected their initial annotations.


Our dataset contains 18,356 programs submitted for 677 programming problems. Each problem has roughly 27 programs, which are likely to have similar semantics yet different code syntax. Excluding closing braces and the common “int main()” line, each program contains an average of 14.7 lines (with the minimum of 1 and maximum of 457 lines of code). The average length of code lines is 9.08 tokens, while the average length of pseudocode lines is 7.86 tokens.

Training and test sets.

To evaluate the generalization on unseen problems and annotation styles, we created two test sets. We generated the first test set TestP by splitting based on problems: we held out 158 problems (23% out of 677 problems), which is equivalent to 1,820 programs (10.1% of all programs). The second test set TestW is split by workers: we held out 7 workers (12% out of 59 workers), which is equivalent to 1,752 programs (9.7% of all programs, with 186 programs overlapping with TestP). We used the remaining data for training and development (90:10 split).

4 Base approach

As illustrated in Figure 2, our base approach to synthesizing a program from pseudocode and public test cases involves two steps. First, a translation model encodes each pseudocode line and generates candidate code lines to be used as the th code line. Then, we search over the possible combinations of candidate translations until we find a program that successfully compiles and passes all public test cases.


To generate candidate code lines, we use a standard seq2seq translation model with an LSTM encoder and decoder klein2017opennmt, attention-based copying mechanism luong2015translation; vinyals2015pointer, and coverage vector tu2016modeling. After encoding the pseudocode line , we apply beam search with beam size to produce a ranked list of candidates translations , where each code line is a sequence of string tokens. (We use

for our experiments.) The model also assigns a probability

for each candidate . The translation model is trained on pairs from the training data using the standard log-likelihood objective.

Best-first search.

We now describe a basic approach for searching over the space of possible programs. Given the candidate lists , we can synthesize a program by picking a candidate from each (where ) and then concatenate them into a program. In our search algorithm, we iterate through programs in the descending order of probability . To do so, we maintain a heap of the combinations indexed by . The heap initially contains the program , which is the top-one translation of the pseudocode. In each iteration, we pop a program from the heap and test it. If the program fails (either from a compilation error, a runtime error, or a mismatch between the actual and expected test outputs), we push modified programs for all that have not been explored to the heap. We continue searching until we either find a program that passes all public test cases or fully utilize the computation budget.

Figure 2: Illustration of best-first search and error localization model. In this example, () satisfies the test cases. Best-first search iterates in the order of decreasing probabilities and succeeds in four compiler calls. The error localization method down-weights , leading to an earlier success.

5 Error localization

So far, we have been treating program compilation and execution as a black box that only tells whether a program passes its test cases. This sparse signal makes the search process less effective. For instance, best-first search will keep using an erroneous candidate if its probability is high.

To speed up search, we unpack the black box and extract more detailed search signals. In this work, we focus on compilation errors, which constitute 88.7% of the failure cases in best-first search. When a program fails to compile, the compiler will report error messages along with the line numbers where the errors occur. Unfortunately, the reported line numbers do not always correspond to the actual location of the mistake (e.g., the error “‘n’ was not declared in this scope” can occur long after the line where n should be declared according to the pseudocode). Empirically, the reported line number does not match the actual incorrect line 21.7% of the time.

Therefore, we treat the compilation error message as a noisy signal, and propose to use an error localization method to infer the actual portion of the code that causes the error. As illustrated in Figure 2, the error localization method has access to the pseudocode , the synthesized code , and the first error message from the compiler, where is a line number and is a message string. It can then either detect the offending code lines or abstain. Depending on the method, we either down-weight or blacklist the translation candidates in the offending code lines.

We now introduce two error localization methods: multiclass classification, which uses a neural model to predict a single offending line; and prefix-based pruning, which uses additional calls to the compiler for detecting an erroneous code prefix.

Multiclass classification.

We train a neural multiclass classifier to predict the offending line among the lines. Our model is similar to the error correction model in gupta2017deepfix. For each line , we embed the tokens of , , and , and then use three separate LSTMs to encode the sequences. We concatenate the final LSTM hidden states with the positional encoding vaswani2017attention of the line offset , and then apply a feedforward network to produce the line embedding of line . The line embeddings are then passed through another LSTM, and the hidden state of each cell

is passed through a feedforward network to compute the logit for line

. We return the line with the highest probability (softmax over logits) if that probability exceeds a threshold and abstain otherwise. We use for the experiments.

Given , we down-weight the current translation candidate of the line so that it is used less often in subsequent search iterations. Concretely, we multiply the probability of the current candidate in line with a constant factor . As this affects the heap indices, we rebuild the heap from scratch (which takes negligible time) and continue the search, skipping any program that has already been explored before the heap rebuild.

To construct a dataset for training the model, we consider each program in the synthesis training dataset, substitute a single line with a candidate generated from pseudocode line , and then collect any modified program that produces a compilation error with an error message . The model is trained to maximize the log-likelihood of the offending lines .

Prefix-based pruning.

The multiclass classification method does not guarantee that the predicted line is actually an offending line. Furthermore, a candidate code line might be offending in some contexts but not others (e.g., a variable re-declaration is no longer offending if the previous declaration no longer exists). To address these issues, we propose an alternative that uses additional compiler calls to find an offending prefix of the program. Concretely, when a compilation error occurs, we use the compiler to to find the minimum such that the prefix , plus closing braces to complete the program, fails to compile. Since programs containing that prefix will also always fail (with very rare exceptions), we can safely blacklist the prefix from future search iterations.

Each additional compiler call is counted as one trial toward the synthesis budget. To save the budget, we only test where corresponds to the three most frequent offsets. If we fail to find an offending prefix, we simply abstain.

6 Experiments

Our main evaluation metric is

success rate at : the fraction of test examples where the system generates an accepted program under the budget of trials. For error localization methods, we also consider the reduction in the number of trials used compared to normal best-first search.

Translation accuracy.

When evaluating the translation model, surface-form metrics such as exact sequence match and BLEU scores fail to account for functional correctness of the code. For instance, a prediction “if (b)” is functionally equivalent to the gold code “if (b == true)” when b is a boolean. Hence, we instead evaluate the functional correctness of the translation. To check if a predicted code line is functionally correct, we replace the code line in the gold program with and then verify whether the program still passes both public and hidden test cases.