Ain't Nobody Got Time For Coding: Structure-Aware Program Synthesis From Natural Language

10/23/2018 ∙ by Jakub Bednarek, et al. ∙ Poznan University of Technology 0

Program synthesis from natural language (NL) is practical for humans and, once technically feasible, would significantly facilitate software development and revolutionize end-user programming. We present SAPS, an end-to-end neural network capable of mapping relatively complex, multi-sentence NL specifications to snippets of executable code. The proposed architecture relies exclusively on neural components, and is built upon a tree2tree autoencoder trained on abstract syntax trees, combined with a pretrained word embedding and a bi-directional multi-layer LSTM for NL processing. The decoder features a doubly-recurrent LSTM with a novel signal propagation scheme and soft attention mechanism. When applied to a large dataset of problems proposed in a previous study, SAPS performs on par with or better than the method proposed there, producing correct programs in over 90 it does not involve any non-neural components to post-process the resulting programs, and uses a fixed-dimensional latent representation as the only link between the NL analyzer and source code generator.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Program synthesis consists in automatic or semi-automatic (e.g., interactive) generation of programs (or other executable structures) from specifications. This task can be posed in several ways. It is most common to assume that specification has the form of input-output pairs (tests), in which case synthesis resembles learning from examples. A popular approach to this class of tasks is genetic programming.

Synthesis from examples is limited by its inductive nature: even if a program passes all provided tests, little can be said about generalization for other inputs. This is one of the rationales for synthesis from formal specifications, which are typically expressed as a pair of logical clauses (a contract): a precondition that constraints the set of acceptable inputs, and a postcondition that defines the properties one requires from program output. Programs so synthesized are correct by construction, but the task is NP-hard, so producing programs longer than a few dozen of instructions becomes computationally challenging. Synthesis from formal specifications is also difficult for programmers, because writing a specification of a nontrivial program can be actually harder than its implementation.

From a practical perspective, the most intuitive and convenient way of specifying programs is natural language (NL). This way of formulating synthesis tasks has been rarely studied in the past, when it was beyond the capabilities of available methods. The recent progress in NLP and deep neural networks made it more realistic, as confirmed by a few studies we review in Section 5. Here, we propose Structure-Aware Program Synthesis (SAPS 111Pun intended.), an end-to-end approach to program synthesis from natural language. SAPS receives a short NL description of requested functionality and produces a snippet of code in response. To that aim, we combine generic word and sentence embeddings with tree2tree autoencoders trained on abstract syntax trees (ASTs). The unique feature of our approach is that the entire NL specification is folded into a single point in a fixed-dimensional latent space, which is then mapped by decoder onto the AST of a code snippet. This modular architecture facilitates usage of pretrained components, e.g. GloVe word embeddings (Pennington et al., 2014). We also enhance the original tree2tree architecture and augment it with global soft attention mechanism. Last but not least, the approach engages a form of dynamic batching for efficient learning from tree structures.

2 SAPS architecture

SAPS accepts a short NL specification as input and produces in response an AST of a snipped of code (program). It comprises of three main components (Fig. 1): (i) a word embedding for preprocessing of the specification, (ii) a mapping from the embedded sequence of specification tokens to a latent representation , and (iii) a decoder that unfolds into an AST. The decoder comes from a tree2tree autoencoder trained via autoassociation on ASTs (the vertical sequence of blocks in the figure). The modules are largely independent, the only requirement being that (ii) can be trained only once (i) and (iii) are known. Therefore, we describe them in that order in the sections that follow.

Figure 1: Overall architecture of SAPS.

2.1 Word embedding

Inspection of the NL specifications proposed in Polosukhin & Skidanov (2018), which we use in the experimental part, revealed that they largely use standard English vocabulary, with only occasional occurrence of terms typical for programming. Given that, we rely on a pretrained GloVe embedding, more specifically the Common Crawl, which has been trained on generic NL on 42B tokens by Pennington et al. (2014), has vocabulary size of 1.9M tokens, and embeds the words in a 300-dimensional space. We chose this particular embedding, because the vocabularies of the other ones shared by Pennington et al. (2014) did not cover all terms occurring in the considered dataset.

Given an input NL query phrase of tokens, this module produces a sequence of

300-dimensional vectors

. In order to be able to train the network with mini-batches of data, we pad each sequence in the mini-batch with a special out-of-vocabulary value,


2.2 Latent-to-AST mapping

We learn a latent, fixed-dimensionality embedding space of code snippets by autoassociative training of a tree-to-tree autoencoder, , where , , is the space of programs’ ASTs, and is the latent space. Once trained, the encoder is scrapped, and we use as the latent-to-AST mapper.

Our autoencoder architecture, which we originally introduced in (Anonymous, 1999), is based on Tai et al. (2015) and Alvarez-Melis & Jaakkola (2017). However it significantly diverges from those works in using only the latent vector for passing the information between encoder and decoder.

Encoder. A single training example is , where is an AST of a snipped of code. Given , starts by encoding the label of every node in

using one-hot encoding

; for the experiments presented in Section 4, there are 72 node labels, so . Then, following Mikolov et al. (2013), we cast s, for each node in independently, to a lower-dimensional space, using a learnable embedding matrix : . The result is a tree of node label embeddings , isomorhpic to .

The encoder network folds the tree bottom-up, starting from the tree leaves, aggregating the hidden states , and ending up at the root node. Because the arity of tree nodes may vary, we apply a recurrent TreeLSTM network, i.e. a variant of LSTM, to merge the information from children nodes (descendants). A TreeLSTM takes into account not only a single preceding node, but all descendant nodes.

The hidden states are initialized with zeroes. For each node, we first compute the sum of the hidden states of its descendants (children) :


For leaf nodes, . Then we compute the values of input, forget and output gates:



is the sigmoid function, and

and and and are, respectively, learnable weights and biases. Note that the values of and depend on the aggregated children’s states (), whereas is computed child-wise, yielding a separate output of forgetting gate for each child of the current node.

We then update the state of the memory cell of the current node:


where stands for elementwise multiplication, and where we assume that the initial states of s of nonexisting children of leaves are zero. Next, we compute the hidden state of the current node:


which is then recursively aggregated by the above procedure. Once this process reaches the root node, the obtained state becomes the latent vector, i.e. . Additionally, we have applied layer normalization (Ba et al., 2016) in the same manner as to traditional LSTM.

Figure 2: The modified DRNN cell consists of two LSTM blocks, each comprising embedding, input projection layer to adapt the input size to LSTM size, layer normalization LSTM cell, and residual skip connection between the projection and the output layers. DRNN cell uses the latent vector to enrich an prediction hidden state by the lost information during hierarchical processing.
Figure 3: Propagation of the hidden states and latent vector during decoding. The dashed line illustrates the traversal of the horizontal state, the solid line propagation of the vertical state. The latent vector is used to initialize both states, but has also impact on subsequent decoding steps via an attention mechanism.

Decoder. Our decoder is a doubly recurrent neural network (DRNN) proposed in Alvarez-Melis & Jaakkola (2017), with modified scheme of state propagation and novel attention mechanism applied to the latent vector (see Fig. 2). DRNN comprises two LSTM cells, which separately capture the vertical and horizontal traversal of the resulting AST tree. The state of the former () is passed vertically downwards from parents to children, and of the latter () horizontally left-to-right to consecutive siblings. Importantly, in the original DRNN Alvarez-Melis & Jaakkola (2017), the state of the horizontal LSTM was reset before processing the children of each node. In contrast, we always pass it between all nodes the same tree depth – even between those that are not siblings – as well as between the last node at current depth and the first node at the next depth level. Consequently, the horizontal LSTM works in breadth-first order, while the vertical one in depth-first mode, as shown in Fig. 3. Preliminary experiments with Python ASTs we conducted in (Anonymous, 1999) have shown that this brings significant improvements. We hypothesize that ASTs capture only program syntax, while pieces of code can be semantically related even if they do not reside in a common part of an AST. For instance, a reference to a previously used variable may require ‘climbing up’ the tree and then descending down to unrelated (non-sibling) subtrees.

We start by initializing both states of the vertical and the horizontal LSTM cells with . Alvarez-Melis & Jaakkola (2017) suggest to intialize only vertical hidden state with latent vector, however, our modification concerning the transfer of horizontal states in breadth-first order implies the same initialization approach also to the horizontal state. The DRNN merges both states of LSTMs in a prediction state:


where and are learnable parameters.

is then mapped to the probability distribution of node’s label


where (see the description of ), and and are learnable parameters applied to is-parent flag () and is-last-child flag (). In training, these are used for teacher forcing, i.e. replaced with ground truth data according to the shape of the input tree ; in inference mode, the flags are calculated as:


where and are learnable parameters.

In training, we minimize the sum of label loss and topological loss:


where is the categorical cross-entropy loss between the ground truth label and the label produced by the DRNN, and is the binary crossentropy, applied separately to the is-parent flag and is-last-child flag.

Because successive iterations of the DRNN may arbitralily modify its prediction state , the information from the encoder may fade away with time, making it harder to reproduce the input tree . A range of works, most relevantly Chen et al. (2018), have shown that this loss of information can be supplemented with an attention mechanism. However, the specific attention architecture proposed there assumes that the window of attention scans the corresponding nodes of the input tree , which implies that has to be available also during decoding. As we cannot assume this in our setting, where the decoder receives only the latent vector as input (Fig. 1), we come up with an alternative attention mechanism that relies only on . We define a (soft) attention window as


and then use it to update the prediction state as follows (compare to the attention-less update formula (8)):


where , , and are learnable parameters. We expect this mechanism to learn focusing on the parts of that carry the information that is relevant for reproduction of tree structure, particularly at the deeper levels.

2.3 Sentence-to-latent mapping

To map the sequences of word embeddings of the NL specification (300-dimensional vectors ) to the latent space (Fig. 1), we employ a multilayer bidirectional LSTM (Schuster & Paliwal, 1997; Graves et al., 2005). The dimensionality of LSTM’s state and output is the same as that of . The output of the LSTM is passed through a layer, in order to match the range of values produced by the tree encoder – which also uses (Eq. 7) – and fed as to the decoder.

Internally, this recurrent network is a 4-layer LSTM: there are four forward cells stacked upon each other, i.e. each consecutive cell receives the output of the previous cell as input, and analogously a stack of four backward cells. The final state of the topmost forward cell (reached after the forward pass over the input sequence) and the final state of the topmost backward cell (reached after the backward pass over the input sequence) are concatenated to form the final output.

3 Program Synthesis Tasks

We assess the performance of SAPS on the set of NL-based program synthesis tasks proposed in (Polosukhin & Skidanov, 2018). The domain-specific language (DSL) used in this suite is Lisp-based AlgoLisp, defined by the grammar in Fig. 4, reproduced from (Polosukhin & Skidanov, 2018). A single program is composed as a nested list of instructions and operands of three types: string, Boolean and function. The language features a number of standard functions (cf. the last-but-one production in the grammar). The motivation for choosing this dataset was the large number of examples and brevity of NL specifications (in contrast to the programming contest data we used in (Anonymous, 1999)). Moreover, the simplicity of syntax causes the AlgoLisp programs to be essentially equivalent to their AST trees. See (Polosukhin & Skidanov, 2018) for detailed description of the dataset.

program       ::= symbol
symbol        ::= constant | argument | function_call | function | lambda
constant      ::= number | string | True | False
function_call ::= (function_name arguments)
function      ::= function_name
arguments     ::= symbol | arguments , symbol
function_name ::= Reduce | Filter | Hap | Head | + | - | ...
lambda        ::= lambda function_call
Figure 4: The grammar of the AlgoLisp DSL from (Polosukhin & Skidanov, 2018).

The dataset comprises 99506 examples, each being a pair of NL specification and the corresponding AlgoLisp program. An example can be seen in Table 1. To handle the variables that occur in programs (like in the above example), we use the same approach as Polosukhin & Skidanov (2018), i.e., each occurrence of a new variable in the program allocates a new placeholder, and the placeholders extent the vocabulary of tokens used in the NL embedding. Placeholders are common for all examples, i.e. the first occurrences of variables in all programs are mapped to the same, first placeholder. The descriptions in NL does not specify the variables, which are needed in source code, therefore there were no need to resorting to such placeholders on the encoder side.

You are given an array . Find the smallest element in , which is strictly greater than the minimum element in (reduce (filter a (partial0 (reduce a inf) <)) inf min)
Table 1: An example AlgoLisp program (right) and its NL specification (left).

For conformance with Polosukhin & Skidanov (2018), we split the data into training set of 79214 examples, validation set (9352 examples) and test set (10940 examples). The lengths of average NL specification and associated code are, respectively, 38 and 24 tokens. The shortest and longest description contain 8 and 164 tokens respectively (2 and 217 for code snippets).

It may be worth noting that, in contrast to the relatively large number of examples available, the vocabularies of terms are rather humble: there are only unique terms/tokens in the NL specifications, and only unique terms in the AlgoLisp programs. The NL vocabulary used in specifications is strongly related to programming, containing words rarely used in everyday language, lie prime, inclusive, incremented etc.

After closer examination, we identified potential inconsistency in the dataset: for a about 10% of tasks (in the test set), some tests are not passed even by the ground truth program (we verified that using the AlgoLisp interpreted published in (Polosukhin & Skidanov, 2018)). To address this issue, we decided to use two performance measures: one using the unmodified test and validation sets, in order to maintain comparability to results presented in (Polosukhin & Skidanov, 2018), and another one using a filtered test set, containing only the tasks that were free from the above problem. Note however that this issue does not affect the training of SAPS, as it does not involve tests.

4 Experiment

The method has been implemented in TensorFlow

Abadi et al. (2016). We train the autoencoder using Adam optimizer (Kingma & Ba, 2014) with initial learning rate set to initial value of

, and then reduced exponentially at rate 0.975 per every two epochs. To lessen the risk of overfitting, we applied L2 regularization to each of trainable parameter, except for biases and layer normalization. When applied to the validation set, the autoencoder reproduces successfully 72.98% of AST trees in the best configuration.

Given the trained decoder and GloVe embedding (Section 2.1), we train the sentence-to-tree and sentence-to-vector mapping using the same configuration and methods as those used for training the autoencoder.

4.1 Results

We examine network’s ability to fully reconstruct trees, the number of generated programs that pass all tests (and at least 50% of tests), and confront SAPS with other methods of synthesis of programs from NL. In addition, we conducted an analysis of the sensitivity of the network to changes in input data (natural language). None of the considered trained configurations of SAPS produced any syntactically incorrect programs (average error 0.0%). The generated programs by the best obtained network for examples from validation set passed in average 94.35% of unit tests, while from the test set 93.32%.

Dev Acc
Test Acc
Attentional Seq2Seq 54.4% 54.1%
Seq2Tree 61.2% 61.0%
SAPS (NLP2Tree - 256) 85.29% 83.03%
Seq2Tree + Search
(only for comparison) 86.1% 85.8%
Table 2: Direct comparison with results achieved by Polosukhin & Skidanov (2018). Accuracy is measured as the percentage of programs passing all tests.
Model Validation Test Model Validation Test
tree2tree - 256 72.98% 68.42% SAPS - 256 92.68% 91.18%
tree2tree - 128 67.06% 60.18% SAPS - 128 82.59% 60.18%
tree2tree - 64 50.24% 39.36% SAPS - 64 58.65% 52.22%
NLP2Vec - 256 14.91% 14.39%
Table 3: Percentages of synthesized programs that are syntactically identical to the target ones (same AST tree).
Model All tests Only passable tests
Validation Test Validation Test
Accuracy 50-Accuracy Accuracy 50-Accuracy Accuracy 50-Accuracy Accuracy 50-Accuracy
SAPS - 256 85.29% 88.34% 83.03% 86.58% 93.24% 94.27% 91.97% 93.22%
SAPS - 128 76.62% 80.92% 70.91% 75.89% 83.90% 86.41% 78.56% 81.54%
SAPS - 64 55.08% 60.35% 49.80% 56.31% 60.19% 64.31% 55.17% 60.56%
Table 4: Percentage of programs passing all tests, using all examples provided in the AlgoLisp database (Polosukhin & Skidanov, 2018) (All), and using only the tests that are consistent with the target program (Only passable). See the main text for detailed explanation.

Due to the inconsistencies in AlgoLips dataset described in Section 4, we divided the results into two groups. First, we compare our best configuration, SAPS with the latent layer of size 256, directly to those achieved by (Polosukhin & Skidanov, 2018), using their evaluation code. The results are presented in Table 2. Though the best test-set accuracy reported by Polosukhin & Skidanov (2018) is 85.8%, achieving it requires a sophisticated search mechanism and use of additional tests. On the other hand, our approach is entirely neural, so it seems fair to compare it to analog neural-only architectures from that cited study. The best accuracy obtained by Polosukhin & Skidanov (2018) without search is 61.0% on the test set. It should be noted that, despite involving no explicit search mechanism, our fully neural algorithm is only slightly worse than the best search-based approach from (Polosukhin & Skidanov, 2018). Moreover, SAPS outperforms the general attention mechanism used in Seq2Tree by roughly 24%.

Secondly, we evaluated the performance of different configurations of our model. In Table 3, we present the accuracy of several variants of SAPS, measured as the percentage of perfectly synthesized programs (i.e., such that the generated AST trees is identical to the target one), separately for the validation (developmental) set and test set. We report two performance metrics: the accuracy of autoencoder (tree2tree), i.e. the percentage of perfectly recreated AST trees in that training phase, and the overall accuracy of SAPS. We also experimented with different sizes of the latent layer: , , and . As expected, the configurations involving largest hidden states perform the best. Also, each complete SAPS architecture is better at synthesizing programs from NL specifications than the corresponding constituent autoencoder is at reproducing the AST trees. We hypothesize that this is due to relatively simplicity of the TreeLSTM encoder used here, which features single, feedforward LSTM, contrary to multilayer, bidirectional LSTM used in the NL encoder.

In the configurations discussed above, the decoder (taken from the trained autoencoder) learns alongside with the NL-to-latent mapping. For comparison, we consider also a variant dubbed NLP2Vec. In this setting, we used the latent vectors from pretrained autoencoder (tree2tree) model as the target (desired output) for the NL-to-latent mapping, and trained the network to minimize the distance between the predicted vector and the one computed by tree2tree (while the decoder remained fixed). As expected, the obtained results were inferior to others (outperforming, however, the approach based only on input-output pairs from (Polosukhin & Skidanov, 2018), which achieved accuracy of 13.3% in the best case (with search mechanism)).

In Table 4, we compared the results obtained on the original, unmodified version of AlgoLisp dataset with the results obtained on tasks containing only valid, passable tests (see the note on this issue in Section ). Expectedly, networks obtained higher scores when evaluated only on the tasks that are free from the abovementioned issue, and those scores are also higher than the percentages of programs that were syntactially identical to the target (Table 3). The latter fact should not be surprising, as a syntactically perfect program has to pass all tests.

To validate the obtained results, we evaluated the robustness of SAPS by applying a range of modifications to the NL specification (Table 5). We checked how does the generated source code depend on different levels of input complexity (simplification). We provided a set of NL descriptions and sorted them from the longest to the shortest. To our surprise, the network works well even if the input is very laconic. We evaluated also the capabilities of generalization in response to various replacements in input sequences, like for example replacing operations (multiplication minimum) and applying different ranges of arrays. We also tested the robustness to complex replacements consisting of multiple simpler ones.

Specification Synthesized program
you are given numbers a and b, your task is to find a + b (+, a, b)
you given numbers a b, your is find a + b (+, a, b)
given a numbers b, find a + b (+, a, b)
given a numbers b, a + b (+, a, b)
a b, a + b (+, (+, a, b), c)
Simple replacement
you are given numbers a and b, your task is to find a multiplied by b (*, a, b)
you are given numbers a and b, your task is to find minimum a and b (min, a, b)
Complex replacement
given a number a and an array of numbers b, find the length of the longest subsequence of range from 0 to a inclusive that is a prefix of b (reduce, (range, 0, (+, a, 1)), 0, (lambda2, (if, (==, arg2, (if, (<, arg1, (len, b)), (deref, b, arg1), 0)), (+, arg1, 1), arg1)))
given a number a and an array of numbers b, find the length of the longest subsequence of range from 1 to a exclusive that is a prefix of b (reduce, (range, 1, a), 0, (lambda2, (if, (==, arg2, (if, (<, arg1, (len, b)), (deref, b, arg1), 0)), (+, arg1, 1), arg1)))
given an array of numbers a, find median of values in a after only keeping first half (deref, (sort, (slice, a, 0, (/, (len, a), 2))), (/, (len, (slice, a, 0, (/, (len, a), 2))), 2))
given an array of numbers a, find mean of values in a after only keeping second half (/, (reduce, (slice, a, (/, (len, a), 2), (len, a)), 0, +), (len, (slice, a, (/, (len, a), 2), (len, a))))
Table 5: The effects of various types of modifications of the NL specification. The first specification in each group is an original task from the validation set, and those that follow are its modified variants (with most modifications marked in bold). In Simplification, we drop and swap words from the specification until the output program is not correct anymore. Simple replacement: we aim at modifying a single keyword in the program. Complex replacement: change of entire subexpression in the output program. Makeover: Complex modification of specification. Except for the specification a b, a + b, all synthesized programs are consistent with specification.

5 Related work

With the recent advent of powerful hardware, the problem of program synthesis received increasing attention. Apart from the fundamental research directions mentioned in the Introduction, a few main approaches can be identified.

Code synthesis from examples
Recently a handful of methods utilizing input-output examples for program synthesis were introduced. Balog et al. (2016) used a neural network as a guide for a search algorithm. The network predicts the most probable code tokens, which greatly reduces the number of solutions that need to be evaluated during search. Krawiec et al. (2017) use genetic programming to find a set of programs that match the formal specification, which ensures the correctness of generated programs for all possible inputs from specification. Devlin et al. (2017) create programs using noisy input-output pairs.

Differentiable compilers
Another approach is to, rather than synthesizing human readable code, embed the whole proces implicitly in the neural substrate. These methods often utilize neural networks and could be seen as differentiable compilers. Graves et al. (2014)

implement a Turing machine using neural network, endowing it with differentiable memory. A similar approach was proposed by

Sukhbaatar et al. (2015), where a memory was also used.

Source code and AST trees from natural language description
With the recent advances in NL understanding, a number of approaches utilizing NL descriptions for code synthesis were introduced. Rabinovich et al. (2017); Yin & Neubig (2017) generate code in the form of abstract syntax trees instead of plain source code. They use recursive neural decoders, specialized for the abstract syntax tree generation. An interesting approach is presented by Ling et al. (2016), where the authors use a set of neural modules to generate source code from multimodal inputs (text and quantitative features), based on a trading card game.

Last but not least, the main reference approach for this work is the neuro-guided approach proposed by Polosukhin & Skidanov (2018). The authors used a neural network to guide a Tree-Beam search algorithm and so prioritize the conducted search process. In the approach proposed there, a sequential encoder folds a short description in natural language, followed by a tree decoder that generates a set of most possible nodes in the AST. For each node, the proposed labels are evaluated by the search algorithm on a set of input-output tests. The proces is repeated recursively for consecutive nodes of the AST tree. The authors used an attention mechanism (Bahdanau et al., 2014) on the entire input sequence – contrary to our approach, where attention concerns only the latent vector.

6 Discussion

SAPS manages to achieve state-of-the-art test-set accuracy, on par with that of Polosukhin & Skidanov (2018)

, and does so with a bare neural model, without any additional postprocessing or other forms of guidance. This remains in stark contrast to that study, where network was queried repeatedly in a Tree-Beam search heuristics to produce the target program step by step, testing the candidate programs on provided tests (see Fig. 2 in

(Polosukhin & Skidanov, 2018)), and in contrast to Balog et al. (2016), where a network was used to prioritize search conducted by an external algorithm. SAPS’s architecture can provide a similar level of quality with purely neural mechanisms. This seems interesting in itself; in particular, it is worth realizing that the fixed-dimensionality latent layer implements an embedding for a large number of programs, many of them implementing sensible semantics - those present in our training, validation and testing set, but possibly also other programs (as suggested by the effects of manipulations shown in Table 5).

We attribute the high performance of SAPS mainly to the quite sophisticated design of our decoder, and in particular its susceptibility to learning. This claim is supported by the comparison of NLP2Tree to tree2tree (Table 3), where the latter used a fixed decoder, trained only in the autoassociative mode (Section 2.2). The gap between the performances of these architectures clearly indicates that it was essential to allow the decoder to adapt in the end-to-end learning spanning from NL to AST trees. This gives rise to an interesting question for follow-up studies: how much of autoassociative learning is required for SAPS to perform well? In an extreme scenario, it would be interesting to examine the performance of SAPS without any autoassociative pretraining at all.

Our networks produce exactly the required program in over 90% of cases (see Table 3). Even when the synthesized program is not identical to the target one, there is still a chance for it to be correct. For instance, for the test set, SAPS-256 produces a perfect copy of the target program in 91.18% of cases (Table 4), while 91.97% programs pass all tests (Table 3). Therefore, 0.79% of programs (roughly one in 11 of imperfectly reproduced programs) pass all tests despite being syntactically different from the target program. It seems likely that in those cases the network came up with an alternative implementation of the target concept expressed by the NL specification. One should keep in mind, however, that passing the tests does not prove semantic equivalence. This could be achieved with formal verification, which we leave out for future work.

In relation to that, let us also note that a substantial fraction of programs that do not pass all tests, pass at least 50% of them (see the 50-Accuracy in Table 4). Because the a priori probability of passing a test by a program is miniscule, this suggests that even the imperfect programs feature useful code snippets that make them respond correctly to some tests. All these promising values of performance indicators make SAPS’s applications in real-world setting quite realistic.

It would be naive to expect SAPS to scale well for even longer specifications and/or target programs (the median length of the former was 36 words, and of the latter 20 tokens). That would arguably require augmenting the method with additional modules, for instance an attention mechanism for interpretation of the NL specification, as used by Polosukhin & Skidanov (2018). That is particularly true for tasks that involve complex, multi-sentence NL specifications.

7 Conclusion

We have shown that end-to-end synthesis of nontrivial programs from natural language is possible with purely neural means, and using exclusively gradient descent as the learning mechanism. Given that (i) capturing the semantics of the NL is in general difficult, (ii) there are deep differences between NL and AST trees on both syntactic and semantic level, and (iii) the correspondence between the parts of the former and the latter is far from trivial, we find this result yet another promising demonstration of the power dwelling in the neural paradigm. In future works, we plan to perform ablation studies on SAPS to determine which components are critical for its performance, and focus on the compositionality of both specifications and program code.
Acknowledgment. We thank the authors of (Polosukhin & Skidanov, 2018) for publishing their code and data.