Program Synthesis and Semantic Parsing with Learned Code Idioms

06/26/2019 ∙ by Richard Shin, et al. ∙ Microsoft berkeley college 0

Program synthesis of general-purpose source code from natural language specifications is challenging due to the need to reason about high-level patterns in the target program and low-level implementation details at the same time. In this work, we present PATOIS, a system that allows a neural program synthesizer to explicitly interleave high-level and low-level reasoning at every generation step. It accomplishes this by automatically mining common code idioms from a given corpus, incorporating them into the underlying language for neural synthesis, and training a tree-based neural synthesizer to use these idioms during code generation. We evaluate PATOIS on two complex semantic parsing datasets and show that using learned code idioms improves the synthesizer's accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Program synthesis is a task of translating an incomplete specification (natural language, input-output examples, or a combination of the two) into the most likely program that satisfies this specification in a given language (Gulwani et al., 2017). In the last decade, it has advanced dramatically thanks to the novel neural and neuro-symbolic techniques (Devlin et al., 2017b; Balog et al., 2017; Kalyan et al., 2018), first mass-market applications (Polozov and Gulwani, 2015), and massive datasets (Devlin et al., 2017a; Yin et al., 2018a; Yu et al., 2018). Table 1 shows a few examples of typical tasks of program synthesis from natural language. Most of the successful applications apply program synthesis to manually crafted domain-specific languages (DSLs) such as FlashFill and Karel, or to subsets of general-purpose functional languages such as SQL and Lisp. However, scaling program synthesis to real-life programs in a general-purpose language with complex control flow remains an open challenge.

We conjecture that one of the main current challenges of synthesizing a program is insufficient separation between high-level and low-level reasoning. In a typical program generation process, be it a neural model or a symbolic search, the program is generated in terms of its syntax tokens, which represent low-level implementation details of the latent high-level patterns in the program. In contrast, humans switch between high-level reasoning (“a binary search over an array”) and low-level implementation (“while l < r: m = (l+r)/2 …”) repeatedly when writing a single function. Reasoning over multiple abstraction levels at once complicates the generation task for a model.

Dataset Natural Language Specification Program
Hearthstone (Ling et al., 2016) Mana Wyrn (1, 3, 1, Minion, Mage, Common) Whenever you cast a spell, gain +1 Attack.
            # \dots
def create_minion(self, player):
    return Minion(1, 3, effects=[Effect(SpellCast(), ActionTag(Give(ChangeAttack(1)), SelfSelector()))])
Spider (Yu et al., 2018) For each stadium, how many concerts are there? Schema: stadium = {stadium_id, name, …}, …
            SELECT, COUNT(*)
FROM concert AS T1 JOIN stadium AS T2
    ON T1.stadium_id = T2.stadium_id
GROUP BY T1.stadium_id
Table 1: Representative program synthesis tasks from real-world semantic parsing datasets.

This conjecture is supported by two key observations. First, recent work (Dong and Lapata, 2018; Murali et al., 2018) has achieved great results by splitting the synthesis process into sketch generation and sketch completion. The first stage generates a high-level sketch of the target program, and the second stage fills in missing details in the sketch. Such separation improves the accuracy of synthesis as compared to an equivalent end-to-end generation. However, it allows only one stage of high-level reasoning at the root level of the program, whereas (a) real-life programs involve common patterns at all syntactic levels, and (b) programmers often interleave high-level and low-level reasoning during implementation.

Second, many successful applications of inductive program synthesis such as FlashFill (Gulwani, 2011) rely on a manually designed DSL to make the underlying search process scalable. Such DSLs include high-level operators that implement common subroutines in a given domain. Thus, they (i) compress the search space, ensuring that every syntactically valid DSL program expresses some useful task, and (ii) enable logical reasoning over the domain-specific operator semantics, making the search efficient. However, DSL design is laborious and requires domain expertise. Recently, Ellis et al. (2018) showed that such DSLs are learnable in the classic domains of inductive program synthesis; in this work, we target general-purpose code generation, where DSL design is difficult even for experts.

In this work, we present a system, called Patois, that equips a program synthesizer with automatically learned high-level code idioms (common program fragments) and trains it to use these idioms in program generation. While syntactic by definition, code idioms often represent useful semantic concepts. Moreover, they compress and abstract the programs by explicitly representing common patterns with unique tokens, thus simplifying generative process for the synthesis model.

Patois has three main components, illustrated in Figure 1

. First, it employs nonparameteric Bayesian inference to mine the code idioms that frequently occur in a given corpus. Second, it marks the occurrences of these idioms in the training dataset as new named operators in an extended grammar. Finally, it trains a neural generative model to optionally emit these named idioms instead of the original code fragments, which allows it to learn idiom usage conditioned on a task specification. During generation, the model has the ability to emit entire idioms in a single step instead of multiple steps of program tree nodes comprising the idioms’ definitions. As a result,

Patois interleaves high-level idioms with low-level tokens at all levels of program synthesis, generalizing beyond fixed top-level sketch generation.

We evaluate Patois on two challenging semantic parsing datasets: Hearthstone (Ling et al., 2016), a dataset of small domain-specific Python programs, and Spider (Yu et al., 2018), a large dataset of SQL queries over various databases. We find that equipping the synthesizer with learned idioms improves its accuracy in generating programs that satisfy the task description.

size=fbox for tree= font=, outer sep=0, l sep=0, grow’=0, child anchor=west, parent anchor=south, anchor=west, calign=first, edge path= draw,{edge}](!u.southwest)+(5pt,0)|-nodefill,inner sep=1.25pt] (.child anchor)edge label; , before typesetting nodes= if n=1 insert before=[,phantom] , fit=band, before computing xy=l=10pt, [If [Compare [Subscript [] [Index []]] [Eq] [Num []]] [stmt* [Assign [] [BinOp [] [Add] [Num [1]]]] […]]]

Figure 1: Top: An overview of Patois. A miner \⃝raisebox{-0.8pt}{\small1} extracts common idioms from the programs in a given dataset. All the idiom occurrences in the dataset programs are \⃝raisebox{-0.8pt}{\small2} marked as optional alternative grammar operators. The dataset with marked occurrences is used to \⃝raisebox{-0.8pt}{\small3} train a neural generative model. At inference time, the model \⃝raisebox{-0.8pt}{\small4} generates programs with named idioms, which are inlined before program execution. Note that idioms may have named subexpressions, may repeat, and may occur at any program level. For clarity, we typeset idioms using function-like syntax in this paper, although they are actually represented as AST fragments with no syntax.
Bottom: AST fragment representation of the idiom in Python. Here sans-serif nodes are fixed non-terminals, monospaced nodes are fixed terminals, and [nobeforeafter, boxsep=2pt, box align=base]boxed nodes are named arguments.

2 Background

Program Synthesis

We consider the following formulation of the program synthesis problem. Assume an underlying programming language of programs. Each program can be represented either as a sequence of its tokens, or, equivalently, as an abstract syntax tree (AST)  parsed according to the context-free grammar (CFG) of the language . The goal of a program synthesis model is to generate a program 

that maximizes the conditional probability

the most likely program given the specification. We also assume a training set , sampled from an unknown true distribution

, from which we wish to estimate the conditional probability


In this work, we consider general-purpose programming languages with a known context-free grammar  such as Python and SQL. Each specification is represented as a natural language task description, a sequence of words (although the Patois synthesizer can be conditioned on any other type of incomplete spec). In principle, we do not impose any restrictions on the generative model  apart from it being able to emit syntactically valid programs. However, as we detail in Section 4, the Patois framework is most easily implemented on top of structural generative models such as sequence-to-tree models (Yin and Neubig, 2017)

and graph neural networks 

(Li et al., 2016; Brockschmidt et al., 2019).

Code Idioms

Following Allamanis and Sutton (2014), we define code idioms as fragments  of valid ASTs  in the CFG , trees of nonterminals and terminals from that may occur as subtrees of valid parse trees from . The grammar extended with a set of idiom fragments forms a tree substitution grammar (TSG). We also associate a non-unique label with each nonterminal leaf in every idiom, and require that every instantiation of an idiom must have its identically-labeled nonterminals instantiated to identical subtrees. This enables the role of idioms as subroutines, where labels act as “named arguments” in the “body” of an idiom. See Figure 1 for an example.

3 Mining Code Idioms

The first step of Patois is obtaining a set of frequent and useful AST fragments as code idioms. The trade-off between frequency and usefulness is crucial: it is trivial to mine commonly occurring short patterns, but they are often meaningless (Aggarwal and Han, 2014). Instead, we employ and extend the methodology of Allamanis et al. (2018) and frame idiom mining as a nonparameteric Bayesian problem.

We represent idiom mining as inference over probabilistic tree substitution grammars (pTSG). A pTSG is a probabilistic context-free grammar extended with production rules that expand to a whole AST fragment instead of a single level of symbols (Cohn et al., 2010; Post and Gildea, 2009). The grammar of our original language  induces a pTSG  with no fragment rules and with choice probabilities estimated from the corpus . To construct a pTSG corresponding to the desired idiom-extended language , we define a distribution over pTSGs as follows.

We first choose a Pitman-Yor process (Teh and Jordan, 2010) as a prior distribution over pTSGs. It is a nonparameteric process that has proven to be effective for mining code idioms in prior work thanks to its modeling of production choices as a Zipfian distribution (in other words, it implements the desired “rich get richer” effect, which encourages a smaller number of larger and more common idioms).

Figure 2: MCMC sampling for an AST (figure from Allamanis and Sutton (2014)). Dots show the inferred nodes where the AST is split into fragments.

Formally, it is a “stick-breaking” process (Sethuraman, 1994) that defines as a distribution for each set of idioms  rooted at a nonterminal symbol as

where is the delta function, and ,

are hyperparameters. See

Allamanis et al. (2018) for details.

Patois uses to compute a posterior distribution using Bayes’ rule, where are concrete AST fragments in the training set . As this calculation is computationally intractable, we approximate it using type-based MCMC (Liang et al., 2010). At each iteration  of the MCMC process, Patois generates a pTSG  whose distribution approaches as . It works by sampling splitting points for each AST in the corpus , which by construction define a set of fragments constituting (see Figure 2). The split probabilities of this Gibbs sampling are set in a way that incentivizes merging adjacent tree fragments that often cooccur in . The final idioms are then extracted from the pTSG obtained at the last MCMC iteration.

While the Pitman-Yor process helps avoid overfitting the idioms to , not all sampled idioms are useful for synthesis. Thus we rank and filter the idioms before using them in the training. In this work, we reuse two ranking functions defined by Allamanis et al. (2018):

and also filter out any terminal idioms (those that do not contain any named arguments ).

4 Using Idioms in Program Synthesis

After mining, Patois obtains a set of common idioms . We wish for a program synthesis model to emit whole idioms as atomic actions during generation instead of individual AST nodes that comprise as a tree fragment. However, we still want to expose to the definition of each idiom and learn its best usage from the definition as well as from the context.

Implementing these desiderata involves two key challenges. First, since idioms are represented as AST fragments without concrete syntax, Patois works best when the synthesis model is structural, it generates the program AST instead of its syntax. Prior work Yin and Neubig (2017); Yin et al. (2018b); Brockschmidt et al. (2019) also showed that tree- and graph-based code generation models outperform recurrent sequence-to-sequence models thanks to better exposure to AST structure, and thus we adopt a similar architecture in this work.

Second, exposing the model to idiom usage patterns is not obvious. One approach could be to extend the grammar with new named operators for each idiom , replace every occurrence of with in the data, and train the synthesizer on the rewritten dataset. However, idiom occurrences often overlap, and any deterministic rewriting strategy would arbitrarily discard some occurrences from the corpus, thus limiting the model’s exposure to idiom usage. In our experiments, we found that greedy rewriting discarded as many as potential idiom occurrences from the dataset. Therefore, a successful training strategy must preserve all occurrences and instead let the model learn a rewriting strategy that optimizes end-to-end synthesis accuracy.

To this end, we present a novel training setup for code generation that encourages the model to choose the most useful subset of idioms and the best representation of each program in terms of the idioms. It works by

marking occurrences of the idioms in the training set

at training time, encouraging the model to emit either the whole idiom or its body for every potential idiom occurrence in the AST

at inference time, replacing the model’s state after emitting an idiom  with the state the model would have if the model had emitted ’s body step by step.

4.1 Model Architecture

The neural synthesis model of Patois combines a spec encoder and a tree-based AST decoder , following the formulation of Yin and Neubig (2017); Yin et al. (2018b). The encoder embeds the natural language specification into word representations . The decoder uses an LSTM to model the sequential generation process of an AST in the depth-first order, wherein each timestep of the LSTM corresponds to an action — either (a) expanding a production from the grammar, (b) expanding an idiom, or (c) generating a terminal token. Thus, the objective , the probability of generating an AST given the embedded spec is


where is the action taken at timestep , and is the partial AST generated before . The probability is computed from the decoder’s hidden state differently depending on the action kind of .

Production Actions

For actions corresponding to expanding production rules from the original CFG , we compute the probability by encoding the current partial AST structure similarly to Yin and Neubig (2017). Specifically, we compute the new hidden state as where is the embedding of the previous action, is the result of soft attention applied to the spec embeddings  as per Bahdanau et al. (2015), is the timestep corresponding to expanding the parent AST node of the current node, and is the embedding of the current node type. The hidden state is then used to compute probabilities of the syntactically appropriate production rules :


where is a 2-layer DNN with a non-linearity.

Terminal Actions

For actions , we compute the probability by combining a small vocabulary of tokens commonly observed in the training data with a copying mechanism Ling et al. (2016); See et al. (2017) over the input to handle UNK tokens. Specifically, we learn two functions and such that produces a score for each vocabulary token


Figure 3: Decoding the AST sorted(my_list,reverse=True), figure adapted from Yin and Neubig (2017). Suppose an idiom sorted([nobeforeafter, boxsep=2pt, box align=base], reverse=True) is mined and added as an operator to the grammar. At training time, Patois adjusts the cross-entropy objective at timestep to additionally allow as a valid production, with no change to further decoding. At inference time, if decoder emits an action , Patois unrolls on the fly by teacher-forcing the shaded portion of the AST generation.

and computes a score for copying the token from the input. The scores are then normalized across the entries corresponding to the same constant, as in Yin and Neubig (2017); Brockschmidt et al. (2019).

4.2 Training to Emit Idioms

As discussed earlier, training the model to emit idioms presents computational and learning challenges. Ideally, we would like to extend Eq. 1 to maximize


where is a set of different action traces that may produce the output AST . The traces differ only in their possible choices of idiom actions that emit some tree fragments of in a single step. However, computing Eq. 3 is intractable because idiom occurrences overlap and cause combinatorial explosion in the number of traces . Instead, we apply Jensen’s inequality and maximize a lower bound:


where is the action taken at timestep in the original generation trace that emits the AST without using idioms, and is the set of idioms that occur (match as tree fragments) rooted at the next node to be expanded at timestep . Figure 3 illustrates this process on an example.

Intuitively, at each timestep during training we encourage the model to emit either the original AST action for this timestep or any applicable idiom that matches the AST at this step, with no penalty to either choice. However, to avoid the combinatorial explosion, we only teacher-force the original generation trace (not the idiom bodies), thus optimizing the lower bound in Eq. 4. At inference time, whenever the model emits an action, we teacher-force the body of to emulate tree fragment expansion and continue decoding from the resulting state.

5 Evaluation


We evaluate Patois on two semantic parsing datasets: Hearthstone (Ling et al., 2016) and Spider (Yu et al., 2018).

Hearthstone is a dataset of 665 card descriptions from the trading card game of the same name, along with the implementations of their effects in Python using the game APIs. The descriptions act as NL specs , and are on average 39.1 words long.

Spider is a dataset of 10,181 questions describing 5,693 unique SQL queries over 200 databases with multiple tables each. Each question pertains to a particular database, whose schema is given to the synthesizer. Database schemas do not overlap between the train and test splits, thus challenging the model to generalize across different domains. The questions are on average 13 words long and databases have on average 27.6 columns and 8.8 foreign keys.


We mine the idioms using the training split of each dataset. Thus Patois cannot indirectly overfit to the test set by learning its idioms, but it also cannot generalize beyond the idioms that occur in the training set. We run type-based MCMC (Section 3) for 10 iterations with and . After ranking (with either or ) and filtering, we use top-ranked idioms to train the generative model. We ran ablation experiments with .

As described in Section 4, for all our experiments we used a tree-based decoder with a pointer mechanism as the synthesizer

, which we implemented in PyTorch 

Paszke et al. (2017). For the Hearthstone dataset, we use a bidirectional LSTM Hochreiter and Schmidhuber (1997) to implement the description encoder , similarly to Yin and Neubig (2017). The word embeddings and hidden LSTM states have dimension 256. The models are trained using the Adadelta optimizer (Zeiler, 2012) with learning rate , , for up to 2,600 steps with a batch size of 10.

For the Spider dataset, word embeddings have dimension 300, and hidden LSTM states have dimension 256. The models are trained using the Adam optimizer (Kingma and Ba, 2015) with , , for up to 40,000 steps with a batch size of 10. The learning rate warms up linearly up to 2.5×10-4 during the first 2,000 steps, and then decays polynomially by where is the total number of steps. Each model configuration is trained on a single NVIDIA P40 or NVIDIA GeForce GTX 1080 Ti GPU.

The Spider tasks additionally include the database schema as an input in the description. We follow a recent approach of embedding the schema using relation-aware self-attention within the encoder Shin (2019). Specifically, we initialize a representation for each column, table, and word in the question, and then update these representations using 4 layers of relation-aware self-attention Shaw et al. (2018a) using a graph that describes the relations between columns and tables in the schema. See Section A in the appendix for more details about the encoder.

5.1 Experimental Results

In each configuration, we compare the performance of equivalent trained models on the same dataset with and without idiom-based training of Patois. For fairness, we show the performance of the same decoder implementation described in Section 4 as a baseline rather than the state-of-the-art results achieved by different approaches from the literature. Following prior work, we evaluate program generation as a semantic parsing task, and measure (i) exact match accuracy and BLEU scores for Hearthstone and (ii) exact match accuracy of program sketches for Spider.

Model [t]1.08cmExact match [t]1.08cmSentence BLEU [t]1.08cmCorpus BLEU
Baseline decoder 0.197 0.767 0.763
Patois, 10 0.151 0.781 0.785
20 0.091 0.745 0.745
40 0.167 0.765 0.764
80 0.197 0.780 0.774
Patois, 10 0.151 0.780 0.783
20 0.167 0.787 0.782
40 0.182 0.773 0.770
80 0.151 0.771 0.768
Table 3: Ablation tests on the Spider dev set.
Model Exact match
Baseline decoder 0.395
Patois, 10 0.394
20 0.379
40 0.395
80 0.407
Patois, 10 0.368
20 0.382
40 0.387
80 0.416
Table 2: Ablation tests on the Hearthstone dev set.

size=fbox            def __init__(self):       super().__init__([nobeforeafter, boxsep=2pt, box align=base], [nobeforeafter, boxsep=2pt, box align=base],           CHARACTER_CLASS.[nobeforeafter, boxsep=2pt, box align=base],           CARD_RARITY.[nobeforeafter, boxsep=2pt, box align=base], [nobeforeafter, boxsep=2pt, box align=base])              [nobeforeafter, boxsep=2pt, box align=base] =       copy.copy([nobeforeafter, boxsep=2pt, box align=base])               class [nobeforeafter, boxsep=2pt, box align=base]([nobeforeafter, boxsep=2pt, box align=base]):        def __init__(self):                 SELECT COUNT([nobeforeafter, boxsep=2pt, box align=base]), [nobeforeafter, boxsep=2pt, box align=base] WHERE [nobeforeafter, boxsep=2pt, box align=base]    INTERSECT [nobeforeafter, boxsep=2pt, box align=base] EXCEPT [nobeforeafter, boxsep=2pt, box align=base]               WHERE [nobeforeafter, boxsep=2pt, box align=base] = $terminal

Figure 4: Five examples of commonly used idioms from the Hearthstone and Spider datasets.
Model [t]1.08cmExact match [t]1.08cmSentence BLEU [t]1.08cmCorpus BLEU
Baseline 0.152 0.743 0.723
Patois 0.197 0.780 0.766
Table 4: Test set results on Hearthstone (using the best configurations on the dev set).

Tables 3 and 3 show our ablation analysis of different configurations of Patois on the Hearthstone and Spider dev sets, respectively. Table 4 shows the test set results of the best model configuration for Hearthstone (the test instances for the Spider dataset are unreleased). As the results show, small numbers of idioms do not significantly change the exact match accuracy but improve BLEU score, and gives a significant improvement in both the exact match accuracy and BLEU scores. The improvement is even more pronounced on the test set with improvement in exact match accuracy and more than 4 BLEU points, which shows that mined training set idioms generalize well to the whole data distribution. As mentioned above, we compare only to the same baseline architecture for fairness, but Patois could also be easily implemented on top of the structural CNN decoder of Sun et al. (2019), the current state of the art on the Hearthstone dataset.

Figure 4 shows some examples of idioms that were frequently used by the model. On Hearthstone, the most popular idioms involve common syntactic elements (class and function definitions) and domain-specific APIs commonly used in card implementations (CARD_RARITY enumerations or copy.copy calls). On Spider, they capture the most common combinations of SQL syntax, such as a SELECT query with a single COUNT column and optional INTERSECT or EXCEPT clauses. Notably, popular idioms are also often big: for instance, the first idiom in Figure 4 expands to a tree fragment with more than 20 nodes. Emitting it in a single step vastly simplifies the decoding process.

6 Related Work

Program synthesis & Semantic parsing

Program synthesis from natural language and input-output examples has a long history in Programming Languages (PL) and Machine Learning (ML) communities (see

Gulwani et al. (2017) for a survey). When an input specification is limited to natural language, the resulting problem can be considered semantic parsing (Liang, 2016). There has been a lot of recent interest in applying recurrent sequence-based and tree-based neural networks to semantic parsing (Yin and Neubig, 2017; Li et al., 2016; Dong and Lapata, 2016; Jia and Liang, 2016; Yin et al., 2018b)

. These approaches commonly use insights from the PL literature, such as grammar-based constraints to reduce the search space, non-deterministic training oracles to enable multiple executable interpretations of intent, and supervision from program execution. They typically either supervise the training on one or more golden programs, or use reinforcement learning to supervise the training from a neural program execution result 

(Neelakantan et al., 2017). Our Patois approach is applicable to any underlying neural semantic parsing model, as long as it is supervised by a corpus of golden programs. It is, however, most easily applicable to tree-based and graph-based models, which directly emit the AST of the target program. In this work we have evaluated Patois as applied on top of the sequence-to-tree decoder of Yin and Neubig (2017), and extended it with a novel training regime that teaches the decoder to emit idiom operators in place of the idiomatic code fragments.

Sketch generation

Two recent works (Dong and Lapata, 2018; Murali et al., 2018) learn abstractions of the target program to compress and abstract the reasoning process of a neural synthesizer. Both of them split the generation process into sketch generation and sketch completion, wherein the first stage emits a partial tree/sequence (a sketch of the program) and the second stage fills in the holes in this sketch. While sketch generation is typically implemented with a neural model, sketch completion can be either a different neural model or a combinatorial search. In contrast to Patois, both works define the grammar of sketches manually by a deterministic program abstraction procedure and only allow a single top-level sketch for each program. In Patois, we learn the abstractions (code idioms) automatically from a corpus and allow them to appear anywhere in the program, as is common in real-life programming.

Learning abstractions

Recently, Ellis et al. (2018) developed an Explore, Compress & Compile (EC) framework for automatically learning DSLs for program synthesis from I/O examples (such as the DSLs used by FlashFill (Gulwani, 2011) and DeepCoder (Balog et al., 2017)). The workflow of EC is similar to Patois, with three stages: (a) learn new DSL subroutines from a corpus of tasks, (b) train a recognition model that maps a task specification to a distribution over DSL operators as in DeepCoder (Balog et al., 2017), and (c) use these operators in a program synthesizer. Patois differs from EC in three aspects: (i) we assume a natural language specification instead of examples, (ii) to handle NL specifications, our synthesizer is a neural semantic parser instead of enumerative search, and (iii) most importantly, we discover idioms that compress general-purpose languages instead of extending DSLs. Unlike for inductive synthesis DSLs such as FlashFill, the existence of useful DSL abstractions for general-purpose languages is not obvious, and our work is the first to demonstrate them.

As described previously, our code idiom mining is an extension of the procedure developed by Allamanis et al. Allamanis and Sutton (2014); Allamanis et al. (2018). While they are the first to use the tree substitution grammar formalism and Bayesian inference to find non-trivial common idioms in a corpus of code, their problem formalization does not involve any application for the learned idioms beyond their explanatory power.

7 Conclusion

Semantic parsing, or neural program synthesis from natural language, has made tremendous progress over the past years, but state-of-the-art models still struggle with program generation at multiple levels of abstraction. In this work, we present a framework that allows incorporating learned coding patterns from a corpus into the vocabulary of a neural synthesizer, thus enabling it to emit high-level or low-level program constructs interchangeably at each generation step. Our current instantiation, Patois, uses Bayesian inference to mine common code idioms, and employs a novel nondeterministic training regime to teach a tree-based generative model to optionally emit whole idiom fragments. Such dataset abstraction using idioms improves the performance of neural program synthesis.

Patois is only the first step toward learned abstractions in program synthesis. While code idioms often correlate with latent semantic concepts and our training regime allows the model to learn which idioms to use and in which context, our current method does not mine them with the intent to directly optimize their usefulness for generation. In future work, we want to alleviate this by jointly learning the mining and synthesis models, thus optimizing the idioms’ usefulness for synthesis by construction. We also want to incorporate program semantics into the idiom definition, such as data flow patterns or natural language phrases from task specs. Our goal is to push neural program synthesis toward human-like programming by letting it reason about semantic program abstractions.


  • Aggarwal and Han (2014) C. C. Aggarwal and J. Han. Frequent pattern mining. Springer, 2014.
  • Allamanis and Sutton (2014) M. Allamanis and C. Sutton. Mining idioms from source code. In Proceedings of the 22 ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), pages 472–483. ACM, 2014.
  • Allamanis et al. (2018) M. Allamanis, E. T. Barr, C. Bird, P. Devanbu, M. Marron, and C. Sutton. Mining semantic loop idioms. IEEE Transactions on Software Engineering, 2018.
  • Bahdanau et al. (2015) D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3 International Conference on Learning Representations (ICLR), 2015.
  • Balog et al. (2017) M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin, and D. Tarlow. DeepCoder: Learning to write programs. In Proceedings of the 5 International Conference on Learning Representations (ICLR), 2017.
  • Brockschmidt et al. (2019) M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov. Generative code modeling with graphs. In Proceedings of the 7 International Conference on Learning Representations (ICLR), 2019.
  • Cohn et al. (2010) T. Cohn, P. Blunsom, and S. Goldwater. Inducing tree-substitution grammars. Journal of Machine Learning Research, 11(Nov):3053–3096, 2010.
  • Devlin et al. (2017a) J. Devlin, R. Bunel, R. Singh, M. Hausknecht, and P. Kohli. Neural program meta-induction. In Advances in Neural Information Processing Systems (NIPS), pages 2080–2088, 2017a.
  • Devlin et al. (2017b) J. Devlin, J. Uesato, S. Bhupatiraju, R. Singh, A.-r. Mohamed, and P. Kohli. RobustFill: Neural program learning under noisy I/O. In Proceedings of the 34 International Conference on Machine Learning (ICML), 2017b.
  • Dong and Lapata (2016) L. Dong and M. Lapata. Language to logical form with neural attention. In Proceedings of the 54 Annual Meeting of the Association for Computational Linguistics (ACL), 2016.
  • Dong and Lapata (2018) L. Dong and M. Lapata. Coarse-to-fine decoding for neural semantic parsing. In Proceedings of the 56 Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
  • Ellis et al. (2018) K. Ellis, L. Morales, M. Sablé-Meyer, A. Solar-Lezama, and J. Tenenbaum.

    Learning libraries of subroutines for neurally-guided Bayesian program induction.

    In Advances in Neural Information Processing Systems, pages 7816–7826, 2018.
  • Gulwani (2011) S. Gulwani. Automating string processing in spreadsheets using input-output examples. In Proceedings of the 38 ACM Symposium on Principles of Programming Languages (POPL), volume 46, pages 317–330, 2011.
  • Gulwani et al. (2017) S. Gulwani, O. Polozov, and R. Singh. Program synthesis. Foundations and Trends® in Programming Languages, 4(1-2):1–119, 2017.
  • Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Jia and Liang (2016) R. Jia and P. Liang. Data recombination for neural semantic parsing. In Proceedings of the 54 Annual Meeting of the Association for Computational Linguistics (ACL), volume 1, pages 12–22, 2016.
  • Kalyan et al. (2018) A. Kalyan, A. Mohta, O. Polozov, D. Batra, P. Jain, and S. Gulwani. Neural-guided deductive search for real-time program synthesis from examples. In Proceedings of the 6 International Conference on Learning Representations (ICLR), 2018.
  • Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of 3 International Conference on Learning Representations (ICLR), 2015.
  • Li et al. (2016) Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. In Proceedings of the 4 International Conference on Learning Representations (ICLR), 2016.
  • Liang (2016) P. Liang. Learning executable semantic parsers for natural language understanding. Communications of the ACM, 59(9):68–76, 2016.
  • Liang et al. (2010) P. Liang, M. I. Jordan, and D. Klein. Type-based MCMC. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 573–581. Association for Computational Linguistics, 2010.
  • Ling et al. (2016) W. Ling, P. Blunsom, E. Grefenstette, K. M. Hermann, T. Kočiskỳ, F. Wang, and A. Senior. Latent predictor networks for code generation. In Proceedings of the 54 Annual Meeting of the Association for Computational Linguistics (ACL), volume 1, pages 599–609, 2016.
  • Murali et al. (2018) V. Murali, L. Qi, S. Chaudhuri, and C. Jermaine. Neural sketch learning for conditional program generation. In Proceedings of the 6 International Conference on Learning Representations (ICLR), 2018.
  • Neelakantan et al. (2017) A. Neelakantan, Q. V. Le, M. Abadi, A. McCallum, and D. Amodei. Learning a natural language interface with neural programmer. In Proceedings of the 5 International Conference on Learning Representations (ICLR), 2017.
  • Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. 2017.
  • Polozov and Gulwani (2015) O. Polozov and S. Gulwani. FlashMeta: A framework for inductive program synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pages 107–126, 2015.
  • Post and Gildea (2009) M. Post and D. Gildea. Bayesian learning of a tree substitution grammar. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 45–48. Association for Computational Linguistics, 2009.
  • See et al. (2017) A. See, P. J. Liu, and C. D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55 Annual Meeting of the Association for Computational Linguistics (ACL), volume 1, pages 1073–1083, 2017.
  • Sethuraman (1994) J. Sethuraman. A constructive definition of Dirichlet priors. Statistica sinica, pages 639–650, 1994.
  • Shaw et al. (2018a) P. Shaw, J. Uszkoreit, and A. Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018a.
  • Shaw et al. (2018b) P. Shaw, J. Uszkoreit, and A. Vaswani. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468. Association for Computational Linguistics, 2018b. doi: 10.18653/v1/N18-2074. URL
  • Shin (2019) R. Shin. Encoding database schemas with relation-aware self-attention for text-to-SQL parsers. arXiv preprint arXiv:1906.11790, 2019.
  • Sun et al. (2019) Z. Sun, Q. Zhu, L. Mou, Y. Xiong, G. Li, and L. Zhang. A grammar-based structural cnn decoder for code generation. In AAAI, 2019.
  • Teh and Jordan (2010) Y. W. Teh and M. I. Jordan. Hierarchical Bayesian nonparametric models with applications. Bayesian nonparametrics, 1:158–207, 2010.
  • Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is All you Need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL
  • Yin and Neubig (2017) P. Yin and G. Neubig. A syntactic neural model for general-purpose code generation. In The 55 Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada, July 2017.
  • Yin et al. (2018a) P. Yin, B. Deng, E. Chen, B. Vasilescu, and G. Neubig. Learning to mine aligned code and natural language pairs from StackOverflow. In International Conference on Mining Software Repositories (MSR), pages 476–486. ACM, 2018a.
  • Yin et al. (2018b) P. Yin, C. Zhou, J. He, and G. Neubig. StructVAE: Tree-structured latent variable models for semi-supervised semantic parsing. In Proceedings of the 56 Annual Meeting of the Association for Computational Linguistics (ACL), 2018b.
  • Yu et al. (2018) T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In EMNLP, 2018.
  • Zeiler (2012) M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

Appendix A Encoder for Spider dataset

Type of Type of Edge label Description
Column Column Same-Table and belong to the same table.
Foreign-Key-Col-F is a foreign key for .
Foreign-Key-Col-R is a foreign key for .
Column Table Primary-Key-F is the primary key of .
Belongs-To-F is a column of (but not the primary key).
Table Column Primary-Key-R is the primary key of .
Belongs-To-R is a column of (but not the primary key).
Table Table Foreign-Key-Tab-F Table has a foreign key column in .
Foreign-Key-Tab-R Same as above, but and are reversed.
Foreign-Key-Tab-B and have foreign keys in both directions.
Table 5: Description of edge types present in the directed graph created to represent the schema. An edge exists from node to node if the pair fulfills one of the descriptions listed in the table, with the corresponding label. Otherwise, no edge exists from to .

In the Spider dataset, each entry contains a question along with a database schema, containing tables and columns. We will use the following notation:

  • for each column in the schema. Each column contains words .

  • for each table in the schema. Each table contains words .

  • for the input question. The question contains words .

a.1 Encoding the Schema as a Graph

We begin by representing the database schema using a directed graph , where each node and edge has a label. We represent each table and column in the schema as a node in this graph, labeled with the words in the name; for columns, we prepend the type of the column to the label. For each pair of nodes and in the graph, Table 5 describes when there exists an edge from to and the label it should have.

a.2 Initial Encoding of the Input

We now obtain an initial representation for each of the nodes in the graph, as well as for the words in the input question. Formally, we perform the following:

where each of the BiLSTM functions first lookup word embeddings for each of the input tokens. The LSTMs do not share any parameters with each other.

a.3 Relation-Aware Self-Attention

At this point, we have representations , , and . Now, we would like to imbue these representations with the information in the schema graph. We use a form of self-attention [Vaswani et al., 2017] that is relation-aware [Shaw et al., 2018b] to achieve this goal.

In one step of relation-aware self-attention, we begin with an input of elements (where ) and transform each into . We follow the formulation described in Shaw et al. [2018b]:

The terms encode the relationship between the two elements and in the input. We explain how we obtain in the next part.

For the application within the Spider encoder, we first construct the input of elements using , , and :

We then apply a stack of 4 relation-aware self-attention layers. We set to facilitate this stacking. The weights of the encoder layers are not tied; each layer has its own set of weights.

We define a discrete set of possible relation types, and map each type to an embedding to obtain and . We need a value of for every pair of elements in . If and both correspond to nodes in (i.e. each is either a column or table) with an edge from to , then we use the label on that edge (possibilities listed in Table 5).

However, this is not sufficient to obtain for every pair of and . In the graph we created for the schema, we have no nodes corresponding to the question words; not every pair of nodes in the graph has an edge between them (the graph is not complete); and we have no self-edges (for when ). As such, we add more types beyond what is defined in Table 5:

  • question, question: Question-Dist-, where ; . We use .

  • If , then Column-Identity or Table-Identity.

  • question, ; or , question:
    Question-Column, Question-Table, Column-Question or Table-Question depending on the type of and .

  • Otherwise, one of Column-Column, Column-Table, Table-Column, or Table-Table.

In the end, we add types beyond the in Table 5, for a total of 25 types.

After processing through the stack of encoder layers, we obtain

We use , , and in the decoder.