Data Recombination for Neural Semantic Parsing

06/11/2016 ∙ by Robin Jia, et al. ∙ Stanford University 0

Modeling crisp logical regularities is crucial in semantic parsing, making it difficult for neural models with no task-specific prior knowledge to achieve good results. In this paper, we introduce data recombination, a novel framework for injecting such prior knowledge into a model. From the training data, we induce a high-precision synchronous context-free grammar, which captures important conditional independence properties commonly found in semantic parsing. We then train a sequence-to-sequence recurrent network (RNN) model with a novel attention-based copying mechanism on datapoints sampled from this grammar, thereby teaching the model about these structural properties. Data recombination improves the accuracy of our RNN model on three semantic parsing datasets, leading to new state-of-the-art performance on the standard GeoQuery dataset for models with comparable supervision.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic parsing—the precise translation of natural language utterances into logical forms—has many applications, including question answering [Zelle and Mooney1996, Zettlemoyer and Collins2005, Zettlemoyer and Collins2007, Liang et al.2011, Berant et al.2013], instruction following [Artzi and Zettlemoyer2013b], and regular expression generation [Kushman and Barzilay2013]. Modern semantic parsers [Artzi and Zettlemoyer2013a, Berant et al.2013]

are complex pieces of software, requiring hand-crafted features, lexicons, and grammars.

Figure 1: An overview of our system. Given a dataset, we induce a high-precision synchronous context-free grammar. We then sample from this grammar to generate new “recombinant” examples, which we use to train a sequence-to-sequence RNN.

Meanwhile, recurrent neural networks (RNNs) have made swift inroads into many structured prediction tasks in NLP, including machine translation

[Sutskever et al.2014, Bahdanau et al.2014] and syntactic parsing [Vinyals et al.2015b, Dyer et al.2015]. Because RNNs make very few domain-specific assumptions, they have the potential to succeed at a wide variety of tasks with minimal feature engineering. However, this flexibility also puts RNNs at a disadvantage compared to standard semantic parsers, which can generalize naturally by leveraging their built-in awareness of logical compositionality.

In this paper, we introduce data recombination, a generic framework for declaratively injecting prior knowledge into a domain-general structured prediction model. In data recombination, prior knowledge about a task is used to build a high-precision generative model that expands the empirical distribution by allowing fragments of different examples to be combined in particular ways. Samples from this generative model are then used to train a domain-general model. In the case of semantic parsing, we construct a generative model by inducing a synchronous context-free grammar (SCFG), creating new examples such as those shown in Figure 1; our domain-general model is a sequence-to-sequence RNN with a novel attention-based copying mechanism. Data recombination boosts the accuracy of our RNN model on three semantic parsing datasets. On the Geo dataset, data recombination improves test accuracy by percentage points over our baseline RNN, leading to new state-of-the-art results for models that do not use a seed lexicon for predicates.

2 Problem statement


: “what is the population of iowa ?

: _answer ( NV , (

_population ( NV , V1 ) , _const (

V0 , _stateid ( iowa ) ) ) )


: “can you list all flights from chicago to milwaukee

: ( _lambda $0 e ( _and

( _flight $0 )

( _from $0 chicago : _ci )

( _to $0 milwaukee : _ci ) ) )


: “when is the weekly standup

: ( call listValue ( call

getProperty meeting.weekly_standup

( string start_time ) ) )

Figure 2: One example from each of our domains. We tokenize logical forms as shown, thereby casting semantic parsing as a sequence-to-sequence task.

We cast semantic parsing as a sequence-to-sequence task. The input utterance is a sequence of words , the input vocabulary; similarly, the output logical form is a sequence of tokens , the output vocabulary. A linear sequence of tokens might appear to lose the hierarchical structure of a logical form, but there is precedent for this choice: vinyals2015grammar showed that an RNN can reliably predict tree-structured outputs in a linear fashion.

We evaluate our system on three existing semantic parsing datasets. Figure 2 shows sample input-output pairs from each of these datasets.

  • GeoQuery (Geo) contains natural language questions about US geography paired with corresponding Prolog database queries. We use the standard split of 600 training examples and 280 test examples introduced by zettlemoyer05ccg. We preprocess the logical forms to De Brujin index notation to standardize variable naming.

  • ATIS (ATIS) contains natural language queries for a flights database paired with corresponding database queries written in lambda calculus. We train on examples and evaluate on the test examples used by zettlemoyer07relaxed.

  • Overnight (Overnight) contains logical forms paired with natural language paraphrases across eight varied subdomains. wang2015overnight constructed the dataset by generating all possible logical forms up to some depth threshold, then getting multiple natural language paraphrases for each logical form from workers on Amazon Mechanical Turk. We evaluate on the same train/test splits as wang2015overnight.

In this paper, we only explore learning from logical forms. In the last few years, there has an emergence of semantic parsers learned from denotations [Clarke et al.2010, Liang et al.2011, Berant et al.2013, Artzi and Zettlemoyer2013b]. While our system cannot directly learn from denotations, it could be used to rerank candidate derivations generated by one of these other systems.

3 Sequence-to-sequence RNN Model

Our sequence-to-sequence RNN model is based on existing attention-based neural machine translation models

[Bahdanau et al.2014, Luong et al.2015a], but also includes a novel attention-based copying mechanism. Similar copying mechanisms have been explored in parallel by gu2016copying and gulcehre2016pointing.

3.1 Basic Model


The encoder converts the input sequence into a sequence of context-sensitive embeddings using a bidirectional RNN [Bahdanau et al.2014]. First, a word embedding function maps each word

to a fixed-dimensional vector. These vectors are fed as input to two RNNs: a forward RNN and a backward RNN. The forward RNN starts with an initial hidden state

, and generates a sequence of hidden states by repeatedly applying the recurrence


The recurrence takes the form of an LSTM [Hochreiter and Schmidhuber1997]. The backward RNN similarly generates hidden states by processing the input sequence in reverse order. Finally, for each input position , we define the context-sensitive embedding to be the concatenation of and


The decoder is an attention-based model [Bahdanau et al.2014, Luong et al.2015a] that generates the output sequence one token at a time. At each time step , it writes based on the current hidden state , then updates the hidden state to based on and . Formally, the decoder is defined by the following equations:


When not specified, ranges over and ranges over . Intuitively, the

’s define a probability distribution over the input words, describing what words in the input the decoder is focusing on at time

. They are computed from the unnormalized attention scores . The matrices , , and , as well as the embedding function , are parameters of the model.

3.2 Attention-based Copying

In the basic model of the previous section, the next output word is chosen via a simple softmax over all words in the output vocabulary. However, this model has difficulty generalizing to the long tail of entity names commonly found in semantic parsing datasets. Conveniently, entity names in the input often correspond directly to tokens in the output (e.g., “iowa” becomes iowa in Figure 2).111 On Geo and ATIS, we make a point not to rely on orthography for non-entities such as “state” to _state, since this leverages information not available to previous models [Zettlemoyer and Collins2005] and is much less language-independent.

To capture this intuition, we introduce a new attention-based copying mechanism. At each time step , the decoder generates one of two types of actions. As before, it can write any word in the output vocabulary. In addition, it can copy any input word directly to the output, where the probability with which we copy is determined by the attention score on . Formally, we define a latent action that is either for some or for some . We then have


The decoder chooses with a softmax over all these possible actions; is then a deterministic function of and . During training, we maximize the log-likelihood of , marginalizing out .

Attention-based copying can be seen as a combination of a standard softmax output layer of an attention-based model [Bahdanau et al.2014] and a Pointer Network [Vinyals et al.2015a]; in a Pointer Network, the only way to generate output is to copy a symbol from the input.

4 Data Recombination


(“what states border texas ?”,

answer(NV, (state(V0), next_to(V0, NV), const(V0, stateid(texas)))))

(“what is the highest mountain in ohio ?”,

answer(NV, highest(V0, (mountain(V0), loc(V0, NV), const(V0, stateid(ohio))))))

Rules created by AbsEntities

Root what states border StateId ?”,

answer(NV, (state(V0), next_to(V0, NV), const(V0, stateid(StateId ))))

StateId texas”, texas

Root what is the highest mountain in StateId ?”,

answer(NV, highest(V0, (mountain(V0), loc(V0, NV),

const(V0, stateid(StateId )))))

StateId ohio”, ohio

Rules created by AbsWholePhrases

Root what states border State ?”, answer(NV, (state(V0), next_to(V0, NV), State ))

State states border texas”, state(V0), next_to(V0, NV), const(V0, stateid(texas))

Root what is the highest mountain in State ?”,

answer(NV, highest(V0, (mountain(V0), loc(V0, NV), State )))

Rules created by Concat-2

Root </s> </s>

Sent what states border texas ?”,

answer(NV, (state(V0), next_to(V0, NV), const(V0, stateid(texas))))

Sent what is the highest mountain in ohio ?”,

answer(NV, highest(V0, (mountain(V0), loc(V0, NV), const(V0, stateid(ohio)))))

Figure 3: Various grammar induction strategies illustrated on Geo. Each strategy converts the rules of an input grammar into rules of an output grammar. This figure shows the base case where the input grammar has rules for each pair in the training dataset.

4.1 Motivation

The main contribution of this paper is a novel data recombination framework that injects important prior knowledge into our oblivious sequence-to-sequence RNN. In this framework, we induce a high-precision generative model from the training data, then sample from it to generate new training examples. The process of inducing this generative model can leverage any available prior knowledge, which is transmitted through the generated examples to the RNN model. A key advantage of our two-stage approach is that it allows us to declare desired properties of the task which might be hard to capture in the model architecture.

Our approach generalizes data augmentation, which is commonly employed to inject prior knowledge into a model. Data augmentation techniques focus on modeling invariances—transformations like translating an image or adding noise that alter the inputs , but do not change the output

. These techniques have proven effective in areas like computer vision

[Krizhevsky et al.2012] and speech recognition [Jaitly and Hinton2013].

In semantic parsing, however, we would like to capture more than just invariance properties. Consider an example with the utterance “what states border texas ?”. Given this example, it should be easy to generalize to questions where “texas” is replaced by the name of any other state: simply replace the mention of Texas in the logical form with the name of the new state. Underlying this phenomenon is a strong conditional independence principle: the meaning of the rest of the sentence is independent of the name of the state in question. Standard data augmentation is not sufficient to model such phenomena: instead of holding fixed, we would like to apply simultaneous transformations to and such that the new still maps to the new . Data recombination addresses this need.

4.2 General Setting

In the general setting of data recombination, we start with a training set of pairs, which defines the empirical distribution . We then fit a generative model to which generalizes beyond the support of , for example by splicing together fragments of different examples. We refer to examples in the support of as recombinant examples. Finally, to train our actual model , we maximize the expected value of , where is drawn from .

4.3 SCFGs for Semantic Parsing

For semantic parsing, we induce a synchronous context-free grammar (SCFG) to serve as the backbone of our generative model . An SCFG consists of a set of production rules , where is a category (non-terminal), and and are sequences of terminal and non-terminal symbols. Any non-terminal symbols in must be aligned to the same non-terminal symbol in , and vice versa. Therefore, an SCFG defines a set of joint derivations of aligned pairs of strings. In our case, we use an SCFG to represent joint derivations of utterances and logical forms (which for us is just a sequence of tokens). After we induce an SCFG from , the corresponding generative model is the distribution over pairs defined by sampling from , where we choose production rules to apply uniformly at random.

It is instructive to compare our SCFG-based data recombination with Wasp [Wong and Mooney2006, Wong and Mooney2007], which uses an SCFG as the actual semantic parsing model. The grammar induced by Wasp must have good coverage in order to generalize to new inputs at test time. Wasp also requires the implementation of an efficient algorithm for computing the conditional probability . In contrast, our SCFG is only used to convey prior knowledge about conditional independence structure, so it only needs to have high precision; our RNN model is responsible for boosting recall over the entire input space. We also only need to forward sample from the SCFG, which is considerably easier to implement than conditional inference.

Below, we examine various strategies for inducing a grammar from a dataset . We first encode as an initial grammar with rules Root for each . Next, we will define each grammar induction strategy as a mapping from an input grammar to a new grammar . This formulation allows us to compose grammar induction strategies (Section 4.3.4).

4.3.1 Abstracting Entities

Our first grammar induction strategy, AbsEntities, simply abstracts entities with their types. We assume that each entity (e.g., texas) has a corresponding type (e.g., state), which we infer based on the presence of certain predicates in the logical form (e.g. stateid). For each grammar rule in , where contains a token (e.g., “texas”) that string matches an entity (e.g., texas) in , we add two rules to : (i) a rule where both occurrences are replaced with the type of the entity (e.g., state), and (ii) a new rule that maps the type to the entity (e.g., ; we reserve the category name State for the next section). Thus, generates recombinant examples that fuse most of one example with an entity found in a second example. A concrete example from the Geo domain is given in Figure 3.

4.3.2 Abstracting Whole Phrases

Our second grammar induction strategy, AbsWholePhrases, abstracts both entities and whole phrases with their types. For each grammar rule in , we add up to two rules to . First, if contains tokens that string match to an entity in , we replace both occurrences with the type of the entity, similarly to rule (i) from AbsEntities. Second, if we can infer that the entire expression evaluates to a set of a particular type (e.g. state) we create a rule that maps the type to . In practice, we also use some simple rules to strip question identifiers from , so that the resulting examples are more natural. Again, refer to Figure 3 for a concrete example.

This strategy works because of a more general conditional independence property: the meaning of any semantically coherent phrase is conditionally independent of the rest of the sentence, the cornerstone of compositional semantics. Note that this assumption is not always correct in general: for example, phenomena like anaphora that involve long-range context dependence violate this assumption. However, this property holds in most existing semantic parsing datasets.

4.3.3 Concatenation

The final grammar induction strategy is a surprisingly simple approach we tried that turns out to work. For any , we define the Concat- strategy, which creates two types of rules. First, we create a single rule that has Root going to a sequence of Sent’s. Then, for each root-level rule in , we add the rule to . See Figure 3 for an example.

Unlike AbsEntities and AbsWholePhrases, concatenation is very general, and can be applied to any sequence transduction problem. Of course, it also does not introduce additional information about compositionality or independence properties present in semantic parsing. However, it does generate harder examples for the attention-based RNN, since the model must learn to attend to the correct parts of the now-longer input sequence. Related work has shown that training a model on more difficult examples can improve generalization, the most canonical case being dropout [Hinton et al.2012, Wager et al.2013].

4.3.4 Composition

We note that grammar induction strategies can be composed, yielding more complex grammars. Given any two grammar induction strategies and , the composition is the grammar induction strategy that takes in and returns . For the strategies we have defined, we can perform this operation symbolically on the grammar rules, without having to sample from the intermediate grammar .

5 Experiments

We evaluate our system on three domains: Geo, ATIS, and Overnight. For ATIS, we report logical form exact match accuracy. For Geo and Overnight, we determine correctness based on denotation match, as in liang11dcs and wang2015overnight, respectively.

5.1 Choice of Grammar Induction Strategy

We note that not all grammar induction strategies make sense for all domains. In particular, we only apply AbsWholePhrases to Geo and Overnight. We do not apply AbsWholePhrases to ATIS, as the dataset has little nesting structure.

5.2 Implementation Details

function train(dataset

, number of epochs

,  number of examples to sample )
     Induce grammar from
     Initialize RNN parameters randomly
     for each iteration  do
          Compute current learning rate
          Initialize current dataset to
          for  do
               Sample new example from
               Add to
          end for
          for each example in  do
          end for
     end for
end function
Figure 4: The training procedure with data recombination. We first induce an SCFG, then sample new recombinant examples from it at each epoch.

We tokenize logical forms in a domain-specific manner, based on the syntax of the formal language being used. On Geo and ATIS, we disallow copying of predicate names to ensure a fair comparison to previous work, as string matching between input words and predicate names is not commonly used. We prevent copying by prepending underscores to predicate tokens; see Figure 2 for examples.

On ATIS alone, when doing attention-based copying and data recombination, we leverage an external lexicon that maps natural language phrases (e.g., “kennedy airport”) to entities (e.g., jfk:ap). When we copy a word that is part of a phrase in the lexicon, we write the entity associated with that lexicon entry. When performing data recombination, we identify entity alignments based on matching phrases and entities from the lexicon.

We run all experiments with hidden units and -dimensional word vectors. We initialize all parameters uniformly at random within the interval

. We maximize the log-likelihood of the correct logical form using stochastic gradient descent. We train the model for a total of

epochs with an initial learning rate of , and halve the learning rate every epochs, starting after epoch . We replace word vectors for words that occur only once in the training set with a universal <unk>

word vector. Our model is implemented in Theano

[Bergstra et al.2010].

When performing data recombination, we sample a new round of recombinant examples from our grammar at each epoch. We add these examples to the original training dataset, randomly shuffle all examples, and train the model for the epoch. Figure 4

gives pseudocode for this training procedure. One important hyperparameter is how many examples to sample at each epoch: we found that a good rule of thumb is to sample as many recombinant examples as there are examples in the training dataset, so that half of the examples the model sees at each epoch are recombinant.

At test time, we use beam search with beam size . We automatically balance missing right parentheses by adding them at the end. On Geo and Overnight, we then pick the highest-scoring logical form that does not yield an executor error when the corresponding denotation is computed. On ATIS, we just pick the top prediction on the beam.

5.3 Impact of the Copying Mechanism

Geo ATIS Overnight
No Copying
With Copying
Table 1: Test accuracy on Geo, ATIS, and Overnight, both with and without copying. On Overnight, we average across all eight domains.

First, we measure the contribution of the attention-based copying mechanism to the model’s overall performance. On each task, we train and evaluate two models: one with the copying mechanism, and one without. Training is done without data recombination. The results are shown in Table 1.

On Geo and ATIS, the copying mechanism helps significantly: it improves test accuracy by percentage points on Geo and points on ATIS. However, on Overnight, adding the copying mechanism actually makes our model perform slightly worse. This result is somewhat expected, as the Overnight dataset contains a very small number of distinct entities. It is also notable that both systems surpass the previous best system on Overnight by a wide margin.

We choose to use the copying mechanism in all subsequent experiments, as it has a large advantage in realistic settings where there are many distinct entities in the world. The concurrent work of gu2016copying and gulcehre2016pointing, both of whom propose similar copying mechanisms, provides additional evidence for the utility of copying on a wide range of NLP tasks.

5.4 Main Results

Previous Work
liang11dcs222The method of liang11dcs is not comparable to ours, as they as they used a seed lexicon mapping words to predicates. We explicitly avoid using such prior knowledge in our system.
Our Model
No Recombination
AE + C2
AWP + AE + C2
AE + C3
Table 2: Test accuracy using different data recombination strategies on Geo and ATIS. AE is AbsEntities, AWP is AbsWholePhrases, C2 is Concat-2, and C3 is Concat-3.
Basketball Blocks Calendar Housing Publications Recipes Restaurants Social Avg.
Previous Work
Our Model
No Recombination
AWP + AE + C2
Table 3: Test accuracy using different data recombination strategies on the Overnight tasks.

For our main results, we train our model with a variety of data recombination strategies on all three datasets. These results are summarized in Tables 2 and 3. We compare our system to the baseline of not using any data recombination, as well as to state-of-the-art systems on all three datasets.

We find that data recombination consistently improves accuracy across the three domains we evaluated on, and that the strongest results come from composing multiple strategies. Combining AbsWholePhrases, AbsEntities, and Concat-2 yields a percentage point improvement over the baseline without data recombination on Geo, and an average of percentage points on Overnight. In fact, on Geo, we achieve test accuracy of , which surpasses the previous state-of-the-art, excluding liang11dcs, which used a seed lexicon for predicates. On ATIS, we experiment with concatenating more than examples, to make up for the fact that we cannot apply AbsWholePhrases, which generates longer examples. We obtain a test accuracy of with AbsEntities composed with Concat-3, which beats the baseline by percentage points and is competitive with the state-of-the-art.

Data recombination without copying.

For completeness, we also investigated the effects of data recombination on the model without attention-based copying. We found that recombination helped significantly on Geo and ATIS, but hurt the model slightly on Overnight. On Geo, the best data recombination strategy yielded test accuracy of , for a gain of percentage points over the baseline with no copying and no recombination; on ATIS, data recombination gives test accuracies as high as , a point gain over the same baseline. However, no data recombination strategy improved average test accuracy on Overnight; the best one resulted in a percentage point decrease in test accuracy. We hypothesize that data recombination helps less on Overnight in general because the space of possible logical forms is very limited, making it more like a large multiclass classification task. Therefore, it is less important for the model to learn good compositional representations that generalize to new logical forms at test time.

5.5 Effect of Longer Examples

Depth-2 (same length)

: “rel:12 of rel:17 of ent:14

: ( _rel:12 ( _rel:17 _ent:14 ) )

Depth-4 (longer)

: “rel:23 of rel:36 of rel:38 of rel:10 of ent:05

: ( _rel:23 ( _rel:36 ( _rel:38

( _rel:10 _ent:05 ) ) ) )

Figure 5: A sample of our artificial data.
Figure 6: The results of our artificial data experiments. We see that the model learns more from longer examples than from same-length examples.

Interestingly, strategies like AbsWholePhrases and Concat-2 help the model even though the resulting recombinant examples are generally not in the support of the test distribution. In particular, these recombinant examples are on average longer than those in the actual dataset, which makes them harder for the attention-based model. Indeed, for every domain, our best accuracy numbers involved some form of concatenation, and often involved AbsWholePhrases as well. In comparison, applying AbsEntities alone, which generates examples of the same length as those in the original dataset, was generally less effective.

We conducted additional experiments on artificial data to investigate the importance of adding longer, harder examples. We experimented with adding new examples via data recombination, as well as adding new independent examples (e.g. to simulate the acquisition of more training data). We constructed a simple world containing a set of entities and a set of binary relations. For any , we can generate a set of depth- examples, which involve the composition of relations applied to a single entity. Example data points are shown in Figure 5. We train our model on various datasets, then test it on a set of randomly chosen depth- examples. The model always has access to a small seed training set of depth- examples. We then add one of four types of examples to the training set:

  • Same length, independent: New randomly chosen depth- examples.333Technically, these are not completely independent, as we sample these new examples without replacement. The same applies to the longer “independent” examples.

  • Longer, independent: Randomly chosen depth- examples.

  • Same length, recombinant: Depth- examples sampled from the grammar induced by applying AbsEntities to the seed dataset.

  • Longer, recombinant: Depth- examples sampled from the grammar induced by applying AbsWholePhrases followed by AbsEntities to the seed dataset.

To maintain consistency between the independent and recombinant experiments, we fix the recombinant examples across all epochs, instead of resampling at every epoch. In Figure 6, we plot accuracy on the test set versus the number of additional examples added of each of these four types. As expected, independent examples are more helpful than the recombinant ones, but both help the model improve considerably. In addition, we see that even though the test dataset only has short examples, adding longer examples helps the model more than adding shorter ones, in both the independent and recombinant cases. These results underscore the importance training on longer, harder examples.

6 Discussion

In this paper, we have presented a novel framework we term data recombination, in which we generate new training examples from a high-precision generative model induced from the original training dataset. We have demonstrated its effectiveness in improving the accuracy of a sequence-to-sequence RNN model on three semantic parsing datasets, using a synchronous context-free grammar as our generative model.

There has been growing interest in applying neural networks to semantic parsing and related tasks. dong2016logical concurrently developed an attention-based RNN model for semantic parsing, although they did not use data recombination. grefenstette2014deep proposed a non-recurrent neural model for semantic parsing, though they did not run experiments. mei2016listen use an RNN model to perform a related task of instruction following.

Our proposed attention-based copying mechanism bears a strong resemblance to two models that were developed independently by other groups. gu2016copying apply a very similar copying mechanism to text summarization and single-turn dialogue generation. gulcehre2016pointing propose a model that decides at each step whether to write from a “shortlist” vocabulary or copy from the input, and report improvements on machine translation and text summarization. Another piece of related work is luong2015rare, who train a neural machine translation system to copy rare words, relying on an external system to generate alignments.

Prior work has explored using paraphrasing for data augmentation on NLP tasks. zhang2015character augment their data by swapping out words for synonyms from WordNet. wang2015petpeeves use a similar strategy, but identify similar words and phrases based on cosine distance between vector space embeddings. Unlike our data recombination strategies, these techniques only change inputs , while keeping the labels fixed. Additionally, these paraphrasing-based transformations can be described in terms of grammar induction, so they can be incorporated into our framework.

In data recombination, data generated by a high-precision generative model is used to train a second, domain-general model. Generative oversampling [Liu et al.2007] learns a generative model in a multiclass classification setting, then uses it to generate additional examples from rare classes in order to combat label imbalance. Uptraining [Petrov et al.2010] uses data labeled by an accurate but slow model to train a computationally cheaper second model. vinyals2015grammar generate a large dataset of constituency parse trees by taking sentences that multiple existing systems parse in the same way, and train a neural model on this dataset.

Some of our induced grammars generate examples that are not in the test distribution, but nonetheless aid in generalization. Related work has also explored the idea of training on altered or out-of-domain data, often interpreting it as a form of regularization. Dropout training has been shown to be a form of adaptive regularization [Hinton et al.2012, Wager et al.2013]. guu2015traversing showed that encouraging a knowledge base completion model to handle longer path queries acts as a form of structural regularization.

Language is a blend of crisp regularities and soft relationships. Our work takes RNNs, which excel at modeling soft phenomena, and uses a highly structured tool—synchronous context free grammars—to infuse them with an understanding of crisp structure. We believe this paradigm for simultaneously modeling the soft and hard aspects of language should have broader applicability beyond semantic parsing.


This work was supported by the NSF Graduate Research Fellowship under Grant No. DGE-114747, and the DARPA Communicating with Computers (CwC) program under ARO prime contract no. W911NF-15-1-0462.


All code, data, and experiments for this paper are available on the CodaLab platform at


  • [Artzi and Zettlemoyer2013a] Y. Artzi and L. Zettlemoyer. 2013a. UW SPF: The University of Washington semantic parsing framework. arXiv preprint arXiv:1311.3011.
  • [Artzi and Zettlemoyer2013b] Y. Artzi and L. Zettlemoyer. 2013b.

    Weakly supervised learning of semantic parsers for mapping instructions to actions.

    Transactions of the Association for Computational Linguistics (TACL), 1:49–62.
  • [Bahdanau et al.2014] D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • [Berant et al.2013] J. Berant, A. Chou, R. Frostig, and P. Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In

    Empirical Methods in Natural Language Processing (EMNLP)

  • [Bergstra et al.2010] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. 2010. Theano: a CPU and GPU math expression compiler. In Python for Scientific Computing Conference.
  • [Clarke et al.2010] J. Clarke, D. Goldwasser, M. Chang, and D. Roth. 2010. Driving semantic parsing from the world’s response. In Computational Natural Language Learning (CoNLL), pages 18–27.
  • [Dong and Lapata2016] L. Dong and M. Lapata. 2016. Language to logical form with neural attention. In Association for Computational Linguistics (ACL).
  • [Dyer et al.2015] C. Dyer, M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith. 2015.

    Transition-based dependency parsing with stack long short-term memory.

    In Association for Computational Linguistics (ACL).
  • [Grefenstette et al.2014] E. Grefenstette, P. Blunsom, N. de Freitas, and K. M. Hermann. 2014. A deep architecture for semantic parsing. In ACL Workshop on Semantic Parsing, pages 22–27.
  • [Gu et al.2016] J. Gu, Z. Lu, H. Li, and V. O. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Association for Computational Linguistics (ACL).
  • [Gulcehre et al.2016] C. Gulcehre, S. Ahn, R. Nallapati, B. Zhou, and Y. Bengio. 2016. Pointing the unknown words. In Association for Computational Linguistics (ACL).
  • [Guu et al.2015] K. Guu, J. Miller, and P. Liang. 2015.

    Traversing knowledge graphs in vector space.

    In Empirical Methods in Natural Language Processing (EMNLP).
  • [Hinton et al.2012] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
  • [Hochreiter and Schmidhuber1997] S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
  • [Jaitly and Hinton2013] N. Jaitly and G. E. Hinton. 2013. Vocal tract length perturbation (vtlp) improves speech recognition. In

    International Conference on Machine Learning (ICML)

  • [Krizhevsky et al.2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105.
  • [Kushman and Barzilay2013] N. Kushman and R. Barzilay. 2013. Using semantic unification to generate regular expressions from natural language. In Human Language Technology and North American Association for Computational Linguistics (HLT/NAACL), pages 826–836.
  • [Kwiatkowski et al.2010] T. Kwiatkowski, L. Zettlemoyer, S. Goldwater, and M. Steedman. 2010. Inducing probabilistic CCG grammars from logical form with higher-order unification. In Empirical Methods in Natural Language Processing (EMNLP), pages 1223–1233.
  • [Kwiatkowski et al.2011] T. Kwiatkowski, L. Zettlemoyer, S. Goldwater, and M. Steedman. 2011. Lexical generalization in CCG grammar induction for semantic parsing. In Empirical Methods in Natural Language Processing (EMNLP), pages 1512–1523.
  • [Liang et al.2011] P. Liang, M. I. Jordan, and D. Klein. 2011. Learning dependency-based compositional semantics. In Association for Computational Linguistics (ACL), pages 590–599.
  • [Liu et al.2007] A. Liu, J. Ghosh, and C. Martin. 2007. Generative oversampling for mining imbalanced datasets. In International Conference on Data Mining (DMIN).
  • [Luong et al.2015a] M. Luong, H. Pham, and C. D. Manning. 2015a. Effective approaches to attention-based neural machine translation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1412–1421.
  • [Luong et al.2015b] M. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. Zaremba. 2015b. Addressing the rare word problem in neural machine translation. In Association for Computational Linguistics (ACL), pages 11–19.
  • [Mei et al.2016] H. Mei, M. Bansal, and M. R. Walter. 2016. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In

    Association for the Advancement of Artificial Intelligence (AAAI)

  • [Petrov et al.2010] S. Petrov, P. Chang, M. Ringgaard, and H. Alshawi. 2010. Uptraining for accurate deterministic question parsing. In Empirical Methods in Natural Language Processing (EMNLP).
  • [Poon2013] H. Poon. 2013. Grounded unsupervised semantic parsing. In Association for Computational Linguistics (ACL).
  • [Sutskever et al.2014] I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 3104–3112.
  • [Vinyals et al.2015a] O. Vinyals, M. Fortunato, and N. Jaitly. 2015a. Pointer networks. In Advances in Neural Information Processing Systems (NIPS), pages 2674–2682.
  • [Vinyals et al.2015b] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. 2015b. Grammar as a foreign language. In Advances in Neural Information Processing Systems (NIPS), pages 2755–2763.
  • [Wager et al.2013] S. Wager, S. I. Wang, and P. Liang. 2013. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems (NIPS).
  • [Wang and Yang2015] W. Y. Wang and D. Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In Empirical Methods in Natural Language Processing (EMNLP).
  • [Wang et al.2015] Y. Wang, J. Berant, and P. Liang. 2015. Building a semantic parser overnight. In Association for Computational Linguistics (ACL).
  • [Wong and Mooney2006] Y. W. Wong and R. J. Mooney. 2006. Learning for semantic parsing with statistical machine translation. In North American Association for Computational Linguistics (NAACL), pages 439–446.
  • [Wong and Mooney2007] Y. W. Wong and R. J. Mooney. 2007. Learning synchronous grammars for semantic parsing with lambda calculus. In Association for Computational Linguistics (ACL), pages 960–967.
  • [Zelle and Mooney1996] M. Zelle and R. J. Mooney. 1996.

    Learning to parse database queries using inductive logic programming.

    In Association for the Advancement of Artificial Intelligence (AAAI), pages 1050–1055.
  • [Zettlemoyer and Collins2005] L. S. Zettlemoyer and M. Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In Uncertainty in Artificial Intelligence (UAI), pages 658–666.
  • [Zettlemoyer and Collins2007] L. S. Zettlemoyer and M. Collins. 2007. Online learning of relaxed CCG grammars for parsing to logical form. In Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP/CoNLL), pages 678–687.
  • [Zhang et al.2015] X. Zhang, J. Zhao, and Y. LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems (NIPS).
  • [Zhao and Huang2015] K. Zhao and L. Huang. 2015. Type-driven incremental semantic parsing with polymorphism. In North American Association for Computational Linguistics (NAACL).