Cross-domain Semantic Parsing via Paraphrasing

04/20/2017 ∙ by Yu Su, et al. ∙ The Regents of the University of California 0

Existing studies on semantic parsing mainly focus on the in-domain setting. We formulate cross-domain semantic parsing as a domain adaptation problem: train a semantic parser on some source domains and then adapt it to the target domain. Due to the diversity of logical forms in different domains, this problem presents unique and intriguing challenges. By converting logical forms into canonical utterances in natural language, we reduce semantic parsing to paraphrasing, and develop an attentive sequence-to-sequence paraphrase model that is general and flexible to adapt to different domains. We discover two problems, small micro variance and large macro variance, of pre-trained word embeddings that hinder their direct use in neural networks, and propose standardization techniques as a remedy. On the popular Overnight dataset, which contains eight domains, we show that both cross-domain training and standardized pre-trained word embeddings can bring significant improvement.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Cross-domain semantic parsing via paraphrasing framework. In a deterministic way, logical forms are first converted into canonical utterances in natural language. A paraphrase model then learns from the source domains and adapts to the target domain. External language resources can be incorporated in a consistent way across domains.

Semantic parsing, which maps natural language utterances into computer-understandable logical forms, has drawn substantial attention recently as a promising direction for developing natural language interfaces to computers. Semantic parsing has been applied in many domains, including querying data/knowledge bases (Woods, 1973; Zelle and Mooney, 1996; Berant et al., 2013), controlling IoT devices (Campagna et al., 2017), and communicating with robots (Chen and Mooney, 2011; Tellex et al., 2011; Artzi and Zettlemoyer, 2013; Bisk et al., 2016).

Despite the wide applications, studies on semantic parsing have mainly focused on the in-domain setting, where both training and testing data are drawn from the same domain. How to build semantic parsers that can learn across domains remains an under-addressed problem. In this work, we study cross-domain semantic parsing. We model it as a domain adaptation problem (Daumé III and Marcu, 2006), where we are given some source domains and a target domain, and the core task is to adapt a semantic parser trained on the source domains to the target domain (Figure 1). The benefits are two-fold: (1) by training on the source domains, the cost of collecting training data for the target domain can be reduced, and (2) the data of source domains may provide information complementary to the data collected for the target domain, leading to better performance on the target domain.

This is a very challenging task. Traditional domain adaptation (Daumé III and Marcu, 2006; Blitzer et al., 2006) only concerns natural languages, while semantic parsing concerns both natural and formal languages. Different domains often involve different predicates. In Figure 1, from the source Basketball domain a semantic parser can learn the semantic mapping from natural language to predicates like team and season, but in the target Social domain it needs to handle predicates like employer instead. Worse still, even for the same predicate, it is legitimate to use arbitrarily different predicate symbols, e.g., other symbols like hired_by or even predicate1 can also be used for the employer predicate, reminiscent of the symbol grounding problem (Harnad, 1990). Therefore, directly transferring the mapping from natural language to predicate symbols learned from source domains to the target domain may not be much beneficial.

Inspired by the recent success of paraphrasing based semantic parsing (Berant and Liang, 2014; Wang et al., 2015), we propose to use natural language as an intermediate representation for cross-domain semantic parsing. As shown in Figure 1, logical forms are converted into canonical utterances in natural language, and semantic parsing is reduced to paraphrasing. It is the knowledge of paraphrasing, at lexical, syntactic, and semantic levels, that will be transferred across domains.

Still, adapting a paraphrase model to a new domain is a challenging and under-addressed problem. To give some idea of the difficulty, for each of the eight domains in the popular Overnight (Wang et al., 2015) dataset, 30% to 55% of the words never occur in any of the other domains, a similar problem observed in domain adaptation for machine translation (Daumé III and Jagarlamudi, 2011). The paraphrase model therefore can get little knowledge for a substantial portion of the target domain from the source domains. We introduce pre-trained word embeddings such as word2vec (Mikolov et al., 2013) to combat the vocabulary variety across domains. Based on recent studies on neural network initialization, we conduct a statistical analysis of pre-trained word embeddings and discover two problems that may hinder their direct use in neural networks: small micro variance, which hurts optimization, and large macro variance, which hurts generalization. We propose to standardize pre-trained word embeddings, and show its advantages both analytically and experimentally.

On the Overnight dataset, we show that cross-domain training under the proposed framework can significantly improve model performance. We also show that, compared with directly using pre-trained word embeddings or normalization as in previous work, the proposed standardization technique can lead to about 10% absolute improvement in accuracy.

2 Cross-domain Semantic Parsing

2.1 Problem Definition

Unless otherwise stated, we will use to denote input utterance, for canonical utterance, and for logical form. We denote as the set of all possible utterances. For a domain, suppose is the set of logical forms, a semantic parser is a mapping that maps every input utterance to a logical form (a null logical form can be included in to reject out-of-domain utterances).

In cross-domain semantic parsing, we assume there are a set of source domains , each with a set of training examples . It is in principle advantageous to model the source domains separately (Daumé III and Marcu, 2006), which retains the possibility of separating domain-general information from domain-specific information, and only transferring the former to the target domain. For simplicity, here we merge the source domains into a single domain with training data . The task is to learn a semantic parser for a target domain , for which we have a set of training examples . Some characteristics can be summarized as follows:

  • and can be totally disjoint.

  • The input utterance distribution of the source and the target domains can be independent and differ remarkably.

  • Typically .

In the most general and challenging case, and can be defined using different formal languages. Because of the lack of relevant datasets, here we restrain ourselves to the case where and are defined using the same formal language, e.g., -DCS (Liang, 2013) as in the Overnight dataset.

2.2 Framework

Our framework follows the research line of semantic parsing via paraphrasing (Berant and Liang, 2014; Wang et al., 2015). While previous work focuses on the in-domain setting, we discuss its applicability and advantages in the cross-domain setting, and develop techniques to address the emerging challenges in the new setting.

  • We assume a one-to-one mapping , where is the set of canonical utterances. In other words, every logical form will be converted into a unique canonical utterance deterministically (Figure 1). Previous work (Wang et al., 2015)

    has demonstrated how to design such a mapping, where a domain-general grammar and a domain-specific lexicon are constructed to automatically convert every logical form to a canonical utterance. In this work, we assume the mapping is given

    111In the experiments we use the provided canonical utterances of the Overnight dataset., and focus on the subsequent paraphrasing and domain adaptation problems.

    This design choice is worth some discussion. The grammar, or at least the lexicon for mapping predicates to natural language, needs to be provided by domain administrators. This indeed brings an additional cost, but we believe it is reasonable and even necessary for three reasons: (1) Only domain administrators know the predicate semantics the best, so it has to be them to reveal that by grounding the predicates to natural language (the symbol grounding problem (Harnad, 1990)). (2) Otherwise, predicate semantics can only be learned from supervised training data of each domain, bringing a significant cost on data collection. (3) Canonical utterances are understandable by average users, and thus can also be used for training data collection via crowdsourcing (Wang et al., 2015; Su et al., 2016), which can amortize the cost.

    Take comparatives as an example. In logical forms, comparatives can be legitimately defined using arbitrarily different predicates in different domains, e.g., <, smallerInSize, or even predicates with an ambiguous surface form, like lt. When converting logical form to canonical utterance, however, domain administrators have to choose common natural language expressions like “less than” and ”smaller”, providing a shared ground for cross-domain semantic parsing.

  • In the previous work based on paraphrasing (Berant and Liang, 2014; Wang et al., 2015), semantic parsers are implemented as log-linear models with hand-engineered domain-specific features (including paraphrase features). Considering the recent success of representation learning for domain adaptation (Glorot et al., 2011; Chen et al., 2012), we propose a paraphrase model based on the sequence-to-sequence (Seq2Seq) model (Sutskever et al., 2014), which can be trained end to end without feature engineering. We show that it outperforms the previous log-linear models by a large margin in the in-domain setting, and can easily adapt to new domains.

  • An advantage of reducing semantic parsing to paraphrasing is that external language resources become easier to incorporate. Observing the vocabulary variety across domains, we introduce pre-trained word embeddings to facilitate domain adaptation. For the example in Figure 1, the paraphrase model may have learned the mapping from “play for” to “whose team is” in a source domain. By acquiring word similarities (“play”-“work” and “team”-“employer”) from pre-trained word embeddings, it can establish the mapping from “work for” to “whose employer is” in the target domain, even without in-domain training data. We analyze statistical characteristics of the pre-trained word embeddings, and propose standardization techniques to remedy some undesired characteristics that may bring a negative effect to neural models.

  • We will use the following protocol: (1) train a paraphrase model using the data of the source domain, (2) use the learned parameters to initialize a model in the target domain, and (3) fine-tune the model using the training data of the target domain.

2.3 Prior Work

While most studies on semantic parsing so far have focused on the in-domain setting, there are a number of studies of particular relevance to this work. In the recent efforts of scaling semantic parsing to large knowledge bases like Freebase (Bollacker et al., 2008), researchers have explored several ways to infer the semantics of knowledge base relations unseen in training, which are often based on at least one (often both) of the following assumptions: (1) Distant supervision. Freebase entities can be linked to external text corpora, and serve as anchors for seeking semantics of Freebase relations from text. For example, Cai and Alexander Cai and Yates (2013), among others (Berant et al., 2013; Xu et al., 2016), use sentences from Wikipedia that contain any entity pair of a Freebase relation as the support set of the relation. (2) Self-explaining predicate symbols. Most Freebase relations are described using a carefully chosen symbol (surface form), e.g., place_of_birth, which provides strong cues for their semantics. For example, Yih et al. Yih et al. (2015)

directly compute the similarity of input utterance and the surface form of Freebase relations via a convolutional neural network. Kwiatkowski et al. 

Kwiatkowski et al. (2013) also extract lexical features from input utterance and the surface form of entities and relations. They have actually evaluated their model on Freebase sub-domains not covered in training, and have shown impressive results. However, in the more general setting of cross-domain semantic parsing, we may have neither of these luxuries. Distant supervision may not be available (e.g., IoT devices involving no entities but actions), and predicate symbols may not provide enough cues (e.g., predicate1

). In this case, seeking additional inputs from domain administrators is probably necessary.

In parallel of this work, Herzig and Berant Herzig and Berant (2017) have explored another direction of semantic parsing with multiple domains, where they use all the domains to train a single semantic parser, and attach a domain-specific encoding to the training data of each domain to help the semantic parser differentiate between domains. We pursue a different direction: we train a semantic parser on some source domains and adapt it to the target domain. Another difference is that their work directly maps utterances to logical forms, while ours is based on paraphrasing.

Cross-domain semantic parsing can be seen as a way to reduce the cost of training data collection, which resonates with the recent trend in semantic parsing. Berant et al. Berant et al. (2013) propose to learn from utterance-denotation pairs instead of utterance-logical form pairs, while Wang et al. Wang et al. (2015) and Su et al. Su et al. (2016) manage to employ crowd workers with no linguistic expertise for data collection. Jia and Liang Jia and Liang (2016) propose an interesting form of data augmentation. They learn a grammar from existing training data, and generate new examples from the grammar by recombining segments from different examples.

We use natural language as an intermediate representation to transfer knowledge across domains, and assume the mapping from the intermediate representation (canonical utterance) to logical form can be done deterministically. Several other intermediate representations have also been used, such as combinatory categorial grammar (Kwiatkowski et al., 2013; Reddy et al., 2014), dependency tree (Reddy et al., 2016, 2017), and semantic role structure (Goldwasser and Roth, 2013). But their main aim is to better represent input utterances with a richer structure. A separate ontology matching step is needed to map the intermediate representation to logical form, which requires domain-dependent training.

A number of other related studies have also used paraphrasing. For example, Fader et al. Fader et al. (2013) leverage question paraphrases to for question answering, while Narayan et al. Narayan et al. (2016) generate paraphrases as a way of data augmentation.

Cross-domain semantic parsing can greatly benefit from the rich literature of domain adaptation and transfer learning 

(Daumé III and Marcu, 2006; Blitzer et al., 2006; Pan and Yang, 2010; Glorot et al., 2011). For example, Chelba and Acero Chelba and Acero (2004) use parameters trained in the source domain as prior to regularize parameters in the target domain. The feature augmentation technique from Daumé III Daumé III (2009) can be very helpful when there are multiple source domains. We expect to see many of these ideas to be applied in the future.

3 Paraphrase Model

In this section we propose a paraphrase model based on the Seq2Seq model (Sutskever et al., 2014) with soft attention. Similar models have been used in semantic parsing (Jia and Liang, 2016; Dong and Lapata, 2016) but for directly mapping utterances to logical forms. We demonstrate that it can also be used as a paraphrase model for semantic parsing. Several other neural models have been proposed for paraphrasing (Socher et al., 2011; Hu et al., 2014; Yin and Schütze, 2015), but it is not the focus of this work to compare all the alternatives.

For an input utterance and an output canonical utterance

, the model estimates the conditional probability

. The tokens are first converted into vectors via a word embedding layer

. The initialization of the word embedding layer is critical for domain adaptation, which we will further discuss in Section 4.

The encoder

, which is implemented as a bi-directional recurrent neural network (RNN), first encodes

into a sequence of state vectors . The state vectors of the forward RNN and the backward RNN are respectively computed as:

where gated recurrent unit (GRU) as defined in

(Cho et al., 2014) is used as the recurrence. We then concatenate the forward and backward state vectors, .

We use an attentive RNN as the decoder, which will generate the output tokens one at a time. We denote the state vectors of the decoder RNN as . The attention takes a form similar to (Vinyals et al., 2015). For the decoding step , the decoder is defined as follows:

where and are model parameters. The decoder first calculates normalized attention weights over encoder states, and get a summary state . The summary state is then used to calculate the next decoder state

and the output probability distribution


  • Given a set of training examples , we minimize the cross-entropy loss which maximizes the log probability of the correct canonical utterances. We apply dropout (Hinton et al., 2012) on both input and output of the GRU cells to prevent overfitting.

  • Given a domain , there are two ways to use a trained model. One is to use it to generate the most likely output utterance given an input utterance  (Sutskever et al., 2014),

    In this case can be any utterance permissable by the output vocabulary, and may not necessarily be a legitimate canonical utterance in . This is more suitable for large domains with a lot of logical forms, like Freebase. An alternative way is to use the model to rank the legitimate canonical utterances (Kannan et al., 2016):

    which is more suitable for small domains having a limited number of logical forms, like the ones in the Overnight dataset. We will adopt the second strategy. It is also very challenging; random guessing leads to almost no success. It is also possible to first find a smaller set of candidates to rank via beam search (Berant et al., 2013; Wang et al., 2015).

4 Pre-trained Word Embedding for Domain Adaptation

Pre-trained word embeddings like word2vec have a great potential to combat the vocabulary variety across domains. For example, we can use pre-trained word2vec vectors to initialize the word embedding layer of the source domain, with the hope that the other parameters in the model will co-adapt with the word vectors during training in the source domain, and generalize better to the out-of-vocabulary words (but covered by word2vec) in the target domain. However, deep neural networks are very sensitive to initialization (Erhan et al., 2010), and a statistical analysis of the pre-trained word2vec vectors reveals some characteristics that may not be desired for initializing deep neural networks. In this section we present the analysis and propose a standardization technique to remedy the undesired characteristics.

Initialization L2 norm Micro Variance Cosine Sim.
word2vec + ES
word2vec + FS
word2vec + EN
Table 1: Statistics of the word embedding matrix with different initialization strategies. Random: random sampling from , thus unit variance. word2vec: raw word2vec

vectors. ES: per-example standardization. FS: per-feature standardization. EN: per-example normalization. Cosine similarity is computed on a randomly selected (but fixed) set of 1M word pairs.

  • Our analysis will be based on the 300-dimensional word2vec vectors trained on the 100B-word Google News corpus222

    . It contains 3 million words, leading to a 3M-by-300 word embedding matrix. The “rule of thumb” to randomly initialize word embedding in neural networks is to sample from a uniform or Gaussian distribution with

    unit variance

    , which works well for a wide range of neural network models in general. We therefore use it as a reference to compare different word embedding initialization strategies. Given a word embedding matrix, we compute the L2 norm of each row and report the mean and the standard deviation. Similarly, we also report the variance of each row (denoted as

    micro variance), which indicates how far the numbers in the row spread out, and pair-wise cosine similarity, which indicates the word similarity captured by word2vec.

    The statistics of the word embedding matrix with different initialization strategies are shown in Table 1. Compared with random initialization, two characteristics of the word2vec vectors stand out: (1) Small micro variance. Both the L2 norm and the micro variance of the word2vec vectors are much smaller. (2) Large macro variance. The variance of different word2vec

    vectors, reflected by the standard deviation of L2 norm, is much larger (e.g., the maximum and the minimum L2 norm are 21.1 and 0.015, respectively). Small micro variance can make the variance of neuron activations starts off too small

    333Under some conditions, including using Xavier initialization (also introduced in that paper and now widely used) for weights, Glorot and Bengio Glorot and Bengio (2010) have shown that the activation variances in a feedforward neural network will be roughly the same as the input variances (word embedding here) at the beginning of training., implying a poor starting point in the parameter space. On the other hand, because of the magnitude difference, large macro variance may make a model hard to generalize to words unseen in training.

  • Based on the above analysis, we propose to do unit variance standardization (standardization for short) on pre-trained word embeddings. There are two possible ways, per-example standardization, which standardizes each row of the embedding matrix to unit variance by simply dividing by the standard deviation of the row, and per-feature standardization, which standardizes each column instead. We do not make the rows or columns zero mean. Per-example standardization enjoys the goodness of both random initialization and pre-trained word embeddings: it fixes the small micro variance problem as well as the large macro variance problem of pre-trained word embeddings, while still preserving cosine similarity, i.e., word similarity. Per-feature standardization does not preserve cosine similarity, nor does it fix the large macro variance problem. However, it enjoys the benefit of global statistics, in contrast to the local statistics of individual word vectors used in per-example standardization. Therefore, in problems where the testing and training vocabularies are similar, per-feature standardization may be more advantageous. Both standardizations lose vector magnitude information. Levy et al. Levy et al. (2015) have suggested per-example normalization444It can also be found in the implementation of Glove (Pennington et al., 2014): of pre-trained word embeddings for lexical tasks like word similarity and analogy, which do no involve deep neural networks. Making the word vectors unit length alleviates the large macro variance problem, but the small micro variance problem remains (Table 1).

  • This is indeed a pretty simple trick, and per-feature standardization (with zero mean) is also a standard data preprocessing method. However, it is not self-evident that this kind of standardization shall be applied on pre-trained word embeddings before using them in deep neural networks, especially with the obvious downside of rendering the word embedding algorithm’s loss function sub-optimal.

    We expect this to be less of a issue for large-scale problems with a large vocabulary and abundant training examples. For example, Vinyals et al. Vinyals et al. (2015) have found that directly using the word2vec vectors for initialization can bring a consistent, though small, improvement in neural constituency parsing. However, for smaller-scale problems (e.g., an application domain of semantic parsing can have a vocabulary size of only a few hundreds), this issue becomes more critical. Initialized with the raw pre-trained vectors, a model may quickly fall into a poor local optimum and may not have enough signal to escape. Because of the large macro variance problem, standardization can be critical for domain adaptation, which needs to generalize to many words unseen in training.

    The proposed standardization technique appears in a similar spirit to batch normalization 

    (Ioffe and Szegedy, 2015). We notice two computational differences, that ours is applied on the inputs while batch normalization is applied on internal neuron activations, and that ours standardizes the whole word embedding matrix beforehand while batch normalization standardizes each mini-batch on the fly. In terms of motivation, the proposed technique aims to remedy some undesired characteristics of pre-trained word embeddings, and batch normalization aims to reduce the internal covariate shift. It is of interest to study the combination of the two in future work.

Metric Calendar Blocks Housing Restaurants Publications Recipes Social Basketball
# of example () 837 1995 941 1657 801 1080 4419 1952
# of logical form () 196 469 231 339 149 124 624 252
vocab. size () 228 227 318 342 203 256 533 360
% other domains 71.1 61.7 60.7 55.8 65.6 71.9 46.0 45.6
% word2vec 91.2 91.6 88.4 88.6 91.1 93.8 86.9 86.9
% other domains + word2vec 93.9 93.8 90.9 90.4 95.6 97.3 89.3 89.4
Table 2: Statistics of the domains in the Overnight dataset. Pre-trained word2vec embedding covers most of the words in each domain, paving a way for domain adaptation.
Method Calendar Blocks Housing Restaurants Publications Recipes Social Basketball Avg.
Previous Methods
Wang et al. Wang et al. (2015) 74.4 41.9 54.0 75.9 59.0 70.8 48.2 46.3 58.8
Xiao et al. Xiao et al. (2016) 75.0 55.6 61.9 80.1 75.8 80.0 80.5 72.7
Jia and Liang Jia and Liang (2016) 78.0 58.1 71.4 76.2 76.4 79.6 81.4 85.2 75.8
Herzig and Berant Herzig and Berant (2017) 82.1 62.7 78.3 82.2 80.7 82.9 81.7 86.2 79.6
Our Methods
Random + I 75.6 60.2 67.2 77.7 77.6 80.1 80.7 86.5 75.7
Random + X 79.2 54.9 74.1 76.2 78.5 82.4 82.5 86.7 76.9
word2vec + I 67.9 59.4 52.4 75.0 64.0 73.2 77.0 87.5 69.5
word2vec + X 78.0 54.4 63.0 81.3 74.5 83.3 81.5 83.1 74.9
word2vec + EN + I 63.1 56.1 60.3 75.3 65.2 69.0 76.4 81.8 68.4
word2vec + EN + X 78.0 52.6 63.5 74.7 65.2 80.6 79.9 80.8 71.2
word2vec + FS + I 78.6 62.2 67.7 78.6 75.8 85.7 81.3 86.7 77.1
word2vec + FS + X 82.7 59.4 75.1 80.4 78.9 85.2 81.8 87.2 78.9
word2vec + ES + I 79.8 60.2 71.4 81.6 78.9 84.7 82.9 86.2 78.2
word2vec + ES + X 82.1 62.2 78.8 83.7 80.1 86.1 83.1 88.2 80.6
Table 3: Main experiment results. We combine the proposed paraphrase model with different word embedding initializations. I: in-domain, X: cross-domain, EN: per-example normalization, FS: per-feature standardization, ES: per-example standardization.

5 Evaluation

5.1 Data Analysis

The Overnight dataset (Wang et al., 2015) contains 8 different domains. Each domain is based on a separate knowledge base, with logical forms written in -DCS (Liang, 2013). Logical forms are converted into canonical utterances via a simple grammar, and the input utterances are collected by asking crowd workers to paraphrase the canonical utterances. Different domains are designed to stress different types of linguistic phenomena. For example, the Calendar domain requires a semantic parser to handle temporal language like “meetings that start after 10 am”, while the Blocks domain features spatial language like “which block is above block 1”.

Vocabularies vary remarkably across domains (Table 2). For each domain, only 45% to 70% of the words are covered by any of the other 7 domains. A model has to learn the out-of-vocabulary words from scratch using in-domain training data. The pre-trained word2vec embedding covers most of the words of each domain, and thus can connect the domains to facilitate domain adaptation. Words that are still missing are mainly stop words and typos, e.g., “ealiest”.

5.2 Experiment Setup

We compare our model with all the previous methods evaluated on the Overnight dataset. Wang et al. Wang et al. (2015) use a log-linear model with a rich set of features, including paraphrase features derived from PPDB (Ganitkevitch et al., 2013), to rank logical forms. Xiao et al. Xiao et al. (2016)

use a multi-layer perceptron to encode the unigrams and bigrams of the input utterance, and then use a RNN to predict the derivation sequence of a logical form under a grammar. Similar to ours, Jia and Liang 

Jia and Liang (2016) also use a Seq2Seq model with bi-directional RNN encoder and attentive decoder, but it is used to predict linearized logical forms. They also propose a data augmentation technique, which further improves the average accuracy to 77.5%. But it is orthogonal to this work and can be incorporated in any model including ours, therefore not included.

The above methods are all based on the in-domain setting, where a separate parser is trained for each domain. In parallel of this work, Herzig and Berant Herzig and Berant (2017) have explored another direction of cross-domain training: they use all of the domains to train a single parser, with a special domain encoding to help differentiate between domains. We instead model it as a domain adaptation problem, where training on the source and the target domains are separate. Their model is the same as Jia and Liang Jia and Liang (2016). It is the current best-performing method on the Overnight dataset.

We use the standard 80%/20% split of training and testing, and randomly hold out 20% of training for validation. In cross-domain experiments, for each target domain, all the other domains are combined as the source domain. Hyper-parameters are selected based on the validation set. State size of both the encoder and the decoder are set to 100, and word embedding size is set to 300. Input and output dropout rate of the GRU cells are 0.7 and 0.5, respectively, and mini-batch size is 512. We use Adam with the default parameters suggested in the paper for optimization. We use gradient clipping with a cap for global norm at 5.0 to alleviate the exploding gradients problem of recurrent neural networks. Early stopping based on the validation set is used to decide when to stop training. The selected model is retrained using the whole training set (training + validation). The evaluation metric is accuracy, i.e., the proportion of testing examples for which the top prediction yields the correct denotation. Our model is implemented in Tensorflow 

(Abadi et al., 2016), and the code can be found at

5.3 Experiment Results

5.3.1 Comparison with Previous Methods

The main experiment results are shown in Table 3. Our base model (Random + I) achieves an accuracy comparable to the previous best in-domain model (Jia and Liang, 2016). With our main novelties, cross-domain training and word embedding standardization, our full model is able to outperform the previous best model, and achieve the best accuracy on 6 out of the 8 domains. Next we examine the novelties separately.

5.3.2 Word Embedding Initialization

The in-domain results clearly show the sensitivity of model performance to word embedding initialization. Directly using the raw word2vec vectors or with per-example normalization, the performance is significantly worse than random initialization (6.2% and 7.3%, respectively). Based on the previous analysis, however, one should not be too surprised. The small micro variance problem hurts optimization. In sharp contrast, both of the proposed standardization techniques lead to better in-domain performance than random initialization (1.4% and 2.5%, respectively), setting a new best in-domain accuracy (78.2%) on Overnight. The results show that the pre-trained word2vec vectors can indeed provide useful information, but only when they are properly standardized.

5.3.3 Cross-domain Training

A consistent improvement from cross-domain training is observed across all word embedding initialization strategies. Even for raw word2vec embedding or per-example normalization, cross-domain training helps the model escape the poor initialization, though still inferior to the alternative initializations. The best results are again obtained with standardization, with per-example standardization bringing a slightly larger improvement than per-feature standardization. We observe that the improvement from cross-domain training is correlated with the abundance of the in-domain training data of the target domain. To further examine this observation, we use the ratio between the number of examples () and the vocabulary size () to indicate the data abundance of a domain (the higher, the more abundant), and compute the Pearson correlation coefficient between data abundance and accuracy improvement from cross-domain training (XI). The results in Table 4 show a consistent, moderate to strong negative correlation between the two variables. In other words, cross-domain training is more beneficial when in-domain training data is less abundant, which is reasonable because in that case the model can learn more from the source domain data that is missing in the training data of the target domain.

Word Embedding Initialization Correlation
Random 0.698
word2vec 0.730
word2vec + EN 0.461
word2vec + FS 0.770
word2vec + ES 0.514
Table 4: Correlation between in-domain data abundance and improvement from cross-domain training. The gain of cross-domain training is more significant when in-domain training data is less abundant.
Figure 2: Results with downsampled in-domain training data. The experiment with each downsampling rate is repeated for 3 times and average results are reported. For simplicity, we only report the average accuracy over all domains. Pre-trained word embedding with per-example standardization is used in both settings.

5.3.4 Using Downsampled Training Data

Compared with the vocabulary size and the number of logical forms, the in-domain training data in the Overnight dataset is indeed abundant. In cross-domain semantic parsing, we are more interested in the scenario where there is insufficient training data for the target domain. To emulate this scenario, we downsample the in-domain training data of each target domain, but still use all training data from the source domain (thus ). The results are shown in Figure 2. The gain of cross-domain training is most significant when in-domain training data is scarce. As we collect more in-domain training data, the gain becomes smaller, which is expected. These results reinforce those from Table 4. It is worth noting that the effect of downsampling varies across domains. For domains with quite abundant training data like Social, using only 30% of the in-domain training data, the model can achieve an accuracy almost as good as when using all the data.

6 Discussion

Scalability, including vertical scalability, i.e., how to scale up to handle more complex inputs and logical constructs, and horizontal scalability, i.e., how to scale out to handle more domains, is one of the most critical challenges semantic parsing is facing today. In this work, we took an early step towards horizontal scalability, and proposed a paraphrasing based framework for cross-domain semantic parsing. With a sequence-to-sequence paraphrase model, we showed that cross-domain training of semantic parsing can be quite effective under a domain adaptation setting. We also studied how to properly standardize pre-trained word embeddings in neural networks, especially for domain adaptation.

This work opens up a number of future directions. As discussed in Section 2.3, many conventional domain adaptation and representation learning ideas can find application in cross-domain semantic parsing. In addition to pre-trained word embeddings, other language resources like paraphrase corpora (Ganitkevitch et al., 2013) can be incorporated into the paraphrase model to further facilitate domain adaptation. In this work we require a full mapping from logical form to canonical utterance, which could be costly for large domains. It is of practical interest to study the case where only a lexicon for mapping schema items to natural language is available. We have restrained ourselves to the case where domains are defined using the same formal language, and we look forward to evaluating the framework on domains of different formal languages when such datasets with canonical utterances become available.


The authors would like to thank the anonymous reviewers for their thoughtful comments. This research was sponsored in part by the Army Research Laboratory under cooperative agreements W911NF09-2-0053 and NSF IIS 1528175. The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notice herein.


  • Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467 [cs.DC].
  • Artzi and Zettlemoyer (2013) Yoav Artzi and Luke Zettlemoyer. 2013.

    Weakly supervised learning of semantic parsers for mapping instructions to actions.

    Transactions of the Association for Computational Linguistics, 1:49–62.
  • Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In

    Proceedings of Conference on Empirical Methods in Natural Language Processing

  • Berant and Liang (2014) Jonathan Berant and Percy Liang. 2014. Semantic parsing via paraphrasing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Bisk et al. (2016) Yonatan Bisk, Deniz Yuret, and Daniel Marcu. 2016. Natural language communication with robots. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics.
  • Blitzer et al. (2006) John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proceedings of Conference on Empirical Methods in Natural Language Processing.
  • Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International conference on Management of data.
  • Cai and Yates (2013) Qingqing Cai and Alexander Yates. 2013. Semantic parsing freebase: Towards open-domain semantic parsing. In Second Joint Conference on Lexical and Computational Semantics (* SEM).
  • Campagna et al. (2017) Giovanni Campagna, Rakesh Ramesh, Silei Xu, Michael Fischer, and Monica S Lam. 2017. Almond: The architecture of an open, crowdsourced, privacy-preserving, programmable virtual assistant. In Proceedings of the International Conference on World Wide Web, pages 341–350. International World Wide Web Conferences Steering Committee.
  • Chelba and Acero (2004) Ciprian Chelba and Alex Acero. 2004. Adaptation of maximum entropy capitalizer: Little data can help a lot. In Proceedings of Conference on Empirical Methods in Natural Language Processing.
  • Chen and Mooney (2011) David L Chen and Raymond J Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In

    Proceedings of the AAAI Conference on Artificial Intelligence

  • Chen et al. (2012) Minmin Chen, Zhixiang Xu, Kilian Weinberger, and Fei Sha. 2012.

    Marginalized denoising autoencoders for domain adaptation.


    Proceedings of the International Conference on Machine Learning

  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078 [cs.CL].
  • Daumé III (2009) Hal Daumé III. 2009. Frustratingly easy domain adaptation. arXiv:0907.1815 [cs.LG].
  • Daumé III and Jagarlamudi (2011) Hal Daumé III and Jagadeesh Jagarlamudi. 2011. Domain adaptation for machine translation by mining unseen words. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Daumé III and Marcu (2006) Hal Daumé III and Daniel Marcu. 2006.

    Domain adaptation for statistical classifiers.

    Journal of Artificial Intelligence Research, 26:101–126.
  • Dong and Lapata (2016) Li Dong and Mirella Lapata. 2016. Language to logical form with neural attention. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Erhan et al. (2010) Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. 2010.

    Why does unsupervised pre-training help deep learning?

    Journal of Machine Learning Research, 11(Feb):625–660.
  • Fader et al. (2013) Anthony Fader, Luke S Zettlemoyer, and Oren Etzioni. 2013. Paraphrase-driven learning for open question answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Ganitkevitch et al. (2013) Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics.
  • Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the International Conference on Machine Learning.
  • Goldwasser and Roth (2013) Dan Goldwasser and Dan Roth. 2013. Leveraging domain-independent information in semantic parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Harnad (1990) Stevan Harnad. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346.
  • Herzig and Berant (2017) Jonathan Herzig and Jonathan Berant. 2017. Neural semantic parsing over multiple knowledge-bases. arXiv:1702.01569 [cs.CL].
  • Hinton et al. (2012) Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580 [cs.NE].
  • Hu et al. (2014) Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. 2014. Convolutional neural network architectures for matching natural language sentences. In Proceedings of the Annual Conference on Neural Information Processing Systems.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, pages 448–456.
  • Jia and Liang (2016) Robin Jia and Percy Liang. 2016. Data recombination for neural semantic parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Kannan et al. (2016) Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, László Lukács, Marina Ganea, Peter Young, et al. 2016. Smart reply: Automated response suggestion for email. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  • Kwiatkowski et al. (2013) Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology matching. In Proceedings of Conference on Empirical Methods in Natural Language Processing.
  • Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225.
  • Liang (2013) Percy Liang. 2013. Lambda dependency-based compositional semantics. arXiv:1309.4408 [cs.AI].
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Annual Conference on Neural Information Processing Systems.
  • Narayan et al. (2016) Shashi Narayan, Siva Reddy, and Shay B Cohen. 2016. Paraphrase generation from latent-variable PCFGs for semantic parsing. arXiv:1601.06068 [cs.CL].
  • Pan and Yang (2010) Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of Conference on Empirical Methods in Natural Language Processing.
  • Reddy et al. (2014) Siva Reddy, Mirella Lapata, and Mark Steedman. 2014. Large-scale semantic parsing without question-answer pairs. Transactions of the Association for Computational Linguistics, 2:377–392.
  • Reddy et al. (2016) Siva Reddy, Oscar Täckström, Michael Collins, Tom Kwiatkowski, Dipanjan Das, Mark Steedman, and Mirella Lapata. 2016. Transforming dependency structures to logical forms for semantic parsing. Transactions of the Association for Computational Linguistics, 4:127–140.
  • Reddy et al. (2017) Siva Reddy, Oscar Täckström, Slav Petrov, Mark Steedman, and Mirella Lapata. 2017. Universal semantic parsing. arXiv:1702.03196 [cs.CL].
  • Socher et al. (2011) Richard Socher, Eric H Huang, Jeffrey Pennington, Andrew Y Ng, and Christopher D Manning. 2011. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the Annual Conference on Neural Information Processing Systems.
  • Su et al. (2016) Yu Su, Huan Sun, Brian Sadler, Mudhakar Srivatsa, Izzeddin Gür, Zenghui Yan, and Xifeng Yan. 2016. On generating characteristic-rich question sets for QA evaluation. In Proceedings of Conference on Empirical Methods in Natural Language Processing.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems.
  • Tellex et al. (2011) Stefanie A Tellex, Thomas Fleming Kollar, Steven R Dickerson, Matthew R Walter, Ashis Banerjee, Seth Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • Vinyals et al. (2015) Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Proceedings of the Annual Conference on Neural Information Processing Systems.
  • Wang et al. (2015) Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Woods (1973) William A Woods. 1973. Progress in natural language understanding: an application to lunar geology. In Proceedings of the American Federation of Information Processing Societies Conference.
  • Xiao et al. (2016) Chunyang Xiao, Marc Dymetman, and Claire Gardent. 2016. Sequence-based structured prediction for semantic parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Xu et al. (2016) Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2016. Question answering on freebase via relation extraction and textual evidence. Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Yih et al. (2015) Scott Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Yin and Schütze (2015) Wenpeng Yin and Hinrich Schütze. 2015. MultiGranCNN: An architecture for general matching of text chunks on multiple levels of granularity. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
  • Zelle and Mooney (1996) John M Zelle and Raymon J Mooney. 1996.

    Learning to parse database queries using inductive logic programming.

    In Proceedings of the AAAI Conference on Artificial Intelligence.