Unified Semantic Parsing with Weak Supervision

06/12/2019 ∙ by Priyanka Agrawal, et al. ∙ ibm 0

Semantic parsing over multiple knowledge bases enables a parser to exploit structural similarities of programs across the multiple domains. However, the fundamental challenge lies in obtaining high-quality annotations of (utterance, program) pairs across various domains needed for training such models. To overcome this, we propose a novel framework to build a unified multi-domain enabled semantic parser trained only with weak supervision (denotations). Weakly supervised training is particularly arduous as the program search space grows exponentially in a multi-domain setting. To solve this, we incorporate a multi-policy distillation mechanism in which we first train domain-specific semantic parsers (teachers) using weak supervision in the absence of the ground truth programs, followed by training a single unified parser (student) from the domain specific policies obtained from these teachers. The resultant semantic parser is not only compact but also generalizes better, and generates more accurate programs. It further does not require the user to provide a domain label while querying. On the standard Overnight dataset (containing multiple domains), we demonstrate that the proposed model improves performance by 20 terms of denotation accuracy in comparison to baseline techniques.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic parsing is the task of converting natural language utterances into machine executable programs such as SQL, lambda logical form Liang (2013)

. This has been a classical area of research in natural language processing (NLP) with earlier works primarily utilizing rule based approaches 

Woods (1973) or grammar based approaches Lafferty et al. (2001); Kwiatkowski et al. (2011); Zettlemoyer and Collins (2005, 2007). Recently, there has been a surge in neural encoder-decoder techniques which are trained with input utterances and corresponding annotated output programs Dong and Lapata (2016); Jia and Liang (2016). However, the performance of these strongly supervised methods is restricted by the size and the diversity of training data i.e. natural language utterances and their corresponding annotated logical forms. This has motivated the work on applying weak supervision based approaches Clarke et al. (2010); Liang et al. (2017); Neelakantan et al. (2016); Chen et al. (2018), which use denotations i.e. the final answers obtained upon executing a program on the knowledge base and use REINFORCE Williams (1992); Norouzi et al. (2016), to guide the network to learn its semantic parsing policy (see Figure 3(a)). Another line of work Goldman et al. (2018); Cheng and Lapata (2018) is aimed towards improving the efficiency of weakly supervised parsers by applying a two-stage approach of first learning to generate program templates followed by exact program generation. It is important to note that this entire body of work on weakly supervised semantic parsing has been restricted to building a parser over a single domain only (i.e. single dataset).

Moving beyond single-domain to multiple domains, Herzig2017NeuralSP proposed semantic parsing networks trained by combining the datasets corresponding to multiple domains into a single pool. Consider the example in Figure 1 illustrating utterances from two domains, recipes and publications, of the Overnight dataset. The utterances have linguistic variations most and maximum number corresponding to the shared program token argmax. This work shows that leveraging such structural similarities in language by combining these different domains leads to improved performance. However, as with many single-domain techniques, this work also requires strong supervision in the form of program annotations corresponding to the utterances. Obtaining such high quality annotations across multiple domains is challenging, thereby making it expensive to scale to newer domains.

To overcome these limitations, in this work, we focus on the problem of developing a semantic parser for multiple domains in the weak supervision setting using denotations. Note that, this combined multiple domain task clearly entails a large set of answers and complex search space in comparison to the individual domain tasks. Therefore, the existing multi-domain semantic parsing models Herzig and Berant (2017) fail when trained under weak supervision setting. See Section 6 for a detailed analysis.

To address this challenge, we propose a multi-policy distillation framework for multi-domain semantic parsing. This framework splits the training in the following two stages: 1) Learn domain experts (teacher) policy using weak supervision for each domain. This allows the individual models to focus on learning the semantic parsing policy for corresponding single domains; 2) Train a unified compressed semantic parser (student) using distillation from these expert policies. This enables the unified student to gain supervision from the above trained expert policies and thus, learn the shared semantic parsing policy for all the domains. This two-stage framework is inspired from policy distillation Rusu et al. (2016)

which transfers policy of a reinforcement learning (RL) agent to train a student network that is more compact and efficient. In our case, weakly supervised domain teachers serve as RL agents. For inference, only the compressed student model is used which takes as input the user utterance from any domain and outputs the corresponding parse program. It is important to note that, the domain identifier input is not required by our model. The generated program is then executed over the corresponding KB to retrieve denotations that are provided as responses to the user.

To the best of our knowledge, we are the first to propose a unified multiple-domain parsing framework which does not assume the availability of ground truth programs. Additionally, it allows inference to be multi-domain enabled and does not require user to provide domain identifiers corresponding to the input utterance. In summary, we make the following contributions:

  • Build a unified neural framework to train a single semantic parser for multiple domains in the absence of ground truth parse programs. (Section 3)

  • We show the effectiveness of multi-policy distillation in learning a semantic parser using independent weakly supervised experts for each domain. (Section 4)

  • We perform an extensive experimental study in multiple domains to understand the efficacy of the proposed system against multiple baselines. We also study the effect of the availability of a small labeled corpus in the distillation setup. (Section 5)

2 Related Work

Figure 2: Illustration of the proposed work in the space of key related work in the area of semantic parsing, knowledge distillation and policy learning

This work is related to three different areas: semantic parsing, policy learning and knowledge distillation. Figure 2 illustrates the placement of our proposed framework of unified semantic parsing in the space of the key related works done in each of these three areas. Semantic parsing has been an extensively studied problem, the first study dating back to  Woods1973. Much of the work has been towards exploiting annotated programs for natural language utterances to build single domain semantic parsers using various methods.  Zettlemoyer:2007,Kwiatkowski2011 propose to learn the probabilistic categorical combination grammars,  Kate:2005 learn transformation from syntactic parse tree of natural language utterance to formal parse tree.  andreas2013semantic model the task of semantic parsing as machine translation. Recently,  DongP16-1004 introduce the use of neural sequence-to-sequence models for the task of machine translation. Due to the cost of obtaining annotated programs, there has been an increasing interest in using weak supervision based methods  Clarke et al. (2010); Liang et al. (2017); Neelakantan et al. (2016); Chen et al. (2018); Goldman et al. (2018) which uses denotations, i.e. final answers obtained on executing a program on the knowledge base, for training.

The problem of semantic parsing has been primarily studied in a single domain setting employing supervised and weakly supervised techniques. However, the task of building a semantic parser in the multi-domain setting is relatively new.  Herzig2017NeuralSP propose semantic parsing models using supervised learning in a multi-domain setup and is the closest to our work. However, none of the existing works inspect the problem of multi-domain semantic parsing in a weak supervision setting.

Knowledge distillation was first presented by Hinton and has been popularly used for model compression of convolution neural networks in computer vision based tasks 

Yu et al. (2017); Li et al. (2017)

.  kim2016seqkd,P17-1176 applied knowledge distillation on recurrent neural networks for the task of machine translation and showed improved performance with a much compressed student network. Our proposed method of policy distillation was first introduced by rusu-distillation-2015 and is built on the principle of knowledge distillation and applied for reinforcement learning agents. Variants of the framework for policy distillations have also been proposed 

Teh et al. (2017). To the best of our knowledge, our work is the first to apply policy distillation in a sequence-to-sequence learning task. We anticipate that the framework described in this paper can be applied to learn unified models for other tasks as well.

3 Proposed Framework

(a) Domain specific expert policy
(b) Learning a unified student by distilling domain policies from experts
Figure 3: Proposed architecture diagram of unified semantic parsing framework. Figure 3(a) demonstrates the training of the experts using weak supervision on the denotation corresponding to input utterance. Once we train all the domain experts for the

domains, we use the probability distributions of the parse generated by these experts to train the student, thereby distilling the domain policies learnt by the teachers to the student as shown in Figure 


In this section, we first present a high level overview of the framework for the proposed unified semantic parsing using multi-policy distillation and then describe the models employed for each component of the framework.

We focus on the setting of ‘K’ domains each with an underlying knowledge-base . We have a training set of utterances and the corresponding final denotations , for each domain . Unlike existing works  Herzig and Berant (2017), we do not assume availability of ground truth programs corresponding to the utterances in the training data. Our goal is to learn a unified semantic parsing model which takes as input a user utterance from any domain and produces the corresponding program which when executed on the corresponding knowledge base should return denotation . In this setup, we only rely on the weak supervision from the final denotations for training this model. Moreover, the domain identifier is not needed by this unified model.

We use multi-policy distillation framework for the task of learning a unified semantic parser. Figure 3 summarizes the proposed architecture. We first train parsing models (teachers) for each domain using weak supervision to learn domain-specific teacher policies. We use REINFORCE for training, similar to prior work on Neural Symbolic Machine Liang et al. (2017) described briefly in Section 4.1. Next, we distill the learnt teacher policies to train a unified semantic parser enabled over multiple domain. (described in Section 4.2). Note that: (1) Our teachers are trained with weak supervision from denotations instead of actual parses and hence are weaker compared to completely supervised semantic parses. (2) State-of-the-art sequence distillation works Kim and Rush (2016); Chen et al. (2017) have focused on a single teacher-student setting.

3.1 Model

In this section, we describe the architecture of semantic parsing model used for both teachers as well as the student networks. We use a standard sequence-to-sequence model Sutskever et al. (2014) with attention similar to DongP16-1004 for this task. Each parsing model (the domain specific teachers and the unified student ) is composed of an -layer encoder LSTM Hochreiter and Schmidhuber (1997) for encoding the input utterances and an -layer attention based decoder LSTM Bahdanau et al. (2014) for producing the program sequences. Note that in this section, we omit the domain id superscript .

Given a user utterance , the aim of the semantic parsing model is to generate output program which should ultimately result in the true denotations . This user utterance is input to the encoder which maps each word in the input sequence to the embedding and uses this embedding to update its respective hidden states using , where are the parameters of encoder LSTM. The last hidden state is input to the decoder’s first state. The decoder updates its hidden state using where is the embedding of output program token at last step and are the decoder LSTM parameters. The output program is generated token-wise by applying softmax over the vocabulary weights derived by transforming the corresponding hidden state .

Further, we employ beam search during decoding which generates a set of parses for every utterance. At each decoding step , a beam containing partial parses of length are maintained. The next step beam are the highest scoring expansions of programs in the beam .

4 Training

In this section we describe the training mechanism employed for the proposed multi-domain policy distillation framework for semantic parsing. The training process in our proposed framework has the following two components (Figure 3): (i) weakly supervised training for domain specific semantic parsing experts and, (ii) distilling multiple domain policies to the unified student . We next describe each of these two components.

4.1 Domain-specific Semantic Parsing Policy

As described in the previous section, an individual domain specific semantic parsing model generates the program which is executed on the knowledge base to return the denotation . For brevity, we omit domain identifier and instance id in this section. In our setting, since labeled programs are not available for training, we use weak supervision from final denotations similar to LiangNSM for each domain expert. As the execution of parse program is a non-differential operation on the KB, we use REINFORCE Williams (1992); Norouzi et al. (2016) for training which maximizes the expected reward. Reward for prediction on an input is defined as the match score between the true denotations for utterance and the denotations obtained by executing the predicted program . The overall objective to maximize the expected reward is as follows

where are the policy parameters; is the output beam containing top scoring programs (described in Section 3.1) and is the likelihood of parse


To reduce the variance in gradient estimation we use baseline

i.e. the average reward for the beam corresponding to the input instance . See Table 2 WeakIndep for the performance achieved for individual domains with this training objective.

Note that the primary challenge with this weakly supervised training is the sparsity in reward signal given the large search space leading to only a few predictions having a non-zero reward. This can be seen in the Table 2 WeakCombined when the entire set of domains is pooled into one, the numbers drop severely due to the exponential increase in the search space.

4.2 Unified Model for multiple domains

For the unified semantic parser, we use the same sequence-to-sequence model described in Section 3.1. The hyper-parameter settings vary from domain-specific models as detailed in Section 5.3. We use the multi-task policy distillation method of rusu-distillation-2015 to train this unified parser for multiple domains. The individual domain experts are trained independently as described in Section 4.1. This distillation framework enables transfer of knowledge from experts to a single student model that operates as a multi-domain parser, even in the absence of any domain indicator with input utterance during the test phase. Each expert provides a transformed training dataset to the student , where is the expert’s probability distribution on the entire program space w.r.t input utterance . Concretely, given is the decoding sequence length and is the vocabulary combined across domains, then denotes the expert ’s respective probabilities that output token equals vocab token , for all time steps and .

The student takes the probability outputs from the experts as the ground truth and is trained in a supervised manner to minimize the cross-entropy loss w.r.t to teachers’ probability distribution:


where are the policy parameters of experts and are the student model parameters; similarly is the probability assigned to output token by student . This training objective enables the unified parser to learn domain-specific parsing strategies from individual domains as well as leverage structural variations across domains. Therefore, the combined multi-domain policy is refined and compressed during the distillation process thus rendering it to be more effective in parsing for each of the domains.

5 Experimental Setup

In this section, we provide details on the data and model used for the experimental analysis111Code and data is available at https://github.com/pagrawal-ml/Unified-Semantic-Parsing. We further elaborate on the baselines used.

DOMAIN Original Dataset Normalized Dataset
Utterance Program Utterance Program
Vocab Vocab Avg. Vocab Vocab Avg.
Length Length
Basketball 340 65 48.3 332 58 20.5
Blocks 213 48 47.4 212 41 9.7
Calendar 206 54 43.7 191 46 8.8
Housing 302 58 42.7 293 48 8.5
Publications 190 44 46.2 187 38 8.5
Recipes 247 49 42.6 241 40 7.8
Restaurants 315 62 41.2 310 48 8.2
Average 259 54.3 44.6 252.3 45.6 10.3
Table 1: Training data statistics for original and normalized dataset. For each domain, we compare the #unique tokens (Vocab) in input utterances and corresponding programs; and average program length.

5.1 Data

We use the Overnight semantic parsing dataset Wang et al. (2015) which contains multiple domains. Each domain has utterances (questions) and corresponding parses in DCS form that are executable on domain specific knowledge base. Every domain is designed to focus on a specific linguistic phenomenon, for example, calendar on temporal knowledge, blocks on spatial queries. In this work, we use seven domains from the dataset as listed in Table 1.

We would like to highlight that we do not use the parses available in the dataset during the training of our unified semantic parser. Our weakly supervised setup uses denotations to navigate the program search space and learn the parsing policy. This search space is a function of decoder (program) length and vocabulary size. Originally, the parses have 45 tokens on an average with a combined vocabulary of 182 distinct tokens across the domains. To reduce the decoder search space, we normalize the data to have shortened parses with an average length of 11 tokens and 147 combined vocab size. We reduce the sequence length by using a set of template normalization functions and reduce the vocab size by masking named entities for each domain. An example of normalization function is the following: an entity utterance say of type recipe in the query is programmed by first creating a single valued list with the entity type i.e. (en.recipe) and then that property is extracted : (call SW.getProperty ( call SW.singleton en.recipe ) ( string ! type )) resulting in 14 tokens. We replace this complex phrasing by directly substituting the entity type under consideration i.e. (en.recipe) (1 token). Next, we show an example for a complete utterance: what recipes posting date is at least the same as rice pudding. Its original parse is: [fontsize=] (call SW.listValue (call SW.filter (call SW.getProperty (call SW.singleton en.recipe) (string ! type)) (call SW.ensureNumericProperty (string posting_date)) (string ¿=) (call SW.ensureNumericEntity (call SW.getProperty en.recipe.rice_pudding (string posting_date))))). Our normalized query is what recipes posting date is at least the same as e0, where entity rice pudding is substituted by entity identifier e0. The normalized parse is as follows: [fontsize=] SW.filter en.recipe SW.ensureNumericProperty posting_date ¿= (SW.ensureNumericEntity SW.getProperty e0 posting_date)

It is important to note that this normalization function is reversible. During the test phase, we apply the reverse function to convert the normalized parses to original forms for computing the denotations. Table 1 shows the domain wise statistics of original and normalized data. It is important to note that this script is applicable for template reduction for any DCS form.

We report hard denotation accuracy i.e. the proportion of questions for which the top prediction and ground truth programs yield the matching answer sets as the evaluation metric. For computing the rewards during training, we use soft denotation accuracy i.e. F1 score between predicted and ground truth answer sets.

Table 2 shows the accuracy with strongly supervised training (Supervised). The average denotation accuracy (with beam width 1) of 70.6% which is comparable to state-of-the-art Jia and Liang (2016) denotation accuracy of 75.6% (with beam width 5). This additionally suggests that data normalization process does not alter the task complexity.

5.2 Baselines

In the absence of any work on multi-domain parser trained without ground truth programs, we compare the performance of the proposed unified framework against the following baselines:

  1. Independent Domain Experts (Weak-Independent): These are the set of weakly supervised semantic parsers, trained independently for each domain using REINFORCE algorithm as described in Section 4.1. Note that these are the teachers in our multi-policy distillation framework.

  2. Combined Weakly Supervised Semantic Parser (Weak-Combined)): As per the recommendation in Herzig2017NeuralSP, we pool all the domains datasets into one and train a single semantic parser with weak supervision.

  3. Independent Policy Distillation (Distill-Independent): We also experiment with independent policy distillation for each domain. The setup is similar to the one described in Section 4.2 used to learn student parsing models, one for each individual domain. Each student model uses the respective expert model as the only teacher.

Following the above naming convention, we term our proposed framework as Distill-Combined. For the sake of completeness, we also compute the skyline Supervised i.e. the sequence-to-sequence model described in Section 3.1 trained with ground truth parses.

5.3 Model Setting

We use the original train-test split provided in the dataset. We further split the training set of each domain into training (80%) and validation (20%) sets. We tune each hyperparameter by choosing the parameter from a range of values and choose the configuration with highest validation accuracy for each model. For each experiment we select from: beam width = {1, 5, 10, 20}, number of layers = {1,2,3,4}, rnn size for both encoder & decoder = {100, 200, 300}. For faster compute, we use the string match accuracy as the proxy to denotation reward. In our experiments, we found that combined model performs better with the number of layers set to 2 and RNN size set to 300 while individual models’ accuracies did not increase with an increase in model capacity. This is intuitive as the combined model requires more capacity to learn multiple domains. Encoder and decoder maximum sequence lengths were set to 50 and 35 respectively. For all the models, RMSprop optimizer

Hinton et al. was used with learning rate set to 0.001.

6 Results and Discussion

Table 2 summarizes our main experimental results. It shows that our proposed framework Distill-Combined clearly outperforms the three baselines Weak-Independent, Weak-Combined, Distill-Independent described in Section 5.2

DOMAIN Weak- Weak- Distill- Distill-
Independent Combined Independent Combined Supervised
Basketball 33.8 0.5 33.8 36.3 81.0
Blocks 27.6 0.8 36.8 37.1 52.8
Calendar 25.0 0.6 12.5 17.3 72.0
Housing 33.3 2.1 42.3 49.2 66.1
Publications 42.2 6.2 45.9 48.4 68.3
Recipes 45.8 2.3 61.5 66.2 80.5
Restaurants 41.3 2.1 40.9 45.2 73.5
Average 35.5 2.1 39.1 42.8 70.6
Table 2: Test denotation accuracy for each domain comparing our proposed method DistillCombined with the three baselines. We also report the skyline Supervised.

Effect of Policy Distillation: Distill-Independent are individual domain models trained through distillation of individual weakly supervised domain experts policies Weak-Independent. We observe that policy distillation of individual expert policies result in an average percentage increase of in accuracy with a maximum of increase in case of blocks domains, which shows the effectiveness of the distillation method employed in our framework. Note that for calendar domain, Weak-Independent is unable to learn the parsing policy probably due to the complexity of temporal utterances. Therefore, further distillation on the inaccurate policy leads to drop in performance. More systematic analysis on the failure cases is an interesting future direction.

Performance of Unified Semantic Parsing framework: The results show the proposed unified semantic parser using multi-policy distillation (Distill-Combined) (as described in section 3) on an average has the highest performance in predicting programs under weak supervision setup. Distill-Combined approach leads to an increased performance by on an average in comparison to individual domain specific teachers (Weak-Independent). We note maximum increase in the case of Housing domain with increase in the denotation accuracy.

Effectiveness of Multi-Policy Distillation: Finally, we evaluate the effectiveness of the overall multi-policy distillation process in comparison to training a combined model with data merged from all the domains (Weak-Combined) in the weak supervision setup. We observe that due to weak signal strength and enlarged search space from multiple domains, Weak-Combined model performs poorly across domains. Thus, further reinforcing the need for the distillation process. As discussed earlier, the Supervised model is trained using strong supervision from ground-truth parses and hence is not considered as a comparable baseline, rather a skyline, for our proposed model

Figure 4: Effect of the fraction of training data on different models

6.1 Effect of Small Parallel Corpus

We show that our model can greatly benefit from the availability of a limited amount of parallel data where semantic parses are available. Figure 4 plots the performance of Weak-Independent and Distill-Independent models for recipes domain when initialized with a pre-trained Supervised model trained on 10% and 30% of parallel training data. As it can be seen, adding 10% parallel data brings an improvement of about 5 points, while increasing the parallel corpus size to only 30% we observe an improvement of about 11 points. The observed huge boost in performance is motivating given the availability of small amount of parallel corpus in most real world scenarios.

7 Conclusions and Future Work

In this work, we addressed the challenge of training a semantic parser for multiple domains without strong supervision i.e. in the absence of ground truth programs corresponding to input utterances. We propose a novel unified neural framework using multi-policy distillation mechanism with two stages of training through weak supervision from denotations i.e. final answers corresponding to utterances. The resultant multi-domain semantic parser is compact and more precise as demonstrated on the Overnight dataset. We believe that this proposed framework has wide applicability to any sequence-to-sequence model.

We show that a small parallel corpus with annotated programs boosts the performance. We plan to explore if further fine-tuning using denotations based training on the distilled model can lead to improvements in the unified parser. We also plan to investigate the possibility of augmenting the parallel corpus by bootstrapping from shared templates across domains. This would further make it feasible to perform transfer learning on a new domain. An interesting direction would be to enable domain experts to identify and actively request for program annotations given the knowledge shared by other domains. We would also like to explore if guiding the decoder through syntactical and domain-specific constraints helps in reducing the search space for the weakly supervised unified parser.


We thank Ghulam Ahmed Ansari and Miguel Ballesteros, our colleagues at IBM for discussions and suggestions which helped in shaping this paper.