Unsupervised Dual Paraphrasing for Two-stage Semantic Parsing

by   Ruisheng Cao, et al.
Shanghai Jiao Tong University

One daunting problem for semantic parsing is the scarcity of annotation. Aiming to reduce nontrivial human labor, we propose a two-stage semantic parsing framework, where the first stage utilizes an unsupervised paraphrase model to convert an unlabeled natural language utterance into the canonical utterance. The downstream naive semantic parser accepts the intermediate output and returns the target logical form. Furthermore, the entire training process is split into two phases: pre-training and cycle learning. Three tailored self-supervised tasks are introduced throughout training to activate the unsupervised paraphrase model. Experimental results on benchmarks Overnight and GeoGranno demonstrate that our framework is effective and compatible with supervised training.


page 1

page 2

page 3

page 4


From Paraphrasing to Semantic Parsing: Unsupervised Semantic Parsing via Synchronous Semantic Decoding

Semantic parsing is challenging due to the structure gap and the semanti...

Semantic Parsing with Dual Learning

Semantic parsing converts natural language queries into structured logic...

Towards Transparent Interactive Semantic Parsing via Step-by-Step Correction

Existing studies on semantic parsing focus primarily on mapping a natura...

Compositional pre-training for neural semantic parsing

Semantic parsing is the process of translating natural language utteranc...

Few-Shot Semantic Parsing for New Predicates

In this work, we investigate the problems of semantic parsing in a few-s...

Iterative Utterance Segmentation for Neural Semantic Parsing

Neural semantic parsers usually fail to parse long and complex utterance...

Unsupervised Full Constituency Parsing with Neighboring Distribution Divergence

Unsupervised constituency parsing has been explored much but is still fa...

1 Introduction

Semantic parsing is the task of converting natural language utterances into structured meaning representations, typically logical forms (Zelle and Mooney, 1996; Wong and Mooney, 2007; Zettlemoyer and Collins, 2007; Lu et al., 2008). One prominent approach to build a semantic parser from scratch follows this procedure (Wang et al., 2015):

  1. (canonical utterance, logical form) pairs are automatically generated according to a domain-general grammar and a domain-specific lexicon.

  2. Researchers use crowdsourcing to paraphrase those canonical utterances into natural language utterances (the upper part of Figure 1).

  3. A semantic parser is built upon collected (natural language utterance, logical form) pairs.

Canonical utterances are pseudo-language utterances automatically generated from grammar rules, which can be understandable to people, but do not sound natural. Though effective, the paraphrasing paradigm suffers from two drawbacks: (1) dependency on nontrivial human labor and (2) low utilization of canonical utterances.

Figure 1: Two-stage semantic parsing framework, which is composed of an unsupervised paraphrase model and a naive neural semantic parser.

Annotators may struggle to understand the exact meanings of canonical utterances. Some canonical utterances even incur ambiguity, which enhances the difficulty of annotation. Furthermore, Wang et al. (2015) and Herzig and Berant (2019) only exploit them during data collection. Once the semantic parsing dataset is constructed, canonical utterances are thrown away, which leads to insufficient utilization. While Berant and Liang (2014) and Su and Yan (2017) have reported the effectiveness of leveraging them as intermediate outputs, they experiment in a completely supervised way, where the human annotation is indispensable.

In this work, inspired by unsupervised neural machine translation 

(Lample et al., 2017; Artetxe et al., 2017), we propose a two-stage semantic parsing framework. The first stage uses a paraphrase model to convert natural language utterances into corresponding canonical utterances. The paraphrase model is trained in an unsupervised way. Then a naive111We use word “naive” just to differentiate from traditional semantic parser, where our module expects to accept canonical utterances instead of natural language utterances. neural semantic parser is built upon auto-generated (canonical utterance, logical form) pairs using traditional supervised training. These two models are concatenated into a pipeline (Figure 1).

Paraphrasing aims to perform semantic normalization and reduce the diversity of expression, trying to bridge the gap between natural language and logical forms. The naive neural semantic parser learns inner mappings between canonical utterances and logical forms, as well as the structural constraints.

The unsupervised paraphrase model consists of one shared encoder and two separate decoders for natural language and canonical utterances. In the pre-training phase, we design three types of noise (Section 3.1

) tailored for sentence-level denoising autoencoder 

(Vincent et al., 2008) task to warm up the paraphrase model without any parallel data. This task aims to reconstruct the raw input utterance from its corrupted version. After obtaining a good initialization point, we further incorporate back-translation (Sennrich et al., 2015)

and dual reinforcement learning (Section

2.2.2) tasks during the cycle learning phase. In this phase, one encoder-decoder model acts as the environment to provide pseudo-samples and reward signals for another.

We conduct extensive experiments on benchmarks Overnight and GeoGranno, both in unsupervised and semi-supervised settings. The results show that our method obtains significant improvements over various baselines in unsupervised settings. With full labeled data, we achieve new state-of-the-art performances ( on Overnight and on GeoGranno), not considering additional data sources.

The main contributions of this work can be summarized as follows:

  • A two-stage semantic parser framework is proposed, which casts parsing into paraphrasing. No supervision is provided in the first stage between input natural language utterances and intermediate output canonical utterances.

  • In unsupervised settings, experimental results on datasets Overnight and GeoGranno demonstrate the superiority of our model over various baselines, including the supervised method in Wang et al. (2015) on Overnight ( compared to ).

  • The framework is also compatible with traditional supervised training and achieves the new state-of-the-art performances on datasets Overnight () and GeoGranno () with full labeled data.

2 Our Approach

2.1 Problem Definition

For the rest of our discussion, we use to denote natural language utterance, for canonical utterance, and for logical form. , and represent the set of all possible natural language utterances, canonical utterances, and logical forms respectively. The underlying mapping function is dominated by grammar rules.

We can train a naive neural semantic parser using attention (Luong et al., 2015) based Seq2Seq model (Sutskever et al., 2014). The labeled samples can be automatically generated by recursively applying grammar rules. can be pre-trained and saved for later usage.

As for the paraphrase model (see Figure 1), it consists of one shared encoder and two independent decoders: for natural language utterances and for canonical utterances. The symbol denotes module composition. Detailed model implementations are omitted here since they are not the main focus (Appendix A.1 for reference).

Given an input utterance , the paraphrase model converts it into possible canonical utterance ; then is passed into the pre-trained naive parser to obtain predicted logical form . Another paraphrase model, , is only used as an auxiliary tool during training.

2.2 Unsupervised training procedures

To train an unsupervised paraphrase model with no parallel data between and , we split the entire training procedure into two phases: pre-training and cycle learning. and are first pre-trained as denoising auto-encoders (DAE). This initialization phase plays a significant part in accelerating convergence due to the ill-posed nature of paraphrasing tasks. Next, in the cycle learning phase, we employ both back-translation (BT) and dual reinforcement learning (DRL) strategies for self-training and exploration.

2.2.1 Pre-training phase

In this phase, we initialize the paraphrase model via the denoising auto-encoder task. All auxiliary models involved in calculating rewards (see Section 3.2) are also pre-trained.

Denoising auto-encoder
Figure 2: Denoising auto-encoders for natural language utterance and canonical utterance .

Given a natural language utterance , we forward it through a noisy channel  (see Section 3.1) and obtain its corrupted version . Then, model tries to reconstruct the original input from its corrupted version , see Figure 2. Symmetrically, model tries to reconstruct the original canonical utterance from its corrupted input . The training objective can be formulated as


where and are parameters for the system.

2.2.2 Cycle learning phase

The training framework till now is just a noisy-copying model. To improve upon it, we adopt two schemes in the cycle learning phase, back-translation (BT) and dual reinforcement learning (DRL), see Figure 3.

Figure 3: Cycle learning tasks: back-translation and dual reinforcement learning.

In this task, the shared encoder aims to map the input utterance of different types into the same latent space, and the decoders need to decompose this representation into the utterance of another type. More concretely, given a natural language utterance , we use paraphrase model in evaluation mode with greedy decoding to convert into canonical utterance . We will obtain pseudo training sample for paraphrase model . Similarly, pair can be synthesized from model given canonical utterance . Next, we train the paraphrase model from these pseudo-parallel samples and update parameters by minimizing


The updated model will generate better paraphrases during the iterative process.

Dual reinforcement learning

Back-translation pays more attention to utilize what has been learned by the dual model, which may lead to a local optimum. To encourage more trials during cycle learning, we introduce the dual reinforcement learning strategy and optimize the system through policy gradient (Sutton et al., 2000).

Starting from a natural language utterance , we sample one canonical utterance through . Then, we evaluate the quality of from different aspects (see Section 3.2) and obtain reward . Similarly, we calculate reward for sampled natural language utterance

. To cope with high variance in reward signals, we increase sample size to

and re-define reward signals via a baseline to stabilize learning: (take for an example)

We investigate different baseline choices (such as running mean, cumulative mean of history, and reward of the greedy decoding prediction), and it performs best when we use the average of rewards within samples of per input, especially with larger sample size. The training objective is the negative sum of expected reward:


The gradient is calculated with REINFORCE (Williams, 1992) algorithm:

The complete loss function in the cycle learning phase is the sum of cross entropy loss and policy gradient loss:

. The entire training procedure is summarized in Algorithm 1.

1:Unlabeled dataset ; Labeled pairs synthesized from grammar; Iterations
2:Paraphrase model Pre-training phase
3:Pre-train all auxiliary models: language models and , naive neural semantic parser and utterance discriminator
4:Pre-train paraphrase models and via objective based on Eq.1 Cycle learning phase
5:for  to  do
6:     Sample natural language utterance
7:     Sample canonical utterance Back-translation
8:     Generate via model ;
9:     Generate via model ;
10:     Use and as pseudo samples, calculate loss based on Eq.2; Dual Reinforcement Learning
11:     Sample via model
12:     Compute total reward via models , , and based on Eq.4
13:     Sample via model
14:     Compute total reward via models , and based on Eq.5
15:     Given and , calculate loss based on Eq.3 Update model parameters
16:     Calculate total loss
17:     Update model parameters, get new models and
18:end for
Algorithm 1 Training procedure

3 Training details

In this section, we elaborate on different types of noise used in our experiment and the reward design in dual reinforcement learning.

3.1 Noisy channel

We introduce three types of noise to deliberately corrupt the input utterance in the DAE task.

Importance-aware word dropping

Traditional word dropping (Lample et al., 2017)

discards each word in the input utterance with equal probability

. During reconstruction, the decoder needs to recover those words based on the context. We further inject a bias towards dropping more frequent words (such as function words) in the corpus instead of less frequent words (such as content words), see Table 1 for illustration.

Input what team does kobe bryant play for
Ordinary drop what does kobe bryant for
Our drop team kobe bryant play for
Table 1: Importance-aware word dropping example.

Each word in the natural language utterance is independently dropped with probability

where is the word count of in , and is the maximum dropout rate ( in our experiment). As for canonical utterances, we apply this word dropping similarly.

Mixed-source addition

For any given raw input, it is either a natural language utterance or a canonical utterance. This observation discourages the shared encoder to learn a common representation space. Thus, we propose to insert extra words from another source into the input utterance. As for noisy channel , which corrupts a natural language utterance, we first select one candidate canonical utterance ; next, - words are randomly sampled from and inserted into arbitrary position in , see Table 2 for example.

To pick candidate

with higher relevance, we use a heuristic method:

canonical utterances are randomly sampled as candidates (); we choose that has the minimum amount of Word Mover’s Distance concerning  (WMD, Kusner et al., 2015). The additive operation is exactly symmetric for noisy channel .

Input how many players are there
Selected number of team
Output how many number players are there
Table 2: Mixed-source addition example.
Bigram shuffling

We also use word shuffling (Lample et al., 2017)

in noisy channels. It has been proven useful in preventing the encoder from relying too much on the word order. Instead of shuffling words, we split the input utterance into n-grams first and shuffle at n-gram level (bigram in our experiment). Considering the inserted words from another source, we shuffle the entire utterance after the addition operation (see Table

3 for example).

Input what is kobe bryants team
1-gram shuffling what is kobe team bryants
2-gram shuffling what is team kobe bryants
Table 3: Bigram shuffling example

3.2 Reward design

In order to provide more informative reward signals and promote the performance in the DRL task, we introduce various rewards from different aspects.


The fluency of an utterance is evaluated by a length-normalized language model. We use individual language models ( and ) for each type of utterances. As for a sampled natural language utterance , the fluency reward is

As for canonical utterances, we also include an additional reward from downstream naive semantic parser to indicate whether the sampled canonical utterance is well-formed as input for .


Natural language utterances are diverse, casual, and flexible, whereas canonical utterances are generally rigid, regular, and restricted to some specific form induced by grammar rules. To distinguish their characteristics, we incorporate another reward signal that determine the style of the sampled utterance. This is implemented by a CNN discriminator (Kim, 2014):


is a pre-trained sentence classifier that evaluates the probability of the input utterance being a canonical utterance.


Relevance reward is included to measure how much content is preserved after paraphrasing. We follow the common practice to take the loglikelihood from the dual model.

Some other methods include computing the cosine similarity of sentence vectors or BLEU score 

(Papineni et al., 2002) between the raw input and the reconstructed utterance. Nevertheless, we find loglikelihood to perform better in our experiments.

The total reward for the sampled canonical utterance and natural language utterance can be formulated as


4 Experiment

In this section, we evaluate our system on benchmarks Overnight and GeoGranno in both unsupervised and semi-supervised settings. Our implementations are public available222https://github.com/rhythmcao/unsup-two-stage-semantic-parsing.


It contains natural language paraphrases paired with logical forms over domains. We follow the traditional train/valid to choose the best model during training. Canonical utterances are generated with tool Sempre333https://github.com/percyliang/sempre paired with target logical forms (Wang et al., 2015). Due to the limited number of grammar rules and its coarse-grained nature, there is only one canonical utterance for each logical form, whereas natural language paraphrases for each canonical utterance on average. For example, to describe the concept of “larger”, in natural language utterances, many synonyms, such as “more than”, “higher”, “at least”, are used, while in canonical utterances, the expression is restricted by grammar.


Due to the language mismatch problem (Herzig and Berant, 2019), annotators are prone to reuse the same phrase or word while paraphrasing. GeoGranno is created via detection instead of paraphrasing. Natural language utterances are firstly collected from query logs. Crowd workers are required to select the correct canonical utterance from candidate list (provided by an incrementally trained score function) per input. We follow exactly the same split (train/valid/test ) in original paper Herzig and Berant (2019).

4.1 Experiment setup

Throughout the experiments, unless otherwise specified, word vectors are initialized with Glove6B Pennington et al. (2014) with coverage on average and allowed to fine-tune. Out-of-vocabulary words are replaced with . Batch size is fixed to and sample size in the DRL task is . During evaluation, the size of beam search is . We use optimizer Adam (Kingma and Ba, 2014) with learning rate for all experiments. All auxiliary models are pre-trained and fixed for later usage. We report the denotation-level accuracy of logical forms in different settings.

Supervised settings

This is the traditional scenario, where labeled pairs are used to train a one-stage parser directly, and pairs are respectively used to train different parts of a two-stage parser.

Unsupervised settings

We split all methods into two categories: one-stage and two-stage. In the one-stage parser, Embed semantic parser is merely trained on pairs but evaluated on natural language utterances. Contextual embeddings ELMo (Peters et al., 2018) and Bert-base-uncased (Devlin et al., 2018) are also used to replace the original embedding layer; WmdSamples method labels each input with the most similar logical form (one-stage) or canonical utterance (two-stage) based on WMD (Kusner et al., 2015) and deals with these faked samples in a supervised way; MultiTaskDae utilizes another decoder for natural language utterances in one-stage parser to perform the same DAE task discussed before. The two-stage CompleteModel can share the encoder or not (-SharedEncoder), and include tasks in the cycle learning phase or not (-CycleLearning). The downstream parser for the two-stage system is Embed + Glove6B and fixed after pre-training.

Semi-supervised settings

To further validate our framework, based on the complete model in unsupervised settings, we also conduct semi-supervised experiments by gradually adding part of labeled paraphrases with supervised training into the training process (both pre-training and cycle learning phase).

4.2 Results and analysis

width= Method Bas Blo Cal Hou Pub Rec Res Soc Avg Supervised Previous SPO (Wang et al., 2015) 46.3 41.9 74.4 54.0 59.0 70.8 75.9 48.2 58.8 DSP-C (Xiao et al., 2016) 80.5 55.6 75.0 61.9 75.8 80.1 80.0 72.7 NoRecomb(Jia and Liang, 2016) 85.2 58.1 78.0 71.4 76.4 79.6 76.2 81.4 75.8 CrossDomain(Su and Yan, 2017) 86.2 60.2 79.8 71.4 78.9 84.7 81.6 82.9 78.2 Seq2Action (Chen et al., 2018) 88.2 61.4 81.5 74.1 80.7 82.9 80.7 82.1 79.0 Dual(Cao et al., 2019) 87.5 63.7 79.8 73.0 81.4 81.5 81.6 83.0 78.9 Ours One-stage 85.2 61.9 73.2 72.0 76.4 80.1 78.6 80.8 76.0 Two-stage 84.9 61.2 78.6 67.2 78.3 80.6 78.9 81.3 76.4 Unsupervised One-stage Embed + Glove6B 22.3 23.6 9.5 26.5 18.0 24.5 24.7 8.4 19.7      + ELMo 36.8 21.1 20.2 21.2 23.6 36.1 37.7 12.8 26.2      + Bert 40.4 31.6 23.2 35.5 37.9 30.1 44.0 19.2 32.7 WmdSamples 34.5 33.8 29.2 37.6 36.7 41.7 56.6 37.0 38.4 MultiTaskDae 44.0 25.8 16.1 34.4 29.2 46.3 43.7 15.5 31.9 Two-stage WmdSamples 31.9 29.0 36.1 47.9 34.2 41.0 53.8 35.8 38.7 CompleteModel 64.7 53.4 58.3 59.3 60.3 68.1 73.2 48.4 60.7  - CycleLearning 32.5 43.1 36.9 48.2 53.4 49.1 58.7 36.9 44.9  - SharedEncoder 63.4 46.4 58.9 61.9 56.5 65.3 64.8 42.9 57.5 Semi-supervised Dual (Cao et al., 2019) + labeled data 83.6 62.2 72.6 61.9 71.4 75.0 76.5 80.4 73.0 CompleteModel + labeled data 83.6 57.4 66.1 63.0 60.3 68.1 75.3 73.1 68.4          + labeled data 84.4 59.4 79.2 57.1 65.2 79.2 77.4 76.9 72.4          + labeled data 85.4 64.9 77.4 69.3 67.1 78.2 79.2 78.3 75.0          + labeled data 85.9 64.4 81.5 66.1 74.5 82.4 79.8 81.6 77.0          + labeled data 87.2 65.7 80.4 75.7 80.1 86.1 82.8 82.7 80.1

Table 4: Denotation level accuracy of logical forms on dataset Overnight. Previous supervised methods with superscript * means cross-domain or extra data sources are not taken into account.

width=0.48 Method GeoGranno Supervised Previous CopyNet+ELMo (Herzig and Berant, 2019) 72.0 Ours One-stage 71.9 Two-stage 71.6 Unsupervised One-stage Embed + Glove6B 36.7      + ELMo 38.9      + Bert 40.7 WmdSamples 32.0 MultiTaskDae 38.1 Two-stage WmdSamples 35.3 CompleteModel 63.7  - CycleLearning 44.6  - SharedEncoder 59.0 Semi-supervised CompleteModel + labeled data 69.4          + labeled data 71.6          + labeled data 74.5

Table 5: Denotation level accuracy of logical forms on dataset GeoGranno.

As Table 4 and 5 demonstrate, in unsupervised settings: (1) two-stage semantic parser is superior to one-stage, which bridges the vast discrepancy between natural language utterances and logical forms by utilizing canonical utterances. Even in supervised experiments, this pipeline is still competitive ( compared to , to ). (2) Not surprisingly, model performance is sensitive to the word embedding initialization. On Overnight, directly using raw Glove6B word vectors, the performance is the worst among all baselines (). Benefiting from pre-trained embeddings ELMo or Bert, the accuracy is dramatically improved ( and ). (3) When we share the encoder module in a one-stage parser for multi-tasking (MultiTaskDae), the performance is not remarkably improved, even slightly lower than Embed+Bert ( compared to , to ). We hypothesize that a semantic parser utilizes the input utterance in a way different from that of a denoising auto-encoder, thus focusing on different zones in representation space. However, in a paraphrasing model, since the input and output utterances are exactly symmetric, sharing the encoder is more suitable to attain an excellent performance (from to on Overnight, to on GeoGranno). Furthermore, the effectiveness of the DAE pre-training task ( and accuracy on target task) can be explained in part by the proximity of natural language and canonical utterances. (4) WmdSamples method is easy to implement but has poor generalization and obvious upper bound. While our system can self-train through cycle learning and promote performance from initial to on Overnight, outperforming traditional supervised method (Wang et al., 2015) by points.

As for semi-supervised results: (1) when only labeled data is added, the performance is dramatically improved from to on Overnight and to on GeoGranno. (2) With annotation, our system is competitive (

) to the neural network model using all data with supervised training. (3) Compared with the previous result reported in

Cao et al. (2019) on dataset Overnight with parallel data, our system surpasses it by a large margin () and achieves the new state-of-the-art performance on both datasets when using all labeled data (), not considering results using additional data sources or cross-domain benefits.

From the experimental results and Figure 4, we can safely summarize that (1) our proposed method resolves the daunting problem of cold start when we train a semantic parser without any parallel data. (2) It is also compatible with traditional supervised training and can easily scale up to handle more labeled data.

Figure 4: Semi-supervised results of different ratios of labeled data on Overnight. Baselines are one-stage and two-stage models with merely supervised training.

4.3 Ablation study

In this section, we analyze the influence of each noise type in the DAE task and different combinations of schemes in the cycle learning phase on dataset Overnight.

4.3.1 Noisy channels in the pre-training DAE

# Types Drop Addition Shuffling Acc
none 26.9
one 33.7
two 43.0
all 44.9
Table 6: Ablation study of different noisy channels.

According to results in Table 6, (1) it is interesting that even without any noise, in which case the denoising auto-encoder degenerates into a simple copying model, the paraphrase model still succeeds to make some useful predictions (). This observation may be attributed to the shared encoder for different utterances. (2) When we gradually complicate the DAE task by increasing the number of noise types, the generalization capability continues to improve. (3) Generally speaking, importance-aware drop and mixed-source addition are more useful than bigram shuffling in this task.

4.3.2 Strategies in the cycle learning

Table 7: Ablation study of schemes in cycle learning

The most striking observation arising from Table 7 is that the performance decreases by percent when we add the DAE task into the cycle learning phase (BT+DRL). A possible explanation for this phenomenon may be that the model has reached its bottleneck of the DAE task after pre-training, thereby making no contribution to the cycle learning process. Another likely factor may stem from the contradictory goals of different tasks. If we continue to add this DAE regularization term, it may hinder exploratory trials of the DRL task. By decoupling the three types of rewards in DRL, we discover that style and relevance rewards are more informative than the fluency reward.

4.4 Case study

Input: who has gotten 3 or more steals
Baseline: player whose number of steals ( over a season ) is at most 3
Ours: player whose number of steals ( over a season ) is at least 3
(a) domain: Basketball
Input: show me all attendees of the weekly standup meeting
Baseline: meeting whose attendee is attendee of weekly standup
Ours: person that is attendee of weekly standup and that is attendee of weekly standup
(b) domain: Calendar
Input: what is the largest state bordering _state_
Baseline: state that has the largest area
Ours: state that borders _state_ and that has the largest area
Input: which state has the highest population density ?
Baseline: population of state that has the largest density
Ours: state that has the largest density
(c) domain: GeoGranno
Table 8: Case study. The input is natural language utterance, and the intermediate output is canonical utterance. Entities in dataset GeoGranno are replaced with its types, e.g. “_state_”.

In Table 8, we compare intermediate canonical utterances generated by our unsupervised paraphrase model with that created by the baseline WmdSamples. In domain Basketball, our system succeeds in paraphrasing the constraint into “at least 3”, which is an alias of “3 or more”. This finding consolidates the assumption that our model can learn these fine-grained semantics, such as phrase alignments. In domain GeoGranno, our model rectifies the errors in baseline system where constraint “borders _state_” is missing and subject “state” is stealthily replaced with “population”. As for domain Calendar, the baseline system fails to identify the query object and requires “meeting” instead of “person”. Although our model correctly understands the purpose, it is somewhat stupid to do unnecessary work. The requirement “attendee of weekly standup” is repeated. This may be caused by the uncontrolled process during cycle learning in that we encourage the model to take risky steps for better solutions.

5 Related Work

Annotation for Semantic Parsing

Semantic parsing is always data-hungry. However, the annotation for semantic parsing is not user-friendly. Many researchers have attempted to relieve the burden of human annotation, such as training from weak supervision Krishnamurthy and Mitchell (2012); Berant et al. (2013); Liang et al. (2017); Goldman et al. (2018)

, semi-supervised learning 

(Yin et al., 2018; Guo et al., 2018; Cao et al., 2019; Zhu et al., 2014), on-line learning (Iyer et al., 2017; Lawrence and Riezler, 2018) and relying on multi-lingual (Zou and Lu, 2018) or cross-domain datasets (Herzig and Berant, 2017; Zhao et al., 2019). In this work, we try to avoid the heavy work in annotation by utilizing canonical utterances as intermediate results and construct an unsupervised model for paraphrasing.

Unsupervised Learning for Seq2Seq Models

Seq2Seq (Sutskever et al., 2014; Zhu and Yu, 2017) models have been successfully applied in unsupervised tasks such as neural machine translation (NMT) (Lample et al., 2017; Artetxe et al., 2017), text simplification (Zhao et al., 2020), spoken language understanding (Zhu et al., 2018) and text style transfer (Luo et al., 2019). Unsupervised NMT relies heavily on pre-trained cross-lingual word embeddings for initialization, as Lample et al. (2018) pointed out. Moreover, it mainly focuses on learning phrase alignments or word mappings. While in this work, we dive into sentence-level semantics and adopt the dual structure of an unsupervised paraphrase model to improve semantic parsing.

6 Conclusion

In this work, aiming to reduce annotation, we propose a two-stage semantic parsing framework. The first stage utilizes the dual structure of an unsupervised paraphrase model to rewrite the input natural language utterance into canonical utterance. Three self-supervised tasks, namely denoising auto-encoder, back-translation and dual reinforcement learning, are introduced to iteratively improve our model through pre-training and cycle learning phases. Experimental results show that our framework is effective, and compatible with supervised training.


We thank the anonymous reviewers for their thoughtful comments. This work has been supported by the National Key Research and Development Program of China (Grant No. 2017YFB1002102) and Shanghai Jiao Tong University Scientific and Technological Innovation Funds (YG2020YQ01).


  • M. Artetxe, G. Labaka, E. Agirre, and K. Cho (2017) Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041. Cited by: §1, §5.
  • J. Berant, A. Chou, R. Frostig, and P. Liang (2013) Semantic parsing on freebase from question-answer pairs. In

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

    pp. 1533–1544. Cited by: §5.
  • J. Berant and P. Liang (2014) Semantic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 1415–1425. External Links: Link, Document Cited by: §1.
  • R. Cao, S. Zhu, C. Liu, J. Li, and K. Yu (2019) Semantic parsing with dual learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 51–64. External Links: Link, Document Cited by: §4.2, Table 4, §5.
  • B. Chen, L. Sun, and X. Han (2018) Sequence-to-action: end-to-end semantic graph generation for semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 766–777. Cited by: Table 4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.1.
  • O. Goldman, V. Latcinnik, E. Nave, A. Globerson, and J. Berant (2018) Weakly supervised semantic parsing with abstract examples. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1809–1819. Cited by: §5.
  • D. Guo, Y. Sun, D. Tang, N. Duan, J. Yin, H. Chi, J. Cao, P. Chen, and M. Zhou (2018) Question generation from sql queries improves neural semantic parsing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1597–1607. Cited by: §5.
  • J. Herzig and J. Berant (2017) Neural semantic parsing over multiple knowledge-bases. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 623–628. Cited by: §5.
  • J. Herzig and J. Berant (2019) Don’t paraphrase, detect! rapid and effective data collection for semantic parsing. arXiv preprint arXiv:1908.09940. Cited by: §1, §4, Table 5.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §A.1.
  • S. Iyer, I. Konstas, A. Cheung, J. Krishnamurthy, and L. Zettlemoyer (2017) Learning a neural semantic parser from user feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 963–973. Cited by: §5.
  • R. Jia and P. Liang (2016) Data recombination for neural semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 12–22. External Links: Link, Document Cited by: §A.1, Table 4.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §A.1, §3.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • J. Krishnamurthy and T. M. Mitchell (2012) Weakly supervised training of semantic parsers. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 754–765. Cited by: §5.
  • M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger (2015) From word embeddings to document distances. In

    International conference on machine learning

    pp. 957–966. Cited by: §3.1, §4.1.
  • G. Lample, A. Conneau, L. Denoyer, and M. Ranzato (2017) Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043. Cited by: §A.1, §1, §3.1, §3.1, §5.
  • G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato (2018) Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 5039–5049. External Links: Link, Document Cited by: §5.
  • C. Lawrence and S. Riezler (2018) Improving a neural semantic parser by counterfactual learning from human bandit feedback. arXiv preprint arXiv:1805.01252. Cited by: §5.
  • C. Liang, J. Berant, Q. Le, K. D. Forbus, and N. Lao (2017) Neural symbolic machines: learning semantic parsers on freebase with weak supervision. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 23–33. Cited by: §5.
  • W. Lu, H. T. Ng, W. S. Lee, and L. S. Zettlemoyer (2008) A generative model for parsing natural language to meaning representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 783–792. Cited by: §1.
  • F. Luo, P. Li, J. Zhou, P. Yang, B. Chang, Z. Sui, and X. Sun (2019) A dual reinforcement learning framework for unsupervised text style transfer. arXiv preprint arXiv:1905.10060. Cited by: §5.
  • T. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. Cited by: §A.1, §2.1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.2.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §4.1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §4.1.
  • R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §1.
  • Y. Su and X. Yan (2017) Cross-domain semantic parsing via paraphrasing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1235–1246. Cited by: §1, Table 4.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2.1, §5.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.2.2.
  • P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008) Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §1.
  • Y. Wang, J. Berant, and P. Liang (2015) Building a semantic parser overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1332–1342. Cited by: 2nd item, §1, §1, §4, §4.2, Table 4.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §2.2.2.
  • Y. W. Wong and R. Mooney (2007) Learning synchronous grammars for semantic parsing with lambda calculus. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 960–967. Cited by: §1.
  • C. Xiao, M. Dymetman, and C. Gardent (2016) Sequence-based structured prediction for semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1341–1350. External Links: Link, Document Cited by: Table 4.
  • P. Yin, C. Zhou, J. He, and G. Neubig (2018) StructVAE: tree-structured latent variable models for semi-supervised semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 754–765. Cited by: §5.
  • J. M. Zelle and R. J. Mooney (1996)

    Learning to parse database queries using inductive logic programming

    In Proceedings of the national conference on artificial intelligence, pp. 1050–1055. Cited by: §1.
  • L. Zettlemoyer and M. Collins (2007) Online learning of relaxed ccg grammars for parsing to logical form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Cited by: §1.
  • Y. Zhao, L. Chen, Z. Chen, and K. Yu (2020) Semi-supervised text simplification with back-translation and asymmetric denoising autoencoders. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §5.
  • Z. Zhao, S. Zhu, and K. Yu (2019) Data augmentation with atomic templates for spoken language understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3637–3643. External Links: Link, Document Cited by: §5.
  • S. Zhu, L. Chen, K. Sun, D. Zheng, and K. Yu (2014) Semantic parser enhancement for dialogue domain extension with little data. In 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 336–341. Cited by: §5.
  • S. Zhu, O. Lan, and K. Yu (2018) Robust spoken language understanding with unsupervised asr-error adaptation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6179–6183. Cited by: §5.
  • S. Zhu and K. Yu (2017) Encoder-decoder with focus-mechanism for sequence labelling based spoken language understanding. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5675–5679. Cited by: §5.
  • Y. Zou and W. Lu (2018) Learning cross-lingual distributed logical representations for semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 673–679. External Links: Link, Document Cited by: §5.

Appendix A Appendices

a.1 Model Implementations

In this section, we give a full version discussion about all models used in our two-stage semantic parsing framework.

Unsupervised paraphrase model

We use traditional attention (Luong et al., 2015) based Seq2Seq model. Different from previous work, we remove the transition function of hidden states between encoder and decoder. The initial hidden states of decoders are initialized to -vectors. Take paraphrase model as an example:

(1) a shared encoder encodes the input utterance into a sequence of contextual representations h through a bi-directional single-layer LSTM (Hochreiter and Schmidhuber, 1997) network ( is the embedding function)

(2) on the decoder side, a traditional LSTM language model at the bottom is used to model dependencies in target utterance  ( is the embedding function on target side)

(3) output state at each time-step is then fused with encoded contexts h to obtain the features for final softmax classifier ( and are model parameters)

In both pre-training and cycle learning phases, the unsupervised paraphrase model is trained for epochs, respectively. To select the best model during unsupervised training, inspired by Lample et al. (2017), we use a surrogate criterion since we have no access to labeled data even during validation time. For one natural language utterance , we pass it into the model and obtain a canonical utterance via greedy decoding. Then is forwarded into the dual paraphrase model . By measuring the BLEU score between raw input and reconstructed utterance , we obtain one metric . In the reverse path, we will obtain another metric by calculating the overall accuracy between raw canonical utterance and its reconstructed version through the naive semantic parser . The overall metric for model selection is ( is a scaling hyper-parameter, set to in our experiments)

Auxiliary models

The naive semantic parser is another Seq2Seq model with exactly the same architecture as . We do not incorporate copy mechanism cause it has been proven useless on dataset Overnight (Jia and Liang, 2016). The language models and are all single-layer unidirectional LSTM networks. As for style discriminator , we use a CNN based sentence classifier (Kim, 2014)

. We use rectified linear units and filter windows of

with feature maps respectively. All the auxiliary models are trained with maximum epochs .

For all models discussed above, the embedding dimension is set to , hidden size to , dropout rate between layers to . All parameters except embedding layers are initialized by uniformly sampling within the interval .