Transfer Learning for Neural Semantic Parsing

by   Xing Fan, et al.

The goal of semantic parsing is to map natural language to a machine interpretable meaning representation language (MRL). One of the constraints that limits full exploration of deep learning technologies for semantic parsing is the lack of sufficient annotation training data. In this paper, we propose using sequence-to-sequence in a multi-task setup for semantic parsing with a focus on transfer learning. We explore three multi-task architectures for sequence-to-sequence modeling and compare their performance with an independently trained model. Our experiments show that the multi-task setup aids transfer learning from an auxiliary task with large labeled data to a target task with smaller labeled data. We see absolute accuracy gains ranging from 1.0 from 2.5 semantic auxiliary tasks.


page 1

page 2

page 3

page 4


One Semantic Parser to Parse Them All: Sequence to Sequence Multi-Task Learning on Semantic Parsing Datasets

Semantic parsers map natural language utterances to meaning representati...

Practical Semantic Parsing for Spoken Language Understanding

Executable semantic parsing is the task of converting natural language u...

Taskonomy: Disentangling Task Transfer Learning

Do visual tasks have a relationship, or are they unrelated? For instance...

Towards Open-Text Semantic Parsing via Multi-Task Learning of Structured Embeddings

Open-text (or open-domain) semantic parsers are designed to interpret an...

Pushing the Limits of AMR Parsing with Self-Learning

Abstract Meaning Representation (AMR) parsing has experienced a notable ...

Graph-Based Decoding for Task Oriented Semantic Parsing

The dominant paradigm for semantic parsing in recent years is to formula...

Generating Synthetic Data for Task-Oriented Semantic Parsing with Hierarchical Representations

Modern conversational AI systems support natural language understanding ...

1 Introduction

Conversational agents, such as Alexa, Siri and Cortana, solve complex tasks by interacting and mediating between the end-user and multiple backend software applications and services. Natural language is a simple interface used for communication between these agents. However, to make natural language machine-readable we need to map it to a representation that describes the semantics of the task expressed in the language. Semantic parsing is the process of mapping a natural-language sentence into a formal machine-readable representation of its meaning. This poses a challenge in a multi-tenant system that has to interact with multiple backend knowledge sources each with their own semantic formalisms and custom schemas for accessing information, where each formalism has various amount of annotation training data.

Recent works have proven sequence-to-sequence to be an effective model architecture Jia and Liang (2016); Dong and Lapata (2016)

for semantic parsing. However, because of the limit amount of annotated data, the advantage of neural networks to capture complex data representation using deep structure 

Johnson et al. (2016) has not been fully explored. Acquiring data is expensive and sometimes infeasible for task-oriented systems, the main reasons being multiple formalisms (e.g., SPARQL for WikiData Vrandečić and Krötzsch (2014), MQL for Freebase Flanagan (2008)), and multiple tasks (question answering, navigation interactions, transactional interactions). We propose to exploit these multiple representations in a multi-task framework so we can minimize the need for a large labeled corpora across these formalisms. By suitably modifying the learning process, we capture the common structures that are implicit across these formalisms and the tasks they are targeted for.

In this work, we focus on a sequence-to-sequence based transfer learning for semantic parsing. In order to tackle the challenge of multiple formalisms, we apply three multi-task frameworks with different levels of parameter sharing. Our hypothesis is that the encoder-decoder paradigm learns a canonicalized representation across all tasks. Over a strong single-task sequence-to-sequence baseline, our proposed approach shows accuracy improvements across the target formalism. In addition, we show that even when the auxiliary task is syntactic parsing we can achieve good gains in semantic parsing that are comparable to the published state-of-the-art.

2 Related Work

There is a large body of work for semantic parsing. These approaches fall into three broad categories – completely supervised learning based on fully annotated logical forms associated with each sentence 

Zelle and Mooney (1996); Zettlemoyer and Collins (2012) using question-answer pairs and conversation logs as supervision Artzi and Zettlemoyer (2011); Liang et al. (2011); Berant et al. (2013) and distant supervision Cai and Yates (2013); Reddy et al. (2014). All these approaches make assumptions about the task, features and the target semantic formalism.

On the other hand, neural network based approaches, in particular the use of recurrent neural networks (RNNs) and encoder-decoder paradigms 

Sutskever et al. (2014), have made fast progress on achieving state-of-the art performance on various NLP tasks Vinyals et al. (2015); Dyer et al. (2015); Bahdanau et al. (2014). A key advantage of RNNs in the encoder-decoder paradigm is that very few assumptions are made about the domain, language and the semantic formalism. This implies they can generalize faster with little feature engineering.

Full semantic graphs can be expensive to annotate, and efforts to date have been fragmented across different formalisms, leading to a limited amount of annotated data in any single formalism. Using neural networks to train semantic parsers on limited data is quite challenging. Multi-task learning aims at improving the generalization performance of a task using related tasks Caruana (1998); Ando and Zhang (2005); Smith and Smith (2004). This opens the opportunity to utilize large amounts of data for a related task to improve the performance across all tasks. There has been recent work in NLP demonstrating improved performance for machine translation Dong et al. (2015) and syntactic parsing Luong et al. (2015).

In this work, we attempt to merge various strands of research using sequence-to-sequence modeling for semantic parsing with focusing on improving semantic formalisms with small amount of training data using a multi-task model architecture. The closest work is Jonathan2017multiKB. Similar to this work, the authors use a neural semantic parsing model in a multi-task framework to jointly learn over multiple knowledge bases. Our work differs from their work in that we focus our attention on transfer learning, where we have access to a large labeled resource in one task and want another semantic formalism with access to limited training data to benefit from a multi-task learning setup. Furthermore, we also demonstrate that we can improve semantic parsing tasks by using large data sources from an auxiliary task such as syntactic parsing, thereby opening up the opportunity for leveraging much larger datasets. Finally, we carefully compare multiple multi-task architectures in our setup and show that increased sharing of both the encoder and decoder along with shared attention results in the best performance.

3 Problem Formulation

3.1 Sequence-to-Sequence Formulation

Figure 1: An example of how the decoder output is generated.

Our semantic parser extends the basic encoder-decoder approach in jia2016data. Given a sequence of inputs , the sequence-to-sequence model will generate an output sequence of . We encode the input tokens into a sequence of embeddings


First, an input embedding layer maps each word

to a fixed-dimensional vector which is then fed as input to the network

to obtain the hidden state representation . The embedding layer

could contain one single word embedding lookup table or a combination of word and gazetteer embeddings, where we concatenate the output from each table. For the encoder and decoder, we use a stacked Gated Recurrent Units (GRU)  

Cho et al. (2014).111In order to speedup training, we use a right-to-left GRU instead of a bidirectional GRU. The hidden states are then converted to one fixed-length context vector per output index, , where summarizes all input hidden states to form the context for a given output index .222In a vanilla decoder, each

, i.e, the hidden representation from the last state of the encoder is used as context for every output time step


The decoder then uses these fixed-length vectors to create the target sequence through the following model. At each time step in the output sequence, a state is calculated as



maps any output symbol to a fixed-dimensional vector. Finally, we compute the probability of the output symbol

given the history using Equation 3.


where the matrix projects the concatenation of and , denoted as , to the final output space. The matrix are part of the trainable model parameters. We use an attention mechanism Bahdanau et al. (2014) to summarize the context vector ,


where is the step index for the decoder output and is the attention weight, calculated using a softmax:


where is the relevance score of each context vector , modeled as:


In this paper, the function is defined as follows:


where , and are trainable parameters.

In order to deal with the large vocabularies in the output layer introduced by the long tail of entities in typical semantic parsing tasks, we use a copy mechanism Jia and Liang (2016). At each time step , the decoder chooses to either copy a token from the encoder’s input stream or to write a token from the the decoder’s fixed output vocabulary. We define two actions:

  1. for some , where is the output vocabulary of the decoder.

  2. for some , which copies one symbol from the input tokens.

We formulate a single softmax to select the action to take, rewriting Equation 3 as follows:


The decoder is now a softmax over the actions ; Figure 1 shows how the decoder’s output at the third time step is generated. At each time step, the decoder will make a decision to copy a particular token from input stream or to write a token from the fixed output label pool.

(a) one-to-many: A multi-task architecture where only the encoder is shared across the two tasks.
(b) one-to-one: A multi-task architecture where both the encoder and decoder along with the attention layer are shared across the two tasks.
(c) one-to-shareMany: A multi-task architecture where both the encoder and decoder along with the attention layer are shared across the two tasks, but the final softmax output layer is trained differently, one for each task.
Figure 2: Three multi-task architectures.

3.2 Multi-task Setup

We focus on training scenarios where multiple training sources are available. Each source can be considered a domain or a task, which consists of pairs of utterance and annotated logical form . There are no constraints on the logical forms having the same formalism across the domains. Also, the tasks can be different, e.g., we can mix semantic parsing and syntactic parsing tasks. We also assume that given an utterance, we already know its associated source in both training and testing.

In this work, we explore and compare three multi-task sequence-to-sequence model architectures: one-to-many, one-to-one and one-to-shareMany.

3.2.1 One-to-Many Architecture

This is the simplest extension of sequence-to-sequence models to the multi-task case. The encoder is shared across all the tasks, but the decoder and attention parameters are not shared. The shared encoder captures the English language sequence, whereas each decoder is trained to predict its own formalism. This architecture is shown in Figure (a)a. For each minibatch, we uniformly sample among all training sources, choosing one source to select data exclusively from. Therefore, at each model parameter update, we only update the encoder, attention module and the decoder for the selected source, while the parameters for the other decoder and attention modules remain the same.

3.2.2 One-to-One Architecture

Figure (b)b shows the one-to-one architecture. Here we have a single sequence-to-sequence model across all the tasks, i.e., the embedding, encoder, attention, decoder and the final output layers are shared across all the tasks. In this architecture, the number of parameters is independent of the number of tasks . Since there is no explicit representation of the domain/task that is being decoded, the input is augmented with an artificial token at the start to identify the task the same way as in johnson2016google.

3.2.3 One-to-ShareMany Architecture

We show the model architecture for one-to-shareMany in Figure (c)c. The model modifies the one-to-many model by encouraging further sharing of the decoder weights. Compared with the one-to-one model, the one-to-shareMany differs in the following aspects:

  • Each task has its own output layer. Our hypothesis is that by separating the tasks in the final layer we can still get the benefit of sharing the parameters, while fine-tuning for specific tasks in the output, resulting in better accuracy on each individual task.

  • The one-to-one requires a concatenation of all output labels from training sources. During training, every minibatch needs to be forwarded and projected to this large softmax layer. While for one-to-ShareMany, each minibatch just needs to be fed to the softmax associated with the chosen source. Therefore, the one-to-shareMany is faster to train especially in cases where the output label size is large.

  • The one-to-one architecture is susceptible to data imbalance across the multiple tasks, and typically requires data upsampling or downsampling. While for one-to-shareMany we can alternate the minibatches amongst the sources using uniform selection.

    From the perspective of neural network optimization, mixing the small training data with a large data set from the auxiliary task can be also seen as adding noise to the training process and hence be helpful for generalization and to avoid overfitting. With the auxiliary tasks, we are able to train large size modesl that can handle complex task without worrying about overfitting.

4 Experiments

4.1 Data Setup

Figure 3: Example utterances for the multiple semantic formalisms

We mainly consider two Alexa dependency-based semantic formalisms in use – an Alexa meaning representation language (AlexaMRL), which is a lightweight formalism used for providing built-in functionality for developers to develop their own skills.333For details see The other formalism we consider is the one used by Evi,444 a question-answering system used in Alexa. Evi uses a proprietary formalism for semantic understanding; we will call this the Evi meaning representation language (EviMRL). Both these formalisms aim to represent natural language. While the EviMRL is aligned with an internal schema specific to the knowledge base (KB), the AlexaMRL is aligned with an RDF-based open-source ontology Guha et al. (2016). Figure 3 shows two example utterances and their parses in both EviMRL and AlexaMRL formalisms.

Our training set consists of utterances – a fraction of our production data, annotated using AlexaMRL – as our main task. For the EviMRL task, we have utterances data set for training. We use a test set of utterances for AlexaMRL testing, and utterances for EviMRL testing. To show the effectiveness of our proposed method, we also use the ATIS corpora as the small task for our transfer learning framework, which has training and test utterances Zettlemoyer and Collins (2007). We also include an auxiliary task such as syntactic parsing in order to demonstrate the flexibility of the multi-task paradigm. We use WSJ training data for syntactic constituency parsing as the large task, similar to the corpus in vinyals2015grammar.

We use Tensorflow 

Abadi et al. (2016)

in all our experiments, with extensions for the copy mechanism. Unless stated otherwise, we train all models for 10 epochs, with a fixed learning rate of 0.5 for the first 6 epochs and halve it subsequently for every epoch. The mini-batch size used is 128. The encoder and decoder use a 3-layer GRU with 512 hidden units. We apply dropout with probability of 0.2 during training. All models are initialized with pre-trained 300-dimension GloVe embeddings 

Pennington et al. (2014). We also apply label embeddings with 300 dimension for the output labels that are randomly initialized and learned during training. The input sequence is reversed before sending it to the encoder Vinyals et al. (2015). We use greedy search during decoding. The output label size for EviMRL is and for Alexa is . For the multi-task setup, we use a vocabulary size of about , and for AlexaMRL independent task, we use a vocabulary size of about . We post-process the output of the decoder by balancing the brackets and determinizing the units of production to avoid duplicates.

4.2 AlexaMRL Transfer Learning Experiments

Figure 4: Accuracy for AlexaMRL.

We first study the effectiveness of the multi-task architecture in a transfer learning setup. Here we consider EviMRL as the large source auxiliary task and the AlexaMRL as the target task we want to transfer learn. We consider various data sizes for the target task – , and and by downsampling. For each target data size, we compare a single-task setup, trained on the target task only, with the the various multi-task setups from Section 3.2 – independent, one-to-one, one-to-many, and one-to-manyShare. Figure 4 summarizes the results. The x-axis lists the four model architecture, and y-axis is the accuracy. The positive number above the mark of one-to-one, one-to-many and one-to-manyShare represents the absolute accuracy gain compared with the independent model. For the independent model, we reduce the hidden layer size from 512 to 256 to optimize the performance.

In all cases, the multi-task architectures provide accuracy improvements over the independent architecture. By jointly training across the two tasks, the model is able to leverage the richer syntactic/semantic structure of the larger task (EviMRL), resulting in an improved encoding of the input utterance that is then fed to the decoder resulting in improved accuracy over the smaller task (AlexaMRL).

We take this sharing further in the one-to-one and one-to-shareMany architecture by introducing shared decoder parameters, which forces the model to learn a common canonical representation for solving the semantic parsing task. Doing so, we see further gains across all data sizes in 4. For instance, in the 200k case, the absolute gain improves from to . As the training data size for the target task increases, we tend to see relatively less gain from model sharing. For instance, in 10k training cases, the absolute gain from the one-to-one and one-to-manyshared is , this gain reduces to when we have 200k training data.

When we have a small amount of training data, the one-to-shareMany provides better accuracy compared with one-to-one. For instance, we see and absolute gain from one-to-one to one-to-shareMany for 10k and 50k cases respectively. However, no gain is observed for 100k and 200k training cases. This confirms the hypothesis that for small amounts of data, having a dedicated output layer is helpful to guide the training.

Transfer learning works best when the source data is large, thereby allowing the smaller task to leverage the rich representation of the larger task. However, as the training data size increases, the accuracy gains from the shared architectures become smaller – the largest gain of absolute is observed in the setting, but as the data increases to the improvements are almost halved to about .

In Table 1, we summarize the numbers of parameters in each of the four model architectures and their step time.555In our experiment, it is the training time for a 128 size minibatches update on Nvidia Tesla K80 GPU As expected, we see comparable training time for one-to-many and one-to-shareMany, but 10% step time increase for one-to-one. We also see that one-to-one and one-to-shareMany have similar number of parameter, which is about 15% smaller than one-to-many due to the sharing of weights. The one-to-shareMany architecture is able to get the increased sharing while still maintaining reasonable training speed per step-size.

Model architecture param. size step time
independent 15 million 0.51
one-to-many 33 million 0.66
one-to-one 28 million 0.71
one-to-shareMany 28 million 0.65
Table 1: parameter size and training time comparision for independent and multi-task models

We also test the accuracy of EviMRL with the transfer learning framework. To our surprise, the EviMRL task also benefits from the AlexMRL task. We observe an absolute increase of accuracy of over the EviMRL baseline.666The baseline is at accuracy for the single task sequence-to-sequence model This observation reinforces the hypothesis that combining data from different semantic formalisms helps the generalization of the model by capturing common sub-structures involved in solving semantic parsing tasks across multiple formalisms.

4.3 Transfer Learning Experiments on ATIS

Here, we apply the described transfer learning setups to the ATIS semantic parsing task Zettlemoyer and Collins (2007). We use a single GRU layer of 128 hidden states to train the independent model. During transfer learning, we increase the model size to two hidden layers each with 512 hidden states. We adjust the minibatch size to 20 and dropout rate to 0.2 for independent model and 0.7 for multi-task model. We post-process the model output, balancing the braces and removing duplicates in the output. The initial learning rate has been adjusted to 0.8 using the dev set. Here, we only report accuracy numbers for the independent and one-to-shareMany frameworks. Correctness is based on denotation match at utterance level. We summarize all the results in Table 2.

System Test accuracy
Previous work
atis2007 84.6
kwiatkowski2011lexical 82.8
poon2013grounded 83.5
zhao2014type 84.2
jia2016data 83.3
dong2016language 84.2
Our work
Independent model 77.2
+ WSJ constituency parsing 79.7
+ EviMRL semantic parsing 84.2
Table 2: Accuracy on ATIS

Our independent model has an accuracy of , which is comparable to the published baseline of reported in jia2016data before their data recombination. To start with, we first consider using a related but complementary task – syntactic constituency parsing, to help improve the semantic parsing task. By adding WSJ constituency parsing as an auxiliary task for ATIS, we see a relative improvement in accuracy over the independent task baseline. This demonstrates that the multi-task architecture is quite general and is not constrained to using semantic parsing as the auxiliary task. This is important as it opens up the possibility of using significantly larger training data on tasks where acquiring labels is relatively easy.

We then add the EviMRL data of instances to the multi-task setup as a third task, and we see further relative improvement of , which is comparable to the published state of the art Zettlemoyer and Collins (2007) and matches the neural network setup in dong2016language.

5 Conclusion

We presented sequence-to-sequence architectures for transfer learning applied to semantic parsing. We explored multiple architectures for multi-task decoding and found that increased parameter sharing results in improved performance especially when the target task data has limited amounts of training data. We observed a 1.0-4.4% absolute accuracy improvement on our internal test set with 10k-200k training data. On ATIS, we observed a accuracy gain.

The results demonstrate the capabilities of sequence-to-sequence modeling to capture a canonicalized representation between tasks, particularly when the architecture uses shared parameters across all its components. Furthermore, by utilizing an auxiliary task like syntactic parsing, we can improve the performance on the target semantic parsing task, showing that the sequence-to-sequence architecture effectively leverages the common structures of syntax and semantics. In future work, we want to use this architecture to build models in an incremental manner where the number of sub-tasks continually grows. We also want to explore auxiliary tasks across multiple languages so we can train multilingual semantic parsers simultaneously, and use transfer learning to combat labeled data sparsity.