Unsupervised Data Generated for GeoQuery and SAIL Datasets
We present a novel semi-supervised approach for sequence transduction and apply it to semantic parsing. The unsupervised component is based on a generative model in which latent sentences generate the unpaired logical forms. We apply this method to a number of semantic parsing tasks focusing on domains with limited access to labelled training data and extend those datasets with synthetically generated logical forms.READ FULL TEXT VIEW PDF
We present our work on semi-supervised parsing of natural language sente...
Neural unsupervised parsing (UP) models learn to parse without access to...
Interpretability and small labelled datasets are key issues in the pract...
Structured information about entities is critical for many semantic pars...
Can we train a system that, on any new input, either says "don't know" o...
Human communication typically has an underlying structure. This is refle...
One daunting problem for semantic parsing is the scarcity of annotation....
Unsupervised Data Generated for GeoQuery and SAIL Datasets
Neural approaches, in particular attention-based sequence-to-sequence models, have shown great promise and obtained state-of-the-art performance for sequence transduction tasks including machine translation [Bahdanau et al.2015], syntactic constituency parsing [Vinyals et al.2015], and semantic role labelling [Zhou and Xu2015]. A key requirement for effectively training such models is an abundance of supervised data.
In this paper we focus on learning mappings from input sequences to output sequences in domains where the latter are easily obtained, but annotation in the form of pairs is sparse or expensive to produce, and propose a novel architecture that accommodates semi-supervised training on sequence transduction tasks. To this end, we augment the transduction objective () with an autoencoding objective where the input sequence is treated as a latent variable (), enabling training from both labelled pairs and unpaired output sequences. This is common in situations where we encode natural language into a logical form governed by some grammar or database.
While such an autoencoder could in principle be constructed by stacking two sequence transducers, modelling the latent variable as a series of discrete symbols drawn from multinomial distributions creates serious computational challenges, as it requires marginalising over the space of latent sequences . To avoid this intractable marginalisation, we introduce a novel differentiable alternative for draws from a softmax which can be used with the reparametrisation trick of journals/corr/KingmaW13. Rather than drawing a discrete symbol in
from a softmax, we draw a distribution over symbols from a logistic-normal distribution at each time step. These serve as continuous relaxations of discrete samples, providing a differentiable estimator of the expected reconstruction log likelihood.
|Geo||what are the high points of states surrounding mississippi|
|NLmaps||Where are kindergartens in Hamburg?|
|SAIL||turn right at the bench into the yellow tiled hall|
|FORWARD - FORWARD - RIGHT - STOP|
We demonstrate the effectiveness of our proposed model on three semantic parsing tasks: the GeoQuery benchmark [Zelle and Mooney1996, Wong and Mooney2006], the SAIL maze navigation task [MacMahon et al.2006] and the Natural Language Querying corpus [Haas and Riezler2016] on OpenStreetMap. As part of our evaluation, we introduce simple mechanisms for generating large amounts of unsupervised training data for two of these tasks.
In most settings, the semi-supervised model outperforms the supervised model, both when trained on additional generated data as well as on subsets of the existing data.
Our sequential autoencoder is shown in Figure 1. At a high level, it can be seen as two sequence-to-sequence models with attention [Bahdanau et al.2015] chained together. More precisely, the model consists of four LSTMs [Hochreiter and Schmidhuber1997], hence the name Seq4. The first, a bidirectional LSTM, encodes the sequence ; next, an LSTM with stochastic output, described below, draws a sequence of distributions over words in vocabulary . The third LSTM encodes these distributions for the last one to attend over and reconstruct as . We now give the details of these parts.
The first LSTM of the encoder half of the model reads the sequence
, represented as a sequence of one-hot vectors over the vocabulary, using a bidirectional RNN into a sequence of vectors where is the sequence length of ,
where are non-linear functions applied at each time step to the current token and their recurrent states , , respectively.
Both the forward and backward functions project the one-hot vector into a dense vector via an embedding matrix, which serves as input to an LSTM.
Subsequently, we wish to predict
. Predicting a discrete sequence of symbols through draws from multinomial distributions over a vocabulary is not an option, as we would not be able to backpropagate through this discrete choice. Marginalising over the possible latent strings or estimating the gradient through naïve Monte Carlo methods would be a prohibitively high variance process because the number of strings is exponential in the maximum length (which we would have to manually specify) with the vocabulary size as base. To allow backpropagation, we instead predict a sequence of distributionsover the symbols of with an RNN attending over , which will later serve to reconstruct :
where models the mapping . We define in the following way:
Let the vector be a distribution over the vocabulary drawn from a logistic-normal distribution111The logistic-normal distribution is the exponentiated and normalised (i.e. taking softmax) normal distribution., the parameters of which, , are predicted by attending by an LSTM attending over the outputs of the encoder (Equation 2), where is the size of the vocabulary
. The use of a logistic normal distribution serves to regularise the model in the semi-supervised learning regime, which is described at the end of this section. Formally, this process, depicted in Figure2, is as follows:
where the function is an LSTM and. We use the reparametrisation trick from journals/corr/KingmaW13 to draw from the logistic normal, allowing us to backpropagate through the sampling process.
Moving on to the decoder part of our model, in the third LSTM, we
embed222 Multiplying the distribution over words and an embedding matrix
averages the word embedding of the entire vocabulary weighted by their
Multiplying the distribution over words and an embedding matrix averages the word embedding of the entire vocabulary weighted by their probabilities.and encode :
When is observed, during supervised training and also when making predictions, instead of the distribution
we feed the one-hot encodedto this part of the model.
In the final LSTM, we decode into :
Equation 9 is implemented as an LSTM attending over producing a sequence of symbols based on recurrent states , aiming to reproduce input :
where is the non-linear function, and the actual probabilities are given by a softmax function after a linear transformation of . At training time, rather than we feed the ground truth .
The complete model described in this section gives a reconstruction function . We define a loss on this reconstruction which accommodates the unsupervised case, where is not observed in the training data, and the supervised case, where pairs are available. Together, these allow us to train the Seq4 model in a semi-supervised setting, which experiments will show provides some benefits over a purely supervised training regime.
When isn’t observed, the loss we minimise during training is the reconstruction loss on , expressed as the negative log-likelihood of the true labels relative to the predictions . To this, we add as a regularising term the KL divergence which effectively penalises the mean and variance of from diverging from those of a prior , which we model as a diagonal Gaussian . This has the effect of smoothing the logistic normal distribution from which we draw the distributions over symbols of , guarding against overfitting of the latent distributions over to symbols seen in the supervised case discussed below. The unsupervised loss is therefore formalised as
with regularising factor is tuned on validation, and
We use a closed form of these individual KL divergences, described by journals/corr/KingmaW13.
When is observed, we additionally minimise the prediction loss on , expressed as the negative log-likelihood of the true labels relative to the predictions , and do not impose the KL loss. The supervised loss is thus
In both the supervised and unsupervised case, because of the continuous relaxation on generating and the reparameterisation trick, the gradient of the losses with regard to the model parameters is well defined throughout Seq4.
We train with a weighted combination of the supervised and unsupervised losses described above. Once trained, we simply use the decoder segment of the model to predict from sequences of symbols represented as one-hot vectors. When the decoder is trained without the encoder in a fully supervised manner, it serves as our supervised sequence-to-sequence baseline model under the name S2S.
We apply our model to three tasks outlined in this section. Moreover, we explain how we generated additional unsupervised training data for two of these tasks. Examples from all datasets are in Table 1.
The first task we consider is the prediction of a query on the Geo corpus which is a frequently used benchmark for semantic parsing. The corpus contains 880 questions about US geography together with executable queries representing those questions. We follow the approach established by zettlemoyer:2005 and split the corpus into 600 training and 280 test cases. Following common practice, we augment the dataset by referring to the database during training and test time. In particular, we use the database to identify and anonymise variables (cities, states, countries and rivers) following the method described in dong:2016.
Most prior work on the Geo
corpus relies on standard semantic parsing methods together with custom heuristics or pipelines for this corpus. The recent paper by dong:2016 is of note, as it uses a sequence-to-sequence model for training which is the unidirectional equivalent toS2S, and also to the decoder part of our Seq4 network.
The second task we tackle with our model is the NLmaps dataset by HaasRiezler:16. The dataset contains 1,500 training and 880 testing instances of natural language questions with corresponding machine readable queries over the geographical OpenStreetMap database. The dataset contains natural language question in both English and German but we focus only on single language semantic parsing, similar to the first task in HaasRiezler:16. We use the data as it is, with the only pre-processing step being the tokenization of both natural language and query form333We removed quotes, added spaces around (), and separated the question mark from the last word in each question..
The SAIL corpus and task were developed to train agents to follow free-form navigational route instructions in a maze environment [MacMahon et al.2006, Chen and Mooney2011]. It consists of a small number of mazes containing features such as objects, wall and floor types. These mazes come together with a large number of human instructions paired with the required actions444There are four actions: LEFT, RIGHT, GO, STOP. to reach the goal state described in those instructions.
We use the sentence-aligned version of the SAIL route instruction dataset containing 3,236 sentences [Chen and Mooney2011]. Following previous work, we accept an action sequence as correct if and only if the final position and orientation exactly match those of the gold data. We do not perform any pre-processing on this dataset.
As argued earlier, we are focusing on tasks where aligned data is sparse and expensive to obtain, while it should be cheap to get unsupervised, monomodal data. Albeit that is a reasonable assumption for real world data, the datasets considered have no such component, thus the approach taken here is to generate random database queries or maze paths, i.e. the machine readable side of the data, and train a semi-supervised model. The alternative not explored here would be to generate natural language questions or instructions instead, but that is more difficult to achieve without human intervention. For this reason, we generate the machine readable side of the data for GeoQuery and SAIL tasks555Our randomly generated unsupervised datasets can be downloaded from http://deepmind.com/publications.
For GeoQuery, we fit a 3-gram Kneser-Ney [Chen and Goodman1999] model to the queries in the training set and sample about 7 million queries from it. We ensure that the sampled queries are different from the training queries, but do not enforce validity. This intentionally simplistic approach is to demonstrate the applicability of our model.
The SAIL dataset has only three mazes. We added a fourth one and over 150k random paths, including duplicates. The new maze is larger ( grid) than the existing ones, and seeks to approximately replicate the key statistics of the other three mazes (maximum corridor length, distribution of objects, etc). Paths within that maze are created by randomly sampling start and end positions.
We evaluate our model on the three tasks in multiple settings. First, we establish a supervised baseline to compare the S2S model with prior work. Next, we train our Seq4 model in a semi-supervised setting on the entire dataset with the additional monomodal training data described in the previous section.
Finally, we perform an “ablation” study where we discard some of the training data and compare S2S to Seq4. S2S is trained solely on the reduced data in a supervised manner, while Seq4 is once again trained semi-supervised on the same reduced data plus the machine readable part of the discarded data (Seq4-) or on the extra generated data (Seq4+).
We train the model using standard gradient descent methods. As none of the datasets used here contain development sets, we tune hyperparameters by cross-validating on the training data. In the case of theSAIL corpus we train on three folds (two mazes for training and validation, one for test each) and report weighted results across the folds following prior work [Mei et al.2016].
|jia2016recombination666jia2016recombination used hand crafted grammars to generate additional supervised training data.||89.3|
The evaluation metric forGeoQuery is the accuracy of exactly predicting the machine readable query. As results in Table 2 show, our supervised S2S baseline model performs slightly better than the comparable model by dong:2016. The semi-supervised Seq4 model with the additional generated queries improves on it further.
The ablation study in Table 3 demonstrates a widening gap between supervised and semi-supervised as the amount of labelled training data gets smaller. This suggests that our model can leverage unlabelled data even when only small amount of labelled data is available.
We report results for the NLmaps corpus in Table 4, comparing the supervised S2S model to the results posted by HaasRiezler:16. While their model used a semantic parsing pipeline including alignment, stemming, language modelling and CFG inference, the strong performance of the S2S model demonstrates the strength of fairly vanilla attention-based sequence-to-sequence models. It should be pointed out that the previous work reports the number of correct answers when queries were executed against the dataset, while we evaluate on the strict accuracy of the generated queries. While we expect these numbers to be nearly equivalent, our evaluation is strictly harder as it does not allow for reordering of query arguments and similar relaxations.
We investigate the Seq4 model only via the ablation study in Table 5 and find little gain through the semi-supervised objective. Our attempt at cheaply generating unsupervised data for this task was not successful, likely due to the complexity of the underlying database.
The experiments for the SAIL task differ slightly from the other two tasks in that the language input does not suffice for choosing an action. While a simple instruction such as ‘turn left’ can easily be translated into the action sequence LEFT-STOP, more complex instructions such as ‘Walk forward until you see a lamp’ require knowledge of the agent’s position in the maze.
To accomplish this we modify the model as follows. First, when encoding action sequences, we concatenate each action with a representation of the maze at the given position, representing the maze-state akin to mei2016navigational with a bag-of-features vector. Second, when decoding action sequences, the RNN outputs an action which is used to update the agent’s position and the representation of that new position is fed into the RNN as its next input.
We cross-validate over the three mazes in the dataset and report overall results weighted by test size (cf. mei2016navigational). Both our supervised and semi-supervised model perform worse than the state-of-the-art (see Table 6), but the latter enjoys a comfortable margin over the former. As the S2S model broadly reimplements the work of mei2016navigational, we put the discrepancy in performance down to the particular design choices that we did not follow in order to keep the model here as general as possible and comparable across tasks.
The ablation studies (Table 7) show little gain for the semi-supervised approach when only using data from the original training set, but substantial improvement with the additional unsupervised data.
|Input from unsupervised data ()||Generated latent representation ()|
|answer smallest city loc_2 state stateid _STATE_||what is the smallest city in the state of _STATE_ /S|
|answer city loc_2 state next_to_2 stateid _STATE_||what are the cities in states which border _STATE_ /S|
|answer mountain loc_2 countryid _COUNTRY_||what is the lakes in _COUNTRY_ /S|
|answer state next_to_2 state all||which states longer states show peak states to /S|
The prediction accuracies of our supervised baseline S2S model are mixed with respect to prior results on their respective tasks. For GeoQuery, S2S performs significantly better than the most similar model from the literature [Dong and Lapata2016], mostly due to the fact that and are encoded with bidirectional LSTMs. With a unidirectional LSTM we get similar results to theirs.
On the SAIL corpus, S2S performs worse than the state of the art. As the models are broadly equivalent we attribute this difference to a number of task-specific choices and optimisations777In particular we don’t use beam search and ensembling. made in mei2016navigational which we did not reimplement for the sake of using a common model across all three tasks.
For NLmaps, S2S performs much better than the state-of-the-art, exceeding the previous best result by 11% despite a very simple tokenization method and a lack of any form of entity anonymisation.
In both the case of GeoQuery and the SAIL task we found the semi-supervised model to convincingly outperform the fully supervised model. The effect was particularly notable in the case of the SAIL corpus, where performance increased from accuracy to (see Table 6). It is worth remembering that the supervised training regime consists of three folds of tuning on two maps with subsequent testing on the third map, which carries a risk of overfitting to the training maps. The introduction of the fourth unsupervised map clearly mitigates this effect. Table 8 shows some examples of unsupervised logical forms being transformed into natural language, which demonstrate how the model can learn to sensibly ground unsupervised data.
The experiments with additional unsupervised data prove the feasibility of our approach and clearly demonstrate the usefulness of the Seq4 model for the general class of sequence-to-sequence tasks where supervised data is hard to come by. To analyse the model further, we also look at the performance of both S2S and Seq4 when reducing the amount of supervised training data available to the model. We compare three settings: the supervised S2S model with reduced training data, Seq4- which uses the removed training data in an unsupervised fashion (throwing away the natural language) and Seq4+ which uses the randomly generated unsupervised data described in Section 3. The S2S model behaves as expected on all three tasks, its performance dropping with the size of the training data. The performance of Seq4- and Seq4+ requires more analysis.
In the case of GeoQuery, having unlabelled data from the true distribution (Seq4-) is a good thing when there is enough of it, as clearly seen when only 5% of the original dataset is used for supervised training and the remaining 95% is used for unsupervised training. The gap shrinks as the amount of supervised data is increased, which is as expected. On the other hand, using a large amount of extra, generated data from an approximating distribution (Seq4+) does not help as much initially when compared with the unsupervised data from the true distribution. However, as the size of the unsupervised dataset in Seq4- becomes the bottleneck this gap closes and eventually the model trained on the extra data achieves higher accuracy.
For the SAIL task the semi-supervised models do better than the supervised results throughout, with the model trained on randomly generated additional data consistently outperforming the model trained only on the original data. This gives further credence to the risk of overfitting to the training mazes already mentioned above.
Finally, in the case of the NLmaps corpus, the semi-supervised approach does not appear to help much at any point during the ablation. These indistinguishable results are likely due to the task’s complexity, causing the ablation experiments to either have to little supervised data to sufficiently ground the latent space to make use of the unsupervised data, or in the higher percentages then too little unsupervised data to meaningfully improve the model.
The tasks in this paper all broadly belong to the domain of semantic parsing, which describes the process of mapping natural language to a formal representation of its meaning. This is extended in the SAIL navigation task, where the formal representation is a function of both the language instruction and a given environment.
Semantic parsing is a well-studied problem with numerous approaches including inductive logic programming[Zelle and Mooney1996], string-to-tree [Galley et al.2004] and string-to-graph [Jones et al.2012] transducers, grammar induction [Kwiatkowski et al.2011, Artzi and Zettlemoyer2013, Reddy et al.2014] or machine translation [Wong and Mooney2006, Andreas et al.2013].
While a large number of relevant literature focuses on defining the grammar of the logical forms [Zettlemoyer and Collins2005], other models learn purely from aligned pairs of text and logical form [Berant and Liang2014], or from more weakly supervised signals such as question-answer pairs together with a database [Liang et al.2011]. Recent work of jia2016recombination induces a synchronous context-free grammar and generates additional training examples , which is one way to address data scarcity issues. The semi-supervised setup proposed here offers an alternative solution to this issue.
Very recently there has been some related work on discrete autoencoders for natural language processing[Suster et al.2016, Marcheggiani and Titov2016, i.a.] This work presents a first approach to using effectively discretised sequential information as the latent representation without resorting to draconian assumptions [Ammar et al.2014] to make marginalisation tractable. While our model is not exactly marginalisable either, the continuous relaxation makes training far more tractable. A related idea was recently presented in deep-fusion-2015, who use monolingual data to improve machine translation by fusing a sequence-to-sequence model and a language model.
We described a method for augmenting a supervised sequence transduction objective with an autoencoding objective, thereby enabling semi-supervised training where previously a scarcity of aligned data might have held back model performance. Across multiple semantic parsing tasks we demonstrated the effectiveness of this approach, improving model performance by training on randomly generated unsupervised data in addition to the original data.
Going forward it would be interesting to further analyse the effects of sampling from a logistic-normal distribution as opposed to a softmax in order to better understand how this impacts the distribution in the latent space. While we focused on tasks with little supervised data and additional unsupervised data in , it would be straightforward to reverse the model to train it with additional labelled data in , i.e. on the natural language side. A natural extension would also be a formulation where semi-supervised training was performed in both and . For instance, machine translation lends itself to such a formulation where for many language pairs parallel data may be scarce while there is an abundance of monolingual data.
Learning Compact Lexicons for CCG Semantic Parsing.In Proceedings of EMNLP, October.
End-to-end Learning of Semantic Role Labeling Using Recurrent Neural Networks.In Proceedings of ACL.