A tensorflow implementation of neural sequence-to-sequence parser for converting natural language queries to logical form.
Semantic parsing aims at mapping natural language to machine interpretable meaning representations. Traditional approaches rely on high-quality lexicons, manually-built templates, and linguistic features which are either domain- or representation-specific. In this paper we present a general method based on an attention-enhanced encoder-decoder model. We encode input utterances into vector representations, and generate their logical forms by conditioning the output sequences or trees on the encoding vectors. Experimental results on four datasets show that our approach performs competitively without using hand-engineered features and is easy to adapt across domains and meaning representations.READ FULL TEXT VIEW PDF
A tensorflow implementation of neural sequence-to-sequence parser for converting natural language queries to logical form.
Semantic parsing is the task of translating text to a formal meaning representation such as logical forms or structured queries. There has recently been a surge of interest in developing machine learning methods for semantic parsing (see the references in Section2), due in part to the existence of corpora containing utterances annotated with formal meaning representations. Figure 1 shows an example of a question (left hand-side) and its annotated logical form (right hand-side), taken from Jobs [Tang and Mooney2001], a well-known semantic parsing benchmark. In order to predict the correct logical form for a given utterance, most previous systems rely on predefined templates and manually designed features, which often render the parsing model domain- or representation-specific. In this work, we aim to use a simple yet effective method to bridge the gap between natural language and logical form with minimal domain knowledge.
Encoder-decoder architectures based on recurrent neural networks have been successfully applied to a variety of NLP tasks ranging from syntactic parsing[Vinyals et al.2015a], to machine translation [Kalchbrenner and Blunsom2013, Cho et al.2014, Sutskever et al.2014], and image description generation [Karpathy and Fei-Fei2015, Vinyals et al.2015b]. As shown in Figure 1
, we adapt the general encoder-decoder paradigm to the semantic parsing task. Our model learns from natural language descriptions paired with meaning representations; it encodes sentences and decodes logical forms using recurrent neural networks with long short-term memory (LSTM) units. We present two model variants, the first one treats semantic parsing as a vanilla sequence transduction task, whereas our second model is equipped with a hierarchical tree decoder which explicitly captures the compositional structure of logical forms. We also introduce an attention mechanism[Bahdanau et al.2015, Luong et al.2015b] allowing the model to learn soft alignments between natural language and logical forms and present an argument identification step to handle rare mentions of entities and numbers.
Evaluation results demonstrate that compared to previous methods our model achieves similar or better performance across datasets and meaning representations, despite using no hand-engineered domain- or representation-specific features.
Our work synthesizes two strands of research, namely semantic parsing and the encoder-decoder architecture with neural networks.
The problem of learning semantic parsers has received significant attention, dating back to lunar. Many approaches learn from sentences paired with logical forms following various modeling strategies. Examples include the use of parsing models [Miller et al.1996, Ge and Mooney2005, Lu et al.2008, Zhao and Huang2015]
, inductive logic programming[Zelle and Mooney1996, Tang and Mooney2000, Thomspon and Mooney2003], probabilistic automata [He and Young2006], string/tree-to-tree transformation rules [Kate et al.2005]
, classifiers based on string kernels[Kate and Mooney2006], machine translation [Wong and Mooney2006, Wong and Mooney2007, Andreas et al.2013], and combinatory categorial grammar induction techniques [Zettlemoyer and Collins2005, Zettlemoyer and Collins2007, Kwiatkowski et al.2010, Kwiatkowski et al.2011]. Other work learns semantic parsers without relying on logical-from annotations, e.g., from sentences paired with conversational logs [Artzi and Zettlemoyer2011], system demonstrations [Chen and Mooney2011, Goldwasser and Roth2011, Artzi and Zettlemoyer2013], question-answer pairs [Clarke et al.2010, Liang et al.2013], and distant supervision [Krishnamurthy and Mitchell2012, Cai and Yates2013, Reddy et al.2014].
Our model learns from natural language descriptions paired with meaning representations. Most previous systems rely on high-quality lexicons, manually-built templates, and features which are either domain- or representation-specific. We instead present a general method that can be easily adapted to different domains and meaning representations. We adopt the general encoder-decoder framework based on neural networks which has been recently repurposed for various NLP tasks such as syntactic parsing [Vinyals et al.2015a], machine translation [Kalchbrenner and Blunsom2013, Cho et al.2014, Sutskever et al.2014], image description generation [Karpathy and Fei-Fei2015, Vinyals et al.2015b], question answering [Hermann et al.2015], and summarization [Rush et al.2015].
neural:instruction use a sequence-to-sequence model to map navigational instructions to actions. Our model works on more well-defined meaning representations (such as Prolog and lambda calculus) and is conceptually simpler; it does not employ bidirectionality or multi-level alignments. deep:sp propose a different architecture for semantic parsing based on the combination of two neural network models. The first model learns shared representations from pairs of questions and their translations into knowledge base queries, whereas the second model generates the queries conditioned on the learned representations. However, they do not report empirical evaluation results.
Our aim is to learn a model which maps natural language input to a logical form representation of its meaning
. The conditional probabilityis decomposed as:
Our method consists of an encoder which encodes natural language input into a vector representation and a decoder which learns to generate conditioned on the encoding vector. In the following we describe two models varying in the way in which is computed.
This model regards both input and output as sequences. As shown in Figure 2, the encoder and decoder are two different -layer recurrent neural networks with long short-term memory (LSTM) units which recursively process tokens one by one. The first time steps belong to the encoder, while the following time steps belong to the decoder. Let denote the hidden vector at time step and layer . is then computed by:
refers to the LSTM function being used. In our experiments we follow the architecture described in zaremba2014recurrent, however other types of gated activation functions are possible (e.g., mt:cho-EtAl:2014:EMNLP2014). For the encoder,is the word vector of the current input token, with being a parameter matrix, and the index of the corresponding token. For the decoder, is the word vector of the previous predicted word, where . Notice that the encoder and decoder have different LSTM parameters.
Once the tokens of the input sequence are encoded into vectors, they are used to initialize the hidden states of the first time step in the decoder. Next, the hidden vector of the topmost LSTM in the decoder is used to predict the -th output token as:
where is a parameter matrix, and a one-hot vector for computing ’s probability from the predicted distribution.
We augment every sequence with a “start-of-sequence” <s> and “end-of-sequence” </s> token. The generation process terminates once </s> is predicted. The conditional probability of generating the whole sequence is then obtained using Equation (1).
The Seq2Seq model has a potential drawback in that it ignores the hierarchical structure of logical forms. As a result, it needs to memorize various pieces of auxiliary information (e.g., bracket pairs) to generate well-formed output. In the following we present a hierarchical tree decoder which is more faithful to the compositional nature of meaning representations. A schematic description of the model is shown in Figure 3.
The present model shares the same encoder with the sequence-to-sequence model described in Section 3.1 (essentially it learns to encode input as vectors). However, its decoder is fundamentally different as it generates logical forms in a top-down manner. In order to represent tree structure, we define a “nonterminal” <n> token which indicates subtrees. As shown in Figure 3, we preprocess the logical form “lambda $0 e (and (>(departure_time $0) 1600:ti) (from $0 dallas:ci))” to a tree by replacing tokens between pairs of brackets with nonterminals. Special tokens <s> and <(> denote the beginning of a sequence and nonterminal sequence, respectively (omitted from Figure 3 due to lack of space). Token </s> represents the end of sequence.
After encoding input , the hierarchical tree decoder uses recurrent neural networks to generate tokens at depth 1 of the subtree corresponding to parts of logical form . If the predicted token is <n>, we decode the sequence by conditioning on the nonterminal’s hidden vector. This process terminates when no more nonterminals are emitted. In other words, a sequence decoder is used to hierarchically generate the tree structure.
In contrast to the sequence decoder described in Section 3.1, the current hidden state does not only depend on its previous time step. In order to better utilize the parent nonterminal’s information, we introduce a parent-feeding connection where the hidden vector of the parent nonterminal is concatenated with the inputs and fed into LSTM.
As an example, Figure 4 shows the decoding tree corresponding to the logical form “A B (C)”, where are predicted tokens, and denote different time steps. Span “(C)” corresponds to a subtree. Decoding in this example has two steps: once input has been encoded, we first generate at depth until token </s> is predicted; next, we generate by conditioning on nonterminal ’s hidden vectors. The probability is the product of these two sequence decoding steps:
where Equation (3) is used for the prediction of each output token.
As shown in Equation (3), the hidden vectors of the input sequence are not directly used in the decoding process. However, it makes intuitively sense to consider relevant information from the input to better predict the current token. Following this idea, various techniques have been proposed to integrate encoder-side information (in the form of a context vector) at each time step of the decoder [Bahdanau et al.2015, Luong et al.2015b, Xu et al.2015].
As shown in Figure 5, in order to find relevant encoder-side context for the current hidden state of decoder, we compute its attention score with the -th hidden state in the encoder as:
where are the top-layer hidden vectors of the encoder. Then, the context vector is the weighted sum of the hidden vectors in the encoder:
In lieu of Equation (3), we further use this context vector which acts as a summary of the encoder to compute the probability of generating as:
where and are three parameter matrices, and is a one-hot vector used to obtain ’s probability.
Our goal is to maximize the likelihood of the generated logical forms given natural language utterances as input. So the objective function is:
where is the set of all natural language-logical form training pairs, and is computed as shown in Equation (1).
The RMSProp algorithm[Tieleman and Hinton2012] is employed to solve this non-convex optimization problem. Moreover, dropout is used for regularizing the model [Zaremba et al.2015]. Specifically, dropout operators are used between different LSTM layers and for the hidden layers before the softmax classifiers. This technique can substantially reduce overfitting, especially on datasets of small size.
At test time, we predict the logical form for an input utterance by:
where represents a candidate output. However, it is impractical to iterate over all possible results to obtain the optimal prediction. According to Equation (1), we decompose the probability so that we can use greedy search (or beam search) to generate tokens one by one.
Algorithm 1 describes the decoding process for Seq2Tree. The time complexity of both decoders is , where is the length of output. The extra computation of Seq2Tree compared with Seq2Seq is to maintain the nonterminal queue, which can be ignored because most of time is spent on matrix operations. We implement the hierarchical tree decoder in a batch mode, so that it can fully utilize GPUs. Specifically, as shown in Algorithm 1, every time we pop multiple nonterminals from the queue and decode these nonterminals in one batch.
The majority of semantic parsing datasets have been developed with question-answering in mind. In the typical application setting, natural language questions are mapped into logical forms and executed on a knowledge base to obtain an answer. Due to the nature of the question-answering task, many natural language utterances contain entities or numbers that are often parsed as arguments in the logical form. Some of them are unavoidably rare or do not appear in the training set at all (this is especially true for small-scale datasets). Conventional sequence encoders simply replace rare words with a special unknown word symbol [Luong et al.2015a, Jean et al.2015], which would be detrimental for semantic parsing.
We have developed a simple procedure for argument identification. Specifically, we identify entities and numbers in input questions and replace them with their type names and unique IDs. For instance, we pre-process the training example “jobs with a salary of 40000” and its logical form “job(ANS), salary_greater_than(ANS, 40000, year)” as “jobs with a salary of num” and “job(ANS), salary_greater_than(ANS, num, year)”. We use the pre-processed examples as training data. At inference time, we also mask entities and numbers with their types and IDs. Once we obtain the decoding result, a post-processing step recovers all the markers type to their corresponding logical constants.
We compare our method against multiple previous systems on four datasets. We describe these datasets below, and present our experimental settings and results. Finally, we conduct model analysis in order to understand what the model learns. The code is available at https://github.com/donglixp/lang2logic.
Our model was trained on the following datasets, covering different domains and using different meaning representations. Examples for each domain are shown in Table 1.
This benchmark dataset contains queries to a database of job listings. Specifically, questions are paired with Prolog-style queries. We used the same training-test split as zc05 which contains training and test instances. Values for the variables company, degree, language, platform, location, job area, and number are identified.
This is a standard semantic parsing benchmark which contains queries to a database of U.S. geography. Geo has instances split into a training set of training examples and test examples [Zettlemoyer and Collins2005]. We used the same meaning representation based on lambda-calculus as fubl. Values for the variables city, state, country, river, and number are identified.
This dataset has queries to a flight booking system. The standard split has training instances, development instances, and test instances. Sentences are paired with lambda-calculus expressions. Values for the variables date, time, city, aircraft code, airport, airline, and number are identified.
ifttt created this dataset by extracting a large number of if-this-then-that recipes from the Ifttt website111http://www.ifttt.com. Recipes are simple programs with exactly one trigger and one action which users specify on the site. Whenever the conditions of the trigger are satisfied, the action is performed. Actions typically revolve around home security (e.g., “turn on my lights when I arrive home”), automation (e.g., “text me if the door opens”), well-being (e.g., “remind me to drink water if I’ve been at a bar for more than two hours”), and so on. Triggers and actions are selected from different channels ( in total) representing various types of services, devices (e.g., Android), and knowledge sources (such as ESPN or Gmail). In the dataset, there are trigger functions from channels, and action functions from channels. We used Quirk et al.’s ifttt original split which contains training, development, and test examples. The Ifttt programs are represented as abstract syntax trees and are paired with natural language descriptions provided by users (see Table 1). Here, numbers and URLs are identified.
Natural language sentences were lowercased; misspellings were corrected using a dictionary based on the Wikipedia list of common misspellings. Words were stemmed using NLTK [Bird et al.2009]. For Ifttt, we filtered tokens, channels and functions which appeared less than five times in the training set. For the other datasets, we filtered input words which did not occur at least two times in the training set, but kept all tokens in the logical forms. Plain string matching was employed to identify augments as described in Section 3.6. More sophisticated approaches could be used, however we leave this future work.
Model hyper-parameters were cross-validated on the training set for Jobs and Geo. We used the standard development sets for Atis and Ifttt. We used the RMSProp algorithm (with batch size set to ) to update the parameters. The smoothing constant of RMSProp was . Gradients were clipped at
to alleviate the exploding gradient problem[Pascanu et al.2013]
. Parameters were randomly initialized from a uniform distribution. A two-layer LSTM was used for Ifttt, while a one-layer LSTM was employed for the other domains. The dropout rate was selected from . Dimensions of hidden vector and word embedding were selected from
. Early stopping was used to determine the number of epochs. Input sentences were reversed before feeding into the encoder[Sutskever et al.2014]. We use greedy search to generate logical forms during inference. Notice that two decoders with shared word embeddings were used to predict triggers and actions for Ifttt, and two softmax classifiers are used to classify channels and functions.
|COCKTAIL [Tang and Mooney2001]||79.4|
|PRECISE [Popescu et al.2003]||88.0|
|ZC05 [Zettlemoyer and Collins2005]||79.3|
|DCS+L [Liang et al.2013]||90.7|
|TISP [Zhao and Huang2015]||85.0|
|SCISSOR [Ge and Mooney2005]||72.3|
|KRISP [Kate and Mooney2006]||71.7|
|WASP [Wong and Mooney2006]||74.8|
|-WASP [Wong and Mooney2007]||86.6|
|LNLZ08 [Lu et al.2008]||81.8|
|ZC05 [Zettlemoyer and Collins2005]||79.3|
|ZC07 [Zettlemoyer and Collins2007]||86.1|
|UBL [Kwiatkowski et al.2010]||87.9|
|FUBL [Kwiatkowski et al.2011]||88.6|
|KCAZ13 [Kwiatkowski et al.2013]||89.0|
|DCS+L [Liang et al.2013]||87.9|
|TISP [Zhao and Huang2015]||88.9|
|ZC07 [Zettlemoyer and Collins2007]||84.6|
|UBL [Kwiatkowski et al.2010]||71.4|
|FUBL [Kwiatkowski et al.2011]||82.8|
|TISP [Zhao and Huang2015]||84.2|
We first discuss the performance of our model on Jobs, Geo, and Atis, and then examine our results on Ifttt. Tables 2–4 present comparisons against a variety of systems previously described in the literature. We report results with the full models (Seq2Seq, Seq2Tree) and two ablation variants, i.e., without an attention mechanism (attention) and without argument identification (argument). We report accuracy which is defined as the proportion of the input sentences that are correctly parsed to their gold standard logical forms. Notice that DCS+L, KCAZ13 and GUSP output answers directly, so accuracy in this setting is defined as the percentage of correct answers.
Overall, Seq2Tree is superior to Seq2Seq. This is to be expected since Seq2Tree explicitly models compositional structure. On the Jobs and Geo datasets which contain logical forms with nested structures, Seq2Tree outperforms Seq2Seq by 2.9% and 2.5%, respectively. Seq2Tree achieves better accuracy over Seq2Seq on Atis too, however, the difference is smaller, since Atis is a simpler domain without complex nested structures. We find that adding attention substantially improves performance on all three datasets. This underlines the importance of utilizing soft alignments between inputs and outputs. We further analyze what the attention layer learns in Figure 6. Moreover, our results show that argument identification is critical for small-scale datasets. For example, about of city names appear less than times in the Geo training set, so it is difficult to learn reliable parameters for these words. In relation to previous work, the proposed models achieve comparable or better performance. Importantly, we use the same framework (Seq2Seq or Seq2Tree) across datasets and meaning representations (Prolog-style logical forms in Jobs and lambda calculus in the other two datasets) without modification. Despite this relatively simple approach, we observe that Seq2Tree ranks second on Jobs, and is tied for first place with ZC07 on Atis.
We illustrate examples of alignments produced by Seq2Seq in Figures LABEL:fig:exp:attention:a and LABEL:fig:exp:attention:b. Alignments produced by Seq2Tree are shown in Figures LABEL:fig:exp:attention:c and LABEL:fig:exp:attention:d. Matrices of attention scores are computed using Equation (5) and are represented in grayscale. Aligned input words and logical form predicates are enclosed in (same color) rectangles whose overlapping areas contain the attention scores. Also notice that attention scores are computed by LSTM hidden vectors which encode context information rather than just the words in their current positions. The examples demonstrate that the attention mechanism can successfully model the correspondence between sentences and logical forms, capturing reordering (Figure LABEL:fig:exp:attention:b), many-to-many (Figure LABEL:fig:exp:attention:a), and many-to-one alignments (Figures LABEL:fig:exp:attention:c,d).
For Ifttt, we follow the same evaluation protocol introduced in ifttt. The dataset is extremely noisy and measuring accuracy is problematic since predicted abstract syntax trees (ASTs) almost never exactly match the gold standard. Quirk et al. view an AST as a set of productions and compute balanced F1 instead which we also adopt. The first column in Table 5 shows the percentage of channels selected correctly for both triggers and actions. The second column measures accuracy for both channels and functions. The last column shows balanced F1 against the gold tree over all productions in the proposed derivation. We compare our model against posclass, the method introduced in Quirk et al. and several of their baselines. posclass is reminiscent of KRISP [Kate and Mooney2006], it learns distributions over productions given input sentences represented as a bag of linguistic features. The retrieval baseline finds the closest description in the training data based on character string-edit-distance and returns the recipe for that training program. The phrasal method uses phrase-based machine translation to generate the recipe, whereas sync extracts synchronous grammar rules from the data, essentially recreating WASP [Wong and Mooney2006]. Finally, they use a binary classifier to predict whether a production should be present in the derivation tree corresponding to the description.
ifttt report results on the full test data and smaller subsets after noise filtering, e.g., when non-English and unintelligible descriptions are removed (Tables (a)a and (b)b). They also ran their system on a high-quality subset of description-program pairs which were found in the gold standard and at least three humans managed to independently reproduce (Table (c)c). Across all subsets our models outperforms posclass and related baselines. Again we observe that Seq2Tree consistently outperforms Seq2Seq, albeit with a small margin. Compared to the previous datasets, the attention mechanism and our argument identification method yield less of an improvement. This may be due to the size of ifttt and the way it was created – user curated descriptions are often of low quality, and thus align very loosely to their corresponding ASTs.
Finally, we inspected the output of our model in order to identify the most common causes of errors which we summarize below.
The attention model used in our experiments does not take the alignment history into consideration. So, some question words, expecially in longer questions, may be ignored in the decoding process. This is a common problem for encoder-decoder models and can be addressed by explicitly modelling the decoding coverage of the source words[Tu et al.2016, Cohn et al.2016]. Keeping track of the attention history would help adjust future attention and guide the decoder towards untranslated source words.
Some mentions are incorrectly identified as arguments. For example, the word may is sometimes identified as a month when it is simply a modal verb. Moreover, some argument mentions are ambiguous. For instance, 6 o’clock can be used to express either 6 am or 6 pm. We could disambiguate arguments based on contextual information. The execution results of logical forms could also help prune unreasonable arguments.
Because the data size of Jobs, Geo, and Atis
is relatively small, some question words are rare in the training set, which makes it hard to estimate reliable parameters for them. One solution would be to learn word embeddings on unannotated text data, and then use these as pretrained vectors for question words.
In this paper we presented an encoder-decoder neural network model for mapping natural language descriptions to their meaning representations. We encode natural language utterances into vectors and generate their corresponding logical forms as sequences or trees using recurrent neural networks with long short-term memory units. Experimental results show that enhancing the model with a hierarchical tree decoder and an attention mechanism improves performance across the board. Extensive comparisons with previous methods show that our approach performs competitively, without recourse to domain- or representation-specific features. Directions for future work are many and varied. For example, it would be interesting to learn a model from question-answer pairs without access to target logical forms. Beyond semantic parsing, we would also like to apply our Seq2Tree model to related structured prediction tasks such as constituency parsing.
We would like to thank Luke Zettlemoyer and Tom Kwiatkowski for sharing the ATIS dataset. The support of the European Research Council under award number 681760 “Translating Multiple Modalities into Text” is gratefully acknowledged.
Weakly supervised learning of semantic parsers for mapping instructions to actions.TACL, 1(1):49–62.