The Web as a Knowledge-base for Answering Complex Questions

03/18/2018 ∙ by Alon Talmor, et al. ∙ Tel Aviv University 0

Answering complex questions is a time-consuming activity for humans that requires reasoning and integration of information. Recent work on reading comprehension made headway in answering simple questions, but tackling complex questions is still an ongoing research challenge. Conversely, semantic parsers have been successful at handling compositionality, but only when the information resides in a target knowledge-base. In this paper, we present a novel framework for answering broad and complex questions, assuming answering simple questions is possible using a search engine and a reading comprehension model. We propose to decompose complex questions into a sequence of simple questions, and compute the final answer from the sequence of answers. To illustrate the viability of our approach, we create a new dataset of complex questions, ComplexWebQuestions, and present a model that decomposes questions and interacts with the web to compute an answer. We empirically demonstrate that question decomposition improves performance from 20.8 precision@1 to 27.5 precision@1 on this new dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 14

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans often want to answer complex questions that require reasoning over multiple pieces of evidence, e.g., “From what country is the winner of the Australian Open women’s singles 2008?”. Answering such questions in broad domains can be quite onerous for humans, because it requires searching and integrating information from multiple sources.

Recently, interest in question answering (QA) has surged in the context of reading comprehension (RC), where an answer is sought for a question given one or more documents Hermann et al. (2015); Joshi et al. (2017); Rajpurkar et al. (2016). Neural models trained over large datasets led to great progress in RC, nearing human-level performance Wang et al. (2017). However, analysis of models revealed Jia and Liang (2017); Chen et al. (2016) that they mostly excel at matching questions to local contexts, but struggle with questions that require reasoning. Moreover, RC assumes documents with the information relevant for the answer are available – but when questions are complex, even retrieving the documents can be difficult.

Conversely, work on QA through semantic parsing has focused primarily on compositionality: questions are translated to compositional programs that encode a sequence of actions for finding the answer in a knowledge-base (KB) Zelle and Mooney (1996); Zettlemoyer and Collins (2005); Artzi and Zettlemoyer (2013); Krishnamurthy and Mitchell (2012); Kwiatkowski et al. (2013); Liang et al. (2011). However, this reliance on a manually-curated KB has limited the coverage and applicability of semantic parsers.

Figure 1: Given a complex questions , we decompose the question to a sequence of simple questions , use a search engine and a QA model to answer the simple questions, from which we compute the final answer .

In this paper we present a framework for QA that is broad, i.e., it does not assume information is in a KB or in retrieved documents, and compositional, i.e., to compute an answer we must perform some computation or reasoning. Our thesis is that answering simple questions can be achieved by combining a search engine with a RC model. Thus, answering complex questions can be addressed by decomposing the question into a sequence of simple questions, and computing the answer from the corresponding answers. Figure 1 illustrates this idea. Our model decomposes the question in the figure into a sequence of simple questions, each is submitted to a search engine, and then an answer is extracted from the search result. Once all answers are gathered, a final answer can be computed using symbolic operations such as union and intersection.

To evaluate our framework we need a dataset of complex questions that calls for reasoning over multiple pieces of information. Because an adequate dataset is missing, we created ComplexWebQuestions, a new dataset for complex questions that builds on WebQuestionsSP, a dataset that includes pairs of simple questions and their corresponding SPARQL query. We take SPARQL queries from WebQuestionsSP and automatically create more complex queries that include phenomena such as function composition, conjunctions, superlatives and comparatives. Then, we use Amazon Mechanical Turk (AMT) to generate natural language questions, and obtain a dataset of 34,689 question-answer pairs (and also SPARQL queries that our model ignores). Data analysis shows that examples are diverse and that AMT workers perform substantial paraphrasing of the original machine-generated question.

We propose a model for answering complex questions through question decomposition. Our model uses a sequence-to-sequence architecture Sutskever et al. (2014) to map utterances to short programs that indicate how to decompose the question and compose the retrieved answers. To obtain supervision for our model, we perform a noisy alignment from machine-generated questions to natural language questions and automatically generate noisy supervision for training.111We differ training from question-answer pairs for future work.

We evaluate our model on ComplexWebQuestionsand find that question decomposition substantially improves precision@1 from 20.8 to 27.5. We find that humans are able to reach 63.0 precision@1 under a limited time budget, leaving ample room for improvement in future work.

To summarize, our main contributions are:

  1. [nosep]

  2. A framework for answering complex questions through question decomposition.

  3. A sequence-to-sequence model for question decomposition that substantially improves performance.

  4. A dataset of 34,689 examples of complex and broad questions, along with answers, web snippets, and SPARQL queries.

Our dataset, ComplexWebQuestions, can be downloaded from http://nlp.cs.tau.ac.il/compwebq and our codebase can be downloaded from https://github.com/alontalmor/WebAsKB.

2 Problem Formulation

Our goal is to learn a model that given a question and a black box QA model for answering simple questions, , produces a computation tree (defined below) that decomposes the question and computes the answer. The model is trained from a set of question-computation tree pairs or question-answer pairs .

Figure 2: A computation tree for “What city is the birthplace of the author of ‘Without end’, and hosted Euro 2012?”. The leaves are strings, and inner nodes are functions (red) applied to their children to produce answers (blue).

A computation tree is a tree where leaves are labeled with strings, and inner nodes are labeled with functions. The arguments of a function are its children sub-trees. To compute an answer, or denotation, from a tree, we recursively apply the function at the root to its children. More formally, given a tree rooted at node , labeled by the function , that has children , the denotation is an arbitrary function applied to the denotations of the root’s children. Denotations are computed recursively and the denotation of a string at the leaf is the string itself, i.e., . This is closely related to “semantic functions” in semantic parsing Berant and Liang (2015), except that we do not interact with a KB, but rather compute directly over the breadth of the web through a search engine.

Figure 2 provides an example computation tree for our running example. Notice that words at the leaves are not necessarily in the original question, e.g., “city” is paraphrased to “cities”. More broadly, our framework allows paraphrasing questions in any way that is helpful for the function . Paraphrasing for better interaction with a QA model has been recently suggested by buck2017ask and nogueira2016end.

We defined the function for answering simple questions, but in fact it comprises two components in this work. First, the question is submitted to a search engine that retrieves a list of web snippets. Next, a RC model extracts the answer from the snippets. While it is possible to train the RC model jointly with question decomposition, in this work we pre-train it separately, and later treat it as a black box.

The expressivity of our QA model is determined by the functions used, which we turn to next.

3 Formal Language

Functions in our formal language take arguments and return values that can be strings (when decomposing or re-phrasing the question), sets of strings, or sets of numbers. Our set of functions includes:

  1. [nosep]

  2. SimpQA: Model for answering simple questions, which takes a string argument and returns a set of strings or numbers as answer.

  3. Comp: This function takes a string containing one unique variable VAR, and a set of answers. E.g., in Figure 2 the first argument is “birthplace of VAR, and the second argument is {Ken Follett, Adam Zagajewski}. The function replaces the variable with each answer string representation and returns their union. Formally, , where denotes the string produced when replacing VAR in with . This is similar to function composition in CCG Steedman (2000), or a join operation in -DCS Liang (2013), where the string is a function applied to previously-computed values.

  4. Conj: takes two sets and returns their intersection. Other set operations can be defined analogously. As syntactic sugar, we allow Conj() to take strings as input, which means that we run SimpQA to obtain a set and then perform intersection. The root node in Figure 2 illustrates an application of Conj.

  5. Add: takes two singleton sets of numbers and returns a set with their addition. Similar functions can be defined analogously. While we support mathematical operations, they were not required in our dataset.

Other logical operations

In semantic parsing superlative and comparative questions like “What is the highest European mountain?” or “What European mountains are higher than Mont Blanc?” are answered by joining the set of European mountains with their elevation. While we could add such functions to the formal language, answering such questions from the web is cumbersome: we would have to extract a list of entities and a numerical value for each. Instead, we handle such constructions using SimpQA directly, assuming they are mentioned verbatim on some web document.

Similarly, negation questions (“What countries are not in the OECD?”) are difficult to handle when working against a search engine only, as this is an open world setup and we do not hold a closed set of countries over which we can perform set subtraction.

In future work, we plan to interface with tables Pasupat and Liang (2015) and KBs Zhong et al. (2017). This will allow us to perform set operations over well-defined sets, and handle in a compositional manner superlatives and comparatives.

4 Dataset

Evaluating our framework requires a dataset of broad and complex questions that examine the importance of question decomposition. While many QA datasets have been developed recently Yang et al. (2015); Rajpurkar et al. (2016); Hewlett et al. (2016); Nguyen et al. (2016); Onishi et al. (2016); Hill et al. (2015); Welbl et al. (2017), they lack a focus on the importance of question decomposition.

Most RC datasets contain simple questions that can be answered from a short input document. Recently, TriviaQA Joshi et al. (2017) presented a larger portion of complex questions, but still most do not require reasoning. Moreover, the focus of TriviaQA is on answer extraction from documents that are given. We, conversely, highlight question decomposition for finding the relevant documents. Put differently, RC is complementary to question decomposition and can be used as part of the implementation of SimpQA. In Section 6 we demonstrate that question decomposition is useful for two different RC approaches.

4.1 Dataset collection

Figure 3: Overview of data collection process.

To generate complex questions we use the dataset WebQuestionsSP Yih et al. (2016), which contains 4,737 questions paired with SPARQL queries for Freebase Bollacker et al. (2008). Questions are broad but simple. Thus, we sample question-query pairs, automatically create more complex SPARQL queries, generate automatically questions that are understandable to AMT workers, and then have them paraphrase those into natural language (similar to wang2015overnight). We compute answers by executing complex SPARQL queries against Freebase, and obtain broad and complex questions. Figure 6 provides an example for this procedure, and we elaborate next.

Composit. Complex SPARQL query Example (natural language)
Conj. or “What films star Taylor Lautner and have costume designs by Nina Proctor?”
Super. “Which school that Sir Ernest Rutherford attended has the latest founding date?”
Compar. “Which of the countries bordering Mexico have an army size of less than 1050?”
Comp. “Where is the end of the river that originates in Shannon Pot?”
Table 1: Rules for generating a complex query from a query (’.’ in SPARQL corresponds to logical and). The query returns the variable , and contains an entity . We denote by the replacement of the entity with a variable . and are any KB predicates, obj is any KB entity, is a numerical value, and is a variable of a CVT type in Freebase which refers to events. The last column provides an example for a NL question for each type.

Generating SPARQL queries

Given a SPARQL query , we create four types of more complex queries: conjunctions, superlatives, comparatives, and compositions. Table 7 gives the exact rules for generation. For conjunctions, superlatives, and comparatives, we identify queries in WebQuestionsSP whose denotation is a set , and generate a new query whose denotation is a strict subset . For conjunctions this is done by traversing the KB and looking for SPARQL triplets that can be added and will yield a valid set . For comparatives and superlatives we find a numerical property common to all , and add a triplet and restrictor to accordingly. For compositions, we find an entity in , and replace with a variable and add to a triplet such that the denotation of that triplet is .

Machine-generated (MG) questions

To have AMT workers paraphrase SPARQL queries into natural language, we need to present them in an understandable form. Therefore, we automatically generate a question they can paraphrase. When we generate new SPARQL queries, new predicates are added to the query (Table 7). We manually annotated 687 templates mapping KB predicates to text for different compositionality types (with 462 unique KB predicates), and use those templates to modify the original WebQuestionsSP question according to the meaning of the generated SPARQL query. E.g., the template for ns:book.author.works_written obj is “the author who wrote OBJ”. For brevity, we provide the details in the supplementary material.

Question Rephrasing

We used AMT workers to paraphrase MG questions into natural language (NL). Each question was paraphrased by one AMT worker and validated by 1-2 other workers. To generate diversity, workers got a bonus if the edit distance of a paraphrase was high compared to the MG question. A total of 200 workers were involved, and 34,689 examples were produced with an average cost of 0.11$ per question. Table 7 gives an example for each compositionality type.

A drawback of our method for generating data is that because queries are generated automatically the question distribution is artificial from a semantic perspective. Still, developing models that are capable of reasoning is an important direction for natural language understanding and ComplexWebQuestions provides an opportunity to develop and evaluate such models.

To summarize, each of our examples contains a question, an answer, a SPARQL query (that our models ignore), and all web snippets harvested by our model when attempting to answer the question. This renders ComplexWebQuestions useful for both the RC and semantic parsing communities.

4.2 Dataset analysis

ComplexWebQuestions builds on the WebQuestions Berant et al. (2013). Questions in WebQuestions are usually about properties of entities (“What is the capital of France?”), often with some filter for the semantic type of the answer (“Which director”, “What city”). WebQuestions also contains questions that refer to events with multiple entities (“Who did Brad Pitt play in Troy?”). ComplexWebQuestions contains all these semantic phenomena, but we add four compositionality types by generating composition questions (45% of the times), conjunctions (45%), superlatives (5%) and comparatives (5%).

Paraphrasing

To generate rich paraphrases, we gave a bonus to workers that substantially modified MG questions. To check whether this worked, we measured surface similarity between MG and NL questions, and examined the similarity. Using normalized edit-distance and the DICE coefficient, we found that NL questions are different from MG questions and that the similarity distribution has wide support (Figure 4). We also found that AMT workers tend to shorten the MG question (MG avg. length: 16, NL avg. length: 13.18), and use a richer vocabulary (MG # unique tokens: 9,489, NL # unique tokens: 14,282).

Figure 4: MG and NL questions similarity with normalized edit-distance, and the DICE coefficient (bars are stacked).
Figure 5: Heat map for similarity matrix between a MG and NL question. The red line indicates a known MG split point. The blue line is the approximated NL split point.

We created a heuristic for approximating the amount of word re-ordering performed by AMT workers. For every question, we constructed a matrix

, where is the similarity between token in the MG question and token in the NL question. Similarity is

if lemmas match, or cosine similarity according to GloVe embeddings

Pennington et al. (2014), when above a threshold, and otherwise. The matrix

allows us to estimate whether parts of the MG question were re-ordered when paraphrased to NL (details in supplementary material). We find that in 44.7% of the conjunction questions and 13.2% of the composition questions, word re-ordering happened, illustrating that substantial changes to the MG question have been made. Figure 

8 illustrates the matrix for a pair of questions with re-ordering.

Last, we find that in WebQuestions almost all questions start with a wh-word, but in ComplexWebQuestions 22% of the questions start with another word, again showing substantial paraphrasing from the original questions.

Qualitative analysis

We randomly sampled 100 examples from the development set and manually identified prevalent phenomena in the data. We present these types in Table 2 along with their frequency. In 18% of the examples a conjunct in the MG question becomes a modifier of a wh-word in the NL question (Wh-modifier). In 22% substantial word re-ordering of the MG questions occurred, and in 42% a minor word re-ordering occurred (“number of building floors is 50” paraphrased as “has 50 floors”). AMT workers used a synonym in 54% of the examples, they omitted words in 27% of the examples and they added new lexical material in 29%.

Type MG question NL question %
Wh-modifier what movies does leo howard play in and that is 113.0 Which Leo Howard movie lasts 113 minutes? 18%
minutes long?


major reord.
Where did the actor that played in the film Hancock 2 What high school did the actor go to who was in the movie Hancock 2? 22%
go to high school?

minor reord.
what to do and see in vienna austria What building in Vienna, Austria has 50 floors? 42%
and the number of building floors is 50?

Synonym
where does the body of water under Kineshma Bridge start Where does the body of water under Kineshma Bridge originate? 54%

skip word
what movies did miley cyrus play in and involves What movie featured Miley Cyrus and involved Cirkus? 27%
organization Cirkus?

add word
what to do if you have one day in bangkok and Which amusement park, that happens to be the one that opened 29%
the place is an amusement park that opened earliest? earliest, should you visit if you have only one day to spend in Bangkok?
Table 2: Examples and frequency of prevalent phenomena in the NL questions for a manually analyzed subset (see text).

To obtain intuition for operations that will be useful in our model, we analyzed the 100 examples for the types of operations that should be applied to the NL question during question decomposition. We found that splitting the NL question is insufficient, and that in 53% of the cases a word in the NL question needs to be copied to multiple questions after decomposition (row 3 in Table 3). Moreover, words that did not appear in the MG question need to be added in 39% of the cases, and words need to be deleted in 28% of the examples.

5 Model and Learning

We would like to develop a model that translates questions into arbitrary computation trees with arbitrary text at the tree leaves. However, this requires training from denotations using methods such as maximum marginal likelihood or reinforcement learning

Guu et al. (2017) that are difficult to optimize. Moreover, such approaches involve issuing large amounts of queries to a search engine at training time, incurring high costs and slowing down training.

Instead, we develop a simple approach in this paper. We consider a subset of all possible computation trees that allows us to automatically generate noisy full supervision. In what follows, we describe the subset of computation trees considered and their representation, a method for automatically generating noisy supervision, and a pointer network model for decoding.

Program Question Split
SimpQA “What building in Vienna, Austria has 50 floors” -
Comp “Where is the birthplace of the writer of Standup Shakespeare” “Where is the birthplace of VAR”
“the writer of Standup Shakespeare”
Conj “What film featured Taylor Swift and “What film featured Taylor Swift”
was directed by Deborah Aquila” “film and was directed by Deborah Aquila”
Table 3: Examples for the types of computation trees that can be decoded by our model.

Representation

We represent computation trees as a sequence of tokens, and consider trees with at most one compositional operation. We denote a sequence of question tokens , and the decoded sequence by . We consider the following token sequences (see Table 3):

  1. [nosep]

  2. SimpQA: The function SimpQA is applied to the question without paraphrasing. In prefix notation this is the tree .

  3. Comp : This sequence of tokens corresponds to the following computation tree: , where is the concatenation operator. This is used for questions where a substring is answered by SimpQA and the answers replace a variable before computing a final answer.

  4. Conj : This sequence of tokens corresponds to the computation tree . The idea is that conjunction can be answered by splitting the question in a single point, where one token is copied to the second part as well (“film” in Table 3). If nothing needs to be copied, then .

This representation supports one compositional operation, and a single copying operation is allowed without any re-phrasing. In future work, we plan to develop a more general representation, which will require training from denotations.

Supervision

Training from denotations is difficult as it involves querying a search engine frequently, which is expensive. Therefore, we take advantage of the the original SPARQL queries and MG questions to generate noisy programs for composition and conjunction questions. Note that these noisy programs are only used as supervision to avoid the costly process of manual annotation, but the model itself does not assume SPARQL queries in any way.

We generate noisy programs from SPARQL queries in the following manner: First, we automatically identify composition and conjunction questions. Because we generated the MG question, we can exactly identify the split points ( in composition questions and in conjunction questions) in the MG question. Then, we use a rule-based algorithm that takes the alignment matrix (Section Dataset), and approximates the split points in the NL question and the index to copy in conjunction questions. The red line in Figure 8 corresponds to the known split point in the MG question, and the blue one is the approximated split point in the NL question. The details of this rule-based algorithm are in the supplementary material.

Thus, we obtain noisy supervision for all composition and conjunction questions and can train a model that translates questions to representations , where and are integer indices.

Pointer network

The representation points to indices in the input, and thus pointer networks Vinyals et al. (2015) are a sensible choice. Because we also need to decode the tokens Comp and Conj, we use “augmented pointer networks”, Zhong et al. (2017): For every question , an augmented question is created by appending the tokens “Comp Conj” to . This allows us to decode the representation with one pointer network that at each decoding step points to one token in the augmented question. We encode with a one-layer GRU Cho et al. (2014), and decode with a one-layer GRU with attention as in jia2016recombination. The only difference is that we decode tokens from the augmented question rather than from a fixed vocabulary.

We train the model with token-level cross-entropy loss, minimizing . Parameters include the GRU encoder and decoder, and embeddings for unknown tokens (that are not in pre-trained GloVe embeddings Pennington et al. (2014)).

The trained model decodes Comp and Conj representations, but sometimes using without decomposition is better. To handle such cases we do the following: We assume that we always have access to a score for every answer, provided by the final invocation of SimpQA (in Conj questions this score is the maximum of the scores given by SimpQA for the two conjuncts), and use the following rule to decide if to use the decoded representation or . Given the scores for answers given by and the scores given by , we return the single answer that has the highest score. The intuition is that the confidence provided by the scores of SimpQA is correlated with answer correctness. In future work we will train directly from denotations and will handle all logical functions in a uniform manner.

6 Experiments

In this section, we aim to examine whether question decomposition can empirically improve performance of QA models over complex questions.

Experimental setup

We used 80% of the examples in ComplexWebQuestions for training, 10% for development, and 10% for test, training the pointer network on 24,708 composition and conjunction examples. The hidden state dimension of the pointer network is , and we used Adagrad Duchi et al. (2010) combined with L regularization and a dropout rate of . We initialize -dimensional word embeddings using GloVe and learn embeddings for missing words.

Simple QA model

As our SimpQA function, we download the web-based QA model of talmor2017evaluating. This model sends the question to Google’s search engine and extracts a distribution over answers from the top- web snippets using manually-engineered features. We re-train the model on our data with one new feature: for every question and candidate answer mention in a snippet, we run RaSoR

, a RC model by lee2016global, and add the output logit score as a feature. We found that combining the web-facing model of talmor2017evaluating and

RaSoR, resulted in improved performance.

Evaluation

For evaluation, we measure precision@1 (p@1), i.e., whether the highest scoring answer returned string-matches one of the correct answers (while answers are sets, 70% of the questions have a single answer, and the average size of the answer set is 2.3).

We evaluate the following models and oracles:

  1. [nosep]

  2. SimpQA: running SimpQA on the entire question, i.e., without decomposition.

  3. SplitQA: Our main model that answers complex questions by decomposition.

  4. SplitQAOracle: An oracle model that chooses whether to perform question decomposition or use SimpQA in hindsight based on what performs better.

  5. RCQA: This is identical to SimpQA, except that we replace the RC model from talmor2017evaluating with the the RC model DocQA Clark and Gardner (2017), whose performance is comparable to state-of-the-art on TriviaQA.

  6. SplitRCQA: This is identical to SplitQA, except that we replace the RC model from talmor2017evaluating with DocQA.

  7. GoogleBox: We sample random development set questions and check whether Google returns a box that contains one of the correct answers.

  8. Human: We sample random development set questions and manually answer the questions with Google’s search engine, including all available information. We limit the amount of time allowed for answering to 4 minutes.

System Dev. Test
SimpQA 20.4 20.8
SplitQA 29.0 27.5
SplitQAOracle 34.0 33.7
RCQA 18.7 18.6
SplitRCQA 21.5 22.0
GoogleBox 2.5 -
Human 63.0 -

Table 4: precision@1 results on the development set and test set for ComplexWebQuestions.

Table 4 presents the results on the development and test sets. SimpQA, which does not decompose questions obtained 20.8 p@1, while by performing question decomposition we substantially improve performance to 27.5 p@1. An upper bound with perfect knowledge on when to decompose is given by SplitQAOracle at 33.7 p@1.

RCQA obtained lower performance SimpQA, as it was trained on data from a different distribution. More importantly SplitRCQA outperforms RCQA by 3.4 points, illustrating that this RC model also benefits from question decomposition, despite the fact that it was not created with question decomposition in mind. This shows the importance of question decomposition for retrieving documents from which an RC model can extract answers. GoogleBox finds a correct answer in 2.5% of the cases, showing that complex questions are challenging for search engines.

To conclude, we demonstrated that question decomposition substantially improves performance on answering complex questions using two independent RC models.

Analysis

We estimate human performance (Human) at 63.0 p@1. We find that answering complex questions takes roughly 1.3 minutes on average. For questions we were unable to answer, we found that in 27% the answer was correct but exact string match with the gold answers failed; in 23.1% the time required to compute the answer was beyond our capabilities; for 15.4% we could not find an answer on the web; 11.5% were of ambiguous nature; 11.5% involved paraphrasing errors of AMT workers; and an additional 11.5% did not contain a correct gold answer.

SplitQA decides if to decompose questions or not based on the confidence of SimpQA. In 61% of the questions the model chooses to decompose the question, and in the rest it sends the question as-is to the search engine. If one of the strategies (decomposition vs. no decomposition) works, our model chooses that right one in 86% of the cases. Moreover, in 71% of these answerable questions, only one strategy yields a correct answer.

We evaluate the ability of the pointer network to mimic our labeling heuristic on the development set. We find that the model outputs the exact correct output sequence 60.9% of the time, and allowing errors of one word to the left and right (this often does not change the final output) accuracy is at 77.1%. Token-level accuracy is 83.0% and allowing one-word errors 89.7%. This shows that SplitQA learned to identify decomposition points in the questions. We also observed that often SplitQA produced decomposition points that are better than the heuristic, e.g., for “What is the place of birth for the lyricist of Roman Holiday”, SplitQA produced “the lyricist of Roman Holiday”, but the heuristic produced “the place of birth for the lyricist of Roman Holiday”. Additional examples of SplitQA question decompositions are provided in Table 5.

Question Split-1 Split-2
“Find the actress who played Hailey Rogers, “the actress who played Hailey Rogers” “Find VAR , what label is she signed to”
what label is she signed to”
“What are the colors of the sports team whose “the sports team whose arena stadium “What are the colors of VAR
arena stadium is the AT&T Stadium” is the AT&T Stadium”
“What amusement park is located in Madrid “What amusement park is located in “park includes the stunt fall ride”
Spain and includes the stunt fall ride” Madrid Spain and”
“Which university whose mascot is “Which university whose mascot is “university Derek Fisher attend”
The Trojan did Derek Fisher attend” The Trojan did”
Table 5: Examples for question decompositions from SplitQA.

ComplexQuestions

To further examine the ability of web-based QA models, we run an experiment against ComplexQuestions Bao et al. (2016), a small dataset of question-answer pairs designed for semantic parsing against Freebase.

System Dev. F Test F
SimpQA 40.7 38.6
SplitQARule 43.1 39.7
SplitQARule++ 46.9 -
CompQ - 40.9
Table 6: F results for ComplexQuestions.

We ran SimpQA on this dataset (Table 6) and obtained 38.6 F (the official metric), slightly lower than CompQ, the best system, which operates directly against Freebase. 222By adding the output logit from RaSor, we improved test F from 32.6, as reported by talmor2017evaluating, to 38.6. By analyzing the training data, we found that we can decompose Comp questions with a rule that splits the question when the words “when” or “during” appear, e.g., “Who was vice president when JFK was president?”.333The data is too small to train our decomposition model. We decomposed questions with this rule and obtained 39.7 F (SplitQARule). Analyzing the development set errors, we found that occasionally SplitQARule returns a correct answer that fails to string-match with the gold answer. By manually fixing these cases, our development set F reaches 46.9 (SplitQARule++). Note that CompQ does not suffer from any string matching issue, as it operates directly against the Freebase KB and thus is guaranteed to output the answer in the correct form. This short experiment shows that a web-based QA model can rival a semantic parser that works against a KB, and that simple question decomposition is beneficial and leads to results comparable to state-of-the-art.

7 Related work

This work is related to a body of work in semantic parsing and RC, in particular to datasets that focus on complex questions such as TriviaQA Joshi et al. (2017), WikiHop Welbl et al. (2017) and Race Lai et al. (2017). Our distinction is in proposing a framework for complex QA that focuses on question decomposition.

Our work is related to chen2017reading and watanabe2017question, who combined retrieval and answer extraction on a large set of documents. We work against the entire web, and propose question decomposition for finding information.

This work is also closely related to dunn2017searchqa and buck2017ask: we start with questions directly and do not assume documents are given. buck2017ask also learn to phrase questions given a black box QA model, but while they focus on paraphrasing, we address decomposition. Using a black box QA model is challenging because you can not assume differentiability, and reproducibility is difficult as black boxes change over time. Nevertheless, we argue that such QA setups provide a holistic view to the problem of QA and can shed light on important research directions going forward.

Another important related research direction is iyyer2016answering, who answered complex questions by decomposing them. However, they used crowdsourcing to obtain direct supervision for the gold decomposition, while we do not assume such supervision. Moreover, they work against web tables, while we interact with a search engine against the entire web.

8 Conclusion

In this paper we propose a new framework for answering complex questions that is based on question decomposition and interaction with the web. We develop a model under this framework and demonstrate it improves complex QA performance on two datasets and using two RC models. We also release a new dataset, ComplexWebQuestions, including questions, SPARQL programs, answers, and web snippets harvested by our model. We believe this dataset will serve the QA and semantic parsing communities, drive research on compositionality, and push the community to work on holistic solutions for QA.

In future work, we plan to train our model directly from weak supervision, i.e., denotations, and to extract information not only from the web, but also from structured information sources such as web tables and KBs.

Acknowledgements

We thank Jonatahn Herzig, Ni Lao, and the anonymous reviewers for their constructive feedback. This work was supported by the Samsung runway project and the Israel Science Foundation, grant 942/16.

References

  • Artzi and Zettlemoyer (2013) Y. Artzi and L. Zettlemoyer. 2013.

    Weakly supervised learning of semantic parsers for mapping instructions to actions.

    Transactions of the Association for Computational Linguistics (TACL) 1:49–62.
  • Bao et al. (2016) J. Bao, N. Duan, Z. Yan, M. Zhou, and T. Zhao. 2016.

    Constraint-based question answering with knowledge graph.

    In International Conference on Computational Linguistics (COLING).
  • Berant et al. (2013) J. Berant, A. Chou, R. Frostig, and P. Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In

    Empirical Methods in Natural Language Processing (EMNLP)

    .
  • Berant and Liang (2015) J. Berant and P. Liang. 2015. Imitation learning of agenda-based semantic parsers. Transactions of the Association for Computational Linguistics (TACL) 3:545–558.
  • Bollacker et al. (2008) K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In International Conference on Management of Data (SIGMOD). pages 1247–1250.
  • Buck et al. (2017) C. Buck, J. Bulian, M. Ciaramita, A. Gesmundo, N. Houlsby, W. Gajewski, and W. Wang. 2017. Ask the right questions: Active question reformulation with reinforcement learning. arXiv preprint arXiv:1705.07830 .
  • Chen et al. (2016) D. Chen, J. Bolton, and C. D. Manning. 2016. A thorough examination of the CNN / Daily Mail reading comprehension task. In Association for Computational Linguistics (ACL).
  • Chen et al. (2017) D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017. Reading Wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051 .
  • Cho et al. (2014) K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 .
  • Clark and Gardner (2017) C. Clark and M. Gardner. 2017. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723 .
  • Duchi et al. (2010) J. Duchi, E. Hazan, and Y. Singer. 2010. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory (COLT).
  • Dunn et al. (2017) M. Dunn, , L. Sagun, M. Higgins, U. Guney, V. Cirik, and K. Cho. 2017. SearchQA: A new Q&A dataset augmented with context from a search engine. arXiv .
  • Guu et al. (2017) K. Guu, P. Pasupat, E. Z. Liu, and P. Liang. 2017. From language to programs: Bridging reinforcement learning and maximum marginal likelihood. In Association for Computational Linguistics (ACL).
  • Hermann et al. (2015) K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems (NIPS).
  • Hewlett et al. (2016) D. Hewlett, A. Lacoste, L. Jones, I. Polosukhin, A. Fandrianto, J. Han, M. Kelcey, and D. Berthelot. 2016. Wikireading: A novel large-scale language understanding task over Wikipedia. In Association for Computational Linguistics (ACL).
  • Hill et al. (2015) F. Hill, A. Bordes, S. Chopra, and J. Weston. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. In International Conference on Learning Representations (ICLR).
  • Iyyer et al. (2016) M. Iyyer, W. Yih, and M. Chang. 2016. Answering complicated question intents expressed in decomposed question sequences. CoRR 0.
  • Jia and Liang (2016) R. Jia and P. Liang. 2016. Data recombination for neural semantic parsing. In Association for Computational Linguistics (ACL).
  • Jia and Liang (2017) R. Jia and P. Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Empirical Methods in Natural Language Processing (EMNLP).
  • Joshi et al. (2017) M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Association for Computational Linguistics (ACL).
  • Krishnamurthy and Mitchell (2012) J. Krishnamurthy and T. Mitchell. 2012. Weakly supervised training of semantic parsers. In Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP/CoNLL). pages 754–765.
  • Kwiatkowski et al. (2013) T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. 2013. Scaling semantic parsers with on-the-fly ontology matching. In Empirical Methods in Natural Language Processing (EMNLP).
  • Lai et al. (2017) G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683 .
  • lee et al. (2016) K. lee, M. Lewis, and L. Zettlemoyer. 2016. Global neural CCG parsing with optimality guarantees. In Empirical Methods in Natural Language Processing (EMNLP).
  • Liang (2013) P. Liang. 2013. Lambda dependency-based compositional semantics. arXiv preprint arXiv:1309.4408 .
  • Liang et al. (2011) P. Liang, M. I. Jordan, and D. Klein. 2011. Learning dependency-based compositional semantics. In Association for Computational Linguistics (ACL). pages 590–599.
  • Nguyen et al. (2016) T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In Workshop on Cognitive Computing at NIPS.
  • Nogueira and Cho (2016) R. Nogueira and K. Cho. 2016. End-to-end goal-driven web navigation. In Advances in Neural Information Processing Systems (NIPS).
  • Onishi et al. (2016) T. Onishi, H. Wang, M. Bansal, K. Gimpel, and D. McAllester. 2016. Who did what: A large-scale person-centered cloze dataset. In Empirical Methods in Natural Language Processing (EMNLP).
  • Pasupat and Liang (2015) P. Pasupat and P. Liang. 2015. Compositional semantic parsing on semi-structured tables. In Association for Computational Linguistics (ACL).
  • Pennington et al. (2014) J. Pennington, R. Socher, and C. D. Manning. 2014.

    Glove: Global vectors for word representation.

    In Empirical Methods in Natural Language Processing (EMNLP).
  • Rajpurkar et al. (2016) P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP).
  • Steedman (2000) M. Steedman. 2000. The Syntactic Process. MIT Press.
  • Sutskever et al. (2014) I. Sutskever, O. Vinyals, and Q. V. Le. 2014.

    Sequence to sequence learning with neural networks.

    In Advances in Neural Information Processing Systems (NIPS). pages 3104–3112.
  • Talmor et al. (2017) A. Talmor, M. Geva, and J. Berant. 2017. Evaluating semantic parsing against a simple web-based question answering model. In *SEM.
  • Vinyals et al. (2015) O. Vinyals, M. Fortunato, and N. Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems (NIPS). pages 2674–2682.
  • Wang et al. (2017) W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Association for Computational Linguistics (ACL).
  • Wang et al. (2015) Y. Wang, J. Berant, and P. Liang. 2015. Building a semantic parser overnight. In Association for Computational Linguistics (ACL).
  • Watanabe et al. (2017) Y. Watanabe, B. Dhingra, and R. Salakhutdinov. 2017. Question answering from unstructured text by retrieval and comprehension. arXiv preprint arXiv:1703.08885 .
  • Welbl et al. (2017) J. Welbl, P. Stenetorp, and S. Riedel. 2017. Constructing datasets for multi-hop reading comprehension across documents. arXiv preprint arXiv:1710.06481 .
  • Yang et al. (2015) Y. Yang, W. Yih, and C. Meek. 2015. WikiQA: A challenge dataset for open-domain question answering. In Empirical Methods in Natural Language Processing (EMNLP). pages 2013–2018.
  • Yih et al. (2016) W. Yih, M. Richardson, C. Meek, M. Chang, and J. Suh. 2016. The value of semantic parse labeling for knowledge base question answering. In Association for Computational Linguistics (ACL).
  • Zelle and Mooney (1996) M. Zelle and R. J. Mooney. 1996.

    Learning to parse database queries using inductive logic programming.

    In

    Association for the Advancement of Artificial Intelligence (AAAI)

    . pages 1050–1055.
  • Zettlemoyer and Collins (2005) L. S. Zettlemoyer and M. Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In Uncertainty in Artificial Intelligence (UAI). pages 658–666.
  • Zhong et al. (2017) V. Zhong, C. Xiong, and R. Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 .

Supplementary Material

Dataset

Generating SPARQL queries

Given a SPARQL query , we create four types of more complex queries: conjunctions, superlatives, comparatives, and compositions. For conjunctions, superlatives, and comparatives, we identify SPARQL queries in WebQuestionsSP whose denotation is a set , and generate a new query whose denotation is a strict subset . We also discard questions that contain the answer within the new machine-generated questions.

For conjunctions this is done by traversing the KB and looking for SPARQL triplets that can be added and will yield a valid set .

For comparatives and superlatives this is done by finding a numerical property common to all , and adding a clause to accordingly.

For compositions, we find an entity in , and replace with a variable and add to a clause such that the denotation of the clause is . We also check for discard ambiguous questions that yield more than one answer for entity .

Composit. Complex SPARQL query Example (natural language)
Conj. or “What films star Taylor Lautner and have costume designs by Nina Proctor?”
Super. “Which school that Sir Ernest Rutherford attended has the latest founding date?”
Compar. “Which of the countries bordering Mexico have an army size of less than 1050?”
Comp. “Where is the end of the river that originates in Shannon Pot?”
Table 7: Rules for generating a complex query from a query (’.’ in SPARQL corresponds to logical and). The query returns the variable , and contains an entity . We denote by the replacement of the entity with a variable . and are any KB predicates, obj is any KB entity, is a numerical value, and is a variable of a CVT type in Freebase which refers to events. The last column provides an example for a NL question for each type.

Table 7 gives the exact rules for generation.

Machine-generated (MG) questions

To have AMT workers paraphrase SPARQL queries into natural language, we need to present them in an understandable form. Therefore, we automatically generate a question they can paraphrase. When we generate SPARQL queries, new predicates are added to the query (Table 7). We manually annotate 503 templates mapping predicates to text for different compositionality types (with 377 unique KB predicates). We annotate the templates in the context of several machine-generated questions to ensure that they result templates are in understandable language.

We use those templates to modify the original WebQuestionsSP question according to the meaning of the generated SPARQL query. E.g., the template for is “the author who wrote OBJ”. Table 8 shows various examples of such templates. “Obj” is replaced in turn by the actual name according to Freebase of the object at hand. Freebase represents events that contain multiple arguments using a special node in the knowledge-base called CVT that represents the event, and is connected with edges to all event arguments. Therefore, some of our templates include two predicates that go through a CVT node, and they are denoted in Table 8 with ’+’.

To fuse the templates with the original WebQuestionsSP natural language questions, templates contain lexical material that glues them back to the question conditioned on the compositionality type. For example, in Conj questions we use the coordinating phrase “and is”, so that “the author who wrote OBJ” will produce “Who was born in London and is the author who wrote OBJ”.

Figure 6: Overview of data collection process. Blue text denotes different stages of the term addition, green represents the obj value, and red the intermediate text to connect the new term and seed question
Freebase Predicate Template
ns:book.author.works_written “the author who wrote obj”

ns:aviation.airport.airlines + ns:aviation.airline_airport_presence.airline
“the airport with the obj airline”

ns:award.competitor.competitions_won
“the winner of obj”

ns:film.actor.film + ns:film.performance.film
“the actor that played in the film obj”
Table 8: Template Examples

First word distribution

Figure 7: First word in question distribution

We find that in WebQuestions almost all questions start with a wh-word, but in ComplexWebQuestions 22% of the questions start with another word, again showing substantial paraphrasing from the original questions. Figure 7 Shows the distribution of first words in questions.

Generating noisy supervision

We created a heuristic for approximating the amount of global word re-ordering performed by AMT workers and creating noisy supervision. For every question, we constructed a matrix , where is the similarity between token in the MG question and token in the NL question. Similarity is if lemmas match, or the cosine distance according to GloVe embedding, when above a threshold, and otherwise. This allows us to compute an approximate word alignment between the MG question and the NL question tokens and assess whether word re-ordering occurred.

For a natural language Conj question of length and a machine-generated question of length with a known split point index , the algorithm first computes the best point to split the NL question assuming there is no re-ordering. This is done iterating over all candidate split points , and returning the split point that maximizes:

(1)

We then compute by trying to find the best split point, assuming that there is re-ordering in the NL questions:

(2)

We then determine the final split point and whether re-ordering occurred by comparing the two values and using the higher one.

In Comp questions, two split points are returned, representing the beginning and end of the phrase that is to be sent to the QA model. Therefore, if are the known split points in the machine-generated questions, we return that maximize:

Figure 8 illustrates finding the split point for a Conj questions by using equation (2). The red line in Figure 8 corresponds to the known split point in the MG question, and the blue one is the estimated split point in the NL question.

Figure 8: Heat map for similarity matrix between an MG and NL question. The red line indicates a known MG split point. The blue line is the approximated NL split point. Below is a graph of each candidate split point score.