Answering questions posed in natural language is a fundamental AI task, with a large number of impressive QA systems built over the years. Today’s Internet search engines, for instance, can successfully retrieve factoid style answers to many natural language queries by efficiently searching the Web. Information Retrieval (IR) systems work under the assumption that answers to many questions of interest are often explicitly stated somewhere [Kwok et al.2001], and all one needs, in principle, is access to a sufficiently large corpus. Similarly, statistical correlation based methods, such as those using Pointwise Mutual Information or PMI [Church and Hanks1989], work under the assumption that many questions can be answered by looking for words that tend to co-occur with the question words in a large corpus.
While both of these approaches help identify correct answers, they are not suitable for questions requiring reasoning, such as chaining together multiple facts in order to arrive at a conclusion. Arguably, such reasoning is a cornerstone of human intelligence, and is a key ability evaluated by standardized science exams given to students. For example, consider a question from the NY Regents 4th Grade Science Test:
In New York State, the longest period of daylight occurs during which month? (A) June (B) March (C) December (D) September
We would like a QA system that, even if the answer is not explicitly stated in a document, can combine basic scientific and geographic facts to answer the question, e.g., New York is in the north hemisphere; the longest day occurs during the summer solstice; and the summer solstice in the north hemisphere occurs in June (hence the answer is June). Figure 1 illustrates how our system approaches this, with the highlighted support graph representing its line of reasoning.
Further, we would like the system to be robust under simple perturbations, such as changing New York to New Zealand (in the southern hemisphere) or changing an incorrect answer option to an irrelevant word such as “last” that happens to have high co-occurrence with the question text.
To this end, we propose a structured reasoning system, called TableILP, that operates over a semi-structured knowledge base derived from text and answers questions by chaining multiple pieces of information and combining parallel evidence.111 A preliminary version of our ILP model was used in the ensemble solver of aristo2016:combining aristo2016:combining. We build upon this earlier ILP formulation, providing further details and incorporating additional syntactic and semantic constraints that improve the score by 17.7%. The knowledge base consists of tables, each of which is a collection of instances of an -ary relation defined over natural language phrases. E.g., as illustrated in Figure 1, a simple table with schema (country, hemisphere) might contain the instance (United States, Northern) while a ternary table with schema (hemisphere, orbital event, month) might contain (North, Summer Solstice, June). TableILP treats lexical constituents of the question , as well as cells of potentially relevant tables , as nodes in a large graph , and attempts to find a subgraph of that “best” supports an answer option. The notion of best support is captured via a number of structural and semantic constraints and preferences, which are conveniently expressed in the Integer Linear Programming (ILP) formalism. We then use an off-the-shelf ILP optimization engine called SCIP [Achterberg2009] to determine the best supported answer for .
Following a recently proposed AI challenge [Clark2015], we evaluate TableILP on unseen elementary-school science questions from standardized tests. Specifically, we consider a challenge set [Clark et al.2016] consisting of all non-diagram multiple choice questions from 6 years of NY Regents 4th grade science exams. In contrast to a state-of-the-art structured inference method [Khot et al.2015] for this task, which used Markov Logic Networks (MLNs) [Richardson and Domingos2006]
, TableILP achieves a significantly (+14% absolute) higher test score. This suggests that a combination of a rich and fine-grained constraint language, namely ILP, even with a publicly available solver is more effective in practice than various MLN formulations of the task. Further, while the scalability of the MLN formulations was limited to very few (typically one or two) selected science rules at a time, our approach easily scales to hundreds of relevant scientific facts. It also complements the kind of questions amenable to IR and PMI techniques, as is evidenced by the fact that a combination (trained using simple Logistic Regression[Clark et al.2016]) of TableILP with IR and PMI results in a significant (+10% absolute) boost in the score compared to IR alone.
Our ablation study suggests that combining facts from multiple tables or multiple rows within a table plays an important role in TableILP’s performance. We also show that TableILP benefits from the table structure, by comparing it with an IR system using the same knowledge (the table rows) but expressed as simple sentences; TableILP scores significantly (+10%) higher. Finally, we demonstrate that our approach is robust to a simple perturbation of incorrect answer options: while the simple perturbation results in a relative drop of 20% and 33% in the performance of IR and PMI methods, respectively, it affects TableILP’s performance by only 12%.
2 Related Work
aristo2016:combining aristo2016:combining proposed an ensemble approach for the science QA task, demonstrating the effectiveness of a combination of information retrieval, statistical association, rule-based reasoning, and an ILP solver operating on semi-structured knowledge. Our ILP system extends their model with additional constraints and preferences (e.g., semantic relation matching), substantially improving QA performance.
A number of systems have been developed for answering factoid questions with short answers (e.g., “What is the capital of France?”) using document collections or databases (e.g., Freebase [Bollacker et al.2008], NELL [Carlson et al.2010]), for example [Brill et al.2002, Fader et al.2014, Ferrucci et al.2010, Ko et al.2007, Yih et al.2014, Yao and Van Durme2014, Zou et al.2014]. However, many science questions have answers that are not explicitly stated in text, and instead require combining information together. Conversely, while there are AI systems for formal scientific reasoning (e.g., [Gunning et al.2010, Novak1977]), they require questions to be posed in logic or restricted English. Our goal here is a system that operates between these two extremes, able to combine information while still operating with natural language.
The task of Recognizing Textual Entailment (RTE) [Dagan et al.2010, Dagan et al.2013] is also closely related, as QA can be cast as entailment (Does corpus entail question+answer? [Bentivogli et al.2008]). However, RTE has primarily focused on the task of linguistic equivalence, and has not addressed questions where some form of scientific reasoning is required. Recent work on Natural Logic [Angeli and Manning2014, MacCartney2009] has extended RTE to account for the logical structure within language. Our work can be seen as going one step further, to add a layer of structured reasoning on top of this; in fact, we use an RTE engine as a basic subroutine for comparing individual table cells in our ILP formulation.
ILP based discrete optimization has been successful in several NLP tasks [Roth and Yih2004, Chang et al.2010, Berant et al.2010, Srikumar and Roth2011, Goldwasser and Roth2011]. While our ILP formulation also operates on natural language text, our focus is on the use of a specific semi-structured table representation for QA. cohen2000:joins cohen2000:joins studied tables with natural language text requiring soft matching, with a focus on efficiently computing the top few candidates given a database query. In contrast, our system, given a natural language question, (implicitly) seeks to generate a query that produces the most supported answer.
3 QA as Subgraph Optimization
We begin with our knowledge representation formalism, followed by our treatment of QA as an optimal subgraph selection problem over such knowledge, and then briefly describe our ILP model for subgraph selection.
3.1 Semi-Structured Knowledge as Tables
We use semi-structured knowledge represented in the form of -ary predicates over natural language text [Clark et al.2016]. Formally, a -column table in the knowledge base is a predicate over strings, where each string is a (typically short) natural language phrase. The column headers capture the table schema, akin to a relational database. Each row in the table corresponds to an instance of this predicate. For example, a simple country-hemisphere table represents the binary predicate with instances such as (Australia, Southern) and (Canada, Northern). Since table content is specified in natural language, the same entity is often represented differently in different tables, posing an additional inference challenge.
Although techniques for constructing this knowledge base are outside the scope of this paper, we briefly mention them. Tables were constructed using a mixture of manual and semi-automatic techniques. First, the table schemas were manually defined based on the syllabus, study guides, and training questions. Tables were then populated both manually and semi-automatically using IKE [Dalvi et al.2016], a table-building tool that performs interactive, bootstrapped relation extraction over a corpus of science text. In addition, to augment these tables with the broad knowledge present in study guides that doesn’t always fit the manually defined table schemas, we ran an Open IE [Banko et al.2007] pattern-based subject-verb-object (SVO) extractor from clark2014:akbc clark2014:akbc over several science texts to populate three-column Open IE tables. Methods for further automating table construction are under development.
3.2 QA as a Search for Desirable Support Graphs
We treat question answering as the task of pairing the question with an answer such that this pair has the best support in the knowledge base, measured in terms of the strength of a “support graph” defined as follows.
Given a multiple choice question and tables , we can define a labeled undirected graph over nodes and edges as follows. We first split into lexical constituents (e.g., non-stopword tokens, or chunks) and answer options . For each table , we consider its cells as well as column headers . The nodes of are then . For presentation purposes, we will equate a graph node with the lexical entity it represents (such as a table cell or a question constituent). The undirected edges of are excluding edges both whose endpoints are within a single table.
Informally, an edge denotes (soft) equality between a question or answer node and a table node, or between two table nodes. To account for lexical variability (e.g., that tool and instrument are essentially equivalent) and generalization (e.g., that a dog is an animal), we replace string equality with a phrase-level entailment or similarity function that labels each edge with an associated score . We use entailment scores (directional) from to and from to , and similarity scores (symmetric) between two nodes in .222In our evaluations, for entailment is a simple WordNet-based [Miller1995] function that computes the best word-to-word alignment between phrases, scores these alignments using WordNet’s hypernym and synonym relations normalized using relevant word-sense frequency, and returns the weighted sum of the scores. for similarity is the maximum of the entailment score in both directions. Alternative definitions for these functions may also be used. In the special case of column headers across two tables, the score is (manually) set to either 0 or 1, indicating whether this corresponds to a meaningful join.
Intuitively, we would like the support graph for an answer option to be connected, and to include nodes from the question, the answer option, and at least one table. Since each table row represents a coherent piece of information but cells within a row do not have any edges in (the same holds also for cells and the corresponding column headers), we use the notion of an augmented subgraph to capture the underlying table structure. Let be a subgraph of . The augmented subgraph is formed by adding to edges such that and are in and they correspond to either the same row (possibly the header row) of a table in or to a cell and the corresponding column header.
A support graph for a question , tables , and an answer option is a subgraph of with the following basic properties:
for all ;
if then there exists a corresponding involving the same columns; and
the augmented subgraph is connected.
A support graph thus connects the question constituents to a unique answer option through table cells and (optionally) table headers corresponding to the aligned cells. A given question and tables give rise to a large number of possible support graphs, and the role of the inference process will be to choose the “best” one under a notion of desirable support graphs developed next. We do this through a number of additional structural and semantic properties; the more properties the support graph satisfies, the more desirable it is.
3.3 ILP Formulation
We model the above support graph search for QA as an ILP optimization problem, i.e., as maximizing a linear objective function over a finite set of variables, subject to a set of linear inequality constraints. A summary of the model is given below. 333Details of the ILP model may be found in the appendix. We note that the ILP objective and constraints aren’t tied to the particular domain of evaluation; they represent general properties that capture what constitutes a well supported answer for a given question.
|header of the -th column of -th table|
|cell in row and column of -th table|
|row of -th table|
|column of -th table|
|-th lexical constituent of the question|
|-th answer option|
Table 1 summarizes the notation for various elements of the problem, such as for cell of table . All core variables in the ILP model are binary, i.e., have domain . For each element, the model has a unary variable capturing whether this element is part of the support graph , i.e., it is “active”. For instance, row is active if at least one cell in row of table is in . The model also has pairwise “alignment” variables, capturing edges of . The alignment variable for an edge in is associated with the corresponding weight , and captures whether is included in . To improve efficiency, we create a pairwise variable for only if is larger than a certain threshold. These unary and pairwise variables are then used to define various types of constraints and preferences, as discussed next.
To make the definitions clearer, we introduce all basic variables used in our optimization in Table 2, and will use them later to define constraints explicitly. We use the notation to refer to a unary variable parameterized by a single element of the optimization, and to refer to a pairwise variable parameterized by a pair of elements. Unary variables represent the presence of a specific element as a node in the support graph . For example if and only if the table is active in . Similarly, if and only if the corresponding edge is present in , which we alternatively refer to as an alignment between cell of table and the -th constituent of the question.
|Basic Pairwise Activity Variables|
|cell to cell|
|cell to question constituent|
|header to question constituent|
|cell to answer option|
|header to answer option|
|column to answer option|
|table to answer option|
|column to column relation|
|High-level Unary Variables|
|active column header|
|active question constituent|
|active answer option|
As previously mentioned, in practice we do not create all possible pairwise variables. Instead we choose the pairs alignment score exceeds a pre-set threshold. For example, we create only if .444An exhaustive list of the minimum alignment thresholds for creating pairwise variables is in Table 10 in the appendix.
The objective function is a weighted linear sum over all variables instantiated for a given question answering problem.555The complete list of weights for unary and pairwise variables is included in Table 9 in the appendix. A small set of auxiliary variables is defined for linearizing complicated constraints.
Constraints are a significant part of our model, used for imposing the desired behavior on the support graph. Due to lack of space, we discuss only a representative subset here.666The complete list of the constraints is explained in Table 13 in the appendix.
Some constraints relate variables to each other. For example, unary variables are defined through constraints that relate them to the corresponding pairwise variables. For instance, for active row variable , we ensure that it is 1 if and only if at least one cell in row is active:
where is collection of pairwise variables with one end in row of table .
In the remainder of this section, we outline some of the important characteristics we expect in our model, and provide details of a few illustrative constraints.
3.3.1 Basic Lookup
Consider the following question:
Which characteristic helps a fox find food? (A) sense of smell (B) thick fur (C) long tail (D) pointed teeth
In order to answer such lookup-style questions, we generally seek a row with the highest aggregate alignment to question constituents. We achieve this by incorporating the question-table alignment variables with the alignment scores, , as coefficients and the active question constituents variable with a constant coefficient in the objective function. Since any additional question-table edge with a positive entailment score (even to irrelevant tables) in the support graph would result in an increase in the score, we disallow tables with alignments only to the question (or only to a choice) and add a small penalty for every table used in order to reduce noise in the support graph. We also limit the maximum number of alignments of a question constituent and table cells, in order to prevent one constituent or cell from having a large influence on the objective function and thereby the solution:
where is the set of all pairwise variables with one end in the question constituent .
3.3.2 Parallel Evidence
For certain questions, evidence needs to be combined from multiple rows of a table. For example,
Sleet, rain, snow, and hail are forms of (A) erosion (B) evaporation (C) groundwater (D) precipitation
To answer this question, we need to combine evidence from multiple table entries from the weather terms table, (term, type), namely (sleet, precipitation), (rain, precipitation), (snow, precipitation), and (hail, precipitation). To achieve this, we allow multiple active rows in the support graph. Similar to the basic constraints, we limit the maximum number of active rows per table and add a penalty for every active row to ensure only relevant rows are considered for reasoning :
To encourage only coherent parallel evidence within a single table, we limit our support graph to always use the same columns across multiple rows within a table, i.e., every active row has the active cells corresponding to the same set of columns.
3.3.3 Evidence Chaining
Questions requiring chaining of evidence from multiple tables, such as the example in Figure 1, are typically the most challenging in this domain. Chaining can be viewed as performing a join between two tables. We introduce alignments between cells across columns in pairs of tables to allow for chaining of evidence. To help minimize potential noise introduced by chaining irrelevant facts, we add a penalty for every inter-table alignment and also rely on the 0/1 weights of header-to-header edges to ensure only semantically meaningful table joins are considered.
3.3.4 Semantic Relation Matching
Our constraints so far have only looked at the content of the table cells, or the structure of the support graph, without explicitly considering the semantics of the table schema. By using alignments between the question and column headers (i.e., type information), we exploit the table schema to prefer alignments to columns relevant to the “topic” of the question. In particular, for questions of the form “which X ”, we prefer answers that directly entail X or are connected to cells that entail X. However, this is not sufficient for questions such as:
What is one way to change water from a liquid to a solid? (A) decrease the temperature (B) increase the temperature (C) decrease the mass (D) increase the mass
Even if we select the correct table, say that describes the initial and final states for a phase change event, both choice (A) and choice (B) would have the exact same score in the presence of table rows (increase temperature, solid, liquid) and (decrease temperature, liquid, solid). The table, however, does have the initial vs. final state structure. To capture this semantic structure, we annotate pairs of columns within certain tables with the semantic relationship present between them. In this example, we would annotate the phase change table with the relations: changeFrom, changeTo, and fromTo.
Given such semantic relations for table schemas, we can now impose a preference towards question-table alignments that respect these relations. We associate each semantic relation with a set of linguistic patterns describing how it might be expressed in natural language. TableILP then uses these patterns to spot possible mentions of the relations in the question . We then add the soft constraint that for every pair of active columns in a table (with an annotated semantic relation) aligned to a pair of question constituents, there should be a valid expression of that relation in between those constituents. In our example, we would match the relation fromTo(liquid, solid) in the table to “liquid to a solid” in the question via the pattern “X to a Y” associated with fromTo(X,Y), and thereby prefer aligning with the correct row (decrease temperature, liquid, solid).
We compare our approach to three existing methods, demonstrating that it outperforms the best previous structured approach [Khot et al.2015] and produces a statistically significant improvement when used in combination with IR-based methods [Clark et al.2016]. For evaluations, we use a 2-core 2.5 GHz Amazon EC2 linux machine with 16 GB RAM.
Question Set. We use the same question set as aristo2016:combining aristo2016:combining, which consists of all non-diagram multiple-choice questions from 12 years of the NY Regents 4th Grade Science exams.777These are the only publicly available state-level science exams. http://www.nysedregents.org/Grade4/Science/home.html The set is split into 108 development questions and 129 hidden test questions based on the year they appeared in (6 years each). All numbers reported below are for the hidden test set, except for question perturbation experiments which relied on the 108 development questions.
Test scores are reported as percentages. For each question, a solver gets a score of if it chooses the correct answer and if it reports a
-way tie that includes the correct answer. On the 129 test questions, a score difference of 9% (or 7%) is statistically significant at the 95% (or 90%, resp.) confidence interval based on the binomial exact test[Howell2012].
Corpora. We work with three knowledge corpora:
Web Corpus: This corpus contains tokens (280 GB of plain text) extracted from Web pages. It was collected by Charles Clarke at the University of Waterloo, and has been used previously by turney2013distributional turney2013distributional and aristo2016:combining aristo2016:combining. We use it here to compute statistical co-occurrence values for the PMI solver.
Sentence Corpus [Clark et al.2016]: This includes sentences from the Web corpus above, as well as around 80,000 sentences from various domain-targeted sources for elementary science: a Regents study guide, CK12 textbooks (www.ck12.org), and web sentences with similar content as the course material.
Table Corpus (cf. Section 3.1): This includes 65 tables totaling around 5,000 rows, designed based on the development set and study guides, as well as 4 Open IE-style [Banko et al.2007] automatically generated tables totaling around 2,600 rows.888 Table Corpus and the ILP model are available at allenai.org.
TableILP (our approach). Given a question , we select the top 7 tables from the Table Corpus using the the standard TF-IDF score of with tables treated as bag-of-words documents. For each selected table, we choose the 20 rows that overlap with the most. This filtering improves efficiency and reduces noise. We then generate an ILP and solve it using the open source SCIP engine [Achterberg2009], returning the active answer option from the optimal solution. To check for ties, we disable , re-solve the ILP, and compare the score of the second-best answer, if any, with that of .
MLN Solver (structured inference baseline). We consider the current state-of-the-art structured reasoning method developed for this specific task by aristo2015:mln aristo2015:mln. We compare against their best performing system, namely Praline, which uses Markov Logic Networks [Richardson and Domingos2006] to (a) align lexical elements of the question with probabilistic first-order science rules and (b) to control inference. We use the entire set of 47,000 science rules from their original work, which were also derived from same domain-targeted sources as the ones used in our Sentence Corpus.
IR Solver (information retrieval baseline). We use the IR baseline by aristo2016:combining aristo2016:combining, which selects the answer option that has the best matching sentence in a corpus. Specifically, for each answer option , the IR solver sends as a query to a search engine (we use Lucene) on the Sentence Corpus, and returns the search engine’s score for the top retrieved sentence , where must have at least one non-stopword overlap with , and at least one with . The option with the highest Lucene score is returned as the answer.
PMI Solver (statistical co-occurrence baseline). We use the PMI-based approach by aristo2016:combining aristo2016:combining, which selects the answer option that most frequently co-occurs with the question words in a corpus. Specifically, it extracts unigrams, bigrams, trigrams, and skip-bigrams from the question and each answer option. For a pair of -grams, their pointwise mutual information (PMI) [Church and Hanks1989] in the corpus is defined as where is the co-occurrence frequency of and (within some window) in the corpus. The solver returns the answer option that has the largest average PMI in the Web Corpus, calculated over all pairs of question -grams and answer option -grams.
We first compare the accuracy of our approach against the previous structured (MLN-based) reasoning solver. We also compare against IR(tables), an IR solver using table rows expressed as sentences, thus embodying an unstructured approach operating on the same knowledge as TableILP.
|Solver||Test Score (%)|
As Table 3 shows, among the two structured inference approaches, TableILP outperforms the MLN baseline by 14%. The preliminary ILP system reported by aristo2016:combining aristo2016:combining achieves only a score of 43.8% on this question set. Further, given the same semi-structured knowledge (i.e., the Table Corpus), TableILP is substantially (+10%) better at exploiting the structure than the IR(tables) baseline, which, as mentioned above, uses the same data expressed as sentences.
4.2.1 Complementary Strengths
While their overall score is similar, TableILP and IR-based methods clearly approach QA very differently. To assess whether TableILP adds any new capabilities, we considered the 50 (out of 129) questions incorrectly answered by PMI solver (ignoring tied scores). On these unseen but arguably more difficult questions, TableILP answered 27 questions correctly, achieving a score of 54% compared to the random chance of 25% for 4-way multiple-choice questions. Results with IR solver were similar: TableILP scored 24.75 on the 52 questions incorrectly answered by IR (i.e., 47.6% accuracy).
|Solver||Test Score (%)|
|TableILP + IR||66.1|
|TableILP + PMI||67.6|
|TableILP + IR+ PMI||69.0|
This analysis highlights the complementary strengths of these solvers. Following aristo2016:combining aristo2016:combining, we create an ensemble of TableILP, IR, and PMI solvers, combining their answer predictions using a simple Logistic Regression model trained on the development set. This model uses 4 features derived from each solver’s score for each answer option, and 11 features derived from TableILP’s support graphs. 999Details of the 11 features may be found in the Appendix B. Table 4 shows the results, with the final combination at 69% representing a significant improvement over individual solvers.
4.2.2 ILP Solution Properties
Table 5 summarizes various ILP and support graph statistics for TableILP, averaged across all test questions.
The optimization model has around 50 high-level constraints, which result, on average, in around 4000 inequalities over 1000 variables. Model creation, which includes computing pairwise entailment scores using WordNet, takes 1.9 seconds on average per question, and the resulting ILP is solved by the SCIP engine in 2.1 seconds (total for all four options), using around 1,300 LP iterations for each option.101010Commercial ILP solvers (e.g., CPLEX, Gurobi) are much faster than the open-source SCIP solver we used for evaluations. Thus, TableILP takes only 4 seconds to answer a question using multiple rows across multiple tables (typically 140 rows in total), as compared to 17 seconds needed by the MLN solver for reasoning with four rules (one per answer option).
|[t] ILP complexity||#variables||1043.8|
|[t] Knowledge use||#rows||2.3|
|[t] Timing stats||model creation||1.9 sec|
|solving the ILP||2.1 sec|
While the final support graph on this question set relies mostly on a single table to answer the question, it generally combines information from more than two rows (2.3 on average) for reasoning. This suggests parallel evidence is more frequently used on this dataset than evidence chaining.
4.3 Ablation Study
To quantify the importance of various components of our system, we performed several ablation experiments, summarized in Table 6 and described next.
|Solver||Test Score (%)|
|No Multiple Row Inference||51.0|
|No Relation Matching||55.6|
|[t]||No Open IE Tables||52.3|
|No Lexical Entailment||50.5|
No Multiple Row Inference: We modify the ILP constraints to limit inference to a single row (and hence a single table), thereby disallowing parallel evidence and evidence chaining (Section 3.3). This drops the performance by 10.5%, highlighting the importance of being able to combine evidence from multiple rows (which would correspond to multiple sentences in a corpus) from one or more tables.
No Relation matching: To assess the importance of considering the semantics of the table, we remove the requirement of matching the semantic relation present between columns of a table with its lexicalization in the question (Section 3.3). The 6% drop indicates TableILP relies strongly on the table semantics to ensure creating meaningful inferential chains.
No Open IE tables: To evaluate the impact of relatively unstructured knowledge from a large corpus, we removed the tables containing Open IE extractions (Section 3.2). The 9% drop in the score shows that this knowledge is important and TableILP is able to exploit it even though it has a very simple triple structure. This opens up the possibility of extending our approach to triples extracted from larger knowledge bases.
No Lexical Entailment: Finally, we test the effect of changing the alignment metric (Section 3.2) from WordNet based scores to a simple asymmetric word-overlap measured as . Relying on just word-matching results in an 11% drop, which is consistent with our knowledge often being defined in terms of generalities.
4.4 Question Perturbation
One desirable property of QA systems is robustness to simple variations of a question, especially when a variation would make the question arguably easier for humans.
To assess this, we consider a simple, automated way to perturb each 4-way multiple-choice question: (1) query Microsoft’s Bing search engine (www.bing.com) with the question text and obtain the text snippet of the top 2,000 hits; (2) create a list of strings by chunking and tokenizing the results; (3) remove stop words and special characters, as well as any words (or their lemma) appearing in the question; (4) sort the remaining strings based on their frequency; and (5) replace the three incorrect answer options in the question with the most frequently occurring strings, thereby generating a new question. For instance:
In New York State, the longest period of daylight occurs during which month? (A) eastern (B) June (C) history (D) years
As in this example, the perturbations (italicized) are often not even of the correct “type”, typically making them much easier for humans. They, however, still remain difficult for solvers.
|Original||% Drop with Perturbation|
|Solver||Score (%)||[t] absolute||relative|
For each of the 108 development questions, we generate 10 new perturbed questions, using the 30 most frequently occurring words in step (5) above. While this approach can introduce new answer options that should be considered correct as well, only 3% of the questions in a random sample exhibited this behavior. Table 7 shows the performance of various solvers on the resulting 1,080 perturbed questions. As one might expect, the PMI approach suffers the most at a 33% relative drop. TableILP’s score drops as well (since answer type matching isn’t perfect), but only by 12%, attesting to its higher resilience to simple question variation.
Answering real science questions is a challenging task because they are posed in natural language, require extensive domain knowledge, and often require combining multiple facts together.
We presented TableILP, a system that can answer such questions, using a semi-structured knowledge base. We treat QA as a subgraph selection problem and then
formulate this as an ILP optimization.
Most importantly, this formulation allows multiple, semi-formally expressed facts to be combined to answer questions, a capability outside the scope of IR-based QA systems.
In our experiments, this approach significantly outperforms both the previous best attempt at structured reasoning for this task,
and an IR engine provided with the same knowledge. It also significantly boosts performance when combined with unstructured methods (IR and PMI).
These results suggest that the approach is both viable and promising for natural language question answering.
D.K. is in part supported by AI2 and Google. The authors would like to thank Christos Christodoulopoulos, Sujay Jauhar, Sam Skjonsberg, and the Aristo Team at AI2 for invaluable discussions and insights.
- [Achterberg2009] T. Achterberg. SCIP: solving constraint integer programs. Math. Prog. Computation, 1(1):1–41, 2009.
- [Angeli and Manning2014] G. Angeli and C. D. Manning. NaturalLI: Natural logic inference for common sense reasoning. In EMNLP, 2014.
- [Banko et al.2007] M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007.
- [Bentivogli et al.2008] L. Bentivogli, P. Clark, I. Dagan, and D. Giampiccolo. The sixth pascal recognizing textual entailment challenge. In TAC, 2008.
- [Berant et al.2010] J. Berant, I. Dagan, and J. Goldberger. Global learning of focused entailment graphs. In ACL, pg. 1220–1229, 2010.
- [Bollacker et al.2008] K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 2008.
- [Brill et al.2002] E. Brill, S. Dumais, and M. Banko. An analysis of the AskMSR question-answering system. In Proceedings of EMNLP, pg. 257–264, 2002.
- [Carlson et al.2010] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.
- [Chang et al.2010] M.-W. Chang, D. Goldwasser, D. Roth, and V. Srikumar. Discriminative learning over constrained latent representations. In 2010 NAACL, pg. 429–437, 2010.
- [Church and Hanks1989] K. W. Church and P. Hanks. Word association norms, mutual information and lexicography. In 27th ACL, pg. 76–83, 1989.
- [Clark et al.2014] P. Clark, N. Balasubramanian, S. Bhakthavatsalam, K. Humphreys, J. Kinkead, A. Sabharwal, and O. Tafjord. Automatic construction of inference-supporting knowledge bases. In 4th AKBC Workshop, Montreal, Canada, Dec 2014.
- [Clark et al.2016] P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. Turney, and D. Khashabi. Combining retrieval, statistics, and inference to answer elementary science questions. In 30th AAAI, 2016.
- [Clark2015] P. Clark. Elementary school science and math tests as a driver for AI: take the Aristo challenge! In 29th AAAI/IAAI, pg. 4019–4021, Austin, TX, 2015.
- [Cohen2000] W. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3):288–321, 2000.
- [Dagan et al.2010] I. Dagan, B. Dolan, B. Magnini, and D. Roth. Recognizing textual entailment: Rational, evaluation and approaches–erratum. Natural Language Engineering, 16(01):105–105, 2010.
- [Dagan et al.2013] I. Dagan, D. Roth, M. Sammons, and F. M. Zanzotto. Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies, 6(4):1–220, 2013.
- [Dalvi et al.2016] B. Dalvi, S. Bhakthavatsalam, and P. Clark. IKE - an interactive tool for knowledge extraction. In 5th AKBC Workshop, 2016.
- [Fader et al.2014] A. Fader, L. Zettlemoyer, and O. Etzioni. Open question answering over curated and extracted knowledge bases. In Proceedings of SIGKDD, pg. 1156–1165, 2014.
- [Ferrucci et al.2010] D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, et al. Building Watson: An overview of the DeepQA project. AI Magazine, 31(3):59–79, 2010.
- [Goldwasser and Roth2011] D. Goldwasser and D. Roth. Learning from natural instructions. In IJCAI, 2011.
- [Gunning et al.2010] D. Gunning, V. Chaudhri, P. Clark, K. Barker, J. Chaw, and M. Greaves. Project Halo update - progress toward digital Aristotle. AI Magazine, 31(3), 2010.
- [Howell2012] D. Howell. Statistical methods for psychology. Cengage Learning, 2012.
- [Khot et al.2015] T. Khot, N. Balasubramanian, E. Gribkoff, A. Sabharwal, P. Clark, and O. Etzioni. Exploring Markov logic networks for question answering. In 2015 EMNLP, Lisbon, Portugal, Sep 2015.
- [Ko et al.2007] J. Ko, E. Nyberg, and L. Si. A probabilistic graphical model for joint answer ranking in question answering. In Proceedings of SIGIR, pg. 343–350, 2007.
- [Kwok et al.2001] C. C. T. Kwok, O. Etzioni, and D. S. Weld. Scaling question answering to the web. In WWW, 2001.
- [MacCartney2009] B. MacCartney. Natural Language Inference. PhD thesis, 2009.
- [Miller1995] G. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
- [Novak1977] G. Novak. Representations of knowledge in a program for solving physics problems. In IJCAI-77, 1977.
- [Richardson and Domingos2006] M. Richardson and P. Domingos. Markov logic networks. Machine learning, 62(1–2):107–136, 2006.
- [Roth and Yih2004] D. Roth and W. Yih. A linear programming formulation for global inference in natural language tasks. In CoNLL, pg. 1–8. ACL, 2004.
- [Srikumar and Roth2011] V. Srikumar and D. Roth. A joint model for extended semantic role labeling. In EMNLP, Edinburgh, Scotland, 2011.
P. D. Turney.
Distributional semantics beyond words: Supervised learning of analogy and paraphrase.TACL, 1:353–366, 2013.
- [Yao and Van Durme2014] X. Yao and B. Van Durme. Information extraction over structured data: Question answering with Freebase. In 52nd ACL, 2014.
- [Yih et al.2014] W. Yih, X. He, and C. Meek. Semantic parsing for single-relation question answering. In ACL (2), pg. 643–648, 2014.
- [Zou et al.2014] L. Zou, R. Huang, H. Wang, J. X. Yu, W. He, and D. Zhao. Natural language question answering over RDF: a graph data driven approach. In SIGMOD, pg. 313–324, 2014.
Appendix A Appendix: ILP Model for TableILP
As it is widely known an ILP can be written as the following:
Variables: We start with a brief overview of the basic variables and how they are combined into high level variables.
Table 8 summarizes our notation to refer to various elements of the problem, such as for cell of table , as defined in Section 3. We define variables over each element by overloading or
notation which refer to a binary variable on elements or their pair, respectively. Table2 contains the complete list of basic variables in the model, all of which are binary. The pairwise variables are defined between pairs of elements; e.g., takes value 1 if and only if the corresponding edge is present in the support graph. Similarly, if a node corresponding to an element of the problem is present in the support graph, we will refer to that element as being active.
|[t] Reference||[t] Description|
|[t]||index over tables|
|index over table rows|
|index over table columns|
|index over lexical constituents of question|
|index over answer options|
|[t]||a unary variable|
|a pairwise variable|
In practice we do not create pairwise variables for all possible pairs of elements; instead we create pairwise variables for edges that have an entailment score exceeding a threshold. For example we create the pairwise variables only if . An exhaustive list of the minimum alignment thresholds for creating pairwise variables is in Table 10.
Table 2 also includes some high level unary variables, which help conveniently impose structural constraints on the support graph we seek. An example is the active row variable which should take value 1 if and only if at least a cell in row of table .
Any of the binary variables defined in our problem are included in the final weighted linear objective function. The weights of the variables in the objective function (i.e. the vectorw in Equation 1) are set according to Table 9. In addition to the current set of variables, we introduce auxiliary variables for certain constraints. Defining auxiliary variables is a common trick for linearizing more intricate constraints at the cost of having more variables.
Constraints: Constraints are significant part of our model in imposing the desirable behaviors for the support graph (cf. Section 3.1).
The complete list of the constraints is explained in Table 13. While groups of constraints are defined for different purposes, it is hard to partition them into disjoint sets of constraints. Here we give examples of some important constraint groups.
Active variable constraints: An important group of constraints relate variables to each other. The unary variables are defined through constraints that relate them to the basic pairwise variables. For example, active row variable should be active if and only if any cell in row is active. (constraint 15, Table 13).
Correctness Constraints: A simple, but important set of constraints force the basic correctness principles on the final answer. For example should contain exactly one answer option which is expressed by constraint 27, Table 13. Another example is that, should contain at least a certain number of constituents in the question, which is modeled by constraint 30, Table 13.
Sparsity Constraints: Another group of constraint induce simplicity (sparsity) in the output. For example should use at most a certain number of knowledge base tables (constraint 28, Table 13), since letting the inference use any table could lead to unreasonably long, and likely error-prone, answer chains.
Appendix B Appendix: Features in Solver Combination
To combine the predictions from all the solvers, we learn a Logistic Regression model [Clark et al.2016] that returns a probability for an answer option,
that returns a probability for an answer option,, being correct based on the following features.
Solver-independent features: Given the solver scores for all the answer options , we generate the following set of features for the answer option , for each of the solvers:
Normalized score =
Softmax score =
Best Option, set to if this is the top-scoring option =
TableILP-specific features: Given the proof graph returned for an option, we generate the following 11 features apart from the solver-independent features:
Average alignment score for question constituents
Minimum alignment score for question constituents
Number of active question constituents
Fraction of active question constituents
Average alignment scores for question choice
Sum of alignment scores for question choice
Number of active table cells
Average alignment scores across all the edges
Minimum alignment scores across all the edges
Log of number of variables in the ILP
Log of number of constraints in the ILP
|Collection of basic variables connected to header column of table :||
|Collection of basic variables connected to cell of table :||
|Collection of basic variables connected to column of table||
|Collection of basic variables connected to row of table :||
|Collection of non-choice basic variables connected to row of table :||
|Collection of non-question basic variables connected to row of table :||
|Collection of basic variables connected to table :||
|Collection of non-choice basic variables connected to table :||
|Collection of basic variables connected to question constituent :||
|Collection of basic variables connected to option||
|Collection of basic variables in column of table connected to option :||
|If any cell in row of table is active, the row should be active.||
|If the row of table is active, at least one cell in that row must be active as well.||
|Column header should be active if any of the basic variables with one end in this column header are active.||
|If the header of column variable is active, at least one basic variable with one end in the end in the header||
|Column is active if at least one of the basic variables with one end in this column are active.||
|If the column is active, at least one of the basic variables with one end in this column should be active.||
|If a basic variable with one end in table is active, the table variable is active.||
|If the table is active, at least one of the basic variables with one end in the table should be active.||
|If any of the basic variables with one end in option are on, the option should be active as well.||
|If the question option is active, there is at least one active basic element connected to it||
|If any of the basic variables with one end in the constituent , the constituent must be active.||
|If the constituent is active, at least one basic variable connected to it must be active.||
|Choose only a single option.||
|There is an upper-bound on the number of active tables; this is to limit the solver and reduce the chance of using spurious tables.||
|The number of active rows in each table is upper-bounded.||
|The number of active constituents in each question is lower-bounded. Clearly We need to use the question definition in order to answer a question.||
|A cell is active if and only if the sum of coefficients of all external alignment to it is at least a minimum specified value||
|A title is active if and only if the sum of coefficients of all external alignment to it is at least a minimum specified value||
|If a column is active, at least one of its cells must be active as well.||
|At most a certain number of columns can be active for a single option||
|If a column is active for a choice, the table is active too.||
|If a table is active for a choice, there must exist an active column for choice.||
|If a table is active for a choice, there must be some non-choice alignment.||
|Answer should be present in at most a certain number of tables||
|If a cell in a column, or its header is aligned with a question option, the column is active for question option as well.||
|If a column is active for an option, there must exist an alignment to header or cell in the column.||
|At most a certain number of columns may be active for question option in a table.||
|If a column is active for a choice, the table is active for an option as well.||
|If the table is active for an option, at least one column is active for a choice||
|Create an auxiliary variable with objective weight 1.5 and activate it, if there a “which” term in the question.||
|Create an auxiliary variable with objective weight 1.5. Add a boost if at least one of the table cells/title aligning to the choice happens to have a good alignment () with the “which” terms, i.e. WhichTermSpan constituents after “which”.||
|A question constituent may not align to more than a certain number of cells||
|Disallow aligning a cell to two question constituents if they are too far apart; in other words add the following constraint if the two constituents and are more than qConsCoalignMaxDist apart from each other:||
|For any two two question constraints that are not more than qConsCoalignMaxDist apart create an auxiliary binary variable and set its weight in the objective function to be , where and are the indices of the two question constituents. With this we boost objective score if a cell aligns to two question constituents that are within a few words of each other||
|If a relation match is active, both the columns for the relation must be active||
|If a column is active, a relation match connecting to the column must be active||
|If a relation match is active, the column cannot align to the question in an invalid position||
|If a row is active, at least a certain number of its cells must be active||
|If row is active, it must have non-choice alignments.||
|If row is active, it must have non-question alignments||
|If two rows of a table are active, the corresponding active cell variables across the two rows must match; in other words, the two rows must have identical activity signature||
|If two rows are active, then at least one active column in which they differ (in tokenized form) must also be active; otherwise the two rows would be identical in the proof graph.||
|If a table is active and another table is also active, at least one inter-table active variable must be active;||