Answering Science Exam Questions Using Query Rewriting with Background Knowledge

09/15/2018 ∙ by Ryan Musa, et al. ∙ ibm University of Illinois at Urbana-Champaign 0

Open-domain question answering (QA) is an important problem in AI and NLP that is emerging as a bellwether for progress on the generalizability of AI methods and techniques. Much of the progress in open-domain QA systems has been realized through advances in information retrieval methods and corpus construction. In this paper, we focus on the recently introduced ARC Challenge dataset, which contains 2,590 multiple choice questions authored for grade-school science exams. These questions are selected to be the most challenging for current QA systems, and current state of the art performance is only slightly better than random chance. We present a system that rewrites a given question into queries that are used to retrieve supporting text from a large corpus of science-related text. Our rewriter is able to incorporate background knowledge from ConceptNet and -- in tandem with a generic textual entailment system trained on SciTail that identifies support in the retrieved results -- outperforms several strong baselines on the end-to-end QA task despite only being trained to identify essential terms in the original source question. We use a generalizable decision methodology over the retrieved evidence and answer candidates to select the best answer. By combining query rewriting, background knowledge, and textual entailment our system is able to outperform several strong baselines on the ARC dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recently released AI2 Reasoning Challenge (ARC) and accompanying ARC Corpus [Clark et al.2018] is an ambitious test for AI systems that perform open-domain question answering (QA). This dataset consist of 2590 multiple choice questions authored for grade-school science exams partitioned in an Easy set and a Challenge set. The Challenge set comprises questions that cannot be answered correctly by either a Pointwise Mutual Information (PMI-based) solver, or by an Information Retrieval (IR-based) solver. CCEK+18a CCEK+18a also note that the simple information retrieval (IR) methodology (Elasticsearch) that they use is a key weakness of current systems, and posit that 95% of the questions can be answered using the ARC corpus.

ARC has proved a difficult dataset to perform well on, particularly the Challenge set: existing systems like KG [Zhang et al.2018] achieve 31.70% accuracy on the test partition. Older models such as DecompAttn [Parikh et al.2016] and BiDAF [Seo et al.2017] that have shown good performance on other datasets – e.g. SQUAD [Rajpurkar et al.2016] – perform only 1-2% above random chance.111 The seeming intractability of the ARC Challenge dataset has only very recently shown signs of yielding to new techniques released online in early September 2018. We report full numbers for all these systems in addition to our top-performing system in Table 1.

Model ARC-Challenge Test ARC-Easy Test
ET-RR [Ni et al.2018] 36.36
BiLSTM Max-Out [Mihaylov et al.2018] 33.87
NCRF++/match-LSTM 33.20 52.22
KG [Zhang et al.2018] 31.70
DGEM [Khot, Sabharwal, and Clark2018] 27.11 58.97
TableILP [Khashabi et al.2016] 26.97 36.15
BiDAF [Seo et al.2017] 26.54 50.11
DecompAtt [Parikh et al.2016] 24.34 58.27
Table 1: Comparison of our system with state-of-the-art systems for the ARC dataset. Numbers taken from ARC Leaderboard as of Sept. 5, 2018 CCEK+18a CCEK+18a.

An important avenue of attack on ARC was identified in BoPaMiYu18 BoPaMiYu18, which examined the knowledge and reasoning requirements for answering questions in the ARC dataset. The authors note that “simple reformulations to the query can greatly increase the quality of the retrieved sentences”. They quantitatively measure the effectiveness of such an approach by demonstrating a 42% increase in score on ARC-Easy using a pre-trained version of the DrQA model [Chen et al.2017].

Recently, the top performing systems for ARC use natural language inference (NLI) models to answer the questions [Zhang et al.2018]. NLI models have improved state of the art performance on a number of important NLP tasks [Yin, Roth, and Schütze2018, Parikh et al.2016, Chen et al.2018] and have gained recent popularity due to the release of large datasets [Bowman et al.2015, Khot, Sabharwal, and Clark2018, Williams, Nangia, and Bowman2018]. The NLI task – also sometimes known as recognizing textual entailment – is to determine whether a given natural language hypothesis can be inferred from a natural language premise

. The NLI problem is often cast as a classification problem: given a hypothesis and premise, classify the relationship between the sentences as either

entailment, contradiction, or neutral.

While finalizing this paper for submission two new systems were posted to the ARC Leaderboard: the ET-RR system of NiZhChMc18 NiZhChMc18 reaches 43.29% dev/36.36% test accuracy on the Challenge set, while MiClKhSa18 MiClKhSa18 report 33.87% accuracy on the test partition for their BiLSTM Max-Out system. As in this paper, NiZhChMc18 NiZhChMc18 pursue the approach suggested by BoPaMiYu18 BoPaMiYu18 in learning how to transform a natural-language question into a query for which an IR system can return a higher-quality selection of results. Both of these systems use entailment models similar to our match-LSTM [Wang and Jiang2016a] model, but also incorporate additional co-attention between questions, candidate answers, and the retrieved evidence. Though their reported accuracies exceed our own, we provide a detailed ablation study in order to better understand the nuances of the ARC dataset (such as a sensitivity to over-tuning).

Contributions: In order to overcome some of the limitations of existing retrieval-based systems on ARC and other similar corpora, we present an approach that uses the original question to produce a set of reformulations. These reformulations are then used to retrieve additional supporting text which can then be used to arrive at the correct answer. We couple this with generic textual entailment model and robust decision rule to achieve good performance on the ARC dataset. We discuss important lessons learned in the construction of this system and key issues to move forward on the ARC dataset.

2 Related Work

Teaching machines how to read, reason, and answer questions over natural language questions is a long standing area of research; doing this well has been a very important mission of both the NLP and AI communities. The Watson project [Ferrucci et al.2010] – also known as DeepQA – is perhaps the most famous example of question answering to date. That project involved largely factoid-based questions, and much of its success can be attributed to the quality of the corpus and the NLP tools employed for question understanding. In this section, we look at the most relevant prior work in improving open-domain question answering.

2.1 Datasets

A number of datasets have been proposed for reading comprehension and question answering. hirschman1999deep hirschman1999deep manually created a dataset of 3rd and 6th grade reading comprehension questions with short answers. The techniques that were explored for this dataset included pattern matching, rules, and logistic regression. MCTest 

[Richardson, Burges, and Renshaw2013] is a crowdsourced dataset, and comprises of 660 elementary-level children’s fictional stories, which are the source of questions and multiple choice answers. Questions and answers were constructed with a restricted vocabulary that a 7 year-old could understand. Half of the questions required the answer to be derived from two sentences, with the motivation being to encourage research in multi-hop (one-hop) reasoning. Recent techniques such as those presented by wang2015machine wang2015machine and yin2016attention yin2016attention have performed well on this dataset.

Currently, SQuAD [Rajpurkar et al.2016] is one of the most popular datasets for reading comprehension: it uses Wikipedia passages as its source, and question-answer pairs are created using crowdsourcing. While it is stated that SQuAD requires logical reasoning, the complexity of reasoning required is far lesser than that required by the AI2 standardized tests dataset [Clark and Etzioni2016, Kembhavi et al.2017]. Some approaches have already attained human-level performance on the first version of SQuAD. More recently, an extended version of SQuAD was released that includes over 50,000 additional questions where the answer cannot be found in source passages [Rajpurkar, Jia, and Liang2018]. While unanswerable questions in SQuAD 2.0 add a significant challenge, the answerable questions are the same (and have the same reasoning complexity) as the questions in the first version of SQuAD. NewsQA [Trischler et al.2016] is another dataset that was created using crowdsourcing; it utilizes passages from news articles to create questions.

Most of the datasets mentioned above are primarily closed world/domain: the answer exists in a given snippet of text that is provided to the system along with the question. On the other hand, in the open domain setting, the question-answer datasets are constructed to encompass the whole pipeline for question answering, starting with the retrieval of relevant documents. SearchQA [Dunn et al.2017] is an effort to create such a dataset; it contains 140K question-answer (QA) pairs. While the motivation was to create an open domain dataset, SearchQA provides text that contains ‘evidence’ (a set of annotated search results) and hence falls short of being a complete open domain QA dataset. TriviaQA [Joshi et al.2017] is another reading comprehension dataset that contains 650K QA pairs with evidence.

Datasets created from standardized science tests are particularly important because they include questions that require complex reasoning techniques to solve. A survey of the knowledge base requirements for answering questions from early science questions was performed by ClHaBa13 ClHaBa13. The authors concluded that advanced inference methods were necessary for many of the questions, as they could not be answered by simple fact based retrieval. Partially resulting from that analysis, a number of science-question focused datasets have been released over the past few years. The AI2 Science Questions dataset was introduced by Clar15a Clar15a along with the Aristo Framework, which we build off of. This dataset contains over 1,000 multiple choice questions from state and federal science questions for elementary and middle school students.222 The SciQ Dataset [Welbl, Liu, and Gardner2017] contains 13,679 crowdsourced multiple choice science questions. To construct this dataset, workers were shown a passage and asked to construct a question along with correct and incorrect answer options. The dataset contained both the source passage as well as the question and answer options.

2.2 Query Expansion & Reformulation

Query expansion and reformulation – particularly in the area of information retrieval (IR) – is well studied [Azad and Deepak2017]. The primary motivation for query expansion and reformulation in IR is that a query may be too short, ambiguous, or ill-formed to retrieve results that are relevant enough to satisfy the information needs of users. In such scenarios, query expansion and reformulation have played a crucial role by generating queries with (possibly) new terms and weights to retrieve relevant results from the IR engine. While there is a long history of research on query expansion [Maron and Kuhns1960], Rocchio’s relevance feedback gave it a new beginning [Rocchio1971]. Query expansion has since been applied to many applications, such as Web Personalization, Information Filtering, and Multimedia IR. In this work, we focus on query expansion as applied to question answering systems.

2.3 Retrieval

Retrieving relevant documents/passages is one of the primary components of open domain question answering systems [Wang et al.2018]

. Errors in this initial module are propagated down the line, and have a significant impact on the ultimate accuracy of QA systems. For example, the latest sentence corpus released by AI2 (i.e. the ARC corpus) is estimated by  CCEK+18a CCEK+18a to contain the answers to 95% of the questions in the ARC dataset. However, even state of the art systems that are not completely IR-based (but use neural or structured representations) perform only slightly above chance on the Challenge set. This is at least partially due to early errors in passage retrieval. Recent work by BBCG18a BBCG18a and WYGZ+18a WYGZ+18a have identified improving the retrieval modules as the key component in improving state of the art QA systems.

2.4 Contrasts with Our System

Note that both our model and task are fundamentally more difficult and more general than the approaches listed above. SearchQA is slightly closer to the ARC Challenge task as SearchQA was an effort to create a dataset that straddles the line between open and closed domain QA. It contains 140K question-answer (QA) pairs along with text that contains “evidence” (a set of annotated search results), and thus falls short of being a completely open domain QA dataset. Each question in the SearchQA dataset is guaranteed to have sufficient support amongst the searchable corpus. In the ARC dataset and corpus, we have no such guarantee.

ARC Challenge represents a unique obstacle in the open domain QA world as the questions are specifically selected to not be answerable using basic techniques paired with a high quality corpus. Our approach combines current best practice in terms of a focus on retrieving highly salient evidence and judging this evidence using a general NLI model. While extremely recent systems for ARC have taken the same approach [Ni et al.2018, Mihaylov et al.2018], our extensive analysis of both our rewriter and our decision rules shed new light on this unique dataset.

3 System Overview

Our overall approach encompasses three discrete elements: the Rewriter that transforms a question into a set of queries; the Retriever uses those queries to obtains relevant passages from a text corpus, and the Resolver which uses the question and the retrieved passages to select the final answer(s).

More formally, a pair composed of a question with a set of of answers is passed into the Rewriter module. This module uses a Term Selector which (optionally) incorporates background knowledge in the form of ConceptNet or other embeddings to generate a set of rewritten queries . In our system, as with most other systems for ARC Challenge [Clark et al.2018], for each we generate a set of queries where each uses the same set of terms with one of the answers appended to the end. This set of queries is then passed to a Retriever which issues the search over a knowledge base to retrieve a set of relevant passages per query to create a set of passages that are passed to the Resolver. The Resolver contains two components: the Entailment Model and the Decision Function. We use match-LSTM [Wang and Jiang2016a] trained on SciTail [Khot, Sabharwal, and Clark2018]

for our entailment model and for each passage passed in we compute the probability that each answer is entailed from the given passage and question. This information is passed to the Decision Function which selects a non-empty set of answers that is returned.

Figure 1: Seq2Seq Query Rewrite Model. A sequence of terms from the original query is translated into a sequence of 0s and 1s which serves as a mask used to select the most salient terms.



Term Selector

ConceptNet Embeddings


Entailment Model

Decision Function


ARC Corpus

, ,

Figure 2: Our overall system architecture. A question with a set of answers is passed into the Rewriter module. This module is composed of a Term Selector which (optionally) incorporates background knowledge to generate a set of rewritten queries. This set of queries is then passed to a retriever which issues the search over a knowledge base to retrieve a set of relevant passages which are passed to the Resolver. The Resolver contains two components: the Entailment Model and the Decision Function and selects a non-empty set of answers options.

3.1 Rewriter Module

For the Rewriter module, we investigate and evaluate two different approaches to rewrite queries by retaining only their most salient terms: a sequence to sequence model similar to  sutskever2014sequence sutskever2014sequence and models based on the recent work by  YaZh18 YaZh18 on Neural Sequence Labeling.

Seq2Seq Model for Selecting Relevant Terms.

We first consider a simple sequence to sequence model shown in Figure 1 that translates a sequence of terms in an input query into a sequence of 0s and 1s of the same length. The input terms are passed to an encoder layer through an embedding layer initialized with pre-trained embeddings (e.g., Glove [Pennington, Socher, and Manning2014]). The outputs of the encoder layer are decoded, using an attention mechanism [Bahdanau, Cho, and Bengio2014], into the resulting sequence of 0s and 1s that is used as a mask to select the most salient terms in the input query. Both the encoder and decoder layers are implemented with a single hidden bidirectional GRU layer ().

NCRF++ Approach to Selecting Relevant Terms.

Our second approach to identifying salient terms comprises four models implemented with the NCRF++ sequence-labeling toolkit of  YaZh18 YaZh18333 The basic NCRF++ model uses a bi-directional LSTM with a single hidden layer () where the input at each token is its 300-dimensional pre-trained GloVe embedding [Pennington, Socher, and Manning2014]. Additional models incorporate background knowledge in the form of graph embeddings derived using the ConceptNet knowledge base  [Speer, Chin, and Havasi2017] using three algorithms: TransH [Wang et al.2014], CompleX [Trouillon et al.2016], and the PPMI

embeddings released with ConcepNet. Entities are linked with the text by matching their surface form with phrases of up to three words. For each token in the question, we concatenate its word embedding with a 10-dimensional vector indicating whether the token is part of the surface form of a ConceptNet entity. We then append either the 300-dimensional vector corresponding to the embedding of that entity in ConceptNet, or a single randomly initialized

UNK vector when a token is not linked to an entity. The final prediction is performed left-to-right using a CRF layer that takes into account the preceding label. We train the models for 50 iterations using SGD with a learning rate of 0.015 and learning rate decay of 0.05.

Method Acc Pr. Re. F1
ET Classifier [Khashabi et al.2017] 0.75 0.91 0.71 0.80
ET Net [Ni et al.2018] 0.74 0.90 0.81
Seq2Seq - 6B.50d 0.75 0.52 0.23 0.32
Seq2Seq - 6B.100d 0.76 0.54 0.46 0.50
Seq2Seq - 840B.300d 0.77 0.57 0.42 0.49
NCRF++ 0.88 0.73 0.80 0.77
CompleX 0.88 0.74 0.80 0.77
TransH 0.88 0.75 0.77 0.76
PPMI 0.87 0.77 0.72 0.75
Figure 3: Essential terms classification performance; (Acc)uracy, (Pr)ecision, (Re)call, and (F1) score for the ET Classifier of KKSR17a KKSR17a reflect the 70/9/21 train/dev/test split reported in their paper. As ET Net does, our methods were evaluated using a random 80/10/10 train/dev/split performed after filtering out questions that appear in the ARC dev/test partitions.

Training and Evaluation of Rewriter Models.

Before integrating the rewriter module into our overall system (Figure 2), the two rewriter models (Seq2seq and NCRF++) are first trained and tested on the Essential Terms dataset introduced by  KKSR17a KKSR17a.444 This dataset consists of 2,223 crowd-sourced questions. Each word in a question is annotated with a numerical rating on the scale 1–5 that indicates the importance of the word. Before training our rewriter models we removed those questions in the Essential Terms dataset that were present in the ARC Challenge and ARC-Easy Dev/Test sets.

Figure 3 presents the results of our models evaluated on Essential Terms dataset along with those of two state-of-the-art systems: ET Classifier KKSR17a KKSR17a and ET Net [Ni et al.2018]. ET Classifier trains an SVM using over 120 features based on the dependency parse, semantic features of the sentences, cluster representations of the words, and properties of question words. There are some slight differences in evaluation methodology: while both the ET Classifier and the ET Net classifier were optimized for precision, we optimize for accuracy. Also, while the ET Classifier was evaluated using a 79/9/21 train/dev/test split, we follow  NiZhChMc18 NiZhChMc18 in using an 80/10/10 split (after removing questions from the Essential Terms dataset that appear in the ARC dev/test partitions).

The key insights from this experimental evaluation are as follows:

  • NCRF++ significantly outperforms the seq2seq model with respect to all evaluation metrics (see results with GloVe 840B.300d).

  • NCRF++ is competitive with respect to ET Net and ET Classifier (without the heavy feature engineering of the latter system). It has significantly better accuracy and recall than ET Classifier although its F1-score is 3% inferior. When used with CompleX graph embeddings [Trouillon et al.2016], it has the same precision as ET Net, but its F1-score is 4% less.

  • Finally, while the results in Table 3 do not seem to support the need for using ConceptNet embeddings, we will see in the next section that, on ARC Challenge Dev, incorporating outside knowledge significantly increase the quality of passages that are available for downstream reasoning.

4 Retriever Module

Retrieving and providing high quality passages to the Resolver module is an important step in ensuring the accuracy of the system. In our system, a set of queries are sent to the Retriever, which then passes these queries along with a number of passages to the Resolver module. We use Elasticsearch [Gormley and Tong2015], a state-of-the-art text indexing system, on the ARC Corpus that is provided as part of the ARC Dataset. CCEK+18a CCEK+18a claim that this 14M-sentence corpus covers 95% of the questions in the ARC Challenge. BoPaMiYu18 BoPaMiYu18 observe that the ARC corpus contains many relevant facts that are useful to solve the annotated questions from the ARC training set. An important direction for future work is augmenting the corpus with other search indices and sources of knowledge from which to retrieve the passages.

(a) AI2 Rule w/ original hypothesis
(b) AI2 Rule w/ split hypothesis
Figure 4: Performance of our models on the Dev partition of the Challenge set using (a) the original hypothesis and (b) the split hypothesis as we vary the number of results k retained by AI2 rule, i.e. overall Elasticsearch score.

5 Resolver Module

Given the passages one still needs to select a particular answer out of the answer set . In our system we divide this process into two steps: the entailment module and the decision rule. In previous systems both of these components were wrapped into one. Separating them allows us to study each of them individually and make more informed design choices.

Entailment Modules.

While reading comprehension models like BiDAF [Seo et al.2017] have been adapted to the multiple-choice QA task by selecting a span in the passage obtained by concatenating several IR results into a larger passage, recent high-scoring systems on the ARC Leaderboard have relied on textual entailment models. In the approach pioneered by KhSaCl17 KhSaCl17, a multiple choice question is converted into an entailment problem wherein each IR result is a

. The question is turned into a fill-in-the-blank statement using a set of handcrafted heuristics (e.g. replacing wh-words). For each candidate answer, a

is generated by inserting the answer into the the blank and the probability according to the model that the premise entails this hypothesis becomes the answer’s score.

We use match-LSTM [Wang and Jiang2016a, Wang and Jiang2016b] trained on SciTail [Khot, Sabharwal, and Clark2018] as our textual entailment model. We chose match-LSTM because: (1) multiple reading comprehension techniques have used match-LSTM as a important module in their overall architecture [Wang and Jiang2016a, Wang and Jiang2016b]; and (2) match-LSTM models trained on SciTail achieve an accuracy of 84% on test (88% on dev), outperforming other recent entailment models such as DeIsTe [Yin, Roth, and Schütze2018] and DGEM [Khot, Sabharwal, and Clark2018].

Match-LSTM consists of an attention layer and a matching layer. Given premise and hypothesis where and are embedding vectors of corresponding words in premise and hypothesis. A contextual representation of premise and hypothesis is generated by encoding their embedding vectors using bi-directional LSTMs. Let and be the contextual representation of the -th word in the premise and the -th word in the hypothesis computed by the BiLSTMs. Then, an attention mechanism is used to determine the attention weighted representation of the word in the hypothesis as follows: where and where . The matcher layer is an where the input (

is the concatenation operator). The max-pooling result over the hidden states

of the matcher is used for classification.

Decision Rules.

In the initial study of the ARC Dataset, CCEK+18a CCEK+18a convert many existing question answering and entailment systems to work with the particular format of the ARC dataset. One of the choices that they made during this conversion was to decide how the output of the entailment systems, which consist of a probability that a given hypothesis is entailed from a premise, are aggregated to arrive at a final answer selection. The rule used, which we call the AI2 Rule for comparison, is to take the top-8 overall passages, across all queries for a given question, by ElasticSearch score. Each one of these queries has a specific that was associated with it due to the fact that all queries are of the format . For each of these top-8 passages the entailment score of is recorded and the top entailment score is used to select an answer.

In our system we decided to make this decision rule not part of the particular entailment system but rather a wholly separate module. The entailment system is responsible for measuring the entailment for each answer option for each of the retrieved passages and passing this information to the decision rule. One can compute a number of statistics and filters over this information and the arrive at one or more answer selections for the overall system.

We experimented with multiple filters: considering the evidence of the top-k answers, by Elasticsearch score overall and the top-k passages by Elasticsearch score per question.555Note that combining Elasticsearch scores across passages is not typically considered a good idea, from the Elastic Search Best Practices FAQ: “… the only purpose of the relevance score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries.”. For each of the passages we considered statistics over the entailment probability vector including Shannon Entropy and KL Divergence for reweighing the importance of these pieces of evidence [Cohen1995]. Finally, one must consider how to aggregate the (possibly disparate) pieces of evidence. The most obvious is to consider the answer with the maximum entailment probability over all pieces of evidence. However, one could take the sum or average of the entailment probabilities or view each passage as a voting rule [Brandt et al.2016] over the correct (most entailed) answer. We will compare and contrast several of these options in the next section but generally we feel the question of evidence aggregation for decision making is an under explored area in the QA literature.

(a) MaxEntail Rule w/ original hypothesis.
(b) MaxEntail Rule w/ split hypothesis.
Figure 5: Performance of our models on the Dev partition of the Challenge set using (a) the original hypothesis and (b) the split hypothesis constructed by splitting a multi-sentence question as we vary the number of results k retained by the MaxEntail rule, i.e. Elasticsearch score per candidate answer.

6 Empirical Evaluation

We considered many discrete design choices for setting up the evaluation of our system on the ARC Challenge Dev set. Our results on the Dev set for of our models and two different decision rules are summarized in Figure 4 and Figure 5. The final results for Test set are provided in Figure 6.

Dev Set

We first consider the important question of how many passages to investigate per query: we can compare and contrast Figure 4 (AI2 Rule) and Figure 5 (Max Entailment of top-K per answer) which varies the number of passages that are considered. The most obvious difference is that the results show that max entailment of top- is strictly a better rule overall for both the original and split hypothesis. In addition to the overall score, keeping the top- results per answer results in a smoother curve that is more amenable to calibration on the dev partition.

Comparing sub-figures (a) and (b) in Figures 4 and 5, we find more empirical support for our decision to investigate splitting the hypothesis. The length of the questions in the Challenge versus Easy set average v. words, respectively; for the answers, the length is words versus respectively. One possible cause for poor performance on ARC Challenge is that entailment models are easily confused by very long, story based questions. Working off the annotations of BoPaMiYu18 BoPaMiYu18, many of the questions of type “Question Logic” are of this nature. To address this, we “split” multi-sentence questions by (a) forming the hypothesis from only final sentence and (b) pre-pending the earlier sentences to each premise. Comparing across the figures we see that, in general, the modified hypothesis splitting leads to a small improvement in scores.

We also see the effect of including the ConceptNet embeddings on the performance of the downstream reasoning task; this is particularly evident in Figure 5. All of the rewritten queries are superior to using the original question. Additionally, in both Figure 5 (a) and (b), the CompleX and PPMI embeddings are performing better than the base rewriter. This is a strong indication that using the background knowledge in specific ways can aid downstream reasoning tasks; this is contrary to the results of MiClKhSa18 MiClKhSa18.

ARC-Challenge-Dev ARC-Challenge-Test
Model AI2 Top-2 Top-30 AI2 Top-2 Top-30
Orig. Question 27.25 31.52 29.68 25.12 31.61 31.43
Orig. Question-Split 24.49 27.84 30.18 25.95 29.80 30.13
Seq2Seq 23.18 31.12 30.68 26.93 29.98 30.13
NCRF++ 27.59 31.35 32.02 26.26 33.18 33.20
NCRF++-Split 28.42 30.00 36.37 26.94 31.58 30.56
TransH 28.01 32.02 32.53 25.86 31.57 32.68
TransH-Split 27.59 29.34 33.86 25.77 30.06 31.58
PPMI 27.42 30.69 30.68 27.40 31.88 32.92
PPMI-Split 29.51 31.02 35.37 26.27 29.76 31.28
CompleX 29.59 32.86 34.20 26.44 30.20 31.66
CompleX-Split 30.26 28.51 35.20 26.74 29.93 31.54
Figure 6: Results of our models on the dev/test partitions of the Challenge set when responding with the maximally entailed answer(s) based on a set of filtered Elasticsearch results. Answers are selected based on: the AI2 rule over the top individual results based on overall Elasticsearch score; (b) the MaxEntail rule over the Top-2 results by Elasticsearch score per candidate answer (typically 8 results total); or (c) the MaxEntail rule retaining the Top-30 results per answer (per Figure 4(b)).

Test Set

After tuning on the Dev set, we moved to the Test set in order to test the following decision rules for all of our system: the AI2 Rule, Top-2 per query, and Top-30 per query. We selected Top-2 as it is the closest analog to the AI2 rule, and Top-30 because there is a clear and long peak from our initial testing on the Dev set. The results of our run on Test set can be found in Figure 6. There are a number of interesting results to glean from this table.

First, for the Dev set, the split treatments nearly uniformly dominate the non-split treatments; while for the Test set this is almost completely reversed. Perhaps more surprisingly, the more complex ConceptNet embeddings are almost uniformly better on the Dev set; while on the Test set they are nearly uniformly worse. We did not suffer quite the loss in accuracy from Dev to Test as seen by NiZhChMc18 NiZhChMc18, who fell from to ; however, our numbers fell from to .

7 Discussion and Conclusions

In this paper, we present a system that answers science questions by extracting the terms that are most relevant for retrieving supporting evidence from a large, noisy corpus. By combining query rewriting, background knowledge, and textual entailment our system is able to outperform several strong baselines on the ARC dataset. Our rewriter is able to incorporate background knowledge from ConceptNet and – in tandem with a generic entailment model trained on SciTail – achieves high performance on the end-to-end QA task despite only being trained to identify essential terms in the original source question.

There are a number of key takeaways from our work: first, researchers should be aware of the impact that Elasticsearch (or a similar tool) can have on the performance of their models. Answer candidates should not be discarded based on the relevance score of their top result; while (correct) answers are likely critical to retrieving relevant results, the original AI2 rule is too aggressive in pruning candidates. Using an entailment model that is capable of leveraging background knowledge in a more principled way that would likely help dramatically in filtering unproductive search results. Second, our results corroborate those of NiZhChMc18 NiZhChMc18 and show that tuning the Dev partition of the Challenge set (299 questions) is extremely sensitive. Though we are unable to speculate on whether this is an artifact of the dataset, or a more fundamental concern in multiple-choice QA; it will likely remain a barrier to significant, reproducible improvements on the ARC dataset in the future.