KG^2: Learning to Reason Science Exam Questions with Contextual Knowledge Graph Embeddings

05/31/2018 ∙ by Yuyu Zhang, et al. ∙ 0

The AI2 Reasoning Challenge (ARC), a new benchmark dataset for question answering (QA) has been recently released. ARC only contains natural science questions authored for human exams, which are hard to answer and require advanced logic reasoning. On the ARC Challenge Set, existing state-of-the-art QA systems fail to significantly outperform random baseline, reflecting the difficult nature of this task. In this paper, we propose a novel framework for answering science exam questions, which mimics human solving process in an open-book exam. To address the reasoning challenge, we construct contextual knowledge graphs respectively for the question itself and supporting sentences. Our model learns to reason with neural embeddings of both knowledge graphs. Experiments on the ARC Challenge Set show that our model outperforms the previous state-of-the-art QA systems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Question answering (QA) has been a long-standing challenge in the field of artificial intelligence. Numerous research works have pushed forward techniques for building QA systems. Many existing approaches achieve high performance on benchmark datasets. However, most of the questions in those datasets only require surface-level reasoning, and do not reveal the full-scale complexity and challenge of the question answering problem. Recently, the

AI2 Reasoning Challenge (ARC) has been proposed (Clark et al., 2018), which is designed to pose a challenge to the QA community. On the ARC Challenge Set, several state-of-the-art QA systems, including leading neural models from the well-known SQuAD and SNLI tasks, only perform slightly better than the random baseline. This striking observation has demonstrated that QA is still far from being solved.

Why it is so difficult to answer the questions in the ARC Challenge Set? 1) ARC consists of natural science questions, namely questions authored for human exams. All of these questions are drawn from real exams; 2) In order to encourage progress on hard questions, a Challenge Set has been partitioned from ARC. To be more specific, if a question could not be correctly answered by neither an information retrieval (IR) method nor a word co-occurrence method, it is sorted into the Challenge Set, otherwise the Easy Set. To illustrate the difference, consider the following two examples from both sets respectively, where the bold answers correspond to the correct choices:

  • ARC Easy Set:

    Which property of air does a barometer measure? (A) speed (B) pressure (C) humidity (D) temperature

    This question is correctly answered by both the IR and word co-occurrence methods.111Note that even it is correctly answered by only one of them, ARC would exclude it from the Challenge Set. The IR method finds sentences relevant to the correct answer in the reference corpus, e.g., “Air pressure will be measured with a barometer”. Due to the substantial word overlap, the question can be easily solved. Similarly, the word co-occurrence method finds that “barometer” and “pressure” co-occur frequently in the corpus, leading to the correct answer.

  • ARC Challenge Set:

    Which property of a mineral can be determined just by looking at it? (A) luster (B) mass (C) weight (D) hardness

    Neither the IR method nor the word co-occurrence method can correctly answer this question. There are no sentences in the corpus similar to “A material’s luster can be determined by looking at it”. Also, “mineral” often co-occurs with distractor options (e.g., mass, hardness), which confuses the word co-occurrence method.

From the examples above, we see that surface-level reasoning methods are not able to solve questions in the Challenge Set, even the required knowledge is already covered in the reference corpus. The ARC Corpus, a large science-related text corpus collected from the Web and released together with ARC, mentions knowledge relevant to about 95% of the ARC Challenge questions (Clark et al., 2018). However, the IR method with the ARC Corpus, as listed in Table 1, only achieves 20.26 test score, which underperforms the random baseline. Collecting more sentences into the corpus would not solve the challenge. Actually we tried to use the entire Web as the reference corpus with Google Search API, and select the answer option with the most number of hits. This only slightly improves the score to 21.58.

To tackle the ARC Challenge, we believe that there is no shortcut to get around advanced logic reasoning and deeper text comprehension. These questions target at students of age 8 through 13 years old, and should be relatively easy for human to solve. For an adult with basic reasoning capability, even she forgets about the knowledge learned in grade school, she can still ace most of these questions in an open-book exam, by searching relevant supporting texts and reasoning over them.

Inspired by the human problem solving process, we propose a neural reasoning engine named for answering science exam questions: read the question, generate hypothesis by combining the question stem and answer option, find supporting sentences in the corpus, and verify the hypothesis. For effective and efficient reasoning, we represent both hypothesis and supporting sentences in knowledge graphs. For example, in the supporting graph, “luster” is linked to “brightness”, and “brightness” connects to “look”, which is consistent with the hypothesis graph. Therefore, such reasoning patterns on graphs can be learned by our differentiable neural engine. Experiments on the ARC Challenge Set show that our model achieves score that surpasses the previous state-of-the-art results.

In summary, the contributions of this work are: 1) We propose a novel differential neural programming framework for reasoning about science exam questions; 2) Our method sets the new state of the art on the ARC Challenge Set; 3) We decompose the remaining difficulties towards solving the ARC Challenge, facilitating the community to engage with the dataset and progress on the challenging task.

2 Related Work

Science QA: For elementary science QA, simple IR-based methods have been proposed for science exams (Clark et al., 2016). Markov Logic Networks (Richardson and Domingos, 2006) has been used to reason over a small set of logical rules (Khot et al., 2015). Jansen et al. (2016) has analyzed knowledge and inference requirements for science exam questions.

The work most related to us is DGEM (Khot et al., 2018), a neural entailment model which also employs Open IE to generate hypothesis graph. Our key contributions over DGEM: 1) DGEM is designed for single sentence entailment, while we aggregate multiple supporting sentences for reasoning; 2) DGEM has no structured representation of supporting facts, while our model learns to reason over the paired hypothesis and supporting graphs together.

Graph Embedding: We employ graph embedding techniques for reasoning over knowledge graphs. Graph embedding has provided the representational flexibility for neural models in many NLP tasks, such as dialog system (He et al., 2017), question answering (Zhang et al., 2017), link prediction (Bordes et al., 2013) and triple classification (Feng et al., 2016). In our paper, we extend this technique to mimic the reasoning process on graph ranking problem.

3 Task

The ARC Challenge Set consists of science exam questions , where is the question stem, is the -th answer option corresponding to (typically 4-way multiple choices), and is the label of correct answer. Both and are in text format. Among the multiple choices, only one of them is the correct answer and others are distractors. With the question stem and options, the goal is to find the correct answer. Accompanied with ARC, the ARC Corpus is also provided, providing 14M science-related sentences from the Web with knowledge relevant to ARC. The use of the ARC Corpus is optional for the ARC Challenge.

4 Approach

4.1 Generating Hypothesis

A hypothesis is a statement that combines a question stem and an answer option , which helps us understand what is being asked and what is the target to be verified. For example, consider the question stem “Which of these occurs due to the rotation of Earth?” and one of the answer options “day and night”. The hypothesis to be generated from them should be: “Day and night occurs due to the rotation of Earth”.

To automatically generate hypothesis, we first identify the wh-word (e.g., which, what, where, etc.) in the question stem, and replace it with the answer option. If there is no wh-word found, we just append the answer option behind the question stem. We create several rules to handle special cases and make hypothesis more natural. For example, “Which of these” and “Which of the following” should be replaced as a whole when they appear in the question stem. We successfully generate hypothesis for most questions, however, there are still a few corner cases requiring advanced rewording, which should be negligible.

4.2 Searching Potential Supports

To verify a hypothesis, we look for supports in the reference corpus. Although the corpus is typically gigantic, we only need to focus on a tiny part of it, which is relevant to the question we are solving. Therefore, we use the generated hypothesis as a query to search the entire corpus. The top retrieved sentences are treated as potential supports for the hypothesis. In order to efficiently search the corpus, we build a local search engine on top of ElasticSearch (Gormley and Tong, 2015). Since the corpus sentences are not as clean as questions, we filter noisy sentences that contain negation words (e.g., not, except, etc.) or unexpected characters or simply too long, and then pick up the top 20 sentences for verifying the hypothesis.

4.3 Constructing Knowledge Graphs

Many questions in the ARC Challenge Set require advanced reasoning on multiple supporting sentences. To aggregate knowledge across sentences, we employ Open IE (Banko et al., 2007; Christensen et al., 2011; Pal et al., 2016) v4 222 to extract relation triples from each sentence, and collect them to construct a contextual knowledge graph.

More specifically, each relation triple is represented as , where is the subject, is the predicate, and is the -th object. We construct the graph by adding nodes , and , and adding directed edges with labels and . If there is adverbial of time or location extracted by Open IE, we add an edge with label or in the knowledge graph. Words in each graph node are lemmatized. Similarly, we construct another knowledge graph for the corresponding hypothesis, which is paired with the supporting knowledge graph. Refer to Appendix A for examples of our generated graphs.

4.4 Learning with Graph Embeddings

Given a question and a candidate choice , we construct the corresponding hypothesis graph and supporting graph by aggregating the relation triples mentioned in Section 4.3. Thus, choosing the right answer for question becomes a graph ranking problem. A good graph scoring function should assign the highest score to the correct hypothesis-supporting graph pair. Without loss of generality, we use point-wise ranking objective, where

becomes a binary classifier.

To implement the graph scoring function, we adapt the recent advances in graph embedding (Dai et al., 2016; Gilmer et al., 2017) to our problem. Specifically, let be a knowledge graph, and be the set of predicate nodes. We associate each node

with an embedding vector

that captures the local information, which is computed recursively using the equation:


Here encodes the text feature of node generated by LSTM that is jointly trained with the supervision. The edge type can be time, loc

, etc. We use a two-layer neural network for the function

. Eq. eq:graph_embed iterates for steps, and we use as the node embedding representation. Finally, the scoring function is defined as:



is the sigmoid function, and the

shift is used to center the matching score at zero. Eq. eq:score is making max inner product search between all pairs of predicate node embeddings. This mimics the procedure of reasoning on the most relevant hypothesis and corresponding supporting evidence, since each embedding vector already captures the information within its -hop neighborhood.

5 Experiments

We compare our method against several recently published baseline models, including state-of-the-art neural models from the well-known SQuAD and SNLI tasks.

5.1 Setup

We use the ARC Challenge Set (Clark et al., 2018) for all experiments. This dataset consists of 2,590 questions drawn from a variety of human exams. We use the original train / development / test split. The test set is held-out for model evaluation, which contains 1,172 questions. For each question, a QA system receives one point if it selects the correct answer, and points if it reports a -way tie (i.e., chooses multiple answers) that includes the correct answer. The ARC Corpus can be optionally used for all models.

5.2 Baselines

Guess-all / Random: This naive baseline just selects all answer options, getting scores for each question with answer options. Random selecting will also converge to this score after enough trials.

IR-ARC: IR-based method sends question stem plus each option as a query to a search engine. For IR-ARC, the search engine is built on top of the ARC Corpus, and the search score is determined by the ElasticSearch score of the top retrieved sentence. The option with the highest search score is finally selected.

IR-Google: This is similar to IR-ARC, but uses Google Search API 333 to retrieve documents from the entire Web, instead of just searching on the ARC Corpus. IR-Google uses the number of hits as the search score.

TableILP: This method (Khashabi et al., 2016)

performs table-based reasoning, which is formulated as an Integer Linear Program (ILP).

TupleInference: This model (Khot et al., 2017) searches for graph that best connects the terms in the question with an answer choice via the knowledge extracted by Open IE.

DecompAttn: It is a neural entailment model (Parikh et al., 2016) adapted to multiple-choice QA by assigning entailment score to the pair of hypothesis and single supporting sentence (Clark et al., 2018). The answer option with the highest score is selected. DecompAttn is a top performer on SNLI (Bowman et al., 2015).

DGEM-OpenIE: DGEM (Khot et al., 2018) is also a neural model for sentence-level entailment, but uses Open IE to create structured representation of the hypothesis. On the SciTail task (Khot et al., 2018), DGEM is a top performer. In Clark et al. (2018), there is another version of DGEM, which uses a proprietary parser together with Open IE and achieves 27.11 test score. For fair comparison, we only list publicly available models in Table 1.

BiDAF: This model (Seo et al., 2016) is for span prediction QA, and has been adapted to multiple-choice QA (Clark et al., 2018). BiDAF is a top performer on SQuAD (Rajpurkar et al., 2016).

5.3 Results and Analysis

Method Test Scores
IR-ARC 20.26
IR-Google 21.58
TupleInference 23.83
DecompAttn 24.34
Guess-all / Random 25.02
DGEM-OpenIE 26.41
BiDAF 26.54
TableILP 26.97
Table 1: Test performance of different QA systems on the ARC Challenge Set. The ARC Corpus is used in DecompAttn, DGEM, BiDAF and .

Table 1

summarizes the test scores of all baseline models and our method. It is striking to see that none of the baseline methods perform significantly better than the random baseline, where the 95% confidence interval is

. Our method achieves 31.70, which substantially improves the previous state of the art by 17.5%.

Figure 1: Distribution of various difficulties in solving the ARC Challenge Set.

Nevertheless, we are still far from “passing” the exam. To dissect the difficulties, we randomly sample 100 questions for investigation and report the results in Figure 1

. More than half of the questions are lack of support: even human couldn’t solve them by only referring to the supporting sentences. This may be caused by the limited coverage of the corpus, and the retrieval bias where sentences with low word overlap can partially explain concept which is indispensable for reasoning. External knowledge sources may help on these questions. 12% questions have lost key information in graph, due to the failure of Open IE. Sentence parsing may be helpful since it reserves more text. 21% questions require very complex reasoning, and only 15% questions are “learnable” given the current framework. This gives us an estimated upper bound when we correctly answer all the learnable questions, and just randomly guess the others, which should be 36.25. Improving the learning algorithm should bring our current result closer to this upper bound.

6 Conclusion and Future Work

We present a neural reasoning engine for answering science exam questions, which learns to reason over contextual knowledge graphs. Experimental results show that our method outperforms existing QA systems on the ARC Challenge Set. In the future, we will explore how to exploit external knowledge sources, and try to improve the quality of open IE by sentence parsing.


Appendix A Examples of Knowledge Graphs

To illustrate how we construct knowledge graphs from hypothesis and supporting sentences, here we present some examples.

We first show a relatively simple example in Figure 2. We see a pair of hypothesis and supporting graphs. The hypothesis is “seed of oak comes from fruit”, as shown in Figure 1(a). Note that the verb “comes” is lemmatized and becomes “come” in the graph. The supporting knowledge graph is plotted in Figure 1(b), where we obtain knowledge including “fruit contains seed”, “fruit is part of tree”, and “oak is kind of tree”. With the supporting knowledge graph, we should be able to infer that the hypothesis is true.

(a) Knowledge graph for hypothesis
(b) Knowledge graph for supports
Figure 2: Example of knowledge graphs for paired hypothesis and supports.

Note that the knowledge graphs can be very complicated when the question stem has multiple sentences, or there is rich information in the supporting sentences extracted by Open IE. We show another example in Figure 3, which has a heavier supporting graph than the previous example. This is actually common in the ARC Challenge Set. In this example, the hypothesis is “day and night occurs due to rotation of earth”, as plotted in Figure 2(a). Looking at the supporting graph in Figure 2(b), we can find key information for this question, such as “day and night occurs because earth rotates”, “day and night causes earth rotation on its axis”, “day and night is caused by earth’s rotation”, etc. With the supporting knowledge graph, we should have necessary information to verify the hypothesis.

(a) Knowledge graph for hypothesis
(b) Knowledge graph for supports
Figure 3: Another example of knowledge graphs for paired hypothesis and supports.