Question answering (QA) has been a long-standing challenge in the field of artificial intelligence. Numerous research works have pushed forward techniques for building QA systems. Many existing approaches achieve high performance on benchmark datasets. However, most of the questions in those datasets only require surface-level reasoning, and do not reveal the full-scale complexity and challenge of the question answering problem. Recently, theAI2 Reasoning Challenge (ARC) has been proposed (Clark et al., 2018), which is designed to pose a challenge to the QA community. On the ARC Challenge Set, several state-of-the-art QA systems, including leading neural models from the well-known SQuAD and SNLI tasks, only perform slightly better than the random baseline. This striking observation has demonstrated that QA is still far from being solved.
Why it is so difficult to answer the questions in the ARC Challenge Set? 1) ARC consists of natural science questions, namely questions authored for human exams. All of these questions are drawn from real exams; 2) In order to encourage progress on hard questions, a Challenge Set has been partitioned from ARC. To be more specific, if a question could not be correctly answered by neither an information retrieval (IR) method nor a word co-occurrence method, it is sorted into the Challenge Set, otherwise the Easy Set. To illustrate the difference, consider the following two examples from both sets respectively, where the bold answers correspond to the correct choices:
ARC Easy Set:
Which property of air does a barometer measure? (A) speed (B) pressure (C) humidity (D) temperature
This question is correctly answered by both the IR and word co-occurrence methods.111Note that even it is correctly answered by only one of them, ARC would exclude it from the Challenge Set. The IR method finds sentences relevant to the correct answer in the reference corpus, e.g., “Air pressure will be measured with a barometer”. Due to the substantial word overlap, the question can be easily solved. Similarly, the word co-occurrence method finds that “barometer” and “pressure” co-occur frequently in the corpus, leading to the correct answer.
ARC Challenge Set:
Which property of a mineral can be determined just by looking at it? (A) luster (B) mass (C) weight (D) hardness
Neither the IR method nor the word co-occurrence method can correctly answer this question. There are no sentences in the corpus similar to “A material’s luster can be determined by looking at it”. Also, “mineral” often co-occurs with distractor options (e.g., mass, hardness), which confuses the word co-occurrence method.
From the examples above, we see that surface-level reasoning methods are not able to solve questions in the Challenge Set, even the required knowledge is already covered in the reference corpus. The ARC Corpus, a large science-related text corpus collected from the Web and released together with ARC, mentions knowledge relevant to about 95% of the ARC Challenge questions (Clark et al., 2018). However, the IR method with the ARC Corpus, as listed in Table 1, only achieves 20.26 test score, which underperforms the random baseline. Collecting more sentences into the corpus would not solve the challenge. Actually we tried to use the entire Web as the reference corpus with Google Search API, and select the answer option with the most number of hits. This only slightly improves the score to 21.58.
To tackle the ARC Challenge, we believe that there is no shortcut to get around advanced logic reasoning and deeper text comprehension. These questions target at students of age 8 through 13 years old, and should be relatively easy for human to solve. For an adult with basic reasoning capability, even she forgets about the knowledge learned in grade school, she can still ace most of these questions in an open-book exam, by searching relevant supporting texts and reasoning over them.
Inspired by the human problem solving process, we propose a neural reasoning engine named for answering science exam questions: read the question, generate hypothesis by combining the question stem and answer option, find supporting sentences in the corpus, and verify the hypothesis. For effective and efficient reasoning, we represent both hypothesis and supporting sentences in knowledge graphs. For example, in the supporting graph, “luster” is linked to “brightness”, and “brightness” connects to “look”, which is consistent with the hypothesis graph. Therefore, such reasoning patterns on graphs can be learned by our differentiable neural engine. Experiments on the ARC Challenge Set show that our model achieves score that surpasses the previous state-of-the-art results.
In summary, the contributions of this work are: 1) We propose a novel differential neural programming framework for reasoning about science exam questions; 2) Our method sets the new state of the art on the ARC Challenge Set; 3) We decompose the remaining difficulties towards solving the ARC Challenge, facilitating the community to engage with the dataset and progress on the challenging task.
2 Related Work
Science QA: For elementary science QA, simple IR-based methods have been proposed for science exams (Clark et al., 2016). Markov Logic Networks (Richardson and Domingos, 2006) has been used to reason over a small set of logical rules (Khot et al., 2015). Jansen et al. (2016) has analyzed knowledge and inference requirements for science exam questions.
The work most related to us is DGEM (Khot et al., 2018), a neural entailment model which also employs Open IE to generate hypothesis graph. Our key contributions over DGEM: 1) DGEM is designed for single sentence entailment, while we aggregate multiple supporting sentences for reasoning; 2) DGEM has no structured representation of supporting facts, while our model learns to reason over the paired hypothesis and supporting graphs together.
Graph Embedding: We employ graph embedding techniques for reasoning over knowledge graphs. Graph embedding has provided the representational flexibility for neural models in many NLP tasks, such as dialog system (He et al., 2017), question answering (Zhang et al., 2017), link prediction (Bordes et al., 2013) and triple classification (Feng et al., 2016). In our paper, we extend this technique to mimic the reasoning process on graph ranking problem.
The ARC Challenge Set consists of science exam questions , where is the question stem, is the -th answer option corresponding to (typically 4-way multiple choices), and is the label of correct answer. Both and are in text format. Among the multiple choices, only one of them is the correct answer and others are distractors. With the question stem and options, the goal is to find the correct answer. Accompanied with ARC, the ARC Corpus is also provided, providing 14M science-related sentences from the Web with knowledge relevant to ARC. The use of the ARC Corpus is optional for the ARC Challenge.
4.1 Generating Hypothesis
A hypothesis is a statement that combines a question stem and an answer option , which helps us understand what is being asked and what is the target to be verified. For example, consider the question stem “Which of these occurs due to the rotation of Earth?” and one of the answer options “day and night”. The hypothesis to be generated from them should be: “Day and night occurs due to the rotation of Earth”.
To automatically generate hypothesis, we first identify the wh-word (e.g., which, what, where, etc.) in the question stem, and replace it with the answer option. If there is no wh-word found, we just append the answer option behind the question stem. We create several rules to handle special cases and make hypothesis more natural. For example, “Which of these” and “Which of the following” should be replaced as a whole when they appear in the question stem. We successfully generate hypothesis for most questions, however, there are still a few corner cases requiring advanced rewording, which should be negligible.
4.2 Searching Potential Supports
To verify a hypothesis, we look for supports in the reference corpus. Although the corpus is typically gigantic, we only need to focus on a tiny part of it, which is relevant to the question we are solving. Therefore, we use the generated hypothesis as a query to search the entire corpus. The top retrieved sentences are treated as potential supports for the hypothesis. In order to efficiently search the corpus, we build a local search engine on top of ElasticSearch (Gormley and Tong, 2015). Since the corpus sentences are not as clean as questions, we filter noisy sentences that contain negation words (e.g., not, except, etc.) or unexpected characters or simply too long, and then pick up the top 20 sentences for verifying the hypothesis.
4.3 Constructing Knowledge Graphs
Many questions in the ARC Challenge Set require advanced reasoning on multiple supporting sentences. To aggregate knowledge across sentences, we employ Open IE (Banko et al., 2007; Christensen et al., 2011; Pal et al., 2016) v4 222https://github.com/allenai/openie-standalone to extract relation triples from each sentence, and collect them to construct a contextual knowledge graph.
More specifically, each relation triple is represented as , where is the subject, is the predicate, and is the -th object. We construct the graph by adding nodes , and , and adding directed edges with labels and . If there is adverbial of time or location extracted by Open IE, we add an edge with label or in the knowledge graph. Words in each graph node are lemmatized. Similarly, we construct another knowledge graph for the corresponding hypothesis, which is paired with the supporting knowledge graph. Refer to Appendix A for examples of our generated graphs.
4.4 Learning with Graph Embeddings
Given a question and a candidate choice , we construct the corresponding hypothesis graph and supporting graph by aggregating the relation triples mentioned in Section 4.3. Thus, choosing the right answer for question becomes a graph ranking problem. A good graph scoring function should assign the highest score to the correct hypothesis-supporting graph pair. Without loss of generality, we use point-wise ranking objective, where
becomes a binary classifier.
To implement the graph scoring function, we adapt the recent advances in graph embedding (Dai et al., 2016; Gilmer et al., 2017) to our problem. Specifically, let be a knowledge graph, and be the set of predicate nodes. We associate each node
with an embedding vectorthat captures the local information, which is computed recursively using the equation:
Here encodes the text feature of node generated by LSTM that is jointly trained with the supervision. The edge type can be time, loc
, etc. We use a two-layer neural network for the function. Eq. eq:graph_embed iterates for steps, and we use as the node embedding representation. Finally, the scoring function is defined as:
is the sigmoid function, and theshift is used to center the matching score at zero. Eq. eq:score is making max inner product search between all pairs of predicate node embeddings. This mimics the procedure of reasoning on the most relevant hypothesis and corresponding supporting evidence, since each embedding vector already captures the information within its -hop neighborhood.
We compare our method against several recently published baseline models, including state-of-the-art neural models from the well-known SQuAD and SNLI tasks.
We use the ARC Challenge Set (Clark et al., 2018) for all experiments. This dataset consists of 2,590 questions drawn from a variety of human exams. We use the original train / development / test split. The test set is held-out for model evaluation, which contains 1,172 questions. For each question, a QA system receives one point if it selects the correct answer, and points if it reports a -way tie (i.e., chooses multiple answers) that includes the correct answer. The ARC Corpus can be optionally used for all models.
Guess-all / Random: This naive baseline just selects all answer options, getting scores for each question with answer options. Random selecting will also converge to this score after enough trials.
IR-ARC: IR-based method sends question stem plus each option as a query to a search engine. For IR-ARC, the search engine is built on top of the ARC Corpus, and the search score is determined by the ElasticSearch score of the top retrieved sentence. The option with the highest search score is finally selected.
IR-Google: This is similar to IR-ARC, but uses Google Search API 333https://developers.google.com/custom-search to retrieve documents from the entire Web, instead of just searching on the ARC Corpus. IR-Google uses the number of hits as the search score.
TableILP: This method (Khashabi et al., 2016)
performs table-based reasoning, which is formulated as an Integer Linear Program (ILP).
TupleInference: This model (Khot et al., 2017) searches for graph that best connects the terms in the question with an answer choice via the knowledge extracted by Open IE.
DecompAttn: It is a neural entailment model (Parikh et al., 2016) adapted to multiple-choice QA by assigning entailment score to the pair of hypothesis and single supporting sentence (Clark et al., 2018). The answer option with the highest score is selected. DecompAttn is a top performer on SNLI (Bowman et al., 2015).
DGEM-OpenIE: DGEM (Khot et al., 2018) is also a neural model for sentence-level entailment, but uses Open IE to create structured representation of the hypothesis. On the SciTail task (Khot et al., 2018), DGEM is a top performer. In Clark et al. (2018), there is another version of DGEM, which uses a proprietary parser together with Open IE and achieves 27.11 test score. For fair comparison, we only list publicly available models in Table 1.
5.3 Results and Analysis
|Guess-all / Random||25.02|
summarizes the test scores of all baseline models and our method. It is striking to see that none of the baseline methods perform significantly better than the random baseline, where the 95% confidence interval is. Our method achieves 31.70, which substantially improves the previous state of the art by 17.5%.
Nevertheless, we are still far from “passing” the exam. To dissect the difficulties, we randomly sample 100 questions for investigation and report the results in Figure 1
. More than half of the questions are lack of support: even human couldn’t solve them by only referring to the supporting sentences. This may be caused by the limited coverage of the corpus, and the retrieval bias where sentences with low word overlap can partially explain concept which is indispensable for reasoning. External knowledge sources may help on these questions. 12% questions have lost key information in graph, due to the failure of Open IE. Sentence parsing may be helpful since it reserves more text. 21% questions require very complex reasoning, and only 15% questions are “learnable” given the current framework. This gives us an estimated upper bound when we correctly answer all the learnable questions, and just randomly guess the others, which should be 36.25. Improving the learning algorithm should bring our current result closer to this upper bound.
6 Conclusion and Future Work
We present a neural reasoning engine for answering science exam questions, which learns to reason over contextual knowledge graphs. Experimental results show that our method outperforms existing QA systems on the ARC Challenge Set. In the future, we will explore how to exploit external knowledge sources, and try to improve the quality of open IE by sentence parsing.
- Banko et al. (2007) Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. Open information extraction from the web. In IJCAI, volume 7, pages 2670–2676, 2007.
- Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787–2795, 2013.
- Bowman et al. (2015) Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.
- Christensen et al. (2011) Janara Christensen, Stephen Soderland, Oren Etzioni, et al. An analysis of open information extraction based on semantic role labeling. In Proceedings of the sixth international conference on Knowledge capture, pages 113–120. ACM, 2011.
- Clark et al. (2016) Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter D Turney, and Daniel Khashabi. Combining retrieval, statistics, and inference to answer elementary science questions. In AAAI, pages 2580–2586, 2016.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
Dai et al. (2016)
Hanjun Dai, Bo Dai, and Le Song.
Discriminative embeddings of latent variable models for structured
International Conference on Machine Learning, pages 2702–2711, 2016.
- Feng et al. (2016) Jun Feng, Minlie Huang, Yang Yang, et al. Gake: Graph aware knowledge embedding. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 641–651, 2016.
- Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
- Gormley and Tong (2015) Clinton Gormley and Zachary Tong. Elasticsearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine. ” O’Reilly Media, Inc.”, 2015.
- He et al. (2017) He He, Anusha Balakrishnan, Mihail Eric, and Percy Liang. Learning symmetric collaborative dialogue agents with dynamic knowledge graph embeddings. arXiv preprint arXiv:1704.07130, 2017.
- Jansen et al. (2016) Peter Jansen, Niranjan Balasubramanian, Mihai Surdeanu, and Peter Clark. What’s in an explanation? characterizing knowledge and inference requirements for elementary science exams. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2956–2965, 2016.
- Khashabi et al. (2016) Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Peter Clark, Oren Etzioni, and Dan Roth. Question answering via integer programming over semi-structured knowledge. arXiv preprint arXiv:1604.06076, 2016.
- Khot et al. (2015) Tushar Khot, Niranjan Balasubramanian, Eric Gribkoff, Ashish Sabharwal, Peter Clark, and Oren Etzioni. Markov logic networks for natural language question answering. arXiv preprint arXiv:1507.03045, 2015.
- Khot et al. (2017) Tushar Khot, Ashish Sabharwal, and Peter Clark. Answering complex questions using open information extraction. arXiv preprint arXiv:1704.05572, 2017.
- Khot et al. (2018) Tushar Khot, Ashish Sabharwal, and Peter Clark. Scitail: A textual entailment dataset from science question answering. In Proceedings of AAAI, 2018.
- Pal et al. (2016) Harinder Pal et al. Demonyms and compound relational nouns in nominal open ie. In Proceedings of the 5th Workshop on Automated Knowledge Base Construction, pages 35–39, 2016.
- Parikh et al. (2016) Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933, 2016.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Richardson and Domingos (2006) Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1-2):107–136, 2006.
- Seo et al. (2016) Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603, 2016.
- Zhang et al. (2017) Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexander J Smola, and Le Song. Variational reasoning for question answering with knowledge graph. arXiv preprint arXiv:1709.04071, 2017.
Appendix A Examples of Knowledge Graphs
To illustrate how we construct knowledge graphs from hypothesis and supporting sentences, here we present some examples.
We first show a relatively simple example in Figure 2. We see a pair of hypothesis and supporting graphs. The hypothesis is “seed of oak comes from fruit”, as shown in Figure 1(a). Note that the verb “comes” is lemmatized and becomes “come” in the graph. The supporting knowledge graph is plotted in Figure 1(b), where we obtain knowledge including “fruit contains seed”, “fruit is part of tree”, and “oak is kind of tree”. With the supporting knowledge graph, we should be able to infer that the hypothesis is true.
Note that the knowledge graphs can be very complicated when the question stem has multiple sentences, or there is rich information in the supporting sentences extracted by Open IE. We show another example in Figure 3, which has a heavier supporting graph than the previous example. This is actually common in the ARC Challenge Set. In this example, the hypothesis is “day and night occurs due to rotation of earth”, as plotted in Figure 2(a). Looking at the supporting graph in Figure 2(b), we can find key information for this question, such as “day and night occurs because earth rotates”, “day and night causes earth rotation on its axis”, “day and night is caused by earth’s rotation”, etc. With the supporting knowledge graph, we should have necessary information to verify the hypothesis.