Unification-based Reconstruction of Explanations for Science Questions

03/31/2020 ∙ by Marco Valentino, et al. ∙ 0

The paper presents a framework to reconstruct explanations for multiple choices science questions through explanation-centred corpora. Building upon the notion of unification in science, the framework ranks explanatory facts with respect to question and candidate answer by leveraging a combination of two different scores: (a) A Relevance Score (RS) that represents the extent to which a given fact is specific to the question; (b) A Unification Score (US) that takes into account the explanatory power of a fact, determined according to its frequency in explanations for similar questions. An extensive evaluation of the framework is performed on the Worldtree corpus, adopting IR weighting schemes for its implementation. The following findings are presented: (1) The proposed approach achieves competitive results when compared to state-of-the-art Transformers, yet possessing the property of being scalable to large explanatory knowledge bases; (2) The combined model significantly outperforms IR baselines (+7.8/8.4 MAP), confirming the complementary aspects of relevance and unification score; (3) The constructed explanations can support downstream models for answer prediction, improving the accuracy of BERT for multiple choices QA on both ARC easy (+6.92 questions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Answering multiple choices science questions has become an established benchmark for testing natural language understanding and inference capabilities of Question Answering (QA) systems. Several datasets and corpora have been released with the aim of advancing research in automated scientific reasoning [17, 4, 23]. In parallel with other NLP research areas, a crucial requirement emerging in recent years is explainability [24, 2, 28]. To boost automatic methods of inference, it is necessary to measure not only the performance of QA systems on answer prediction, but also the ability to provide explanations for the underlying reasoning process. The need for explainability and a quantitative methodology for its evaluation have directed research efforts towards the creation of explanation-centred corpora [14, 11]. At the same time, shared tasks on explanation reconstruction have been launched to accelerate the creation of explainable inference models [13].

Given a multiple choices science question, the task of explanation reconstruction consists in retrieving the gold explanation that supports the answer by selecting a set of facts from a Knowledge Base (KB). The structure of explanations in scientific domain affects the complexity of the reconstruction task. Explanations for science questions are typically composed of two main parts: a question-specific part, containing knowledge about concrete concepts in the question, and a core scientific part, including general scientific statements and laws. On the one hand, question-specific sentences share key terms with the question and can therefore be retrieved by IR techniques based on lexical overlaps. On the other hand, assessing the relevance of core scientific statements is a complex task as they tend to be less grounded into the question. Consider the following question and answer pair from [14]:

  • : what is an example of a force producing heat? : two sticks getting warm when rubbed together.

An explanation that justifies might be composed using the following sentences from a KB: () a stick is a kind of object; () to rub together means to move against; () friction is a kind of force; () friction occurs when two object’s surfaces move against each other; () friction causes the temperature of an object to increase. The explanation contains a set of sentences that are directly connected with and (, and ), along with a set of abstract facts that require additional knowledge to be inferred ( and ). Performing multi-step inference through lexical overlaps to retrieve abstract facts is challenging due to semantic drift, i.e. the tendency of composing spurious inference chains leading to wrong conclusions [15, 8]. However, explanations in science operate through unification [9, 19, 20]. Scientific laws generate abstract explanatory patterns that are instantiated to show how apparently disconnected phenomena belong to a common regularity. An example of unification is Newton’s law of universal gravitation, which unifies the motion of planets and falling bodies on Earth showing that all bodies with mass obey the same law of nature. The more phenomena a pattern unifies, the greater its explanatory power. As consequence, highly explanatory patterns tend to be frequently reused in the process of explaining similar phenomena. Coming back to our example, the statement (“friction occurs when two object’s surfaces move against each other”) tends to explain a larger set of questions than (“a stick is a kind of object”) since the former constitutes an abstract explanatory pattern that applies to a broader range of situations (friction occurs for all objects surfaces). In this paper, we build upon the notion of explanatory unification and describe a framework to reconstruct explanations for science questions using explanation-centred corpora. In line with the shared task [13], we frame explanation reconstruction as a ranking problem and propose a model that adopts a combination of two different scores: (1) A Relevance Score (RS) that represents the extent to which a given fact is specific to the question; (2) A Unification Score (US) that takes into account the explanatory power of a fact, determined according to its frequency in explanations for similar questions. We perform extensive evaluation on the Worldtree corpus [14], a subset of the ARC dataset [4] that includes explanations for science questions, adopting IR weighting schemes for the implementation. The following findings are presented:

  1. The proposed approach achieves competitive results when compared to state-of-the-art Transformers [7, 34], yet possessing the property of being scalable to large explanatory knowledge bases;

  2. The combined model significantly outperforms IR baselines (+7.8/8.4 MAP), confirming the complementary aspects of relevance and unification score. Crucially, the unification score gives a decisive contribution to the ranking of complex scientific facts;

  3. The constructed explanations can support downstream models for answer prediction, significantly improving the accuracy of BERT [7] for multiple choices QA on both ARC easy (+6.92%) and challenge (+15.69%) questions.

2 Explanation Reconstruction as Ranking Problem

A multiple choice science question is defined as a tuple , where represents the textual description of the problem and the set of possible candidate answers. Given a problem formulation and a candidate answer , the task of explanation reconstruction consists in selecting a set of facts from a knowledge base that support and justify . In this paper, we adopt a methodology for explanation reconstruction that relies on the existence of an explanation-centred corpus for science questions. A corpus of explanations is generally composed of two distinct knowledge sources: (a) A primary knowledge base, Facts KB (), defined as a collection of sentences encoding the knowledge necessary to answer and explain science questions. This background knowledge typically spans from general common sense and taxonomic facts (e.g. an animal is a kind of living thing) to scientific statements and laws (e.g. gravity causes objects that have mass to be pulled down). A fundamental and desirable characteristic of is reusability - i.e. each of its facts can be potentially reused to compose explanations for multiple questions; (b) A secondary knowledge base, Explanation KB (), consisting of a set of tuples , each of them connecting a question to its correct answer and the corresponding explanation . An explanation is therefore a composition of facts belonging to . To generalise over unseen questions, this knowledge source must be representative for the task - i.e. it has to store a set of representative examples whose solutions can support the reconstruction of explanations for unknown questions. In this setting, the explanation reconstruction task for a given, unseen science question can be modelled as a ranking problem [13]. Specifically, given the question formulation and a candidate answer , the algorithm to solve the task is divided into three macro steps: (1) Computing an explanatory score for each fact with respect to and ; (2) Producing an ordered set ; (3) Selecting the top elements belonging to and interpreting them as an explanation for ; . We operate under the framework described above and define an approach for computing the explanatory score by leveraging both and .

2.1 Modeling Explanatory Relevance

We present an approach for modeling that leverages a combination of two scoring functions, one defined in terms of lexical relevance (relevance score) and another one based on similar examples in the explanatory knowledge base (unification score). The proposed approach is guided by the following research hypothesis: (RH1) Explanations for science questions are generally composed of a set of question-specific facts connected to a set of more abstract statements; (RH2) Question-specific facts tend to share key terms with the question and can therefore be effectively ranked by IR techniques based on lexical overlaps. (RH3) Core scientific statements tend to be more abstract and therefore more difficult to rank by means of shared terms. The number of overlaps with question and candidate answer is generally low. However, due to explanatory unification, abstract scientific facts tend to be frequently reused across similar questions. We hypothesise that the explanatory power of a fact for a given question and candidate answer is proportional to the number of times appears in explanations for similar questions. To summarise, our hypothesis state that the explanatory score of a fact depends on two main factors: (a) How specific a fact is with respect to a given question and candidate answer; (b) How often the same fact is reused to construct explanations for similar questions in . Moreover, we hypothesise that these two components are highly complementary for explanation reconstruction, capturing two general aspects of explanations in scientific domain. In order to formalise the hypothesis described above, we model the explanatory scoring function as a combination of two distinct components:

(1)

Here, represents an Information Retrieval relevance score (RS) assigned to with respect to and , while represents the Unification Score (US) of computed over as follows:

(2)

is the set of k-nearest neighbours of and belonging to retrieved according to a similarity function . In the formulation of Equation 2 we aim at capturing two main aspects related to our research hypothesis: (a) The more a fact is reused for explanations in , the higher its explanatory power and therefore its Unification Score (US); (b) The Unification Score (US) of a fact is proportional to the similarity between the examples in that use in their explanations and the unseen pair () we want to explain.

Figure 1: Overview of the proposed framework for explanation reconstruction. Both Relevance Score (RS) and Unification Score (US) are implemented adopting IR weighting schemes.

Figure 1 shows a schematic representation of the framework. Both Relevance Score (RS) and Unification Score (US) receive in input the question along with a candidate answer . However, while the RS component operates over , the US component retrieves explanations from similar cases stored in . The scores computed independently by each module are then combined as described in Equation 1 to obtain the final explanation ranking.

3 Empirical Evaluation

Worldtree Corpus.

We carried out an empirical evaluation on the Worldtree corpus  [14], a subset of the ARC dataset [4] that includes explanations for science questions. The corpus provides an explanatory knowledge base ( and ) where each explanation in is represented as a set of lexically connected sentences describing how to arrive at the correct answer. The science questions in the Worldtree corpus are split into training-set, dev-set, and test-set. The gold explanations in the training-set are used to form the Explanation KB (), while the gold explanations in dev and test set are used for evaluation purpose only. The corpus groups the explanation sentences belonging to into three explanatory roles: grounding, central and lexical glue. Consider the example in Figure 1. To support and

the system has to retrieve the scientific facts describing how friction occurs and produces heat across objects. The corpus classifies these facts (

as central. Grounding explanations like “stick is a kind of object” () link question and answer to the central explanations. Lexical glues such as “to rub; to rub together means to mover against” () are used to fill lexical gaps between sentences. Additionally, the corpus divides the facts belonging to into three inference categories: retrieval type, inference supporting type, and complex inference type. Taxonomic knowledge and properties such as “stick is a kind of object” () and “friction is a kind of force” () are classified as retrieval type. Facts describing actions, affordances, and requirements such as “friction occurs when two object’s surfaces move against each other” () are grouped under the inference supporting types. Knowledge about causality, description of processes and if-then conditions such as “friction causes the temperature of an object to increase” () is classified as complex inference type.

Implementation Details.

Relevance and unification score are both implemented adopting TF-IDF/BM25 vectors and cosine similarity function (i.e.

). Specifically, The RS model uses TF-IDF/BM25 to compute the relevance function for each fact in (i.e. function in Equation 1) while the US model adopts TF-IDF/BM25 to assign similarity scores to the questions in (i.e. function in Equation 2). For reproducibility, the code is available online 111https://github.com/ai-systems/unification_reconstruction_explanations.

3.1 Explanation Reconstruction

Model Approach MAP
Test Dev
Das et al. das2019chains BERT re-ranking with inference chains 56.3 58.5
Chia et al. chia2019red BERT re-ranking with gold IR scores 47.7 50.9
Chia et al. chia2019red Iterated BM25 45.8 49.7
RS BM25 BM25 Relevance Score 43.0 46.1
Banerjee banerjee2019asu BERT iterative re-ranking 41.3 42.3
D’Souza et al.d2019team Feature-reach SVM ranking + Rules 39.4 44.4
RS TF-IDF TF-IDF Relevance Score 39.4 42.8
D’Souza et al. d2019team Feature-rich SVM ranking 34.1 37.1
RS + US (Best) Joint Relevance and Unification Score 50.8 54.5
US (Best) Unification Score 22.9 21.9
Table 1: Results on test and dev set and comparison with state-of-the-art approaches.

In line with the shared task [13], the performance of the models are evaluated by computing the Mean Average Precision (MAP) of the explanation ranking produced for a given question and its correct answer against the gold explanation in the corpus. Table 1 illustrates the score achieved by our best implementation (bold) compared to state-of-the-art approaches in the literature. The state-of-the-art approaches are grouped into four categories: BERT-based models, Feature-based models , Information Retrieval with re-ranking and One-step Information Retrieval.

BERT-based models.

This class of approaches employs the Worldtree training-set as labelled data for a BERT language model [7]. The best-performing system [5] achieves a MAP score of 56.3/58.5 with a multi-step retrieval strategy. In the first step, it returns the top K sentences ranked by a TF-IDF model. In the second step, BERT is used to re-rank the paths composed of all the facts in that are within 1-hop from the first retrieved set. Similarly, other approaches adopt BERT with a multi-step strategy to re-rank each fact individually [1, 3]. Although BERT-based models achieve state-of-the-art results in explanation reconstruction, these approaches tend to be computationally expensive, requiring a pre-filtering step to limit the space of possible candidate facts. Consequently, these systems do not scale with the size of the corpus, and their applicability for the downstream answer prediction task is limited. Comparatively, the RS + US implementation adopts IR weighting schemes, making it suitable for large corpora and downstream modules for answer prediction (as shown in Section 3.3). Along with improved scalability, the proposed approach achieves competitive results (50.8/54.5 MAP). Although we observe lower performance when compared to the best-performing approach (-5.5/-4 MAP), the joint RS + US model outperforms two BERT-based models [3, 1] on both test and dev set by 3.1/3.6 and 9.5/12.2 MAP respectively.

Feature-based models.

D’Souza et al. d2019team propose an approach based on a learning-to-rank paradigm. The model extracts a set of features based on overlaps and coherence metrics between questions and explanation sentences. These features are then given in input to a SVM ranker module. While this approach scales to the whole corpus without requiring any pre-filtering step, it is significantly outperformed by the RS + US model on both test and dev set by 16.7/17.4 MAP respectively.

Information retrieval with re-ranking.

Chia et al. chia2019red describe a multi-step, iterative re-ranking model based on BM25. The first step consists in retrieving the explanation sentence that is most similar to the question adopting BM25 vectors. During the second step, the BM25 vector of the question is updated by aggregating it with the retrieved explanation sentence vector through a max operation. The first and second steps are repeated for times. Although this approach uses scalable IR techniques, it relies on a multi-step retrieval strategy. Besides, the RS + US model outperforms this approach on both test and dev set by 5.0/4.8 MAP respectively.

One-step information retrieval.

We compare the RS + US model with two IR baselines. The baselines adopt TF-IDF and BM25 to compute the Relevance Score (RS) only, i.e. the term in Equation 1 is set to 0 for each fact . In line with previous IR literature [29], BM25 leads to better performance than TF-IDF. While these approaches share similar characteristics, the combined RS + US model outperforms both RS BM25 and RS TF-IDF on test and dev-set by 7.8/8.4 and 11.4/11.7 MAP. Moreover, the joint RS + US model improves the performance of the US model alone by 27.9/32.6 MAP. These results outline the complementary aspects of relevance and unification score. We provide a detailed analysis by performing an ablation study on the dev-set (Section 3.2).

3.2 Ablation Study

Model MAP
All Central Grounding Lexical Glue
RS TF-IDF 42.8 43.4 25.4 8.2
RS BM25 46.1 46.6 23.3 10.7
US TF-IDF 21.6 16.9 22.0 13.4
US BM25 21.9 18.1 16.7 15.0
RS TF-IDF + US TF-IDF 48.5 46.4 32.7 11.7
RS TF-IDF + US BM25 50.7 48.6 30.42 13.4
RS BM25 + US TF-IDF 51.9 48.2 31.7 14.8
RS BM25 + US BM25 54.5 51.7 27.3 16.7
(a) Explanatory roles.
Model MAP
1+ Overlaps 1 Overlap 0 Overlaps
RS TF-IDF 57.2 33.6 7.1
RS BM25 62.2 37.1 7.1
US TF-IDF 17.37 18.0 12.5
US BM25 18.1 18.1 13.1
RS TF-IDF + US TF-IDF 60.2 38.4 9.0
RS TF-IDF + US BM25 62.5 39.5 9.6
RS BM25 + US TF-IDF 61.3 40.6 11.0
RS BM25 + US BM25 64.8 41.9 11.2
(b) Number of lexical overlaps.
Model MAP
Retrieval Inference-supporting Complex Inference
RS TF-IDF 33.5 34.7 21.8
RS BM25 36.0 36.1 24.8
US TF-IDF 17.6 12.8 19.5
US BM25 16.8 13.2 20.9
RS TF-IDF + US TF-IDF 38.3 33.2 30.2
RS TF-IDF + US BM25 40.0 35.6 33.3
RS BM25 + US TF-IDF 40.5 33.6 33.4
RS BM25 + US BM25 40.6 38.3 36.8
(c) Inference types.
Model MAP
US BM25 21.9
RS BM25 46.1
RS BM25 + US BM25 54.5
US Inverse BM25 12.9
RS BM25 + US Inverse BM25 20.9
(d) Inverse question similarity function.
Table 2: Comparison on the dev-set over different explanation types.

We present an ablation study with the aim of understanding the contribution of each sub-component to the general performance of the joint RS + US model (see Table 1). To this end, a detailed evaluation on the development set of the Worldtree corpus is carried out, analysing the performance in reconstructing explanations of different types and complexity. We compare the joint model (RS + US) with each individual sub-component (RS and US alone). In addition, a set of qualitative examples are analysed to provide additional insights on the complementary aspects captured by relevance and unification score.

Explanation type.

Given a question and its correct answer , we classify a fact belonging to the gold explanation according to its explanatory role (central, grounding, lexical glue) and inference type (retrieval, inference-supporting and complex inference). In addition, three new categories are derived from the number of overlaps between and the concatenation of with computed by considering nouns, verbs, adjectives and adverbs (1+ overlaps, 1 overlap, 0 overlaps). Table 2 reports the MAP score for each of the described categories. Overall, the best results are obtained by the BM25 implementation of the joint model (RS BM25 + US BM25) with a MAP score of 54.5. Specifically, RS BM25 + US BM25 achieves a significant improvement over both RS BM25 (+8.5 MAP) and US BM25 (+32.6 MAP) baselines. Regarding the explanatory roles (Table (a)a), the joint TF-IDF implementation shows the best performance in the reconstruction of grounding explanations (32.7 MAP). On the other hand, a significant improvement over the RS baseline is obtained by RS BM25 + US BM25 on both lexical glues and central explanation sentences (+6 and +5.6 MAP over RS BM25). Regarding the lexical overlaps categories (Table (b)b), we observe a steady improvement for all the combined RS + US models over the respective RS baselines. Notably, the US models achieve the best performance on the 0 overlaps category, which includes the most challenging facts for the RS models. The improved ability to rank abstract explanatory facts contributes to better performance for the joint models (RS + US) in the reconstruction of explanations that share few terms with question and answer (1 Overlap and 0 Overlaps categories). This characteristic leads to an improvement of 4.8 and 4.1 MAP for the RS BM25 + US BM25 model over the RS BM25 baseline. Similar results are achieved on the inference types categories (Table (c)c). Crucially, the largest improvement is observed for the ranking of complex inference sentences where RS BM25 + US BM25 outperforms RS BM25 by 12 MAP, confirming the decisive contribution of the Unification Score (US) to the ranking of complex scientific facts.

Explanation length and precision.

Science questions in the Worldtree corpus require an average of six facts in their explanations  [11]. Long explanations typically include sentences that share few terms with question and answer, increasing the complexity of the reconstruction task. Therefore, to test the robustness of the models, we measure the performance in reconstructing explanations of increasing length. Figure (a)a shows the change in MAP score for the RS + US, RS and US models (BM25 implementation). The drop in performance for the RS baseline with increasing explanation length reflects the complexity of the task. This drop occurs because the RS model is not able to assign high scores to abstract explanatory facts. Conversely, the US model exhibits steadier performance, with a trend that is inverse to the RS model. Short explanations, indeed, tend to include only question-specific facts that are not frequently reused in . On the other hand, the longer the explanation, the higher the number of abstract explanatory facts. The decrease in MAP score observed for the RS model is then compensated by the Unification Score (US), since abstract scientific facts tend to be reused frequently across similar questions. Therefore, the US model contributes to more stable performance for the joint RS + US model, resulting in a larger improvement over the RS baseline with increasing explanation length.

(a) MAP vs Explanation length.
(b) Precision@K on the dev-set.
Figure 2: The graphs show the contribution of each component to the robustness of the joint model.
RS + US (Blue Straight), RS (Green Dotted), US (Red Dashed).

Figure (b)b illustrates the Precision@K. As shown in the graph, the drop in precision for the US model exhibits the smallest gradient. Similarly to what was observed for explanations of increasing length, the US score contributes to the robustness of the RS + US model, making it able to reconstruct more precise explanations. As discussed in section 3.3, this feature has a positive impact on answer prediction.

Impact of question similarity.

To understand the impact of similar questions on the reconstruction, we evaluate the joint model with the inverse question similarity function, namely substituting to in Equation 2. The results (Table (d)d) highlights a significant drop in performance for both US (-9 MAP) and combined RS + US model (-33.6). The decline not only confirms the importance of using sensible question similarity models for the Unification Score (US), but also the hypothesis that similar explanations patterns are particularly reused across similar questions.

Qualitative analysis.

To provide additional insights on the complementary aspects of unification and relevance score, we present a set of qualitative examples from the dev-set. Table 3 illustrates the ranking assigned by RS and RS + US models to scientific sentences of increasing complexity. The words in bold indicate lexical overlaps between question, answer and explanation sentence. In the first example, the sentence “gravity; gravitational force causes objects that have mass; substances to be pulled down; to fall on a planet” shares key terms with question and candidate answer and is therefore relatively easy to rank for the RS model (#36). Nevertheless, the RS + US model is able to improve the ranking by 34 positions (#2), as the gravitational law represents a scientific pattern with high explanatory unification, frequently reused across similar questions. The impact of the Unification Score is more evident when considering abstract explanatory facts. Coming back to our original example (i.e. “What is an example of a force producing heat?”), the fact “friction causes the temperature of an object to increase” has no significant overlaps with question and answer. Thus, the RS model ranks the gold explanation sentence in a low position (#1472). However, the Unification Score (US) is able to capture the explanatory power of the fact from similar questions in , pushing the RS + US ranking up to position #21 (+1451).

Question Answer Explanation Sentence Most Similar Questions in RS RS + US
If you bounce a rubber ball on the floor it goes up and then comes down. What causes the ball to come down? gravity gravity; gravitational force causes objects that have mass; substances to be pulled down; to fall on a planet (1) A ball is tossed up in the air and it comes back down. The ball comes back down because of - gravity (2) A student drops a ball. Which force causes the ball to fall to the ground? - gravity #36 #2 (34)
Which animals would most likely be helped by flood in a coastal area? alligators as water increases in an environment, the population of aquatic animals will increase (1) Where would animals and plants be most affected by a flood? - low areas (2) Which change would most likely increase the number of salamanders? - flood #198 #57 (141)
What is an example of a force producing heat? two sticks getting warm when rubbed together friction causes the temperature of an object to increase (1) Rubbing sandpaper on a piece of wood produces what two types of energy? - sound and heat (2) Which force produces energy as heat? - friction #1472 #21 (1451)
Table 3: Impact of the RS + US model on the ranking of scientific sentences with increasing complexity.

Limitations.

Two major factors limit the performance of the joint RS + US model: (a) The gold explanation of a question does not include any abstract fact, only question-specific facts are needed; (b) A subset of examples in that share abstract explanatory patterns with are not covered by the question similarity function. In case (a), the Unification Score (US) might introduce noise due to the lack of representative examples in that support the reconstruction of the gold explanation for . In this case, the Relevance Score (RS) is sufficient. Case (b) is related to the specific implementation of the similarity function. TF-IDF and BM25 lead to similarity scores based on shared terms. In some cases, however, a subset of questions adopting the same explanatory pattern of can be retrieved only by means of semantic abstraction (e.g. hypernyms, synonyms), a characteristic not captured by the current implementation. This observation suggests the exploration of abstract question representations as focus for future work.

3.3 Answer Prediction

Answer Prediction Model Accuracy
Easy Challenge Overall
BERT (no explanation) 48.54 26.28 41.78
BERT + RS BM25 (K = 3) 53.20 40.97 49.39
BERT + RS BM25 (K = 5) 54.36 38.14 49.31
BERT + RS BM25 (K = 10) 32.71 29.63 31.75
BERT + RS BM25 + US BM25 (K = 3) 55.46 41.97 51.62
BERT + RS BM25 + US BM25 (K = 5) 54.48 39.43 50.12
BERT + RS BM25 + US BM25 (K = 10) 48.66 37.37 45.14
Table 4: Performance of BERT on answer prediction (test-set) with and without explanations.

To understand whether explanations can support answer prediction, we compare the performance of BERT [7] without explanations with the performance of BERT operating on the top K explanation sentences retrieved by RS and RS + US models (BM25 implementation). BERT without explanations is provided with question and candidate answer only. On the other hand, BERT with explanation receives the following inputs: the question (), a candidate answer () and the explanation computed for (). In this setting, the model is fine-tuned for binary classification (

) to predict a set of probability scores

for each candidate answer in :

(3)

The binary classifier operates on the final hidden state corresponding to the [CLS] token. To answer the question , the module selects the candidate answer such that . Table 4 reports the accuracy with and without explanations on the Worldtree test-set for both ARC easy and challenge questions [4]. Notably, a significant improvement in accuracy can be observed when BERT is provided with explanations retrieved by the reconstruction modules (+9.78% accuracy with RS BM25 + US BM25 model). The improvement is consistent on the easy split (+6.92%) and particularly significant for challenge questions (+15.69%). Overall, we observe a correlation between more precise explanations and accuracy in answer prediction, with BERT + RS BM25 being outperformed by BERT + RS BM25 + US BM25 for each value of K. In fact, the decrease in accuracy occurring with increasing values of K is coherent with the drop in precision for both models observed in Figure (b)b. Moreover, steadier results in answer prediction adopting the RS + US model suggest a positive contribution from abstract explanatory facts. Additional investigation of this aspect will be a focus for future work.

4 Related Work

Explanations for Science Questions.

Reconstructing explanations for science questions can be reduced to a multi-hop inference problem, where multiple pieces of evidence have to be aggregated to arrive at the final answer [16, 18, 12]. Aggregation methods based on lexical overlaps suffer from semantic drift  [15, 8], i.e. the tendency of composing spurious inference chains leading to wrong conclusions. One way to contain semantic drift is to identify common explanatory patterns through explanation-centred corpora [13, 14]. Transformer-based architectures [5, 3] represent the state-of-the-art for explanation reconstruction in this setting. However, these models require high computational resources that prevent their applicability to large explanatory knowledge bases. On the other hand, approaches based on IR weighting schemes are readily scalable. The approach described in this paper preserves the benefits of IR weighting schemes, obtaining, at the same time, performances comparable with Transformer-based architectures. Thanks to this feature, the framework can be applied in combination with downstream answer prediction models. Our findings are in line with previous work in different Question Answering settings [27, 37], which highlights the positive impact of explanation on the final answer prediction task.

Scientific Explanation and AI.

The field of Artificial Intelligence has been historically inspired by models of explanation in Philosophy of Science 

[32]. The deductive-nomological model proposed by Hempel [10] constitutes the philosophical foundation for explainable models based on logical deduction, such as Expert Systems [22, 35] and Explanation-based Learning [25]. Similarly, the inherent relation between explanation and causality [36, 30] has inspired computational models of causal inference [26]. The view of explanation as unification [9, 19, 20] is closely related to Case-based reasoning [21, 31, 6]. In this context, analogical reasoning plays a key role in the process of reusing abstract patterns for explaining new phenomena [33]. Similarly to our approach, Case-based reasoning applies this insight to construct explanations for novel problems by retrieving, reusing and adapting known cases whose solution and explanation are stored into a knowledge base.

5 Conclusion

This paper proposed a framework for explanation reconstruction that is built upon the notion of explanatory unification in scientific explanation. An extensive evaluation on the Worldtree corpus led to the following findings: (1) The approach is competitive with state-of-the-art Transformers, yet being inherently scalable (2) The joint model significantly outperforms IR baselines, confirming the complementary aspects of relevance and unification score; (3) Explanation can support answer prediction, improving the accuracy of BERT by up to 10% overall. As suggested by the performed analysis, future work will address current limitations by exploring the role of different semantic representations and similarity measures for the questions. Additional work will be focused on extending the framework with multi-hop inference methods in order to refine precision and robustness of the discussed implementation and further investigate the impact of explanatory inference on multiple choices science questions.

References

  • [1] P. Banerjee (2019) ASU at textgraphs 2019 shared task: explanation regeneration using language models and iterative re-ranking. In

    Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

    ,
    pp. 78–84. Cited by: §3.1.
  • [2] O. Biran and C. Cotton (2017)

    Explanation and justification in machine learning: a survey

    .
    In IJCAI-17 workshop on explainable AI (XAI), Vol. 8. Cited by: §1.
  • [3] Y. K. Chia, S. Witteveen, and M. Andrews (2019) Red dragon ai at textgraphs 2019 shared task: language model assisted explanation generation. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 85–89. Cited by: §3.1, §4.
  • [4] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018) Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457. Cited by: §1, §1, §3, §3.3.
  • [5] R. Das, A. Godbole, M. Zaheer, S. Dhuliawala, and A. McCallum (2019) Chains-of-reasoning at textgraphs 2019 shared task: reasoning over chains of facts for explainable multi-hop inference. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 101–117. Cited by: §3.1, §4.
  • [6] R. L. De Mantaras, D. McSherry, D. Bridge, D. Leake, B. Smyth, S. Craw, B. Faltings, M. L. Maher, M. T COX, K. Forbus, et al. (2005) Retrieval, reuse, revision and retention in case-based reasoning.

    The Knowledge Engineering Review

    20 (3), pp. 215–240.
    Cited by: §4.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: item 1, item 3, §3.1, §3.3.
  • [8] D. Fried, P. Jansen, G. Hahn-Powell, M. Surdeanu, and P. Clark (2015) Higher-order lexical semantic models for non-factoid answer reranking. Transactions of the Association for Computational Linguistics 3, pp. 197–210. Cited by: §1, §4.
  • [9] M. Friedman (1974) Explanation and scientific understanding. The Journal of Philosophy 71 (1), pp. 5–19. Cited by: §1, §4.
  • [10] C. G. Hempel (1965) Aspects of scientific explanation. Cited by: §4.
  • [11] P. Jansen, N. Balasubramanian, M. Surdeanu, and P. Clark (2016) What’s in an explanation? characterizing knowledge and inference requirements for elementary science exams. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2956–2965. Cited by: §1, §3.2.
  • [12] P. Jansen, R. Sharp, M. Surdeanu, and P. Clark (2017) Framing qa as building and ranking intersentence answer justifications. Computational Linguistics 43 (2), pp. 407–449. Cited by: §4.
  • [13] P. Jansen and D. Ustalov (2019) TextGraphs 2019 shared task on multi-hop inference for explanation regeneration. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 63–77. Cited by: §1, §1, §2, §3.1, §4.
  • [14] P. Jansen, E. Wainwright, S. Marmorstein, and C. Morrison (2018) WorldTree: a corpus of explanation graphs for elementary science questions supporting multi-hop inference. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §1, §1, §3, §4.
  • [15] D. Khashabi, E. S. Azer, T. Khot, A. Sabharwal, and D. Roth (2019) On the capabilities and limitations of reasoning for natural language understanding. arXiv preprint arXiv:1901.02522. Cited by: §1, §4.
  • [16] D. Khashabi, T. Khot, A. Sabharwal, and D. Roth (2018) Question answering as global reasoning over semantic abstractions. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §4.
  • [17] T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal (2019) QASC: a dataset for question answering via sentence composition. arXiv preprint arXiv:1910.11473. Cited by: §1.
  • [18] T. Khot, A. Sabharwal, and P. Clark (2017) Answering complex questions using open information extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 311–316. Cited by: §4.
  • [19] P. Kitcher (1981) Explanatory unification. Philosophy of science 48 (4), pp. 507–531. Cited by: §1, §4.
  • [20] P. Kitcher (1989) Explanatory unification and the causal structure of the world. Cited by: §1, §4.
  • [21] J. Kolodner (2014) Case-based reasoning. Morgan Kaufmann. Cited by: §4.
  • [22] C. Lacave and F. J. Diez (2004)

    A review of explanation methods for heuristic expert systems

    .
    The Knowledge Engineering Review 19 (2), pp. 133–146. Cited by: §4.
  • [23] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal (2018) Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391. Cited by: §1.
  • [24] T. Miller (2019) Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence 267, pp. 1–38. Cited by: §1.
  • [25] T. M. Mitchell, R. M. Keller, and S. T. Kedar-Cabelli (1986) Explanation-based generalization: a unifying view. Machine learning 1 (1), pp. 47–80. Cited by: §4.
  • [26] J. Pearl (2009) Causality. Cambridge university press. Cited by: §4.
  • [27] N. F. Rajani, B. McCann, C. Xiong, and R. Socher (2019) Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4932–4942. Cited by: §4.
  • [28] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ” Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §1.
  • [29] S. Robertson, H. Zaragoza, et al. (2009) The probabilistic relevance framework: bm25 and beyond. Foundations and Trends® in Information Retrieval 3 (4), pp. 333–389. Cited by: §3.1.
  • [30] W. C. Salmon (1984) Scientific explanation and the causal structure of the world. Princeton University Press. Cited by: §4.
  • [31] F. Sørmo, J. Cassens, and A. Aamodt (2005) Explanation in case-based reasoning–perspectives and goals. Artificial Intelligence Review 24 (2), pp. 109–143. Cited by: §4.
  • [32] P. Thagard and A. Litt (2008) Models of scientific explanation. The Cambridge handbook of computational psychology, pp. 549–564. Cited by: §4.
  • [33] P. Thagard (1992) Analogy, explanation, and education. Journal of research in science teaching 29 (6), pp. 537–544. Cited by: §4.
  • [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: item 1.
  • [35] M. R. Wick and W. B. Thompson (1992) Reconstructive expert system explanation. Artificial Intelligence 54 (1-2), pp. 33–70. Cited by: §4.
  • [36] J. Woodward (2005) Making things happen: a theory of causal explanation. Oxford university press. Cited by: §4.
  • [37] V. Yadav, S. Bethard, and M. Surdeanu (2019) Quick and (not so) dirty: unsupervised selection of justification sentences for multi-hop question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2578–2589. Cited by: §4.