Hybrid Autoregressive Solver for Scalable Abductive Natural Language Inference

07/25/2021 ∙ by Marco Valentino, et al. ∙ The University of Manchester 1

Regenerating natural language explanations for science questions is a challenging task for evaluating complex multi-hop and abductive inference capabilities. In this setting, Transformers trained on human-annotated explanations achieve state-of-the-art performance when adopted as cross-encoder architectures. However, while much attention has been devoted to the quality of the constructed explanations, the problem of performing abductive inference at scale is still under-studied. As intrinsically not scalable, the cross-encoder architectural paradigm is not suitable for efficient multi-hop inference on massive facts banks. To maximise both accuracy and inference time, we propose a hybrid abductive solver that autoregressively combines a dense bi-encoder with a sparse model of explanatory power, computed leveraging explicit patterns in the explanations. Our experiments demonstrate that the proposed framework can achieve performance comparable with the state-of-the-art cross-encoder while being ≈ 50 times faster and scalable to corpora of millions of facts. Moreover, we study the impact of the hybridisation on semantic drift and science question answering without additional training, showing that it boosts the quality of the explanations and contributes to improved downstream inference performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Explanation regeneration is the task of retrieving and combining two or more facts from an external knowledge source to reconstruct the explanation supporting a certain natural language hypothesis Xie et al. (2020); Jansen et al. (2018). As such, this task represents a crucial intermediate step for the development and evaluation of explainable Natural Language Inference (NLI) models Wiegreffe and Marasović (2021); Thayaparan et al. (2020a). In particular, explanation regeneration on science questions has been identified as a suitable benchmark for complex multi-hop and abductive inference capabilities Xie et al. (2020); Jansen and Ustalov (2019); Jansen et al. (2018). Scientific explanations, in fact, require the articulation and integration of commonsense and scientific knowledge for the construction of long explanatory reasoning chains Clark et al. (2018); Jansen et al. (2016). Moreover, since the structure of scientific explanations cannot be derived from the decomposition of the questions, the task requires the encoding of complex abstractive mechanisms for the identification of relevant explanatory knowledge Valentino et al. (2021); Zhou et al. (2021); Thayaparan et al. (2020b).

To tackle these challenges, most of the existing approaches for explanation regeneration leverage the power of the cross-attention mechanism in Transformers Devlin et al. (2019); Vaswani et al. (2017), training sequence classification models on the task of composing relevant explanatory chains supervised via human-annotated explanations Cartuyvels et al. (2020); Das et al. (2019); Chia et al. (2019); Banerjee (2019). While Transformers achieve state-of-the-art performance, the adoption of cross-encoder architectures makes abductive inference intrinsically inefficient and not scalable to massive corpora. The problem of constructing scientific explanations efficiently, in fact, is still under-studied, with the state-of-the-art Transformer Cartuyvels et al. (2020) requiring seconds per question on corpora of facts.

Figure 1: We propose a hybrid architecture, scalable to massive facts banks, that performs abductive inference autoregressively. At each time step

, we perform inference via Maximum Inner Product Search (MIPS) (1) combining dense and sparse encoders to compute relevance and explanatory power of candidate facts (2) and expand the explanation (3). The explanatory power of a given fact is estimated by analysing explanations for similar hypotheses (Explanations Corpus), while the relevance of a fact at time-step

depends on the partial explanation constructed at time .

In this work, we are interested in developing new mechanisms to enable abductive inference at scale, maximising both accuracy and inference time in explanation regeneration. To this end, we investigate the construction of abductive solvers through scalable bi-encoder architectures Reimers and Gurevych (2019) to perform efficient multi-hop inference via Maximum Inner Product Search (MIPS) Johnson et al. (2019). Given the complexity of abductive inference in the scientific domain, however, the adoption of bi-encoders alone is expected to lead to a drastic drop in performance since the cross-attention mechanism in Transformers cannot be leveraged to learn meaningful compositions of explanatory chains. To tackle the lack of cross-attention, we hypothesise that the orchestration of latent and explicit patterns emerging in natural language explanations can enable the design of the abstractive mechanisms necessary for accurate regeneration while preserving the scalability intrinsic in bi-encoders.

To validate this hypothesis, we present SCAR (for Scalable Autoregressive Infer

ence), a hybrid abductive solver that autoregressively combines a Transformer-based bi-encoder with a sparse model that leverages explicit patterns in corpora of scientific explanations. Specifically, SCAR integrates sparse and dense encoders to define a joint model of relevance and explanatory power and perform abductive inference in an iterative fashion, conditioning the probability of retrieving a fact at time-step

on the partial explanation constructed at time-step (Fig. 1). We performed an extensive evaluation on the WorldTree corpus Xie et al. (2020); Jansen and Ustalov (2019); Jansen et al. (2018), presenting the following conclusions:

  1. The proposed abductive solver achieves performance comparable with the state-of-the-art cross-encoder while being times faster.

  2. We study the impact of the hybridisation on semantic drift, showing that SCAR is more robust in the construction of challenging explanations requiring long reasoning chains.

  3. We investigate the applicability of SCAR on science question answering without additional training, demonstrating improved accuracy and robustness when performing abductive inference iteratively.

  4. We perform a scalability analysis by gradually expanding the adopted facts bank, showing that SCAR can scale to corpora containing millions of facts.

To the best of our knowledge, we are the first to propose a hybrid autoregressive solver for explanation regeneration in the scientific domain, and demonstrate its efficacy for abductive natural language inference at scale.

2 Explanation Regeneration

Given a scientific hypothesis (e.g., “Two sticks getting warm when rubbed together is an example of a force producing heat”), the task of explanation regeneration consists in reconstructing the explanation supporting , composing a sequence of atomic facts retrieved from external knowledge sources.

Explanation regeneration can be framed as a multi-hop abductive inference problem, where the goal is to construct the best explanation for a given natural language statement adopting multiple inference steps. To learn to perform abductive inference, a recent line of research relies on explanation-centred corpora, which are typically composed of two distinct knowledge sources Jansen and Ustalov (2019):

  1. A fact bank of individual commonsense and scientific sentences including the knowledge necessary to construct explanations for scientific hypotheses.

  2. A corpus of explanations consisting of true hypotheses and natural language explanations composed of sentences from the fact bank.

3 Autoregressive Inference

To model the multi-hop nature of scientific explanations, we propose a hybrid abductive solver that performs inference autoregressively (Fig. 1). Specifically, we model the probability of composing an explanation sequence for a certain hypothesis using the following formulation:

(1)

where is the maximum number of inference steps to construct an explanation and represents a fact retrieved at time-step from the facts bank. We implement the model recursively by updating the hypothesis at each time-step , concatenating it with the partial explanation constructed at time step :

(2)

where represents the string concatenation function. The probability is then approximated via an explanatory scoring function that jointly models relevance and explanatory power. Since similar scientific hypotheses require similar explanations Valentino et al. (2021), we define explanatory power as a measure to capture explicit patterns in the explanations, quantifying the extent to which a given fact explains similar hypotheses in the corpus (additional details in Section 3.1). The explanatory scoring function is then defined as follows:

(3)

where represents the explanatory power of , while represents the relevance of at time step . The probability of selecting a certain fact , therefore, depends jointly on its explanatory power and relevance with respect to the partially constructed explanation. To leverage the complementary aspects of dense and sparse representations, we implement through a hybrid model. Specifically, given a sparse and a dense encoder, we compute as follows:

(4)

with

representing the cosine similarity between the embeddings. In our experiments, we adopt BM25

Robertson et al. (2009) as a sparse representation, while Sentence-BERT Reimers and Gurevych (2019) is adopted to train the dense encoder (see section 3.2 for additional details).

3.1 Explanatory Power

Recent work have shown that explicit explanatory patterns emerge in corpora of natural language explanations Valentino et al. (2021) – i.e., facts describing scientific laws with high explanatory power (i.e., laws that explain a large set of phenomena such as gravity, or friction) are frequently reused to explain similar hypotheses. Therefore, these patterns can be leveraged to define a computational model of explanatory power for estimating the relevance of abstract scientific facts, providing a framework for efficient abductive inference in the scientific domain.

Given an unseen hypothesis , a sentence encoder , and a corpus of scientific explanations, the explanatory power of a generic fact can be estimated by analysing explanations for similar hypotheses in the corpus:

(5)
(6)

where represents a list of hypotheses retrieved according to the similarity between the embeddings and , and is the indicator function verifying whether is part of the explanation for the hypothesis . Specifically, the more a fact is reused for explaining hypotheses that are similar to in the corpus, the higher its explanatory power.

In this work, we hypothesise that this model can be integrated within a hybrid abductive solver based on dense and sparse encoders (Sec. 3), improving inference performance while preserving scalability. In our experiments, we adopt BM25 Robertson et al. (2009) to compute the explanatory power efficiently.

3.2 Dense Encoder

To learn a dense representation for efficient multi-hop inference, we fine-tune a Sentence-BERT model using a bi-encoder architecture Reimers and Gurevych (2019)

. The bi-encoder adopts a siamese network to learn a joint embedding space for hypotheses and explanatory facts in the corpus. Following Sentence-BERT, we obtain fixed sized sentence embeddings by adding a mean-pooling operation to the output vectors of BERT

Devlin et al. (2019). We employ a unique BERT model with shared parameters to learn a sentence encoder for both facts and hypotheses.

Training.

The bi-encoder is fine-tuned on inference chains extracted from the gold explanations in the WorldTree corpus Jansen and Ustalov (2019). To train the model autoregressively, we first serialize the gold explanations sorting the facts in decreasing order of BM25 similarity with the hypothesis. We adopt BM25 for the serialization since the facts with lower lexical similarity are typically the ones requiring multi-hop inference. Subsequently, given a training hypothesis and an explanation sequence , we derive positive example tuples , one for each fact , using as hypothesis. To make the model robust to distracting information, we construct a set of negative examples for each tuple retrieving the top most similar facts to that are not part of the gold explanation (using BM25). We found that the best results are obtained when considering negative examples for each positive tuple. Adopting this methodology, we are able to construct a training set containing inference chains. We use the constructed training set and the siamese network to fine-tune the encoder via contrastive loss Hadsell et al. (2006), which has been demonstrated to be effective for learning robust dense representations.

Inference.

At the cost of sacrificing the performance gain resulting from the cross-attention, the bi-encoder allows for efficient multi-hop inference through Maximum Inner Product Search (MIPS). To enable scalability, we construct an index of dense embeddings for the whole facts bank. To this end, we adopt the approximated inner product search index (IndexIVFFlat) available in FAISS Johnson et al. (2019). At inference time, we derive the vector of the test hypothesis at each time-step using the trained encoder, and employ approximated MIPS to retrieve the similarity scores for the top candidate facts in the corpus. Subsequently, we combine the similarity scores with the sparse model to compute relevance and explanatory power (Equation 3) and select the fact to expand the explanation at time-step . After steps, we rank the remaining facts in the corpus considering hypothesis and partially constructed explanation.

Model Approach Description MAP
Dense Models (Cross-encoders)
Cartuyvels et al. (2020) Autoregressive BERT cross-encoder 57.07
Das et al. (2019) BERT path-ranking + single fact ensemble 56.25
Das et al. (2019) BERT single fact 55.74
Das et al. (2019) BERT path-ranking 53.13
Chia et al. (2019) BERT re-ranking with gold IR scores 49.45
Banerjee (2019) BERT iterative re-ranking 41.30
Sparse Models
Valentino et al. (2021) Unification-based Reconstruction 50.83
Chia et al. (2019) Iterative BM25 45.76
BM25 Robertson et al. (2009) BM25 Relevance Score 43.01
TF-IDF TF-IDF Relevance Score 39.42
Hybrid Model (Bi-encoders)
SCAR Scalable Autoregressive Inference 56.22
Table 1: Results on the test-set and comparison with previous approaches. SCAR significantly outperforms all the sparse models and obtains comparable results with state-of-the-art cross-encoders.

4 Empirical Evaluation

We perform an extensive evaluation on the WorldTree corpus Xie et al. (2020); Jansen et al. (2018) adopting the dataset released for the shared task on multi-hop explanation regeneration111https://github.com/umanlp/tg2019task Jansen and Ustalov (2019), where a diverse set of sparse and dense models have been evaluated.

WorldTree is a subset of the ARC corpus Clark et al. (2018) that consists of multiple-choice science questions annotated with natural language explanations supporting the correct answers. The explanations in WorldTree contain an average of six facts and as many as 16, requiring challenging multi-hop inference to be regenerated. The WorldTree corpus provides a held-out test-set consisting of 1,240 science questions with masked explanations where we run the main experiment and comparison with published approaches. On the other hand, we adopt the explanations in the training-set () for training the dense encoder (section 3.2) and computing the explanatory power for unseen hypotheses at inference time (section 3.1). To run our experiments, we first transform each question and correct answer pair into a hypothesis following the methodology described in Demszky et al. (2018).

We adopt bert-base-uncased Devlin et al. (2019) as a dense encoder to perform a fair comparison with existing cross-encoders employing the same model. The best results on explanation regeneration are obtained when running SCAR for 4 inference steps (additional details in Section 4.3). In line with the shared task, the performance of the system is evaluated through the Mean Average Precision (MAP) of the produced ranking of facts with respect to the gold explanations in the corpus.

4.1 Explanation Regeneration

Table 1 reports the results achieved by our best model on the explanation regeneration task together with a comparison with previously published approaches. Specifically, we compare our hybrid framework based on bi-encoders with a variety of sparse and dense retrieval models.

Overall, we found that SCAR significantly outperforms all the considered sparse models (+5.39 MAP compared to Valentino et al. (2021)), obtaining, at the same time, comparable results with the state-of-the-art cross-encoder (-0.85 MAP compared to Cartuyvels et al. (2020)). The following paragraphs provide a detailed comparison with previous work.

Dense Models.

As illustrated in Table 1, all the considered dense models employ BERT Devlin et al. (2019) as a cross-encoder architecture for abductive inference. The state-of-the-art model proposed by Cartuyvels et al. (2020) adopts an autoregressive formulation similar to SCAR. However, the use of cross-encoders makes the model computationally expensive and intrinsically not scalable. Due to the complexity of cross-encoders, in fact, the authors have to employ a pre-filtering step based on TF-IDF and apply the model for re-ranking a small set of candidate facts at each iteration. In contrast, we found that the use of a hybrid model allows SCAR to achieve comparable performance (-0.85 MAP), being 50 times faster without any pre-filtering step (Section 4.2). The second-best dense approach employs an ensemble of two BERT models Das et al. (2019). A first BERT model is trained to predict the relevance of each fact individually given a certain hypothesis. A second BERT model is adopted to re-rank a set of two-hops inference chains constructed via TF-IDF. The use of two BERT models in parallel, however, makes the approach computationally exhaustive. We observe that SCAR can achieve similar performance with the use of a single BERT bi-encoder, outperforming each individual sub-component in the ensemble with a drastic improvement in efficiency (SCAR is 96.8 times and 167.4 times faster respectively, see Section 4.2). The remaining dense models Chia et al. (2019); Banerjee (2019) adopt BERT-based cross-encoders to re-rank the list of candidate facts retrieved using sparse Information Retrieval (IR) techniques. As illustrated in Table 1, SCAR outperforms these approaches by a large margin (+6.77 and +14.92 MAP respectively).

Sparse Models.

We compare SCAR with sparse models presented on the explanation regeneration task. We observe that SCAR significantly outperforms the Unification-based Reconstruction model proposed by Valentino et al. (2021) (+5.39 MAP), which employs a similar model of explanatory power in combination with BM25, but without dense representation and autoregressive inference. These results confirm the contribution of the hybrid model together with the importance of modelling abductive inference in a iterative fashion. In addition, we compare SCAR with the model proposed by Chia et al. (2019) which adopts BM25 vectors to retrieve facts iteratively. We found that SCAR can dramatically improve the performance of this model by 10.46 MAP points. Finally, we measure the performance of standalone sparse baselines for a sanity check, showing that SCAR can significantly outperform BM25 and TFIDF on the abductive inference task (+13.21 and +16.8 MAP respectively), while preserving a similar scalability (Section 4.6).

Model MAP Time (s/q)
Cross-encoders
Autoregressive BERT 57.07 9.6
BERT single fact 55.74 18.4
BERT path-ranking 53.13 31.8
Proposed Model
SCAR 56.22 (98.5%) 0.19 ()
Table 2: Detailed comparison with BERT cross-encoders on the test set in terms of Mean Average Precision (MAP) and inference time (seconds per question).

4.2 Inference Time

We performed additional experiments to evaluate the efficiency of SCAR and contrast it with state-of-the-art cross-encoders. To this end, we run SCAR on 1 16GB Nvidia Tesla P100 GPU and compare the inference time with that of dense models executed on the same infrastructure Cartuyvels et al. (2020). Table 2 reports MAP and execution time in terms of seconds per question. As evident from the table, we found that SCAR is 50.5 times faster than the state-of-the-art cross-encoder Cartuyvels et al. (2020), while achieving 98.5% of its performance. Moreover, when compared to the individual BERT models proposed by Das et al. (2019), SCAR is able to achieve better MAP score (+0.48 and +3.09), increasing even more the gap in terms of efficiency ( and times faster).

(a)
(b)
Figure 2: (a) Impact of increasing the number of similar hypotheses to estimate the explanatory power (Equation 5). (b) Performance considering gold explanations with increasing number of facts. The models start with a comparable MAP score at 1, ending up with a wider gap in performance, confirming that the proposed hybrid model is more robust to semantic drift.
Model MAP Time (s/q)
Bi-encoder 1 41.98 0.04
2 42.17 0.08
3 39.97 0.12
4 38.34 0.16
5 37.24 0.19
6 36.64 0.24
BM25 1 45.99 0.02
2 47.77 0.04
3 48.35 0.05
4 48.06 0.07
5 47.97 0.09
6 47.66 0.11
Bi-encoder + BM25 1 51.53 0.05
2 54.52 0.08
3 55.65 0.14
4 56.07 0.18
5 56.24 0.22
6 55.87 0.27
SCAR (Exp. power) 1 57.10 0.06
2 59.20 0.10
3 59.73 0.15
4 60.28 0.19
5 59.79 0.24
6 59.36 0.29
Table 3: Ablation study on the dev-set, where represents the maximum number of iterations adopted to regenerate the explanations, and is the inference time in terms of seconds per question.

4.3 Ablation Studies

In order to understand how the different components of our approach complement each other, we carried out distinct ablation studies. The studies are performed on the dev-set since the explanations on the test-set are masked.

Table 3 presents the results on explanation regeneration for different ablations of SCAR adopting an increasing number of iterations for abductive inference. The results show how the performance improves as we combine sparse and dense models, with a decisive contribution coming from each individual sub-component. Specifically, considering the best results obtained in each case, we observe that SCAR achieves an improvement of 18.11 MAP when considering only the dense component (Bi-encoder) and 11.93 MAP when compared to the sparse model (BM25). Moreover, the ablation demonstrates the fundamental role of the explanatory power model in achieving the final performance, which leads to an improvement of 4.04 MAP over the hybrid Bi-encoder + BM25 relevance model (Equation 3).

Overall, we notice that performing abductive inference iteratively is beneficial to the performance across the different components. We observe that the improvement is more prominent when comparing (only using the hypothesis) with (using hypothesis and first fact), highlighting the central significance of the first retrieved fact to support the complete regeneration process. Except for the Bi-encoder, the experiments demonstrate a slight improvement when adding more iterations to the process, obtaining the best results for SCAR using a total of 4 inference steps.

We notice that the best performing component in terms of inference time is BM25. The integration with the dense model, in fact, slightly increases the inference time yet leading to a decisive improvement in terms of MAP score. Even with the overhead caused by the Bi-encoder, however, SCAR can still perform abductive inference in less than half a second per question, a feature that demonstrates the scalability of the approach with respect to the number of iterations.

Finally, we evaluate the impact of the explanatory power model by considering a larger set of training hypotheses (Figure (a)a). To this end, we compare the performance across different configurations with increasing values of in Equation 5. The results demonstrate the positive impact of the explanatory power model on abductive inference, with a rapid increase of MAP peaking at . After reaching this value, we observe that considering additional hypotheses has little impact on the model’s performance.

4.4 Semantic Drift

Recent work has shown that the regeneration of natural language explanations is particularly challenging for multi-hop inference models as it can lead to a phenomenon known as semantic drift – i.e., the composition of spurious inference chains caused by the tendency of drifting away from the original context in the hypothesis Khashabi et al. (2019); Xie et al. (2020); Jansen and Ustalov (2019); Thayaparan et al. (2020a). In general, the larger the size of the explanation, the higher the probability of semantic drift. Therefore, it is particularly important to evaluate and compare the robustness of multi-hop inference models on hypotheses requiring long explanations. To this end, we present a study of semantic drift, comparing the performance of different ablations of SCAR on hypotheses with a varying number of facts in the gold explanations. The results of the study are reported in Figure (b)b.

Overall, we observe a degradation in performance for all the considered models that becomes more prominent as the explanations increase in size. Such a degradation is likely due to semantic drift. However, the results suggest that SCAR exhibits more stable performance on long explanations ( 6 facts) when compared to its individual sub-components. In particular, the plotted results in Figure (b)b clearly show that, while all the models start with comparable MAP scores on explanations containing a single fact, the gap in performance gradually increases with the size of the explanations, with SCAR obtaining an improvement of MAP over BM25 + Bi-encoder on explanations containing more than 10 facts.

These results confirm the hypotheses that dense and sparse models possess complementary features for explanation regeneration and that the proposed hybridisation has a decisive impact on improving abductive inference for scientific hypotheses in the most challenging setting.

4.5 Question Answering

Model t = 1 t = 2 t = 3 t = 4
Random 25.00 25.00 25.00 25.00
BM25 48.23 39.82 35.84 33.18
Bi-encoder 54.42 52.21 50.88 50.00
Bi-encoder + BM25 59.29 52.21 47.79 44.69
SCAR 60.62 60.62 61.06 57.96
Table 4: Accuracy in question answering using the models as abductive inference solvers without additional training.

Since the construction of spurious inference chains can lead to wrong answer prediction, semantic drift often influences the downstream capabilities of answering the question. Therefore, we additionally evaluate the performance of SCAR on the multiple-choice question answering task, employing the model as an abductive inference solver without additional training. Specifically, given a multiple-choice science question, we employ SCAR to construct an explanation for each candidate answer, and derive the relative candidate answer score by summing up the explanatory score of each fact in the explanation (Equation 3). Subsequently, we consider the answer with the highest-scoring explanation as the correct one. Table 4 shows the results achieved adopting different iterations for abductive inference.

Similarly to the results on explanation regeneration, this experiment confirms the interplay between dense and sparse models in improving the performance and robustness on downstream question answering. Specifically, we observe that, while the performance of different ablations decreases rapidly with an increasing number of inference steps, the performance of SCAR are more stable, reaching a peak at . These results confirm the robustness of SCAR in multi-hop inference together with its resilience to semantic drift.

4.6 Scalability

Finally, we measure the scalability of SCAR on large facts banks containing milions of sentences. To perform this analysis, we gradually expand the set of facts in the WorldTree corpus by randomly extracting sentences from GenericsKB222https://allenai.org/data/genericskb Bhakthavatsalam et al. (2020), a curated facts bank of commonsense and scientific knowledge. To evaluate scalability, we compare the inference time of SCAR with that of standalone BM25. The results of this experiment, reported in Figure 3, demonstrate that SCAR scales similarly to BM25. Even considering the overhead caused by the Bi-encoder model, in fact, SCAR is still able to perform abductive inference in less than 1 second per question on corpora containing 1 million facts.

Figure 3: Graph showing the scalability of SCAR to corpora containing a million facts. We observe a trend similar to BM25, with the inference time of SCAR remaining below one second per question.

5 Related Work

Multi-hop inference is the task of combining two or more pieces of evidence to solve a particular reasoning problem. This task is often used to evaluate explainable inference models since the constructed chains of reasoning can be interpreted as an explanation for the final predictions Wiegreffe and Marasović (2021); Thayaparan et al. (2020a).

Given the importance of multi-hop inference for explainability, there is a recent focus on datasets and resources providing annotated explanations to support and evaluate the construction of valid inference chains Xie et al. (2020); Jhamtani and Clark (2020); Khot et al. (2020); Ferreira and Freitas (2020); Khashabi et al. (2018); Yang et al. (2018); Mihaylov et al. (2018); Jansen et al. (2018); Welbl et al. (2018). Most of the existing datasets for multi-hop inference, however, contain explanations composed of up to two sentences or paragraphs, limiting the possibility to assess the robustness of the systems on long reasoning chains. Consequently, most of the existing multi-hop models are evaluated on two-hops inference tasks Li et al. (2021); Tu et al. (2020); Yadav et al. (2020); Fang et al. (2020); Asai et al. (2019); Banerjee et al. (2019); Yadav et al. (2019); Thayaparan et al. (2019).

On the other hand, explanation regeneration on science questions is designed to evaluate the construction of long explanatory chains (with explanations containing up to 16 facts), in a setting where the structure of the inference cannot be derived from a direct decomposition of the questions Xie et al. (2020); Jansen and Ustalov (2019); Jansen et al. (2018). To deal with the difficulty of the task, state-of-the-art models leverage the cross-attention mechanism in Transformers Vaswani et al. (2017), learning to compose relevant explanatory chains supervised via human-annotated explanations Cartuyvels et al. (2020); Das et al. (2019); Chia et al. (2019); Banerjee (2019).

The autoregressive formulation proposed in this paper is similar to the one introduced by Cartuyvels et al. (2020), which, however, perform iterative inference though a cross-encoder architecture based on BERT Devlin et al. (2019). Differently from this work, we present a hybrid architecture based on bi-encoders Reimers and Gurevych (2019) with the aim of maximising both accuracy and inference time in explanation regeneration. Our model of explanatory power is based on the work done by Valentino et al. (2021), which shows how to leverage explicit patterns in corpora of scientific explanations for multi-hop abductive inference. In this paper, we build upon previous work by demonstrating that models based on explicit patterns can be effectively combined with dense representation to achieve nearly state-of-the-art performance while preserving scalability.

Our framework is related to recent work on dense retrieval for knowledge-intensive NLP tasks, which focuses on the design of scalable architectures with Maximum Inner Product Search (MIPS) based on Transformers Xiong et al. (2021); Zhao et al. (2021); Lin et al. (2021); Karpukhin et al. (2020); Lewis et al. (2020); Dhingra et al. (2019). Our multi-hop inference model is similar to Lin et al. (2021) and Xiong et al. (2021) which adopt bi-encoders for open-ended commonsense reasoning and open-domain question answering. However, to the best of our knowledge, we are the first to integrate bi-encoders in a hybrid architecture for scalable abductive inference in the scientific domain.

6 Conclusion

In this work, we focused on the problem of performing abductive natural language inference at scale, aiming at maximising both accuracy and inference time in scientific explanation regeneration.

We proposed SCAR, a hybrid architecture, scalable to massive facts banks, that performs abductive inference autoregressively, conditioning the probability of retrieving a fact at time-step on the partial explanation constructed at time-step . SCAR combines a Transformer-based bi-encoder with a sparse model that leverages explicit patterns in corpora of scientific explanations, defining a joint model of relevance and explanatory power for abductive inference.

An extensive evaluation demonstrated that the proposed abductive solver achieves performance comparable with that of state-of-the-art cross-encoders while being times faster. Moreover, we investigated the impact of the hybridisation on semantic drift and science question answering, showing that SCAR is more robust in the construction of challenging explanations requiring long reasoning chains.

This work demonstrated the effectiveness of hybrid models in enabling explainable natural language inference at scale, opening the way for future research at the intersection of latent and explicit models for explanation generation. As a future work, we plan to investigate the integration of relevance and explanatory power in a end-to-end differentiable architecture, and explore the applicability of the framework on additional commonsense and scientific reasoning tasks.

References

  • A. Asai, K. Hashimoto, H. Hajishirzi, R. Socher, and C. Xiong (2019) Learning to retrieve reasoning paths over wikipedia graph for question answering. In International Conference on Learning Representations, Cited by: §5.
  • P. Banerjee, K. K. Pal, A. Mitra, and C. Baral (2019) Careful selection of knowledge to solve open book question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6120–6129. Cited by: §5.
  • P. Banerjee (2019)

    ASU at textgraphs 2019 shared task: explanation regeneration using language models and iterative re-ranking

    .
    In

    Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

    ,
    pp. 78–84. Cited by: §1, Table 1, §4.1, §5.
  • S. Bhakthavatsalam, C. Anastasiades, and P. Clark (2020) Genericskb: a knowledge base of generic statements. arXiv preprint arXiv:2005.00660. Cited by: §4.6.
  • R. Cartuyvels, G. Spinks, and M. Moens (2020) Autoregressive reasoning over chains of facts with transformers. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 6916–6930. External Links: Link, Document Cited by: §1, Table 1, §4.1, §4.1, §4.2, §5, §5.
  • Y. K. Chia, S. Witteveen, and M. Andrews (2019) Red dragon ai at textgraphs 2019 shared task: language model assisted explanation generation. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 85–89. Cited by: §1, Table 1, §4.1, §4.1, §5.
  • P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018) Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457. Cited by: §1, §4.
  • R. Das, A. Godbole, M. Zaheer, S. Dhuliawala, and A. McCallum (2019) Chains-of-reasoning at textgraphs 2019 shared task: reasoning over chains of facts for explainable multi-hop inference. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 101–117. Cited by: §1, Table 1, §4.1, §4.2, §5.
  • D. Demszky, K. Guu, and P. Liang (2018) Transforming question answering datasets into natural language inference datasets. arXiv preprint arXiv:1809.02922. Cited by: §4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §3.2, §4.1, §4, §5.
  • B. Dhingra, M. Zaheer, V. Balachandran, G. Neubig, R. Salakhutdinov, and W. W. Cohen (2019) Differentiable reasoning over a virtual knowledge base. In International Conference on Learning Representations, Cited by: §5.
  • Y. Fang, S. Sun, Z. Gan, R. Pillai, S. Wang, and J. Liu (2020) Hierarchical graph network for multi-hop question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8823–8838. Cited by: §5.
  • D. Ferreira and A. Freitas (2020) Natural language premise selection: finding supporting statements for mathematical text. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 2175–2182. Cited by: §5.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In

    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)

    ,
    Vol. 2, pp. 1735–1742. External Links: Document Cited by: §3.2.
  • P. Jansen, N. Balasubramanian, M. Surdeanu, and P. Clark (2016) What’s in an explanation? characterizing knowledge and inference requirements for elementary science exams. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2956–2965. Cited by: §1.
  • P. Jansen and D. Ustalov (2019) TextGraphs 2019 shared task on multi-hop inference for explanation regeneration. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 63–77. Cited by: §1, §1, §2, §3.2, §4.4, §4, §5.
  • P. Jansen, E. Wainwright, S. Marmorstein, and C. Morrison (2018) WorldTree: a corpus of explanation graphs for elementary science questions supporting multi-hop inference. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §1, §1, §4, §5, §5.
  • H. Jhamtani and P. Clark (2020) Learning to explain: datasets and models for identifying valid reasoning chains in multihop question-answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 137–150. Cited by: §5.
  • J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with gpus. IEEE Transactions on Big Data (), pp. 1–1. External Links: Document Cited by: §1, §3.2.
  • V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 6769–6781. External Links: Link, Document Cited by: §5.
  • D. Khashabi, E. S. Azer, T. Khot, A. Sabharwal, and D. Roth (2019) On the capabilities and limitations of reasoning for natural language understanding. arXiv preprint arXiv:1901.02522. Cited by: §4.4.
  • D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth (2018) Looking beyond the surface: a challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262. Cited by: §5.
  • T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal (2020) QASC: a dataset for question answering via sentence composition.

    Proceedings of the AAAI Conference on Artificial Intelligence

    34 (05), pp. 8082–8090.
    External Links: Link, Document Cited by: §5.
  • P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 9459–9474. External Links: Link Cited by: §5.
  • S. Li, X. Li, L. Shang, X. Jiang, Q. Liu, C. Sun, Z. Ji, and B. Liu (2021) HopRetriever: retrieve hops over wikipedia to answer complex questions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 13279–13287. Cited by: §5.
  • B. Y. Lin, H. Sun, B. Dhingra, M. Zaheer, X. Ren, and W. Cohen (2021) Differentiable open-ended commonsense reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 4611–4625. External Links: Link, Document Cited by: §5.
  • T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal (2018) Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391. Cited by: §5.
  • N. Reimers and I. Gurevych (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3982–3992. External Links: Link, Document Cited by: §1, §3.2, §3, §5.
  • S. Robertson, H. Zaragoza, et al. (2009) The probabilistic relevance framework: bm25 and beyond. Foundations and Trends® in Information Retrieval 3 (4), pp. 333–389. Cited by: §3.1, Table 1, §3.
  • M. Thayaparan, M. Valentino, and A. Freitas (2020a) A survey on explainability in machine reading comprehension. arXiv preprint arXiv:2010.00389. Cited by: §1, §4.4, §5.
  • M. Thayaparan, M. Valentino, and A. Freitas (2020b) Explanationlp: abductive reasoning for explainable science question answering. arXiv preprint arXiv:2010.13128. Cited by: §1.
  • M. Thayaparan, M. Valentino, V. Schlegel, and A. Freitas (2019) Identifying supporting facts for multi-hop question answering with document graph networks. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 42–51. Cited by: §5.
  • M. Tu, K. Huang, G. Wang, J. Huang, X. He, and B. Zhou (2020) Select, answer and explain: interpretable multi-hop reading comprehension over multiple documents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 9073–9080. Cited by: §5.
  • M. Valentino, M. Thayaparan, and A. Freitas (2021) Unification-based reconstruction of multi-hop explanations for science questions. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 200–211. External Links: Link Cited by: §1, §3.1, Table 1, §3, §4.1, §4.1, §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §5.
  • J. Welbl, P. Stenetorp, and S. Riedel (2018) Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics 6, pp. 287–302. Cited by: §5.
  • S. Wiegreffe and A. Marasović (2021) Teach me to explain: a review of datasets for explainable nlp. arXiv preprint arXiv:2102.12060. Cited by: §1, §5.
  • Z. Xie, S. Thiem, J. Martin, E. Wainwright, S. Marmorstein, and P. Jansen (2020) WorldTree v2: a corpus of science-domain structured explanations and inference patterns supporting multi-hop inference. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 5456–5473 (English). External Links: Link, ISBN 979-10-95546-34-4 Cited by: §1, §1, §4.4, §4, §5, §5.
  • W. Xiong, X. Li, S. Iyer, J. Du, P. Lewis, W. Y. Wang, Y. Mehdad, S. Yih, S. Riedel, D. Kiela, and B. Oguz (2021) Answering complex open-domain questions with multi-hop dense retrieval. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • V. Yadav, S. Bethard, and M. Surdeanu (2019) Quick and (not so) dirty: unsupervised selection of justification sentences for multi-hop question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2578–2589. Cited by: §5.
  • V. Yadav, S. Bethard, and M. Surdeanu (2020) Unsupervised alignment-based iterative evidence retrieval for multi-hop question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4514–4525. Cited by: §5.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: a dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380. Cited by: §5.
  • C. Zhao, C. Xiong, J. Boyd-Graber, and H. Daumé III (2021) Multi-step reasoning over unstructured text with beam dense retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 4635–4641. External Links: Link, Document Cited by: §5.
  • Z. Zhou, M. Valentino, D. Landers, and A. Freitas (2021) Encoding explanatory knowledge for zero-shot science question answering. arXiv preprint arXiv:2105.05737. Cited by: §1.