Robustifying Multi-hop QA through Pseudo-Evidentiality Training

07/07/2021 ∙ by Kyungjae Lee, et al. ∙ Seoul National University 0

This paper studies the bias problem of multi-hop question answering models, of answering correctly without correct reasoning. One way to robustify these models is by supervising to not only answer right, but also with right reasoning chains. An existing direction is to annotate reasoning chains to train models, requiring expensive additional annotations. In contrast, we propose a new approach to learn evidentiality, deciding whether the answer prediction is supported by correct evidences, without such annotations. Instead, we compare counterfactual changes in answer confidence with and without evidence sentences, to generate "pseudo-evidentiality" annotations. We validate our proposed model on an original set and challenge set in HotpotQA, showing that our method is accurate and robust in multi-hop reasoning.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-hop Question Answering (QA) is a task of answering complex questions by connecting information from several texts. Since the information is spread over multiple facts, this task requires to capture multiple relevant facts (which we refer as evidences) and infer an answer based on all these evidences.

However, previous works min2019compositional; chen2019understanding; trivedi2020multihop observe “disconnected reasoning” in some correct answers. It happens when models can exploit specific types of artifacts (e.g., entity type), to leverage them as reasoning shortcuts to guess the correct answer. For example, assume that a given question is: “which country got independence when World War II ended?” and a passage is: “Korea got independence in 1945”. Although information (“World War II ended in 1945”) is insufficient, QA models predict “Korea”, simply because its answer type is country (or, using shortcut).

Figure 1: Overview of our proposed supervision: using Answerability and Evidentiality

To address the problem of reasoning shortcuts, we propose to supervise “evidentiality” – deciding whether a model answer is supported by correct evidences (see Figure 1). This is related to the problem that most of the early reader models for QA failed to predict whether questions are not answerable. Lack of answerability training led models to provide a wrong answer with high confidence, when they had to answer “unanswerable”. Similarly, we aim to train for models to recognize whether their answer is “unsupported” by evidences, as well. In our work, along with the answerability, we train the QA model to identify the existence of evidences by using passages of two types: (1) Evidence-positive and (2) Evidence-negative set. While the former has both answer and evidence, the latter does not have evidence supporting the answer, such that we can detect models taking shortcuts.

Our first research question is: how do we acquire evidence-positive and negative examples for training without annotations? For evidence-positive set, the closest existing approach niu2020self is to consider attention scores, which can be considered as pseudo-annotation for evidence-positive set. In other word, sentence with high attention scores, often used as an “interpretation” of whether is causal for model prediction, can be selected to build evidence-positive set. However, follow-up works serrano2019attention; jain2019attention argued that attention is limited as an explanation, because causality cannot be measured, without observing model behaviors in a counterfactual case of the same passage without . In addition, sentence causality should be aggregated to measure group causality of multiple evidences for multi-hop reasoning. To annotate group causality as “pseudo-evidentiality”, we propose Interpreter module, which removes and aggregates evidences into a group, to compare predictions in observational and counterfactual cases.

As a second research question, we ask how to learn from evidence-positive and evidence-negative set. To this end, we identify two objectives: (O1) QA model should not be overconfident in evidence-negative set, while (O2) confident in evidence-positive. A naive approach to pursue the former is to lower the model confidence on evidence-negative set via regularization. However, such regularization can cause violating (O2) due to correlation between confidence distributions for evidence-positive and negative set. Our solution is to selectively regularize, by purposedly training a biased model violating (O1), and decorrelate the target model from the biased model.

For experiments, we demonstrate the impact of our approach on HotpotQA dataset. Our empirical results show that our model can improve QA performance through pseudo-evidentiality, outperforming other baselines. In addition, our proposed approach can orthogonally combine with another SOTA model for additional performance gains.

2 Related Work

Since multi-hop reasoning tasks, such as HotpotQA, are released, many approaches for the task have been proposed. These approaches can be categorized by strategies used, such as graph-based networks qiu2019dynamically; fang2019hierarchical, external knowledge retrieval asai2019learning, and supporting fact selection nie2019revealing; groeneveld2020simple.

Our focus is to identify and alleviate reasoning shortcuts in multi-hop QA, without evidence annotations. Models taking shortcuts were widely observed from various tasks, such as object detection singh2020don, NLI tu2020empirical, and also for our target task of multi-hop QA min2019compositional; chen2019understanding; trivedi2020multihop

, where models learn simple heuristic rules, answering correctly but without proper reasoning.

To mitigate the effect of shortcuts, adversarial examples jiang2019avoiding can be generated, or alternatively, models can be robustifed trivedi2020multihop with additional supervision for paragraph-level “sufficiency” – to identify whether a pair of two paragraphs are sufficient for right reasoning or not, which reduces shortcuts on a single paragraph. While the binary classification for paragraph-sufficiency is relatively easy (96.7 F1 in trivedi2020multihop), our target of capturing a finer-grained sentence-evidentiality is more challenging. Existing QA model nie2019revealing; groeneveld2020simple treats this as a supervised task, based on sentence-level human annotation. In contrast, ours requires no annotation and focuses on avoiding reasoning shortcuts using evidentiality, which was not the purpose of evidence selection in the existing model.

3 Proposed Approach

In this section, to prevent reasoning shortcuts, we introduce a new approach for data acquiring and learning. We describe this task (Section 3.1) and address two research questions, of generating labels for supervision (Section 3.2) and learning (Section 3.3), respectively.

3.1 Task Description

Our task definition follows distractor setting, between distractor and full-wiki in HotpotQA dataset yang2018hotpotqa, which consists of 112k questions requiring the understanding of corresponding passages to answer correctly. Each question has a candidate set of 10 paragraphs (of which two are positive paragraphs and eight are negative ), where the supporting facts for reasoning are scattered in two positive paragraphs. Then, given a question

, the objective of this task is to aggregate relevant facts from the candidate set and estimate a consecutive answer span

. For task evaluation, the estimated answer span is compared with the ground truth answer span in terms of F1 score at word-level.

3.2 Generating Examples for Training Answerability and Evidentiality

Answerability for Multi-hop Reasoning

For answerability training in single-hop QA, datasets such as SQuAD 2.0 rajpurkar2018know provide labels of answerability, so that models can be trained not to be overconfident on unanswerable text.

Similarly, we build triples of question , answer , and passage , to be labeled for answerability. HotpotQA dataset pairs with 10 paragraphs, where evidences can be scattered to two paragraphs. Based on such characteristic, concatenating two positive paragraphs is guaranteed to be answerable/evidential and concatenating two negative paragraphs (with neither evidence nor answer) is guaranteed to be unanswerable. We define a set of answerable triplets () as answer-positive set , and an unanswerable set as answer-negative set

. From the labels, we train a transformer-based model to classify the answerability (the detail will be discussed in the next section).

However, answerability cannot supervise whether the given passage has all of these relevant evidences for reasoning. This causes a lack of generalization ability, especially on examples with an answer but no evidence.

Evidentiality for Multi-hop Reasoning

While learning the answerability, we aim to capture the existence of reasoning chains in the given passage. To supervise the existence of evidences, we construct examples: evidence-positive and evidence-negative set, as shown in Figure 1.

Specifically, let be the ground truth of evidences to infer , and be a sentence containing an answer , corresponding to . Given and , expected labels of evidentiality, indicating whether the evidences for answering are sufficient in the passage, are as follow:


We define a set of passages satisfying as evidence-positive set , and a set satisfying as evidence-negative set .

Since we do not use human-annotations, we aim to generate “pseudo-evidentiality” annotation. First, for evidence-negative set, we modify answer sentence and unanswerable passages, and generate examples with the three following types:

  • [leftmargin=0.6cm]

  • 1) Answer Sentence Only: we remove all sentences in answerable passage except , such that the input passage becomes , which contains a correct answer but no other evidences. That is, .

  • 2) Answer Sentence + Irrelevant Facts: we use irrelevant facts with answers as context, by concatenating and unanswerable . That is, , where .

  • 3) Partial Evidence + Irrelevant Facts: we use partially-relevant and irrelevant facts as context, by concatenating and . That is, .

These evidence-negative examples do not have all relevant evidences, thus if a model predicts the correct answer on such examples, it means that the model learned reasoning shortcuts.

Second, building an evidence-positive set is more challenging, because it is difficult to capture multiple relevant facts, with neither annotations nor supervision. Our distinction is obtaining the above annotation from model itself, by interpreting the internal mechanism of models. On a trained model, we aim to find influential sentences in predicting correct answer , among sentences in an answerable passage. Then, we consider them as a pseudo evidence-positive set. Since such pseudo labels relies on the trained model which is not perfect, 100% recall of in Eq. (1) is not guaranteed, though we observe 87% empirical recall (Table 1).

Section 1 discusses how interpretation, such as attention scores niu2020self, can be pseudo-evidentiality. For QA tasks, an existing approach perez2019finding uses answer confidence for finding pseudo-evidences, as we discuss below:

(A) Accumulative interpreter: to consider multiple sentences as evidences, the existing approach perez2019finding iteratively inserts sentence into set

, with a highest probability at

-th iteration, as follows:


where starts with the sentence containing answer , which is minimal context for our task. This method can consider multiple sentences as evidence by inserting iteratively into a set, but cannot consider the effect of erasing sentences from reasoning chain.

(B) Our proposed Interpreter: to enhance the interpretability, we consider both erasing and inserting each sentence, in contrast to accumulative interpreter considering only the latter. Intuitively, erasing evidence would change the prediction significantly, if such evidence is causally salient, which we compute as follows:


where is a passage out of sentence . We hypothesize that breaking reasoning chain, by erasing , should significantly decrease . In other words, with higher is salient. Combining the two saliency scores in Eq. (2),(3), our final saliency is as follows:


where the constant values can be omitted in . At each iteration, the sentence that maximize is selected, as done in Eq. (2). This promotes selection that increases confidence on important sentences, and decreases confidence on unimportant sentences. We stop the iterations if or , then the final sentences in are a pseudo evidence-positive set . To reduce the search space, we empirically set 111Based on observations that 99% in HotpotQA require less than 6 evidence sentences for reasoning..

Figure 2: Learning of our proposed approach: (a) Training QA model for evidentiality, extracted by Interpreter. (b) Our QA predictor for learning decorrelated features on biased examples.

Briefly, we obtain the labels of answerability and evidentiality, as follows:

  • [leftmargin=0.6cm]

  • Answer-positive and negative set: the former has both answer and evidences, and the latter has neither.

  • Evidence-positive and negative set: the former is expected to have all the evidences, and the latter has an answer with no evidence.

3.3 Learning Answerability & Evidentiality

In this section, our goal is to learn the above labels of answerability and evidentiality.

Supervising Answers and Answerability (Base)

As optimizing QA model is not our focus, we adopt the existing model in min2019compositional. As the architecture of QA modal, we use a powerful transformer-based model – RoBERTa liu2019roberta, where the input is [CLS] question [SEP] passage [EOS]. The output of the model is as follows:


where and are fully connected layers with the trainable parameters , and are the the probabilities of start and end positions, is the output dimension of the encoder, is the size of the input sequence.

For answerability, they build a classifier through the hidden state of [CLS] token that represents both and . As HotpotQA dataset covers both yes-or-no and span-extraction questions, which we follow the convention of asai2019learning to support both as a multi-class classification problem of predicting the four probabilities:


where , , , and denote the probabilities of the answer type being span, yes, no, and no answer, respectively, and

is the trainable parameters. For training answer span and its class, the loss function of example

is the sum of cross entropy losses (), as follows:


where and are the starting and ending position of answer , respectively, and is the index of the actual class in example .

Supervising Evidentiality

As overviewed in Section 1, Base model is reported to take a shortcut, or a direct path between answer and question , neglecting implicit intermediate paths (evidences). Specifically, we present the two objectives for unbiased models:

  • [leftmargin=0.5cm]

  • (O1): QA model should not be overconfident on passages with no evidences (i.e., on ).

  • (O2): QA model should be confident on passages with both answer/evidences (i.e., on )

For (O1), as a naive approach, one may consider a regularization term to avoid overconfidence on evidence-negative set

. Overconfident answer distribution would be diverged from uniform distribution, such that Kullback–Leibler (KL) divergence

, where and are the answer probabilities and the uniform distribution, respectively, is high when overconfident:


where indicates uniform distribution. This regularization term forces the answer probabilities on to be closer to the uniform one.

However, one reported risk utama2020mind; grand2019adversarial is that suppressing data with biases has a side-effect of lowering confidence on unbiased data (especially on in-distribution). Similarly, in our case, regularizing to keep the confidence low for , can cause lowering that for , due to their correlation. In other words, pursuing (O1) violates (O2), which we observe later in Figure 3. Our next goal is thus to decorrelate two distributions on and to satisfy both (O1) and (O2).

Figure 2(b) shows how we feed the hidden states into two predictors. Predictor is for learning the target distribution and predictor is purposedly trained to be overconfident on evidence-negative set , where this biased answer distribution is denoted as . We regularize target distribution to diverge from the biased distribution of .

Formally, the biased answer distributions ( and ) are as follows:


where and are fully connected layers with the trainable parameters . Then, we optimize to predict answer on evidence-negative set , which makes layer biased (taking shortcuts), and regularize by maximizing KL divergence between and fixed . The regularization term of example is as follows:


where is a hyper-parameter. This loss is optimized on only evidence-negative set .

Lastly, to pursue (O2), we train on , as done on . However, in initial steps of training, our Interpreter is not reliable, since the QA model is not trained enough yet. We thus train without for the first epochs, then extract at epoch and continue to train on all sets, as shown in Figure 2(a). In the final loss function, we apply different losses as set and :


where the function is a delayed step function (1 when epoch is greater than , 0 otherwise).

3.4 Passage Selection at Inference Time

For our multi-hop QA task, it requires to find answerable passages with both answer and evidence, from candidate passages. While we can access the ground-truth of answerability in training set, we need to identify the answerability of at inference time. For this, we consider two directions: (1) Paragraph Pair Selection, which is specific to HotpotQA, and (2) Supervised Evidence Selector trained on pseudo-labels.

For (1), we consider the data characteristic, mentioned in Section 3.1; we know one pair of paragraphs is answerable/evidential (when both paragraphs are positive, or ). Thus, the goal is to identify the answerable pair of paragraphs, from all possible pairs (denoted as paired-paragraph). We can let the model select one pair with highest estimated answerability, in Eq. (6), and predict answers on the paired passage, which is likely to be evidential.

For (2), some pipelined approaches nie2019revealing; groeneveld2020simple design an evidence selector, extracting top k sentences from all candidate paragraphs. While they supervise the model using ground-truth of evidences, we assume there is no such annotation, thus train on pseudo-labels . We denote this setting as selected-evidences. For evidence selector, we follow an extracting method in beltagy2020longformer, where the special token [S] is added at ending position of each sentence, and from BERT indicates -th sentence embedding. Then, a binary classifier is trained on the pseudo-labels, where is a fully connected layer. During training, the classifier identifies whether each sentence is evidence-positive (1) or negative (0). At inference time, we first select top 5 sentences222Table 1

shows the precision and recall of top5 sentences.

on paragraph candidates, and then insert the selected evidences into QA model for testing.

While we discuss how to get the answerable passage above, we can use the passage setting for evaluation. To show the robustness of our model, we construct a challenge test set by excluding easy examples (i.e., easy to take shortcuts). To detect such easy examples, we build a set of single-paragraph , that none of it is evidential in HotpotQA, as the dataset avoids having all evidences in a single paragraph, to discourage single-hop reasoning. If QA model predicts the correct answer on the (unevidential) single-paragraph, we remove such examples in HotpotQA, and define the remaining set as the challenge set.


# of sent Prec Recall
GT evidences 2.38 100. 100.
Answerable 6.45 36.94 100.
(Train set) 3.64 61.13 86.64
(Dev set) 5.00 46.12 90.35


Table 1: The precision and recall of pseudo evidences from Interpreter, compared to the ground truth (GT).

4 Experiment

In this section, we formulate our research questions to guide our experiments and describe evaluation results corresponding to each question.


Model Input at Inference Question Answering (F1)
Original Set Challenge Set


  without external knowledge
B-I: Single-paragraph QA Single-paragraph 68.65 0.0
B-II: Single-paragraph QA Paired-paragraph 62.01 30.07
O-I: Our model Single-paragraph 32.61 19.81
O-II: Our model Paired-paragraph 68.08 41.69
O-III: Our model (full) Selected-evidences 70.21 44.57
 with external knowledge
C-I: asai2019learning    Retrieved-evidences 73.30 48.54
C-II: asai2019learning + Ours Retrieved-evidences 73.95 50.15


Table 2: The comparison of the proposed models on the original set and challenge set.
Research Questions

To evaluate the effectiveness of our method, we address the following research questions:

  • [leftmargin=0.6cm]

  • RQ1: How effective is our proposed method for a multi-hop QA task?

  • RQ2: Does our Interpreter effectively extract pseudo-evidentiality annotations for training?

  • RQ3: Does our method avoid reasoning shortcuts in unseen data?


Our implementation settings for QA model follow RoBERTa (Base version with 12 layers) liu2019roberta. We use the Adam optimizer with a learning rate of 0.00005 and a batch-size of 8 on RTX titan. We extract the evidence-positive set after 3 epoch (=3 in Eq. (11)) and retrain for 3 epochs. As a hyper-parameter, we search among , and found the best value (=0.01), based on 5% hold-out set sampled from the training set.


We report standard F1 score for HotpotQA, to evaluate the overall QA accuracy to find the correct answers. For evidence selection, we also report F1 score, Precision, and Recall to evaluate the sentence-level evidence retrieval accuracy.

4.1 RQ1: QA Effectiveness

Evaluation Set
  • [leftmargin=0.4cm]

  • Original Set: We evaluate our proposed approach on multi-hop reasoning dataset, HotpotQA333 yang2018hotpotqa. HotpotQA contains 112K examples of multi-hop questions and answers. For evaluation, we use the HotpotQA dev set (distractor setting) with 7405 examples.

  • Challenge Set: To validate the robustness, we construct a challenge set where QA model on single-paragraph gets zero F1, while such model achieves 67 F1 in the original set. That is, we exclude instances with F1 0, where the QA model predicts an answer without right reasoning. The exclusion makes sure the baseline obtains zero F1 on the challenge set. The number of surviving examples in our challenge set is 1653 (21.5% of dev set).


Model QA (F1)
Original Challenge
Our model (full) 70.21 44.57
(A) remove 68.51 40.78
(B) remove & 66.42 40.75
(C) replace with 69.64 42.54


Table 3: The ablation study on our full model.


Model Evidence Selection
F1 Precision Recall
Retrieval-based AIR yadav2020unsupervised 66.16 63.06 69.57
Accumulative-based interpreter on our QA model 54.05 53.56 62.38
(a) Interpreter on Single-paragraph QA 56.76 57.50 63.71
(b) Interpreter on our QA model w/ 70.30 62.04 87.10
(c) Interpreter on our QA model (full) 69.35 61.09 86.59


Table 4: The comparison of the proposed models for evidence selection
(a) Single-paragraph QA
(b) Ours w/
(c) Ours w/ (full)
(d) Three models on
Figure 3: Confidence Analysis: Confidence scores of three models in the ascending order, on (light color) and (dark colar). (a) Base model trained on single-paragraphs. (b) Our model with . (c) Our full model with . (d) Comparison of three models on .
Baselines, Our models, and Competitors

As a baseline, we follow the previous QA model min2019compositional trained on single-paragraphs. We test our model on single-paragraphs, paired-paragraphs and selected evidences settings discussed in Section 3.4. As a strong competitor, among released models for HotpotQA, we implement a state-of-the-art model asai2019learning444Highest performing model in the leaderboard of HotpotQA with public code release, using external knowledge and a graph-based retriever.

Main Results

This section includes the results of our model for multi-hop reasoning. As shown in Table 2, our full model outperforms baselines on both original and challenge set.

We can further observe that i) when tested on single-paragraphs, where forced to take shortcuts, our model (O-I) is worse than the baseline (B-I), which indicates that B-I learned the shortcuts. In contrast, O-II outperforms B-II on paired-paragraphs where at least one passage candidate has all the evidences.

ii) When tested on evidences selected by our method (O-III), we can improve F1 scores on both original set and challenge set. This noise filtering effect of evidence selection, by eliminating irrelevant sentences, was consistently observed in a supervised setting nie2019revealing; groeneveld2020simple; beltagy2020longformer, which we could reproduce without annotation.

iii) Combining our method with SOTA (C-I) asai2019learning leads to accuracy gains in both sets. C-I has distinctions of using external knowledge of reasoning paths, to outperform models without such advantages, but our method can contribute to complementary gains.

Ablation Study

As shown in Table 3, we conduct an ablation study of O-III in Table 2. In (A), we remove from Interpreter, in training time. On the QA model without , the performance decreased significantly, suggesting the importance of evidence-positive set. In (B), we remove evidentaility labels of both and , and observed that the performance drop is larger compared to other variants. Through (A) and (B), we show that training our evidentiality labels can increase QA performance. In (C), we replace with , removing layer to train biased features. On the replaced regularization, the performance also decreased, suggesting that training is effective for a multi-hop QA task.

4.2 RQ2: Evaluation of Pseudo-Evidentiality Annotation

In this section, we evaluate the effectiveness of our Interpreter, which generates evidences on training set, without supervision. We compare the pseudo evidences with human-annotation, by sentence-level. For evaluation, we measure sentence-level F1 score, Precision and Recall, following the evidence selection evaluation in yang2018hotpotqa.

As a baseline, we implement the retrieval-based model, AIR yadav2020unsupervised, which is an unsupervised method as ours. As shown in Table 4, our Interpreter on our QA model outperforms the retrieval-based method, in terms of F1 and Recall, while the baseline (AIR) achieves the highest precision (63.06%). We argue recall, aiming at identifying all evidences, is much critical for multi-hop reasoning, for our goal of avoiding disconnected reasoning, as long as precision remains higher than precision of answerable (36.94%), in Table 1.

As variants of our method, we test our Interpreter on various models. First, when comparing (a) and (c), our full model (c) outperforms the baseline (a) over all metrics. The baseline (a) trained on single-paragraphs got biased, thus the evidences generated by the biased model are less accurate. Second, the variant (b) trained by outperforms (c) our full model. In Eq. (8), the loss term does not train layer for biased features, unlike in Eq. (10). This shows that learning results in performance degradation for evidence selection, despite performance gain in QA.

4.3 RQ3: Generalization

In this section, to show that our model avoids reasoning shortcuts for unseen data, we analyze the confidence distribution of models on the evidence-positive and negative set. In dev set, we treat the ground truth of evidences as , and a single sentence containing answer as (each has 7K - pairs). On these set, Figure 3 shows confidence of three models; (a), (b), and (c) mentioned in Section 4.2. We sort the confidence scores in ascending order, where y-axis indicates the confidence and x-axis refers to the sorted index. Thus, the colored area indicates the dominance of confidence distribution. Ideally, for a debiased model, the area on evidence-positive set should be large, while that on evidence-negative should be small.

Desirably, in Figure 3(a), the area under the curve for should decrease for pursuing (O1), moving along blue arrow, while that of should increase for (O2), as red arrow shows. In Figure 3(b), our model with follows blue arrow, with a smaller area under the curve for , while keeping that of comparable to Figure 3(a). For the comparison, Figure 3(d) shows all curves on . In Figure 3(c), our full model follows both directions of blue and red arrows, which indicates that ours satisfied both (O1) and (O2).

5 Conclusion

In this paper, we propose a new approach to train multi-hop QA models, not to take reasoning shortcuts of guessing right answers without sufficient evidences. We do not require annotations and generate pseudo-evidentiality instead, by regularizing QA model from being overconfident when evidences are insufficient. Our experimental results show that our method outperforms baselines on HotpotQA and has the effectiveness to distinguish between evidence-positive and negative set.


This research was supported by IITP grant funded by the Korea government (MSIT) (No.2017-0-01779, XAI) and ITRC support program funded by the Korea government (MSIT) (IITP-2021-2020-0-01789).