Rapid progress has been made in the field of question answering (QA), thanks to the release of many large-scale, high-quality QA datasets. Early datasets such as SQuAD (Rajpurkar et al., 2016, 2018), NewsQA (Trischler et al., 2017), and TriviaQA (Joshi et al., 2017)
mainly consist of single-hop questions, where an answer with supporting justification can be found within one passage or a short segment of text. These benchmarks focus on evaluating QA models’ ability to perform pattern matching between a passage and a question. Recently, multi-hop QA datasets, such as QAngaroo Wikihop(Welbl et al., 2018) and HotpotQA (Yang et al., 2018)
, have gained increasing attention. They require models to retrieve multiple pieces of supporting evidence from different documents and to reason over the evidence collected to answer a question. The standard evaluation metrics of QA datasets include exact match (EM) and F1 scores averaged over the test set. However, it is unclear to what extent the multi-hop QA models truly master the ability of multi-hop reasoning.
In this work, we propose an additional evaluation scheme to test whether multi-hop QA systems know how to answer the single-hop sub-questions of a multi-hop question. Our motivation is that if a system can correctly answer a multi-hop question, it should be able to answer the corresponding single-hop sub-questions which form the complete reasoning path, just like what humans can naturally do. Figure 1 presents an illustrating example. A successful QA model needs to be able to answer the two sub-questions “Which movie starring Arnold Schwarzenegger as a former New York Police detective” and “What year did Guns N Roses perform a promo for End of Days” if it truly understands the multi-hop question “What year did Guns N Roses perform a promo for a movie starring Arnold Schwarzenegger as a former New York Police detective?”.
We focus on the HotpotQA (Yang et al., 2018) dataset under the distractor setting, in which multi-hop questions are asked over several Wikipedia paragraphs. We create the evaluation set by automatically generating the sub-questions and then extracting their answers. The candidate answers to the sub-questions are then manually verified, which results in 1,000 human-verified sub-question evaluation examples. We show that all three top-performing models which we experiment with fail to answer a large portion of sub-questions (49.8% to 60.4%), although their corresponding multi-hop questions are correctly answered. This observation indicates that the models learn to answer the multi-hop questions without truly understanding the underlying reasoning path.
To motivate the development of new multi-hop reasoning models, we propose an initial architecture by treating the sub-questions explicitly. Our model consists of four components, namely paragraph selector, question type classifier, multi-hop QA model, and single-hop QA model. Instead of performing end-to-end training, we choose to train and evaluate each component individually. The availability of intermediate results also makes our model more interpretable. We show that with automatically generated sub-questions and their answers used for training, our model outperforms other top models by a large margin on the sub-question evaluation111We will release the sub-question evaluation dataset and the code upon publication.. Overall, we believe that explicit reasoning should play an important role in multi-hop question answering. Our work takes a step forward towards building a more explainable multi-hop QA system.
2 Construction of Evaluation Examples
In this section, we introduce our semi-automatic approach to generate two sub-questions and their corresponding answers for multi-hop questions for the HotpotQA dataset. As shown in Figure 2, the evaluation examples are generated in three steps. Firstly, we decompose each source question into several sub-strings by predicting the breaking points and post-process them to generate two sub-questions. Then, the answers for the sub-questions are extracted from the paragraphs. At last, the candidate evaluation examples generated are sent for human verification. We first introduce the HotpotQA dataset and then elaborate on each step of the construction pipeline.
HotpotQA contains 113k crowd-sourced multi-hop QA pairs on Wikipedia articles. We focus on bridge-type questions under the distractor setting. During the construction of such an example in HotpotQA, two related paragraphs from different Wikipedia articles titled are presented to crowd-workers. The two paragraphs are related since the text content in one paragraph contains the title entity of the other paragraph. This shared title entity is referred to as the bridge entity. Using Figure 1 as an example, the second paragraph about Oh My God contains the title entity of the first paragraph, End of Days (underlined). Thus, End of Days is referred as the bridge entity. The crowd-workers are encouraged to ask a multi-hop question that makes use of information from both paragraphs and to annotate the supporting sentences which help to determine the answer. Then, eight other related distracting paragraphs are retrieved from Wikipedia and mixed with the two gold paragraphs. The ten paragraphs serve as the source of answers for the question. Given an example from HotpotQA, our objective is to generate an evaluation example as follows:
where and are the two sub-questions, and and are their corresponding answers.
2.2 Sub-Question Generation
Given a multi-hop question, the first step is to decompose it into sub-questions. We use the model introduced in DecompRC (Min et al., 2019) to generate the sub-questions. Instead of generating a target sequence word by word, we adopt a copying and editing approach. The multi-hop question is first converted into BERT word embeddings (Devlin et al., 2019)
, and then sent to a fully connected neural network to predict the splitting points. It is trained on 400 annotated examples. The predicted text spans are post-processed to form the two sub-questions, following a set of handcrafted rules.
2.3 Intermediate Answer Extraction
One particular characteristic of bridge-type questions from HotpotQA dataset is that the two gold paragraphs are linked by a bridge entity. Since the crowd-workers are required to form a multi-hop question which makes use of information from both paragraphs, there is a high probability that the bridge entity is the answer for the first sub-question. For the example shown in Figure2, Shirley Temple in gold paragraph 2 is the bridge entity. It is also the intermediate answer for the multi-hop question, i.e., the answer for the first sub-question.
Three different situations are considered in order to extract the bridge entity. First, if the title entity of paragraph occurs in the other paragraph , while the title entity of does not occur in , then is recognized as the bridge entity. Second, if neither nor is contained in the other paragraph, then the title entity with more overlapping text with the other paragraph is chosen as the bridge entity (since sometimes the alias of the Wikipedia title is used in the paragraph). Lastly, if both and appear in the other paragraph, then the title entity which does not appear in both the question and the answer is chosen as the bridge entity, since an entity mentioned in the multi-hop question or included in the final answer is unlikely to be the bridge entity. The bridge entity is set to be unidentified if none or both of the title entities satisfy at least one of the requirements. As illustrated in Figure 2, once the bridge entity is retrieved, the blank in the second sub-question will be updated. The answer to the second sub-question should be the same as the multi-hop question.
2.4 Human Verification
Sub-question generation and intermediate answer extraction help to efficiently generate candidate sub-questions and their answers for multi-hop questions from HotpotQA. To ensure the quality of the evaluation set, the examples generated are manually verified. For each example, we present to an annotator the original multi-hop question, the answer, two sub-questions generated and their corresponding answers, and two gold paragraphs, i.e., . The annotators are instructed to verify whether is a two-hop question and whether is the correct answer. Erroneous examples found in this step are filtered out. Then, the annotators are required to review whether and are two semantically and syntactically correct sub-questions of and whether and are valid and to correct them if not. In total, 1,000 examples are generated from the HotpotQA official development set and manually verified for use in our evaluation.
To determine whether existing top models understand the underlying reasoning path, we perform evaluation on three published top-performing QA models with publicly available open-source code: DFGN(Qiu et al., 2019)222https://github.com/woshiyyya/DFGN-pytorch, DecompRC (Min et al., 2019) 333https://github.com/shmsw25/DecompRC, and CogQA (Ding et al., 2019)444https://github.com/THUDM/CogQA.
For all experiments, we measure EM and F1 scores for , and on 1,000 human-verified examples. Since the objective of our evaluation is to determine whether models are able to correctly answer the decomposed single-hop sub-questions whose parent multi-hop questions are correctly answered. We also collect corresponding categorical statistics. To measure the correctness of a predicted answer, we first use exact string match as the only metric. However, during error analysis, we find that many predicted answers that partially matched the gold answers should also be regarded as correct. Some representative examples are shown in Table 1. Although these predicted answers have zero EM scores, they are semantically equivalent to the correct answers given. Therefore, we define a more flexible metric named partial match (PM) as an additional evaluation of correctness. Given a gold answer text span and a predicted answer text , they are partially matched if either one of the following two requirements is satisfied.
|Case||Gold Answer||Predicted Answer|
|1||from 1986 to 2013||1986 to 2013|
|2||City of Angles (film)||City of Angles|
|3||Mondelez International, Inc.||the company Mondelez International|
Table 2 shows the performance of DFGN, DecompRC, and CogQA on multi-hop questions and their single-hop sub-questions. Compared to multi-hop questions, the performance of DFGN and CogQA drops on simpler sub-questions, especially on the second sub-questions (11.13 F1 reduction for DFGN and 6.84 F1 reduction for DecompRC). CogQA achieves slightly better performance on sub-questions. The EM and F1 scores are averaged over all examples. In order to understand whether models are able to answer the sub-questions of correctly answered multi-hop questions, we collect the correctness statistics with regards to each individual example. Table 3 presents the categorical statistics. The first four rows demonstrate the percentage of examples whose multi-hop question can be correctly answered. Among these examples, we notice that there is a high probability that the models fail to answer at least one of the sub-questions, as shown in rows 2 to 4. We refer to these examples as model failure cases. The percentage of model failure cases over all correctly answered multi-hop questions is defined as model failure rate. Figure 3 shows the results for all models. All three models tested have a high model failure rate, indicating that the models learn to answer the multi-hop questions without understanding the underlying reasoning chains. The same phenomenon appears under both exact match and the less strict partial match evaluation.
After analyzing the error examples, we observe one common characteristic shared by model failure cases: there is a high similarity between the words in the second sub-question and the words near the answer in a paragraph. The models are able to locate the correct answer by local pattern matching, instead of going through the reasoning steps. For the example presented in Figure 1, the information in second sub-question “What year did Guns N Roses perform …” alone is enough for the model to retrieve the correct answer “1999”. With a distracting paragraph containing phrases “film … starring Arnold Schwarzenegger ...”, the model is misled to answer the first sub-question wrongly.
|Question Answered||Exact Match||Partial Match|
|DFGN||DecompRC||CogQA||Our model||DFGN||DecompRC||CogQA||Our model|
4 Proposed New Model
4.1 Training Data Creation
To prepare the training dataset, we apply the automatic procedure of sub-question generation and intermediate answer extraction to all bridge-type questions from the HotpotQA training set. The human verification step is not performed. In total, we produced 41,362 annotated training examples with sub-questions and their corresponding answers.
4.2 Model Structure
The sub-question evaluation experiments show that existing models trained on the HotpotQA dataset may answer the question correctly without understanding the underlying reasoning chain. To remedy this problem, we propose a new model to handle the sub-questions explicitly. As shown in Figure 4, our model consists of four components. Given a question, we first select related paragraphs from all ten paragraphs and concatenate them to form the context. Then a question type classifier is used to determine whether the question is a single-hop or multi-hop question. Finally, the example is sent to the corresponding QA model for answer prediction. Instead of end-to-end training, all four components are trained individually.
4.2.1 Gold Paragraph Selection
HotpotQA provides ten paragraphs in the distractor setting and only two of them contain information to answer the question. To remove unrelated context and ease the computational burden of subsequent steps, we first perform paragraph selection on the given context.
We feed a question and each of the 10 paragraphs to a BERT model (Devlin et al., 2019) and get a softmaxed probability of being a gold paragraph for . A paragraph is chosen as a gold paragraph for question if is larger than a threshold , or the probability is the highest or second highest among the 10 candidate paragraphs. To include most of the gold paragraphs for each question, we aim for high recall and acceptable precision. The threshold value for 98.6% recall and 68.7% precision is selected. The concatenation of all positive gold paragraphs predicted for each question is used as the new context for all subsequent steps.
4.2.2 Question Type Classification
A question type classifier is constructed to explicitly predict whether a question is single-hop or multi-hop. We use a pre-trained BERT model for sequence classification as the classifier, and fine-tune it with the multi-hop questions and sub-questions generated. The model takes in a question and its new context as input and aims to predict the question type. The classification results on 1,000 evaluation examples are shown in Table 4. The accuracy is 82.5%. It is noteworthy that the recall for multi-hop questions is 96.9% and most of the error cases belong to misclassification of single-hop questions as multi-hop questions.
|Baseline Model (Yang et al., 2018)||45.60||59.02|
|QFE (Nishida et al., 2019)||53.86||68.06|
|DecompRC (Min et al., 2019)||55.20||69.63|
|KGNN (Ye et al., 2019)||50.81||65.75|
|DFGN (Qiu et al., 2019)||56.31||69.69|
|ChainEx (Chen et al., 2019)||61.20||74.11|
|HGN (Fang et al., 2019)||66.07||79.36|
|SAE-large (Tu et al., 2019)||66.92||79.62|
|Our Model (SEval)||61.87||74.37|
|Model||Model failure rate|
4.2.3 Multi-Hop QA Model
We build our multi-hop QA model based on a BERT model for question answering. The question and the new context are concatenated as one sequence and fed to the model, where represents a separator token. Following the same strategy of BERT on SQuAD, we calculate the probability of each token in the context being the start position and end position of the answer span. The answer span with the highest sum of start and end probabilities is selected.
To apply on HotpotQA, we extend the prediction layer with answer type prediction and supporting facts prediction. Answer type prediction aims to predict whether an answer is “yes”, “no”, or an answer span. It is achieved by feeding the output vectors of BERT for the context to a linear layer, followed by a max-pooling layer over the whole sequence. For supporting facts prediction, we feed the BERT output of tokens in each sentence to a softmaxed linear layer and predict whether the sentence is a supporting fact. To realize this, we extract the start and end positions of sentences in the context and include them in the feature set during the pre-processing step. This model is fine-tuned on the official HotpotQA dataset, except that the context is replaced with the paragraphs obtained by gold paragraph selection.
4.2.4 Single-Hop QA Model
We also exploit the pre-trained BERT model for our single-hop QA model component. The decomposed sub-questions and their extracted answers from the HotpotQA training set are used to fine-tune the model.
Table 5 shows the performance of models on the HotpotQA blind test set under the distractor setting. Although our model emphasizes sub-question handling, it also achieves competitive performance on multi-hop questions. As shown in Table 2, our model outperforms two of the three QA systems on multi-hop questions in the human-verified dataset. It outperforms all other models on sub-questions. A large improvement is made on the first sub-questions. Table 3 and Figure 3 show that our model reduces the model failure rate significantly compared to the other three models. Both explicit single-hop QA modeling and the sub-question training data generated contribute to the improvement.
4.4 Ablation Test
Table 6 presents the ablation studies of our system. “– SingleQA” refers to the model with the question type classifier and single-hop QA model removed. This model performs slightly better on multi-hop questions, while the model failure rate is higher. The result suggests that having an explicit model to handle sub-questions is indeed helpful for intermediate answer extraction. On the other hand, our model performs much worse on multi-hop questions when the question type classifier and multi-hop QA models are removed, although it achieves better performance on the first sub-questions.
5 Related Work
5.1 Multi-hop QA Datasets
Earlier document-based QA datasets (Hermann et al., 2015; Rajpurkar et al., 2016; Trischler et al., 2017) mainly contain single-hop questions, with emphasis on evaluating models’ ability on local pattern matching. Existing models (Lan et al., 2019; Zhang et al., 2019) have achieved super-human performance. To address the ability of performing complex reasoning, several multi-hop QA datasets (Khashabi et al., 2018; Welbl et al., 2018; Yang et al., 2018) have been proposed. MultiRC (Khashabi et al., 2018) contains multiple-choice questions which can only be answered by integrating information from multiple sentences. They ensure this property by excluding the questions which can be answered given incomplete context. QAngaroo Wikihop (Welbl et al., 2018) constructs multi-hop questions by transforming Wikidata facts into questions and retrieving related Wikipedia articles using a bipartite graph connecting entities and documents.
Existing QA datasets suffer from a lack of interpretability. It is a good start for HotpotQA (Yang et al., 2018) to provide annotations for supporting facts. However, our work show that models jointly trained with supporting facts prediction still fail to answer the sub-questions along the reasoning path. Therefore, answering complex multi-hop reading comprehension questions in an end-to-end manner without explicit modeling of the reasoning chain has severe drawbacks, resulting in non-explainable QA systems.
5.2 Multi-hop QA Models on HotpotQA
on HotpotQA adopt the architecture of top-performing single-hop QA models and enhance it for the multi-hop setting. They first transform a question and its context into vector representations via pre-trained word embeddings, then encode them via several layers of recurrent neural networks. Next, they update the vector representations for tokens in the context by making interaction with the question vector in a multi-hop manner. Attention mechanism(Hermann et al., 2015) is commonly used to retrieve related evidence. The final representations for the context are sent to an answer prediction layer. Another group of successful models (Qiu et al., 2019; Ye et al., 2019; Fang et al., 2019) on HotpotQA focuses on constructing graphs based on named entities extracted from a question and its context. They perform reasoning over the entity graph using graph neural networks (Scarselli et al., 2009) and pass information back to the document representation for answer prediction. Although some models aim to provide explainable intermediate answers, their performance on sub-question evaluation is still unsatisfactory.
5.3 Adversarial Evaluation for QA Datasets
Jia and Liang (2017) first apply adversarial evaluation to QA models on SQuAD (Rajpurkar et al., 2016). They show that the performance of state-of-the-art models drops significantly when an additional distracting sentence is added to the context. Gan and Ng (2019) evaluate the robustness of models trained on SQuAD by asking them to answer paraphrased questions. They also find that paraphrased test questions lead to significant decrease in performance on multiple state-of-the-art models. On HotpotQA dataset, Jiang and Bansal (2019) demonstrate that existing models often answer a multi-hop question via exploiting some reasoning shortcut. To remedy the problem, they propose a new model using a control unit to guide the multi-hop reasoning process by dynamically attending to the question. All these works analyze the deficiencies of QA datasets by inserting distracting information to questions or contexts. Our work addresses this issue in a novel way. Without modifying the original data, we show the lack of reasoning ability of the existing multi-hop QA models by constructing an additional set of sub-questions for evaluation.
We propose a new way to evaluate whether multi-hop QA systems have learned the ability to perform reasoning over multiple documents by asking sub-questions. An automatic approach is designed to generate sub-questions for a multi-hop question. On a human-verified test set, we show that all three existing top models give worse performance on the sub-questions compared to our proposed model with an explicit question type classification component and a single-hop QA component. As an initial step towards a more explainable QA system, we hope our work could motivate the construction of multi-hop QA datasets with explicit reasoning paths annotated and the development of better multi-hop QA models.
- Multi-hop question answering via reasoning chains. arXiv e-prints, pp. arXiv:1910.02610. External Links: Cited by: Table 5.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp. 4171–4186. Cited by: §2.2, §4.2.1.
- Cognitive graph for multi-hop reading comprehension at scale. In ACL, pp. 2694–2703. Cited by: §3.
- Hierarchical graph network for multi-hop question answering. arXiv e-prints, pp. arXiv:1911.03631. External Links: Cited by: Table 5, §5.2.
- Multi-hop paragraph retrieval for open-domain question answering. In ACL, pp. 2296–2309. Cited by: §5.2.
- Improving the robustness of question answering systems to question paraphrasing. In ACL, pp. 6065–6075. Cited by: §5.3.
- Teaching machines to read and comprehend. In NIPS, pp. 1693–1701. Cited by: §5.1, §5.2.
- Adversarial examples for evaluating reading comprehension systems. In EMNLP, pp. 2021–2031. Cited by: §5.3.
- Avoiding reasoning shortcuts: adversarial evaluation, training, and model development for multi-hop QA. In ACL, pp. 2726–2736. Cited by: §5.3.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, pp. 1601–1611. Cited by: §1.
- Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In NAACL-HLT, pp. 252–262. Cited by: §5.1.
ALBERT: A lite BERT for self-supervised learning of language representations. arXiv e-prints, pp. arXiv:1909.11942. External Links: Cited by: §5.1.
- Multi-hop reading comprehension through question decomposition and rescoring. In ACL, pp. 6097–6109. Cited by: §2.2, §3, Table 5.
- Answering while summarizing: multi-task learning for multi-hop QA with evidence extraction. In ACL, pp. 2335–2345. Cited by: Table 5, §5.2.
- Dynamically fused graph network for multi-hop reasoning. In ACL, pp. 6140–6150. Cited by: §3, Table 5, §5.2.
- Know what you don’t know: unanswerable questions for SQuAD. In ACL, pp. 784–789. Cited by: §1.
- SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP, pp. 2383–2392. Cited by: §1, §5.1, §5.3.
- The graph neural network model. IEEE Trans. Neural Networks 20 (1), pp. 61–80. Cited by: §5.2.
- NewsQA: A machine comprehension dataset. In Rep4NLP@ACL, pp. 191–200. Cited by: §1, §5.1.
- Select, answer and explain: interpretable multi-hop reading comprehension over multiple documents. arXiv e-prints, pp. arXiv:1911.00484. External Links: Cited by: Table 5, §5.2.
- Constructing datasets for multi-hop reading comprehension across documents. TACL 6, pp. 287–302. Cited by: §1, §5.1.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP, pp. 2369–2380. Cited by: §1, §1, Table 5, §5.1, §5.1, §5.2.
- Multi-paragraph reasoning with knowledge-enhanced graph neural network. arXiv e-prints, pp. arXiv:1911.02170. External Links: Cited by: Table 5, §5.2.
- SG-net: syntax-guided machine reading comprehension. arXiv e-prints, pp. arXiv:1908.05147. External Links: Cited by: §5.1.