Hierarchical Graph Network for Multi-hop Question Answering

11/09/2019 ∙ by Yuwei Fang, et al. ∙ Microsoft 0

In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering. To aggregate clues from scattered texts across multiple paragraphs, a hierarchical graph is created by constructing nodes from different levels of granularity (i.e., questions, paragraphs, sentences, and entities), the representations of which are initialized with BERT-based context encoders. By weaving heterogeneous nodes in an integral unified graph, this characteristic hierarchical differentiation of node granularity enables HGN to support different question answering sub-tasks simultaneously (e.g., paragraph selection, supporting facts extraction, and answer prediction). Given a constructed hierarchical graph for each question, the initial node representations are updated through graph propagation; and for each sub-task, multi-hop reasoning is performed by traversing through graph edges. Extensive experiments on the HotpotQA benchmark demonstrate that the proposed HGN approach significantly outperforms prior state-of-the-art methods by a large margin in both Distractor and Fullwiki settings.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In contrast to one-hop question answering Rajpurkar et al. (2016); Trischler et al. (2016); Lai et al. (2017), where answers can be derived from a single paragraph Wang and Jiang (2017); Seo et al. (2017); Liu et al. (2018); Devlin et al. (2019), recent studies have more and more focused on multi-hop reasoning across multiple documents or paragraphs for question answering. Popular tasks include WikiHop Welbl et al. (2018), ComplexWebQuestions Talmor and Berant (2018), and HotpotQA Yang et al. (2018).

Figure 1: An example of multi-hop question answering from HotpotQA. The model needs to identify relevant paragraphs, determine supporting facts, and predict the answer correctly.

An example from HotpotQA is illustrated in Figure 1. In order to correctly answer the question (“The director of the romantic comedy ‘Big Stone Gap’ is based in what New York city”), the model first needs to identify P1 as a relevant paragraph, whose title contains keywords that appear in the question (“Big Stone Gap”). S1, the first sentence of P1, is then verified as supporting facts, which leads to the next-hop paragraph P2. From P2, the span “Greenwich Village, New York City” is selected as the predicted answer.

Most existing studies use a retriever to find paragraphs that potentially contain the right answer to the question (P1 and P2 in this case). Then, a Machine Reading Comprehension (MRC) model is applied to the selected paragraphs for answer prediction Nishida et al. (2019); Min et al. (2019b). However, even after identifying a reasoning chain through multiple paragraphs, it still remains a big challenge how to aggregate evidence from sources on different granularity levels (e.g., paragraphs, sentences, entities) for both answer and supporting facts prediction.

To tackle this challenge, some studies aggregate document information into an entity graph, based on which query-guided multi-hop reasoning is performed for answer/supporting facts prediction. Depending on the characteristics of the dataset, answers can be selected either from the entities in the constructed entity graph Song et al. (2018); Dhingra et al. (2018); De Cao et al. (2019); Tu et al. (2019); Ding et al. (2019), or from spans of documents by fusing entity representations back into token-level document representation Xiao et al. (2019). However, the constructed graph is often used for predicting answers only, but insufficient for finding supporting facts. Also, reasoning through a simple entity graph Ding et al. (2019) or a paragraph-entity hybrid graph Tu et al. (2019) is not sufficient for handling complicated questions that require multi-hop reasoning.

Intuitively, given a question that requires multiple hops through a set of documents in order to derive the right answer, a natural sequence of actions follows: () identifying relevant paragraphs; () determining supporting facts in those paragraphs; and (

) pinpointing the right answer based on the gathered evidence. To this end, the message passing algorithm in graph neural network, which can pass on multi-hop information through graph propagation, has the potential of effectively predicting both supporting facts and answer jointly for multi-hop questions.

Motivated by this, we propose a Hierarchical Graph Network (HGN) for multi-hop question answering, which provides multi-level fine-grained graphs with a hierarchical structure for joint answer and evidence prediction. Instead of using only entities as nodes, we construct a hierarchical graph for each question to capture clues from sources on different levels of granularity. Specifically, we introduce four types of graph nodes: question, paragraphs, sentences, and entities (see Figure 2). To obtain contextualized representations for these hierarchical nodes, large-scale pre-trained language models such as BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019) are used for contextual encoding. These initial representations are then passed through a graph neural network for graph propagation. The updated representations of different nodes are used to perform different sub-tasks (e.g., paragraph selection, supporting facts prediction and entity prediction) through a hierarchical manner. Since some answers may not be entities, a span prediction module is further introduced for final answer prediction.

The main contributions of this paper are three-fold. () We propose a Hierarchical Graph Network (HGN) for multi-hop question answering, where heterogeneous nodes are woven into an integral unified graph. () Nodes from different granularity levels are utilized for different sub-tasks, providing effective supervision signals for both supporting facts extraction and final answer prediction. () HGN achieves new state of the art in both Distractor and Fullwiki settings on HotpotQA benchmark, outperforming previous work by a significant margin.

Figure 2: Model architecture of the proposed Hierarchical Graph Network. The constructed graph corresponds to the example in Figure 1. Green, blue, orange, and brown colors represent paragraph, sentence, entity, and question nodes, respectively. Some entities and hyperlinks are omitted for illustration simplicity.

2 Related Work

Multi-Hop QA

Multi-hop question answering requires a model to aggregate scattered pieces of evidence across multiple documents to predict the right answer. WikiHop Welbl et al. (2018) and HotpotQA Yang et al. (2018) are two recent datasets designed for this purpose. Specifically, WikiHop is constructed using the schema of the underlying knowledge bases, thus limiting answers to entities only. HotpotQA, on the other hand, is free-form text collected by Amazon Mechanical Turkers, which results in significantly more diverse questions and answers. HotpotQA also focuses more on explainability, by requiring supporting facts as the reasoning chain for deriving the correct answer. Two settings are provided in HotpotQA: the distractor setting requires techniques for multi-hop reading comprehension, while the fullwiki setting is more focused on information retrieval.

Existing work on HotpotQA distractor setting tries to convert the multi-hop reasoning task into single-hop sub-problems. Specifically, QFE Nishida et al. (2019) regards the evidence extraction as a query-focused summarization task, and reformulates the query in each hop. DecompRC Min et al. (2019b) decomposes a compositional question into simpler sub-questions, and then leverages single-hop MRC models to answer the sub-questions. A neural modular network is also proposed in Jiang and Bansal (2019b), where carefully designed neural modules are dynamically assembled for more interpretable multi-hop reasoning. Although the task is multi-hop by nature, recent studies Chen and Durrett (2019); Min et al. (2019a) also observed models that achieve high performance may not necessarily perform the expected multi-hop reasoning procedure, and may be merely leveraging some reasoning shortcuts Jiang and Bansal (2019a).

Graph Neural Network for Multi-hop QA

Besides the work mentioned above, recent studies on multi-hop QA also focus on building graphs based on entities, and reasoning over the constructed graph using graph neural networks Kipf and Welling (2017); Veličković et al. (2018). For example, MHQA-GRN Song et al. (2018) and Coref-GRN Dhingra et al. (2018) construct an entity graph based on co-reference resolution or sliding windows. Entity-GCN De Cao et al. (2019) considers three different types of edges that connect different entities in the entity graph. HDE-Graph Tu et al. (2019) enriches information in the entity graph by adding document nodes and creating interactions among documents, entities and answer candidates. Cognitive Graph QA Ding et al. (2019) employs an MRC model to predict answer spans and possible next-hop spans, and then organizes them into a cognitive graph. DFGN Xiao et al. (2019) constructs a dynamic entity graph, where in each reasoning step irrelevant entities are softly masked out, and a fusion module is designed to improve the interaction between the entity graph and the documents.

Different from the above methods, our proposed model constructs a hierarchical graph, effectively exploring relations among clues of different granularities and employing different nodes to perform different tasks, such as supporting facts prediction and entity prediction.

3 Hierarchical Graph Network

As illustrated in Figure 2, the proposed Hierarchical Graph Network (HGN) consists of four main components: () Graph Construction Module (Sec. 3.1), through which a hierarchical graph is constructed to connect clues from different sources; () Context Encoding Module (Sec. 3.2), where initial representations of graph nodes are obtained via a BERT-based encoder; () Graph Reasoning Module (Sec. 3.3), where graph-attention-based message passing algorithm is applied to jointly update node representations; and () Multi-task Prediction Module (Sec. 3.4), where multiple sub-tasks, including paragraph selection, supporting facts prediction, entity prediction, and answer span extraction, are performed simultaneously. The following sub-sections describe each component in detail.

3.1 Graph Construction

The hierarchical graph is constructed in two steps: () identifying relevant multi-hop paragraphs; and () adding edges that represent connections between sentences and entities within the selected paragraphs.

Paragraph Selection

Starting from the question, the first step is to identify relevant paragraphs (i.e., the first hop). We first retrieve those documents whose titles match the whole question. If multiple paragraphs are found, only two paragraphs with highest ranking scores are selected. If title matching returns no relevant documents, we further search for paragraphs that contain entities appearing in the question. If this also fails, a BERT-based paragraph ranker (described below) will be used to select the paragraph with the highest ranking score. The number of first-hop paragraphs will be at most two.

Once the first-hop paragraphs are identified, the next step is to find facts and entities within the paragraphs that can lead to other relevant paragraphs (i.e,, the second hop). Instead of relying on entity linking, which could be noisy, we use hyperlinks (provided by Wikipedia) in the first-hop paragraphs to discover second-hop paragraphs. Once the links are selected, we add edges between the sentences containing these links (source) and the paragraphs that the hyperlinks refer to (target), as illustrated by the dashed orange line in Figure 2. In order to allow information flow from both directions, the edges are considered as bidirectional.

Through this two-hop selection process, we are able to obtain several candidate paragraphs. In order to reduce introduced noise, we use a paragraph ranking model to select paragraphs with top- ranking scores in each step. This paragraph ranking model is based on a pre-trained BERT encoder, followed by a binary classification layer, to predict whether an input paragraph contains the ground-truth supporting facts or not.

Hierarchical Graph Construction

Paragraphs are comprised of sentences, and each sentence contains multiple entities. This graph is naturally encoded in a hierarchical structure, and also motivates how we construct the hierarchical graph. For each paragraph node, we add an edge between the node and all the sentences in the paragraph, each sentence corresponding to a sentence node. For each sentence node, we extract all the entities in the sentence and add edges between the sentence node and these entity nodes. Optionally, edges between paragraphs and edges between sentences can also be included in the final graph.

Each type of these nodes captures semantics from different information sources. Thus, the proposed hierarchical graph effectively exploits the structural information across all the different granularity levels to learn fine-grained representations, which can locate supporting facts and answers more accurately than simpler graphs with homogeneous nodes.

An example hierarchical graph is illustrated in Figure 2. We define different types of edges as follows: () edges between question node and paragraph nodes; () edges between question node and its corresponding entity nodes (entities appearing in the question, not shown for simplicity); () edges between paragraph nodes and their corresponding sentence nodes (sentences within the paragraph); () edges between sentence nodes and their linked paragraph nodes (linked through hyperlinks); () edges between sentence nodes and their corresponding entity nodes (entities appearing in the sentences); () edges between paragraph nodes; and () edges between sentence nodes that appear in the same paragraph. Note that a sentence is only connected to its previous and next neighboring sentence. The final graph consists of these seven types of edges as well as four types of nodes, which link the question to paragraphs, sentences, and entities in a hierarchical way.

3.2 Context Encoding

Given the constructed hierarchical graph, the next step is to obtain the initial representations of all the graph nodes. To this end, we first combine all the selected paragraphs into context , which is concatenated with the question and fed into pre-trained BERT Devlin et al. (2019), followed by a bi-attention layer Seo et al. (2017). We denote the encoded question representation as , and the encoded context representation as , where , are the length of the question and the context, respectively. Each and .

A shared BiLSTM is applied on top of the context representation , and the representations of different nodes are extracted from the output of the BiLSTM, denoted as . For entity/sentence/paragraph nodes, which are spans of the context, the representation is calculated from: () the hidden state of the backward LSTM at the start position, and (

) the hidden state of the forward LSTM at the end position. For the question node, a max-pooling layer is used to obtain its representation. Specifically,


where , , and denote the start position of the -th paragraph/sentence/entity node. Similarly, , , and denote the corresponding end positions. denotes an MLP layer, and

denotes the concatenation of two vectors. As a summary, after context encoding, each

, , and , serves as the representation of the -th paragraph/sentence/entity node. The question node is represented as .

3.3 Graph Reasoning

After context encoding, HGN performs reasoning over the hierarchical graph, where the contextualized representations of all the graph nodes are transformed into higher-level features via a graph neural network. Specifically, let , , and , where , and denote the number of paragraph/sentence/entity nodes in a graph. In experiments, we set , and

(padded where necessary), and denote

, where , and is the feature dimension of each node.

For graph propagation, we use Graph Attention Network (GAT) Veličković et al. (2018) to perform message passing over the hierarchical graph. Specifically, GAT takes all the nodes as input, and updates node feature through its neighbors in the graph. Formally,


where is a weight matrix to be learned,

denotes an activation function, and

is the attention coefficients, which can be calculated by:


where is the weight matrix corresponding to the edge type between the -th and -th nodes, and denotes the LeakyRelu activation function. In a summary, after graph reasoning, we obtain , from which the updated representations for each type of node can be obtained, i.e., , , , and .

Model Ans Sup Joint
DecompRC Min et al. (2019b) 55.20 69.63 - - - -
ChainEx Chen et al. (2019) 61.20 74.11 - - - -
Baseline Model Yang et al. (2018) 45.60 59.02 20.32 64.49 10.83 40.16
QFE Nishida et al. (2019) 53.86 68.06 57.75 84.49 34.63 59.61
DFGN Xiao et al. (2019) 56.31 69.69 51.50 81.62 33.62 59.82
LQR-Net Anonymous (2020a) 60.20 73.78 56.21 84.09 36.56 63.68
P-BERT 61.18 74.16 51.38 82.76 35.42 63.79
SAE 60.36 73.58 56.93 84.63 38.81 64.96
TAP2 64.99 78.59 55.47 85.57 39.77 69.12
EPS+BERT 65.79 79.05 58.50 86.26 42.47 70.48
HGN (ours) 66.07 79.36 60.33 87.33 43.57 71.03
Table 1: Results on the test set of HotpotQA in the Distractor setting. HGN achieves state-of-the-art results at the time of submission (Sep. 27, 2019). () indicates unpublished work. BERT-wwm is used for context encoding. Leaderboard: https://hotpotqa.github.io/.

3.4 Multi-task Prediction

In this module, the updated node representations after graph reasoning are exploited for different sub-tasks of QA: () paragraph selection based on paragraph nodes; () supporting facts prediction based on sentence nodes; and () answer prediction based on entity nodes. Since the answers may not reside in entity nodes, the loss from () only serves as a regularization term, and the encoded context representation is directly used for answer span extraction.

Similar to Xiao et al. (2019), we use a cascade structure to solve the output dependency, and jointly perform all the tasks in a multi-task way. The final objective is specified as:


where , , , and

are hyper-parameters, and each loss function is a cross-entropy loss, calculated over the logits described below.

For both paragraph selection () and supporting facts prediction (

), we use a two-layer MLP as the binary classifier:


where represents the logit that a sentence is selected as supporting facts, and represents the logit that a paragraph contains the ground-truth supporting facts.

We treat entity prediction () as a multi-class classification problem. Candidate entities include all the entities in the constructed graph, plus an additional dummy entity indicating that the ground-truth answer does not exist among the entity nodes. Specifically,


During inference, the above loss only serves as a regularization term, and the final answer will be predicted by the answer span extraction module. For answer span extraction, a two-layer MLP on top of BiLSTMs is used to calculate the logits of every position being the beginning and end points of the ground-truth span :


where is a function that maps from to by repeating multiple times, and is the number of tokens for the context.

For answer-type111Following previous work, answer type includes span, entity, yes and no. classification (), we use a two-layer MLP on top of BiLSTM for multi-class classification:


The final cross-entropy loss () used for training is defined over all the above logits: .

4 Experiments

In this section, we describe our experiments on the HotpotQA dataset, comparing HGN with state-of-the-art approaches and providing detailed analysis to validate the effectiveness of our proposed model.

4.1 Experimental Setup


We use HotpotQA dataset Yang et al. (2018) for evaluation, which has become a popular benchmark for multi-hop QA. Specifically, two sub-tasks are included in this dataset: () Answer prediction; and () Supporting facts prediction. For each sub-task, exact match (EM) and partial match (F1) are used to evaluate model performance, and a joint EM and F1 score is used to measure the final performance, which encourages the model to take both answer and evidence prediction into consideration.

In addition, there are two settings in HotpotQA: Distractor and Fullwiki setting. In the Distractor setting, for each question, two gold paragraphs with ground-truth answers and supporting facts are provided, along with 8 ‘distractor’ paragraphs that were collected via a bi-gram TF-IDF retriever Chen et al. (2017). The Fullwiki setting is more challenging, which contains the same questions as in the Distractor setting, but does not provide relevant paragraphs. To obtain the right answer and supporting facts, the entire Wikipedia can be used to find relevant documents.

Model Ans Sup Joint
TPReasoner Xiong et al. (2019) 36.04 47.43 - - - -
Baseline Model Yang et al. (2018) 23.95 32.89 3.86 37.71 1.85 16.15
QFE Nishida et al. (2019) 28.66 38.06 14.20 44.35 8.69 23.10
MUPPET Feldman and El-Yaniv (2019) 30.61 40.26 16.65 47.33 10.85 27.01
Cognitive Graph Ding et al. (2019) 37.12 48.87 22.82 57.69 12.42 34.92
PR-BERT 43.33 53.79 21.90 59.63 14.50 39.11
Golden Retriever Qi et al. (2019) 37.92 48.58 30.69 64.24 18.04 39.13
Entity-centric BERT Godbole et al. (2019) 41.82 53.09 26.26 57.29 17.01 39.18
SemanticRetrievalMRS Yixin Nie (2019) 45.32 57.34 38.67 70.83 25.14 47.60
Transformer-XH Anonymous (2020c) 48.95 60.75 41.66 70.01 27.13 49.57
MIR+EPS+BERT 52.86 64.79 42.75 72.00 31.19 54.75
Graph Recur. Retriever Anonymous (2020b) 56.04 68.87 44.14 73.03 29.18 55.31
HGN (ours) 56.71 69.16 49.97 76.39 35.63 59.86
Table 2: Results on the test set of HotpotQA in the Fullwiki setting. HGN, when combined with the SemanticRetrievalMRS retrieval system, achieves state-of-the-art results at the time of submission (Oct. 7, 2019). () indicates unpublished work. RoBERTa-large is used for context encoding. Leaderboard: https://hotpotqa.github.io/.

Implementation Details

Our implementation is based on the Transformer library Wolf et al. (2019), we use BERT-wwm (whole word masking) or RoBERTa Liu et al. (2019) for context encoding. To construct the proposed hierarchical graph, we use spacy222https://spacy.io to extract entities from both questions and sentences. The numbers of entities, sentences and paragraphs in one graph are limited to 60, 40 and 4, respectively. Since HotpotQA only requires two-hop reasoning, up to two paragraphs are connected to each question. Our paragraph ranking model is a binary classifier based on the BERT-base model. For the Fullwiki setting, we leverage the retrieved paragraphs and the paragraph ranker provided by Yixin Nie (2019). The hyper-parameters and are set to 1, 5, 1 and 1, respectively.


We compare HGN with both published and unpublished work in both settings. For the Distractor setting, we compare with DFGN Xiao et al. (2019), QFE Nishida et al. (2019), the official baseline Yang et al. (2018), and DecompRC Min et al. (2019b). Unpublished work includes TAP2, EPS+BERT, SAE, P-BERT, LQR-net Anonymous (2020a), and ChainEx Chen et al. (2019).

For the Fullwiki setting, the published baselines include SemanticRetrievalMRS Yixin Nie (2019), Entity-centric BERT Godbole et al. (2019), GoldEn Retriever Qi et al. (2019), Cognitive Graph Ding et al. (2019), MUPPET Feldman and El-Yaniv (2019), QFE Nishida et al. (2019), and the official baseline Yang et al. (2018). Unpublished work includes Graph-based Recurrent Retriever Anonymous (2020b), MIR+EPS+BERT, Transformer-XH Anonymous (2020c), PR-BERT, and TPReasoner Xiong et al. (2019).

4.2 Experimental Results

Results on the Leaderboard

Table 1 and Table 2 summarize our results on the hidden test set of HotpotQA in the Distractor and Fullwiki setting, respectively. The proposed HGN outperforms both published and unpublished work on every metric by a significant margin. For example, HGN achieves a Joint EM/F1 score of 43.57/71.03 and 35.63/59.86 on the Distractor and Fullwiki setting, respectively, with an absolute improvement of 2.36/0.38 and 6.45/4.55 points over the previous state of the art. Below, we will conduct detailed analysis on the dev set to analyze the source of the performance gain.

Effectiveness of Paragraph Selection

scale=0.95,center Method Precision Recall #Para. Threshold-based 60.28 98.27 3.26 Top 2 from ranker 93.43 93.43 2 Top 4 from ranker 49.39 98.48 4 2 paragraphs (ours) 94.53 94.53 2 4 paragraphs (ours) 49.45 98.74 4

Table 3: Performance of paragraph selection on the dev set of HotpotQA based on BERT-base.

The proposed HGN relies on effective paragraph selection to find relevant multi-hop paragraphs. Table 3 shows the performance of paragraph selection on the dev set of HotpotQA. In DFGN, paragraphs are selected based on a threshold to maintain high recall (98.27%), leading to a low precision (60.28%). Compared to both threshold-based and pure Top--based paragraph selection, our two-step paragraph selection process is more accurate, achieving 94.53% precision and 94.53% recall. Besides these two top-ranked paragraphs, we also include two other paragraphs with the next highest ranking scores, to obtain a higher coverage on potential answers, while sacrificing slightly the precision score.

Table 4 summarizes the results on the dev set in the Distractor setting, using our paragraph selection approach for both DFGN and the plain BERT-base model. Note that the original DFGN does not finetune BERT, leading to much worse performance. In order to provide a fair comparison, we modify their released code to allow finetuning of BERT. Results show that our paragraph selection method outperforms the threshold-based one in both models.

scale=0.95,center Model Ans F1 Sup F1 Joint F1 DFGN (paper) 69.38 82.23 59.89 DFGN + threshold-based 71.90 83.57 63.04 + 2 para. (ours) 72.53 83.57 63.87 + 4 para. (ours) 72.67 83.34 63.63 BERT-base + threshold-based 71.95 82.79 62.43 + 2 para. (ours) 72.42 83.64 63.94 + 4 para. (ours) 72.67 84.86 64.24

Table 4: Results with selected paragraphs on the dev set in the Distractor setting.

Effectiveness of Hierarchical Graph

scale=0.95,center Model Ans F1 Sup F1 Joint F1 w/o Graph 80.58 85.83 71.02 PS Graph 80.94 87.59 72.61 PSE Graph 80.70 88.00 72.79 Hier. Graph 81.00 87.93 73.01

Table 5: Ablation study on the effectiveness of the hierarchical graph on the dev set in the Distractor setting. RoBERTa-large is used for context encoding.

As described in Section 3.1, we construct our graph with four types of nodes and seven types of edges. For ablation study, we build the graph step by step. First, we only consider edges from question to paragraphs, and from paragraphs to sentences, i.e., only edge type (), () and () are considered. We call this the PS Graph. Based on this, entity nodes and edges related to each entity node (corresponding to edge type () and ()) are added. We call this the PSE Graph. Lastly, edge types () and () are added, resulting in the final hierarchical graph.

As shown in Table 5, the use of PS Graph improves the joint F1 score over the plain RoBERTa model by points. By further adding entity nodes, the Joint F1 increases by points. This indicates that the addition of entity nodes is helpful, but may also bring in noise, thus only leading to limited performance improvement. By including edges among sentences and paragraphs, our final hierarchical graph provides an additional improvement of points. We hypothesize that this is due to the explicit connection between sentences that leads to better representations.

Effectiveness of Multi-task Loss

As described in Section 3.4, different node representations are utilized for different downstream sub-tasks. Table 6 shows the ablation study results on paragraph selection loss and entity prediction loss . The span extraction loss and supporting facts prediction loss are not ablated, since they are the essential final sub-tasks on which we evaluate the model. As shown in the table, using paragraph selection and entity prediction loss can further improve the joint F1 by 0.31 points, which demonstrates the effectiveness of optimizing all the losses jointly.

Effectiveness of Pre-trained Language Model

scale=0.95,center Objective Ans F1 Sup F1 Joint F1 81.00 87.93 73.01 80.86 87.99 72.87 80.89 87.71 72.73 & 80.76 87.78 72.70

Table 6: Ablation study on the proposed multi-task loss. RoBERTa-large is used for context encoding.

scale=0.95,center Model Ans F1 Sup F1 Joint F1 DFGN (BERT-base) 69.38 82.23 59.89 EPS (BERT-wwm) 79.05 86.26 70.48 HGN (BERT-base) 74.07 85.62 66.01 HGN (BERT-wwm) 79.69 87.38 71.45 HGN (RoBERTa) 81.00 87.93 73.01

Table 7: Results with different pre-trained language models on the dev set in the Distractor setting. () is unpublished work with results on the test set, using BERT whole word masking (wwm).

To isolate the effects of pre-trained language models, we compare our HGN with prior state-of-the-art methods by using the same pre-trained language models. Results in Table 7 show that our HGN variants outperform DFGN and EPS, indicating that the performance gain comes from a better model design.

Figure 3: Examples of supporting facts prediction in the HotpotQA Distractor setting.

4.3 Case Study

We provide two example questions for case study. To answer the question in Figure 3 (left), needs to be linked with . Subsequently, the sentence within is connected to through the hyperlink (“John Surtees”) in . A plain BERT model without using the constructed graph missed as additional supporting facts, while our HGN discovers and utilizes both pieces of evidence as the connections among , and are explicitly encoded in our hierarchical graph.

For the question in Figure 3 (right), the inference chain is . The plain BERT model infers the evidence sentences and correctly. However, it fails to predict as the supporting facts, while HGN succeeds, potentially due to the explicit connections between sentences in the constructed graph.

5 Conclusion

In this paper, we propose a new approach, Hierarchical Graph Network (HGN), for multi-hop question answering. To capture clues from different granularity levels, our HGN model weaves heterogeneous nodes into a single unified graph. Experiments with detailed analysis demonstrate the effectiveness of our proposed model, which achieves state-of-the-art performance on HotpotQA benchmark. Currently, in the Fullwiki setting, an off-the-shelf paragraph retriever is adopted for selecting relevant context from large corpus of text. Future work includes investigating the interaction and joint training between HGN and paragraph retriever for performance improvement.


  • Anonymous (2020a) Latent question reformulation and information accumulation for multi-hop machine reading. In Submitted to ICLR, Cited by: Table 1, §4.1.
  • Anonymous (2020b) Learning to retrieve reasoning paths over wikipedia graph for question answering. In Submitted to ICLR, Cited by: §4.1, Table 2.
  • Anonymous (2020c) Transformer-{xh}: multi-hop question answering with extra hop attention. In Submitted to ICLR, Cited by: §4.1, Table 2.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading Wikipedia to answer open-domain questions. In ACL, Cited by: §4.1.
  • J. Chen and G. Durrett (2019) Understanding dataset design choices for multi-hop reasoning. In NAACL, Cited by: §2.
  • J. Chen, S. Lin, and G. Durrett (2019) Multi-hop question answering via reasoning chains. arXiv preprint arXiv:1910.02610. Cited by: Table 1, §4.1.
  • N. De Cao, W. Aziz, and I. Titov (2019) Question answering by reasoning across documents with graph convolutional networks. In NAACL, Cited by: §1, §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1, §1, §3.2.
  • B. Dhingra, Q. Jin, Z. Yang, W. W. Cohen, and R. Salakhutdinov (2018) Neural models for reasoning over multiple mentions using coreference. In NAACL, Cited by: §1, §2.
  • M. Ding, C. Zhou, Q. Chen, H. Yang, and J. Tang (2019) Cognitive graph for multi-hop reading comprehension at scale. In ACL, Cited by: §1, §2, §4.1, Table 2.
  • Y. Feldman and R. El-Yaniv (2019) Multi-hop paragraph retrieval for open-domain question answering. arXiv preprint arXiv:1906.06606. Cited by: §4.1, Table 2.
  • A. Godbole, D. Kavarthapu, R. Das, Z. Gong, A. Singhal, H. Zamani, M. Yu, T. Gao, X. Guo, M. Zaheer, et al. (2019) Multi-step entity-centric information retrieval for multi-hop question answering. arXiv preprint arXiv:1909.07598. Cited by: §4.1, Table 2.
  • Y. Jiang and M. Bansal (2019a) Avoiding reasoning shortcuts: adversarial evaluation, training, and model development for multi-hop qa. In ACL, Cited by: §2.
  • Y. Jiang and M. Bansal (2019b) Self-assembling modular networks for interpretable multi-hop reasoning. In EMNLP, Cited by: §2.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §2.
  • G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy (2017) Race: large-scale reading comprehension dataset from examinations. In EMNLP, Cited by: §1.
  • X. Liu, Y. Shen, K. Duh, and J. Gao (2018) Stochastic answer networks for machine reading comprehension. In ACL, Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §4.1.
  • S. Min, E. Wallace, S. Singh, M. Gardner, H. Hajishirzi, and L. Zettlemoyer (2019a) Compositional questions do not necessitate multi-hop reasoning. In ACL, Cited by: §2.
  • S. Min, V. Zhong, L. Zettlemoyer, and H. Hajishirzi (2019b) Multi-hop reading comprehension through question decomposition and rescoring. In ACL, Cited by: §1, §2, Table 1, §4.1.
  • K. Nishida, K. Nishida, M. Nagata, A. Otsuka, I. Saito, H. Asano, and J. Tomita (2019) Answering while summarizing: multi-task learning for multi-hop qa with evidence extraction. In ACL, Cited by: §1, §2, Table 1, §4.1, §4.1, Table 2.
  • P. Qi, X. Lin, L. Mehr, Z. Wang, and C. D. Manning (2019) Answering complex open-domain questions through iterative query generation. In EMNLP, Cited by: §4.1, Table 2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. In EMNLP, Cited by: §1.
  • M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2017) Bidirectional attention flow for machine comprehension. In ICLR, Cited by: §1, §3.2.
  • L. Song, Z. Wang, M. Yu, Y. Zhang, R. Florian, and D. Gildea (2018) Exploring graph-structured passage representation for multi-hop reading comprehension with graph neural networks. arXiv preprint arXiv:1809.02040. Cited by: §1, §2.
  • A. Talmor and J. Berant (2018) The web as a knowledge-base for answering complex questions. In NAACL, Cited by: §1.
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman (2016) Newsqa: a machine comprehension dataset. arXiv preprint arXiv:1611.09830. Cited by: §1.
  • M. Tu, G. Wang, J. Huang, Y. Tang, X. He, and B. Zhou (2019) Multi-hop reading comprehension across multiple documents by reasoning over heterogeneous graphs. In ACL, Cited by: §1, §2.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. In ICLR, Cited by: §2, §3.3.
  • S. Wang and J. Jiang (2017) Machine comprehension using match-lstm and answer pointer. In ICLR, Cited by: §1.
  • J. Welbl, P. Stenetorp, and S. Riedel (2018) Constructing datasets for multi-hop reading comprehension across documents. TACL. Cited by: §1, §2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019)

    Transformers: state-of-the-art natural language processing

    arXiv preprint arXiv:1910.03771. Cited by: §4.1.
  • Y. Xiao, Y. Qu, L. Qiu, H. Zhou, L. Li, W. Zhang, and Y. Yu (2019) Dynamically fused graph network for multi-hop reasoning. In ACL, Cited by: §1, §2, §3.4, Table 1, §4.1.
  • W. Xiong, M. Yu, X. Guo, H. Wang, S. Chang, M. Campbell, and W. Y. Wang (2019) Simple yet effective bridge reasoning for open-domain multi-hop question answering. arXiv preprint arXiv:1909.07597. Cited by: §4.1, Table 2.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) Hotpotqa: a dataset for diverse, explainable multi-hop question answering. In EMNLP, Cited by: §1, §2, Table 1, §4.1, §4.1, §4.1, Table 2.
  • M. B. Yixin Nie (2019) Revealing the importance of semantic retrieval for machine reading at scale. In EMNLP, Cited by: §4.1, §4.1, Table 2.