Multi-hop Reading Comprehension across Multiple Documents by Reasoning over Heterogeneous Graphs

05/17/2019 ∙ by Ming Tu, et al. ∙, Inc. 0

Multi-hop reading comprehension (RC) across documents poses new challenge over single-document RC because it requires reasoning over multiple documents to reach the final answer. In this paper, we propose a new model to tackle the multi-hop RC problem. We introduce a heterogeneous graph with different types of nodes and edges, which is named as Heterogeneous Document-Entity (HDE) graph. The advantage of HDE graph is that it contains different granularity levels of information including candidates, documents and entities in specific document contexts. Our proposed model can do reasoning over the HDE graph with nodes representation initialized with co-attention and self-attention based context encoders. We employ Graph Neural Networks (GNN) based message passing algorithms to accumulate evidences on the proposed HDE graph. Evaluated on the blind test set of the Qangaroo WikiHop data set, our HDE graph based model (single model) achieves state-of-the-art result.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Being able to comprehend a document and output correct answer given a query/question about content in the document, often referred as machine reading comprehension (RC) or question answering (QA), is an important and challenging task in natural language processing (NLP). Plenty of data sets have been constructed to facilitate research on this topic, such as SQuAD

Rajpurkar et al. (2016, 2018), NarrativeQA Kočiskỳ et al. (2018) and CoQA Reddy et al. (2018). Many neural models have been proposed to tackle the machine RC/QA problem Seo et al. (2016); Xiong et al. (2016); Tay et al. (2018), and great success has been achieved, especially after the release of the BERT Devlin et al. (2018).

Query: record_label get ready
Support doc 1: Mason Durell Betha (born August 27, 1977), better known by stage name Mase (formerly often stylized Ma$e or MA$E), is an American hip hop recording artist and minister. He is best known for being signed to Sean “Diddy” Combs’s label Bad Boy Records. …
Support doc 2: “Get Ready” was the only single released from Mase’s second album, Double Up. It was released on May 25, 1999, produced by Sean “Puffy” Combs, Teddy Riley and Andreao “Fanatic” Heard and featured R&B group, Blackstreet, it contains a sample of “A Night to Remember”, performed by Shalamar. …
Support doc 3: Bad Boy Entertainment (also known as Bad Boy Records) is an American record label founded in 1993 by Sean Combs. …
Candidates: bad boy records, record label, rock music, …
Answer: bad boy records

Figure 1: A WikiHop example. Words with different colors indicate the evidences across documents.

However, current research mainly focuses on machine RC/QA on a single document or paragraph, and still lacks the ability to do reasoning across multiple documents when a single document is not enough to find the correct answer. To promote the study for multi-hop RC over multiple documents, two data sets are recently proposed: WikiHop Welbl et al. (2018) and HotpotQA Yang et al. (2018). These two data sets require multi-hop reasoning over multiple supporting documents to find the answer. In Figure 1, we show an excerpt from one sample in WikiHop development set to illustrate the need for multi-hop reasoning.

Two types of approaches have been proposed on the multi-hop multi-document RC problem. The first is based on previous neural RC models. The earliest attempt in Dhingra et al. (2018) concatenated all supporting documents and designed a recurrent layer to explicitly exploit the skip connections between entities given automatically generated coreference annotations. Adding this layer to the neural RC models improved performance on multi-hop tasks. Recently, an attention based system Zhong et al. (2019) utilizing both document-level and entity-level information achieved state-of-the-art results on WikiHop data set, proving that techniques like co-attention and self-attention widely employed in single-document RC tasks are also useful in multi-document RC tasks.

The second type of research work is based on graph neural networks (GNN) for multi-hop reasoning. The study in Song et al. (2018)

adopted two separate name entity recognition (NER) and coreference resolution systems to locate entities in support documents. Those entities serve as nodes in GNN to enable multi-hop reasoning across documents. Work in

De Cao et al. (2018) directly used mentions of candidates (found in documents by simple exact matching strategy) as GNN nodes and calculate classification scores over mentions of candidates.

In this paper, we propose a new method to solve the multi-hop RC problem across multiple documents. Inspired by the success of GNN based methods Song et al. (2018); De Cao et al. (2018) for multi-hop RC, we introduce a new type of graph, called Heterogeneous Document-Entity (HDE) graph. Our proposed HDE graph has the following advantages:

  • Instead of graphs with single type of nodes Song et al. (2018); De Cao et al. (2018), the HDE graph contains different types of query-aware nodes representing different granularity levels of information. Specifically, instead of only entity nodes as in Song et al. (2018); De Cao et al. (2018), we include nodes corresponding to candidates, documents and entities. In addition, following the success of Coarse-grain Fine-grain Coattention (CFC) network Zhong et al. (2019), we apply both co-attention and self-attention to learn query-aware node representations of candidates, documents and entities;

  • The HDE graph enables rich information interaction among different types of nodes thus facilitate accurate reasoning. Different types of nodes are connected with different types of edges to highlight the various structural information presented among query, document and candidates.

Through ablation studies, we show the effectiveness of our proposed HDE graph for multi-hop multi-document RC task. Evaluated on the blind test set of WikiHop, our proposed end-to-end trained single neural model beats the current published state-of-the-art results in Zhong et al. (2019) and is the 2nd best model on the WikiHop leaderboard. Meanwhile, our ensemble model ranks 1st place on the WikiHop leadrboard and surpasses the human performance (as reported in Welbl et al. (2018)) on this data set by 0.2% 111By May 30th 2019, This is achieved without using pretrained contextual ELMo embedding Peters et al. (2018).

2 Related Work

The study presented in this paper is directly related to existing research on multi-hop reading comprehension across multiple documents Dhingra et al. (2018); Song et al. (2018); De Cao et al. (2018); Zhong et al. (2019); Kundu et al. (2018). The method presented in this paper is similar to previous studies using GNN for multi-hop reasoning Song et al. (2018); De Cao et al. (2018). Our novelty is that we propose to use a heterogeneous graph instead of a graph with single type of nodes to incorporate different granularity levels of information. The co-attention and self-attention based encoding of multi-level information presented in each input is also inspired by the CFC model Zhong et al. (2019) because they show the effectiveness of attention mechanisms. Our model is very different from the other two studies Dhingra et al. (2018); Kundu et al. (2018): these two studies both explicitly score the possible reasoning paths with extra NER or coreference resolution systems while our method does not require these modules and we do multi-hop reasoning over graphs. Besides these studies, our work is also related to the following research directions.

Multi-hop RC: There exist several different data sets that require reasoning in multiple steps in literature, for example bAbI Weston et al. (2015), MultiRC Khashabi et al. (2018) and OpenBookQA Mihaylov et al. (2018). A lot of systems have been proposed to solve the multi-hop RC problem with these data sets Sun et al. (2018); Wu et al. (2019). However, these data sets require multi-hop reasoning over multiple sentences or multiple common knowledge while the problem we want to solve in this paper requires collecting evidences across multiple documents.

GNN for NLP:

Recently, there is considerable amount of interest in applying GNN to NLP tasks and great success has been achieved. For example, in neural machine translation, GNN has been employed to integrate syntactic and semantic information into encoders

Bastings et al. (2017); Marcheggiani et al. (2018); Zhang et al. (2018) applied GNN to relation extraction over pruned dependency trees; the study by Yao et al. (2018) employed GNN over a heterogeneous graph to do text classification, which inspires our idea of the HDE graph; Liu et al. (2018) proposed a new contextualized neural network for sequence learning by leveraging various types of non-local contextual information in the form of information passing over GNN. These studies are related to our work in the sense that we both use GNN to improve the information interaction over long context or across documents.

3 Methodology

In this section, we describe different modules of the proposed Heterogeneous Document-Entity (HDE) graph-based multi-hop RC model. The overall system diagram is shown in Figure 2. Our model can be roughly categorized into three parts: initializing HDE graph nodes with co-attention and self-attention based context encoding, reasoning over HDE graph with GNN based message passing algorithms and score accumulation from updated HDE graph nodes representations.

3.1 Context encoding

Given a query with the form of (s, r, ?) which represents subject, relation and unknown object respectively, a set of support documents and a set of candidates , the task is to predict the correct answer to the query. To encode information including in the text of query, candidates and support documents, we use a pretrained embedding matrix Pennington et al. (2014)

to convert word sequences to sequences of vectors. Let

, and represent the embedding matrices of query, -th supporting document and -th candidate of a sample, where , and are the numbers of words in query, -th supporting document and -th candidate respectively.

is the dimension of the word embedding. We use bidirectional recurrent neural networks (RNN) with gated recurrent unit (GRU)

Cho et al. (2014) to encode the contextual information present in the query, supporting documents and candidates separately. The output of query, document and candidate encoders are , and . denotes the output dimension of RNN encoders.

Figure 2: System diagram. and are the number of support documents and candidates respectively. We use yellow nodes to represent query-aware candidate representation, blue nodes to represent extracted query-aware entity representation and green nodes to represent query-aware document representation.

Entity extraction: entities play an import role in bridging multiple documents and connecting a query and the corresponding answer as shown in figure 1. For example, the entity “get ready” in query and two entities “Mase” and “Sean Combs” co-occur in the 2nd support document, and both “Mase” and “Sean Combs” can lead to the correct answer “bad boy records”. Based on this observation, we propose to extract mentions of both query subject s and candidates from documents. We will show later that by including mentions of query subject the performance can be improved. We use simple exact match strategy De Cao et al. (2018); Zhong et al. (2019) to find the locations of mentions of query subject and candidates, i.e. we need the start and end positions of each mention. Each mention is treated as an entity. Then, representations of entities can be taken out from the -th document encoding . We denote an entity’s representation as where is the length of the entity.

Co-attention: Co-attention has achieved great success for single document reading comprehension tasks Seo et al. (2016); Xiong et al. (2016), and recently was applied to multiple-hop reading comprehension Zhong et al. (2019). Co-attention enables the model to combine learned query contextual information attended by document and document contextual information attended by query, with inputs of one query and one document. We follow the implementation of co-attention in Zhong et al. (2019).

We use the co-attention between a query and a supporting document for illustration. Same operations can be applied to other documents, or between the query and extracted entities. Given RNN-encoded sequences of the query and a document

, the affinity matrix between the query and document can be calculated as


where denotes matrix transpose. Each entry of the matrix indicates how related two words are, one from the query and one from the document. For simplification, in later context, we ignore the superscript which indicates the operation on the -th document.

Next we derive the attention context of the query and document as follows:


denotes column-wise normalization. We further encode the co-attended document context using a bidirectional RNN with GRU:


The final co-attention context is the column-wise concatenation of and :


We expect carries query-aware contextual information of supporting documents as shown by Zhong et al. (2019). The same co-attention module can also be applied to query and candidates, and query and entities (as shown in Figure 2) to get and

. Note that we do not do co-attention between query and entities corresponding to query subject because query subject is already a part of the query. To keep the dimensionality consistent, we apply a single-layer multi-layer perceptron (MLP) with

activation function to increase the dimension of the query subject entities to .

Self-attentive pooling: while co-attention yields a query-aware contextual representation of documents, self-attentive pooling is designed to convert the sequential contextual representation to a fixed dimensional non-sequential feature vector by selecting important query-aware information Zhong et al. (2019). Self-attentive pooling summarizes the information presented in the co-attention output by calculating a score for each word in the sequence. The scores are normalized and a weighted sum based pooling is applied to the sequence to get a single feature vector as the summarization of the input sequence. Formally, the self-attention module can be formulated as the following operations given as input:


where is a two-layer MLP with as activation function. Similarly, after self-attentive pooling, we can get and for each candidate and entity.

Our context encoding module is different from the one used in Zhong et al. (2019) in following aspects: 1) we compute the co-attention between query and candidates which is not presented in the CFC model. 2) For entity word sequences, we first calculate co-attention with query and then use self-attention to summarize each entity word sequence while Zhong et al. (2019) first do self-attention on entity word sequences to get a sequence of entity vectors in each documents. Then, they apply co-attention with query.

3.2 Reasoning over HDE graph

Graph building: let a HDE graph be denoted as , where stands for node representations and represents edges between nodes. In our proposed HDE graph based model, we treat each document, candidate and entity extracted from documents as nodes in the HDE graph, i.e., each document (candidate/entity) corresponds to one node in the HDE graph. These nodes represent different granularity levels of query-aware information: document nodes encode document-level global information regarding to the query; candidate nodes encode query-aware information in candidates; entity nodes encode query-aware information in specific document context or the query subject. The HDE graph is built to enable graph-based reasoning. It exploits useful structural information among query, support documents and candidates. We expect our HDE graph could perform multi-hop reasoning to locate the answer nodes or entity nodes of answers given a query.

Self-attentive pooling generates vector representations for each candidate, document and entity, which can be directly employed to initialize the node representations . For edge connections , we define the following types of edges between pairs of nodes to encode various structural information in the HDE graph:

  1. an edge between a document node and a candidate node if the candidate appear in the document at least one time.

  2. an edge between a document node and an entity node if the entity is extracted from the document.

  3. an edge between a candidate node and an entity node if the entity is a mention of the candidate.

  4. an edge between two entity nodes if they are extracted from the same document.

  5. an edge between two entity nodes if they are mentions of the same candidate or query subject and they are extracted from different documents.

  6. all candidate nodes connect with each other.

  7. entity nodes that do not meet previous conditions are connected.

Type 4, 5, 7 edges are also employed in De Cao et al. (2018) where the authors show the effectiveness of those different types of edges. Similarly, we treat these different edges differently to make information propagate differently over these seven different types of edges. More details will be introduced in next paragraph about message passing over the HDE graph. In Figure 3, we illustrate a toy example of the proposed HDE graph.

Figure 3: A toy example of HDE graph. The dash dot lines connecting documents (green nodes) and candidates (yellow nodes) correspond to type 1 edge. The normal dash lines connecting documents and entities (blue nodes) correspond to type 2 edge. The square dot lines connecting entities and candidates correspond to type 3 edge. The red solid line connecting two entities correspond to type 4 edge. The purple solid line correspond to type 5 edge. The black solid lines connecting two candidates correspond to type 6 edge. For good visualization, we ignore the type 7 edge in this figure.

Message passing: we define how information propagates over the graph in order to do reasoning over the HDE graph. Different variants of GNN have different implementations of message passing strategies. In this study, we follow the message passing design in GCN Kipf and Welling (2016); De Cao et al. (2018) as it gives good performance on validation set compared to other strategies Veličković et al. (2017); Xu et al. (2018). Generally, the message passing over graphs can be achieved in two steps: aggregation and combination Hamilton et al. (2017), and this process can be conducted multiple times (usually referred as layers or hops in GNN literature). Here, we give the aggregation and combination formulation of the message passing over the proposed HDE graph. The first step aggregates information from neighbors of each node, which can be formulated as


where is the set of all edge types, is the neighbors of node with edge type and is the node representation of node in layer ( initialized with self-attention outputs). indicates the size of the neighboring set. defines a transformation on the neighboring node representations, and can be implemented with a MLP. represents the aggregated information in layer for node , and can be combined with the transformed node representation:


where can also be implemented with a MLP.

It has been shown that GNN suffers from the smoothing problem if the number of layers is large Kipf and Welling (2016). The smoothing problem can result in similar nodes representation and lose the discriminative ability when doing classification on nodes. To tackle this problem, we add a gating mechanism Gilmer et al. (2017) on the combined information .


denotes the sigmoid function on transformed concatenation of

and . is then applied to the combined information to control the amount information from computed update or from the original node representation. functions as a non-linear activation function. denotes element-wise multiplication.

In this study, , and are all implemented with single-layer MLPs, the output dimension of which is . After times message passing, all candidate, document and entity nodes will have their final updated node representation.

3.3 Score accumulation

The final node representations of candidate and entity nodes corresponding to mentions of candidates are used to calculate classification scores. This procedure can be formulated as


where is the node representation of all candidate nodes and is the number of candidates. is the node representation of all entity nodes that correspond to candidates, and is the number of those nodes. is an operation that takes the maximum over scores of entities that belong to the same candidate. and are implemented with two-layer MLPs with activation function. The hidden layer size is half of the input dimension, and the output dimension is 1. We directly sum the scores from candidate nodes and entity nodes as the final scores over multiple candidates. Thus, the output score vector gives a distribution over all candidates. Since the task is multi-class classification, we use cross-entropy loss as training objective which takes and the labels as input.

4 Experiments

4.1 Dataset

We use WikiHop Welbl et al. (2018) to validate the effectiveness of our proposed model. The query of WikiHop is constructed with entities and relations from WikiData, while supporting documents are from WikiReading Hewlett et al. (2016). A bipartite graph connecting entities and documents is first built and the answer for each query is located by traversal on this graph. Candidates that are type-consistent with the answer and share the same relation in query with the answer are included, resulting in a set of candidates. Thus, WikiHop is a multi-choice style reading comprehension data set. There are totally about 43K samples in training set, 5K samples in development set and 2.5K samples in test set. The test set is not provided and can only be evaluated on blindly. The task is to predict the correct answer given a query and multiple supporting documents. In the experiment, we train our proposed model on all training samples in WikiHop

, and tune model hyperparameters on all samples in development set. We only evaluate our proposed model on the unmasked version of


4.2 Experimental settings

Queries, support documents and candidates are tokenized into word sequences with NLTK Loper and Bird (2002). We empirically split the query into relation and subject entity. Exact matching strategy is employed to locate mentions of both subject entity and candidates in supporting documents. 300-dimensional GLoVe embeddings (with 840B tokens and 2.2M vocabulary size) Pennington et al. (2014)

and 100-dimensional character n-gram embeddings

Hashimoto et al. (2017)

are used to convert words into 400-dimensional vector representations. Out of vocabulary words are initialized with random vectors. The embedding matrices are not updated during training. The proposed model is implemented with PyTorch

Paszke et al. (2017)

. More details about experimental and hyperparameter settings can be found in supplementary materials. The performance on development set is measured after each training epoch, and the model with the highest accuracy is saved and submitted to be evaluated on the blind test set. We will make our code publicly available after the review process.

We also prepared an ensemble model consisting of 15 models with different hyperparameter settings and random seeds. We used the simple majority voting strategy to fuse the candidate predictions of different models together.

Single models Accuracy (%)
Dev Test
BiDAF - 42.9
Coref-GRUDhingra et al. (2018) 56.0 59.3
MHQA-GRNSong et al. (2018) 62.8 65.4
Entity-GCNDe Cao et al. (2018) 64.8 67.6
CFCZhong et al. (2019) 66.4 70.6
Kundu et al. (2018) 67.1 -
DynSAN* - 71.4
Proposed 68.1 70.9



Ensemble models
Entity-GCNDe Cao et al. (2018) 68.5 71.2
DynSAN* - 73.8
Proposed 70.9 74.3
Table 1: Performance comparison among different models on WikiHop development and test set. The results of “BiDAF” are presented in the paper by Welbl et al. (2018). Models annotated with “*” are unpublished but available on WikiHop leaderboard. “-” indicates unavailable numbers.
Model Accuracy (%)
Full model 68.1 -
  - HDE graph 65.5 2.6
  - different edge types 66.7 1.4
  - candidate nodes scores 67.1 1.0
  - entity nodes scores 66.6 1.5
  - candidate nodes 66.2 1.9
  - document nodes 67.6 0.5
  - entity nodes 63.6 4.5
Table 2: Ablation results on the WikiHop dev set.

4.3 Results

In Table 1, we show the results of the our proposed HDE graph based model on both development and test set and compare it with previously published results. We show that our proposed HDE graph based model improves the published state-of-the-art accuracy on development set from 67.1% Kundu et al. (2018) to 68.1%, on the blind test set from 70.6% Zhong et al. (2019) to 70.9%. Compared to the best single model “DynSAN” (unpublished) on WikiHop leaderboard, our proposed model is still 0.5% worse. Compared to two previous studies using GNN for multi-hop reading comprehension Song et al. (2018); De Cao et al. (2018), our model surpasses them by a large margin even though we do not use better pre-trained contextual embedding ELMo Peters et al. (2018).

For the ensemble models, our proposed system achieves the state-of-the-art performance, which is also 0.2% higher than the reported human performance Welbl et al. (2018). Even though our single model is a little worse than the “DynSAN”, our ensemble model is better than both the ensembled “DynSAN” and the ensembled “Entity-GCN”.

Model Single-follow Multi-follow
With HDE graph 67.8 71.0
Without HDE graph 66.7 67.0
Table 3: Accuracy(%) comparison under different types of samples.

4.4 Ablation studies

In order to better understand the contribution of different modules to the performance, we conduct several ablation studies on the development set of WikiHop.

If we remove the proposed HDE graph and directly use the representations of candidates and entities corresponding to mentions of candidates (equation 7) for score accumulation, the accuracy on WikiHop development set drops 2.6% absolutely. This proves the efficacy of the proposed HDE graph on multi-hop reasoning across multiple documents.

If we treat all edge types equally without using different GNN parameters for different edge types (equation 9), the accuracy drops 1.4%, which indicates that different information encoded by different types of edges is also important to retain good performance; If only scores of entity nodes (right part of equation 12) are considered in score accumulation, the accuracy on dev set degrades by 1.0%; if only scores of candidates nodes (left part of equation 12) are considered, the accuracy degrades by 1.5%. This means that the scores on entity nodes contribute more to the classification, which is reasonable because entities carry context information in the document while candidates do not.

We also investigate the effect of removing different types of nodes. Note that removing nodes is not the same as removing scores from candidate/entity nodes — it means we do not use the scores on these nodes during score accumulation but nodes still exist during message passing on the HDE graph. However, removing one type of nodes means the nodes and corresponding edges do not exist in the HDE graph. The ablation shows that removing entity nodes results in the largest degradation of performance while removing document nodes result in the least degradation. This finding is consistent with the study by De Cao et al. (2018)

where they emphasize the importance of entities in multi-hop reasoning. The small contribution of document nodes is probably caused by too much information loss during self-attentive pooling over long sequences. Better ways are needed to encode document information into graph. More ablation studies are included in the supplementary materials due to space constraint.

Figure 4: Plots between number of support documents (x-axis) and number of examples (left y-axis), and between number of support documents and accuracy (right y-axis).

Figure 5: Plots between number of candidates (x-axis) and number of examples (left y-axis), and between number of candidates and accuracy (right y-axis).

4.5 Result analysis

To investigate how the HDE graph helps multi-hop reasoning, we conduct experiments on WikiHop development set where we discard the HDE graph and only use the candidate and entity representations output by self-attention. In Table 3, “Single-follow” (2069 samples in the dev set) means a single document is enough to answer the query, while “Multi-follow” (2601 samples) means multiple documents are needed. These information is provided in Welbl et al. (2018). We observe in Table 2 that the performance is consistently better for “with HDE graph” in both cases. In “Single-follow” case the absolute accuracy improvement is 1.1%, while a significant 4.0% improvement is achieved in the “Multi-follow” case, which has even more samples than “Single-follow” case. This proves that the proposed HDE graph is good at reasoning over multiple documents.

We also investigate how our model performs w.r.t. the number of support documents and number of candidates given an input sample. In Figure 5, the blue line with square markers shows the number of support documents in one sample (x-axis) and the corresponding frequencies in the development set (y-axis). The orange line with diamond markers shows the change of accuracy with the increasing of number of support documents. We choose the number of support documents with more than 50 appearances in the development set. For example, there are about 300 samples with 5 support documents and the accuracy of our model on these 300 samples is about 80%. Overall, we find the accuracy decreases with the increasing number of support documents. This is reasonable because more documents possibly means more entities and bigger graph, and is more challenging for reasoning. Figure 5 indicates the similar trend (when the number of candidates are less than 20) with the increasing number of candidates, which we believe is partly caused by the larger HDE graph. Also, more candidates cause more confusion in the selection.

5 Conclusion

We propose a new GNN-based method for multi-hop RC across multiple documents. We introduce the HDE graph, a heterogeneous graph for multiple-hop reasoning over nodes representing different granularity levels of information. We use co-attention and self-attention to encode candidates, documents, entities of mentions of candidates and query subjects into query-aware representations, which are then employed to initialize graph node representations. Evaluated on WikiHop, our end-to-end trained single neural model delivers competitive results while our ensemble model achieves the state-of-the-art performance. In the future, we would like to investigate explainable GNN for this task, such as explicit reasoning path in Kundu et al. (2018), and work on other data sets such as HotpotQA.

6 Acknowledgements

We would like to thank Johannes Welbl from University College London for running evaluation on our submitted model.