Machine reading comprehension has been a popular topic in the past years, and a variety of models have been proposed to address this problem, such as BiDAF , Reinforced mnemonic reader , and ReasoNet . However, most existing works focus on finding evidence and answer in a single document.
In fact, in reality, many questions can only be answered after reasoning across multiple documents. Table 1 shows a multi-choice style reading comprehension example, which is from WikiHop dataset . In the example, we can only answer the question ‘what is the place of death of alexander john ellis?’ after extracting and integrating the facts ‘Alexander John Ellis is buried in Kensal Green Cemetery’ and ‘Kensal Green Cemetery is located in Kensington’ from multiple documents, which is a more challenging task.
|Question: place of death, alexander john ellis, ?|
|Support doc1: Alexander John Ellis, was an English mathematician … is buried in Kensal Green Cemetery.|
|Support doc2: The areas of College Park and Kensal Green Cemetery are located in the London boroughs of Hammersmith & Fulham and Kensington & Chelsea, respectively.|
|Candidates: college park, france, Kensington, London|
The main challenge is that the evidence is distributed in different documents and there is a lot of noise in the documents. We need to extract this evidence from multiple documents, but it is difficult to capture their dependencies for reasoning. Many works used graph convolution networks(GCNs) to deal with this problem, such as Entity-GCN , BAG  and HDE . They transform documents into an entity graph, and then import the entity graph into graph convolution networks(GCNs) to simulate the process of multi-hop reasoning.
However, these GCN-based approaches have some disadvantages. Firstly, they generated the entities only from the question and candidate answers, lacking much key information for multi-hop reasoning. For example, as the example in Table 1, the entity ‘Kensal Green Cemetery’ is an important clue to answer the question, but the above approaches ignored this information. Secondly, the traditional GCNs only update the central node based on the aggregated information of adjacent nodes and use this to simulate the process of reasoning. But the question information is not fully utilized and there is a lot of irrelevant information during information propagating across documents in the multi-hop reasoning.
In this paper, we propose a novel approach to solve the above problem. We introduce a path-based reasoning graph for multiple documents. Compared to traditional graphs, the path-based reasoning graph contains multiple reasoning paths from questions to candidate answers, combining both the idea of the GCN-based and path-based approaches. Thus, we construct a path-based reasoning graph by extracting reasoning paths(e.g., Alexander John Ellis Kensal Green Cemetery Kensington) from supporting documents and then adding reasoning nodes (e.g., Kensal Green Cemetery) in these paths to the entity graph. And then, we apply a Gated-RGCN to learn the representation of nodes. Compared to GCNs, Gated-RGCN utilizes attention and question-aware gating mechanism to regulate the usefulness of information propagating across documents and add question information during reasoning, which is closer to human reasoning processes.
Our contributions can be summarized as follows:
We propose a path-based reasoning graph, which introduces information about reasoning paths into the graph;
We propose Gated-RGCN to optimize the convolution formula of RGCN, which is more suitable for multi-hop reading comprehension;
We evaluated our approach on WikiHop dataset , and our approach achieves new state-of-the-art accuracy. Especially, our ensemble model surpasses the human performance by .
2 Related Work
Recently, there are several categories of approaches that have been proposed to tackle the problem of multi-hop reading comprehension across documents, including GCN-based approaches (Entity-GCN , BAG , HDE , MHQA-GRN , DFGN ), memory based approaches (Coref-GRU , EPAr ), path based approaches (PathNet ), and attention based approaches (CFC , DynSAN ).
GCN-based approaches organize supporting documents into a graph, and then employ Graph Neural Networks based message passing algorithms to perform multi-step reasoning. For example, Entity-GCN constructed an entity graph from supporting documents, where nodes are mentions of subject entity and candidates, and edges are relations between mentions. BAG  applied bi-directional attention between the entity graph and the query after GCN reasoning over the entity graph. HDE  constructed a heterogeneous graph where nodes correspond to candidates, documents, and entities. MHQA-GRN  constructed a graph where each node is either an entity mention or a pronoun representing an entity, and edges fall into three types: same-typed, window-typed and coreference-typed edge. DFGN  proposed a dynamic fusion reasoning block based on graph neural networks. Our work proposes Gated-RGCN to optimize the graph convolution operation, it is better for regulating the usefulness of information propagating across documents and add question information during reasoning.
Memory-based approaches try to aggregate evidences for each entity from multiple documents through a memory network. For example, Coref-GRU  aggregated information from multiple mentions of the same entity by incorporating coreference in the GRU layers. EPAr  used a hierarchical memory network to construct a ‘reasoning tree’, which contains a set of root-to-leaf reasoning chains, and then merged evidences from all chains to make the final prediction.
PathNet  proposed a typical path-based approach for multi-hop reading comprehension. It extracted paths from documents for each candidate given a question, and then predicted the answer by scoring these paths. Our work introduces the idea of path-based approach on GCN-based approach which is better for multi-hop reasoning.
CFC  and DynSAN  are two typical attention-based approaches. CFC applied co-attention and self-attention to learn query aware node representations of candidates, documents and entities. While DynSAN proposed a dynamic self-attention architecture to determine what tokens are important for constructing intra-passage or cross-passage token level semantic representations. In our work, we employ an attention mechanism between graphs and the question at each layer of Gated-RGCN.
Meanwhile, in order to promote the research on multi-hop QA, several datasets have been designed, including WikiHop , OpenBookQA , NarrativeQA , MultiRC  and HotpotQA . For example, WikiHop is a multi-choice style reading comprehension data set, where the task is to select the correct object entity from candidates when given a query and a set of supporting documents. While OpenBookQA focuses on the multi-hop QA which needs a corpus of provided science facts (open book) with external broad common knowledge.
In addition, knowledge completion over knowledge graph (KG) and KG based query answering are also related to our task, since they both need multi-hop reasoning, i.e., finding the reasoning path between two entities in KG. For example, MINERVA  formulates the multi-hop reasoning as a sequential decision problem, and uses the REINFORCE algorithm  to train an end-to-end model for multi-hop KG query answering. Meta-KGR 
also used the reinforcement learning method to learn a relation-specific multi-hop reasoning agent to search for reasoning paths and target entities. They further used meta-learning to perform multi-hop reasoning over few-shot relations of knowledge graphs.
In this section, we first formulate the task of multi-hop reading comprehension across documents, and then elaborate our approach in detail.
3.1 Task Formulation
The task of multi-hop reading comprehension across documents can be formally defined as: given a question and a set of supporting documents , the task is to find the correct answer from a set of answer candidates , where is the number of words in the question and is the number of candidates in .
In the WikiHop dataset , the question is given in the form of a tuple , where represents the subject entity, and represents the relation between and the unknown tail entity. In our example, means where did alexander john ellis die, and the answer candidates is . When given the supporting documents, e.g., supporting doc1, doc2 in Table 1, we should identify the correct answer from the candidates by reasoning across these documents.
As shown in Figure 1, our approach mainly consists of three components, including graph construction, reasoning with Gated-RGCN, and output layer. In the following sections, we will elaborate on each component in detail.
3.2 Graph Construction
We construct an entity graph based on the Entity-GCN , which extracts all mentions of entities in in as nodes in the graph. Besides, inspired by the human reasoning processing, reasoning paths from the subject entity in question to the candidates could be helpful for reasoning across documents, so we add reasoning entities in the paths into our entity graph. In our example, the path alexander john ellis Kensal Green Cemetery Kensington from documents indicate that the candidate Kensington may be the correct answer for the question alexander john ellis, place of death, ? . Thus, we treat Kensal Green Cemetery as a reasoning entity, and add it into the entity graph.
Formally, for a given question , we would like to extract paths from to from , e.g., , where is a reasoning entity. In order to find a path, we first find a document which contains the mention of the subject entity in , and then find all the named entities and noun phrases that appear in the same sentence with . In our example, we find Kensal Green Cemetery and alexander john ellis appear in the same sentence in supporting doc1, so we extract Kensal Green Cemetery as one of reasoning entities. Then, we find another document which contains any of the reasoning entities. In our example, supporting doc2 contains the reasoning entity Kensal Green Cemetery. Finally, we check whether the reasoning entity appears with one of the candidates in the same sentence. If so, we would add the path to the entity graph. For example, Kensal Green Cemetery and Kensington appear in the same sentence in supporting doc2. Therefore, the path alexander john ellis Kensal Green Cemetery Kensington can be added to the entity graph.
Since each entity in different documents has different contexts, so we use mentions of the subject entity, reasoning entities, and candidate answers as nodes in the entity graph. In our example, Kensal Green Cemetery appears in two different sentences, so we need to add nodes for different positions Figure 2 shows an example of an entity graph, where , and are mentions of the subject entity , reasoning entities , and candidate answers respectively.
Then, we define the following types of edges between pairs of nodes to encode various structural information in the entity graph.
an edge between a subject node and a reasoning node if they appear in the same sentence in a document, e.g., in Figure 2.
an edge between two nodes if they are reasoning nodes and are adjacent nodes on the same path, e.g., in Figure 2.
an edge between a reasoning node and a candidate node if they appear in the same sentence in a document, e.g., in Figure 2.
an edge between two nodes if they are mentions of the same candidate, e.g., in Figure 2.
an edge between two nodes if they appear in the same document.
nodes that do not meet previous conditions are connected.
3.3 Reasoning with Gated-RGCN
to model contextual information for each node in different documents. These two vectors are concatenated, and then encoded through 1-layer linear network. Thus, the features for all nodes can be denoted as, where is the number of nodes in the graph, and is the dimension of the node feature.
After graph initialization, we employ a Gated Relational Graph Convolutional Network (Gated-RGCN) to realize multi-hop reasoning. First, we use R-GCN to aggregate messages from its direct neighbors. Specifically, at th layer, the aggregated message for node can be obtained via
where is the neighbors of node , is the set of relations between and , is a relation-specific weight matrix, indicates the size of , and is the hidden state of node at th layer. Then, the update message for node can be obtained by combining the aggregated message with its original node information:
where is a general weight.
is the sigmoid function,is the concatenation of and ,
is implemented with a single-layer multi-layer perceptron (MLP),
is a non-linear activation function, anddenotes element-wise multiplication.
This gating mechanism regulates how much of the update message propagates to the next step, so it can prevent overwriting past information. However, the traditional GCN method only updates the central node based on the aggregated information of adjacent nodes, but there is a lot of irrelevant information during information propagating.
When humans do reasoning problems, they always choose the supports information based on the query information. Inspired by human reasoning processing, we add another question-aware gate to optimize graph convolution procedure, which is suitable for multi-hop reading comprehension. This gating mechanism can regulate the aggregated message according to the question, and introduce the question information into the update message simultaneously.
First, we represent the question using a bidirectional LSTM (BiLSTM) network , where GLoVe is used as word embeddings.
Then, the final question representation can be obtained by a weighted sum of these vectors.
Finally, could be updated via:
We stack the networks for layers where all parameters are shared, and finally obtain for the entity graph.
3.4 Output Layer
Similar to BAG , we apply a bi-directional attention between the entity graph and the question. The similarity matrix is first calculated via
where is the average operation in last dimension, and is a single-layer MLP. Then, the node-to-question attention and question-to-node attention are calculated via
where and denote performing softmax and max function across columns respectively, is the function to duplicate the result for times into shape .
The output of the bi-directional attention layer is , which is then fed to a 2-layer fully connected feed-forward network with
as the activation function in each layer. Finally, the softmax function is applied among the output, which can generate the prediction result for each node in the graph. Each candidate may correspond to several nodes, since it may appear in multiple documents. We use the maximal probability of these nodes as the result of the candidate, and use cross-entropy as the loss function.
4.1 Dataset and Experimental Settings
We use WikiHop  to validate the effectiveness of our proposed approach, which is a multi-choice style reading comprehension data set. The dataset contains about 43K/5K/2.5K samples in training, development, and test set respectively. The test set is not public and can only be evaluated online blindly.
In our implementation, we use NLTK  to tokenize the supporting documents, question, and candidates into word sequences, and then find mentions of subject entity and candidates in supporting documents through the exact matching strategy. In order to extract reasoning entities, we use Stanford CoreNLP  to perform entity recognition on the supporting documents.
We use the standard 1024-dimension ELMo and 300-dimension pre-trained GLoVe (trained from 840B Web crawl data) as word embeddings. The dimensions of hidden states in BiLSTM and GCN are set as , and the number of nodes and the query length is truncated as 600 and 25 respectively. We stack layers of the Gated-RGCN blocks. During training, we set the mini-batch size as 16, and use Adam  with learning rate 0.0002 for learning the parameters.
4.2 Main Results
We compare our approach with several previously published models, and present our results in Table 2. The performance of multiple choice QA is evaluated by the accuracy of choosing the correct answer. Table 2 shows the performance of approaches both on development and test set respectively. As shown in the table, we can see that our approach achieves the state-of-the-art accuracy on both development and test set against all types of approaches, including GCN-based approaches (Entity-GCN , BAG , HDE , MHQA-GRN ), memory-based approaches (Coref-GRU , EPAr ), path-based approaches (PathNet ), and attention-based approaches (CFC , DynSAN ).
For ensemble models, our approach also achieves state-of-the-art performance, which surpasses the reported human performance  by about .
4.3 Ablations Studies
We conduct an ablation study to evaluate the contribution of each model component, and show the results in Table 3.
|(a) w/o reasoning entities||69.1||-1.7|
|(b) w/o question||70.2||-0.6|
|(c) w/o attention in question encoding||68.8||-2.0|
|(d) w/o edge types||69.0||-1.8|
|(e) reduce edge types||69.5||-1.3|
In (a), we delete all reasoning entities, so our graph degenerates to the entity graph in . The accuracy on the development set drops to 69.1%, but it is still higher than the accuracy of Entity-GCN . This proves the effectiveness of reasoning entities and the Gated-RGCN mechanism. In (b), we do not bring in the question in the reasoning. While in (c), we only use BiLSTM to encode the question, but do not adopt the attention mechanism in BiLSTM. From (b) and (c), we can see that question is useful for reasoning. In (d), we treat all edge types equally, so Gated-RGCN is replaced by Gated-GCN in reasoning, which reduces accuracy absolutely. In (e), we only define the edge types as in BAG , and the accuracy drops to . From (d) and (e), we learn that different types of edges is also critical in the reasoning.
In this section, we conduct a series of experiments with different setting in our approach.
Different number of Gated-RGCN layers.
We evaluate our approach with different number of GCN layers (), and the result is shown in Figure 3(a). From the figure, we can see that the accuracy increases gradually and then drops finally, which proves the effectivity of Gated-RGCN. It is expected that the accuracy drops finally, because more hops would bring in noise. As for why 4-hops is the best, this may be because many samples are 1-hop problems, and the length of path in these samples is exactly 4.
Different number of supporting documents.
We split the development set into subsets according to the number of supporting documents, and then evaluate our approach on each subset. The results are shown in Figure 3(b). From the figure, we can see that more documents may bring in more irrelevant information, which is harmful to the multi-hop QA. However, even on a number greater than 16, our model achieved an accuracy of 69.5%, which is better than the overall effect of most models.
Different word embeddings.
Table 4 shows the results of our approach when using different word embeddings. From the table, we can see that only using ELMo or GLoVe in our approach will cause a severe drop, and replacing ELMo with BERT  delivers a competitive result. Meanwhile, we can see that for the GCN-based approach, the initialization of nodes is extremely critical, because this type of approach has greatly compressed the information when constructing the graph.
In this paper, we propose a novel approach for multi-hop reading comprehension across documents. Our approach extends the entity graph by introducing reasoning entities, which can form the reasoning path from question to candidates. In addition, our approach incorporates the question in the multi-hop reasoning through a new gate mechanism to regulate how much useful information propagating from neighbors to the node. Experiments show that our approach achieves state-of-the-art accuracy both for single and ensemble models.
Our future work would focus on the interpretability of the multi-hop reading comprehension across documents. In addition, we would like to build the entity graph with reasoning entities dynamically during the reasoning, and apply our model on other datasets.
This work is supported by the National Key Research and Development Project of China (No. 2018AAA0101900), the Fundamental Research Funds for the Central Universities (No. 2019FZA5013), the Zhejiang Provincial Natural Science Foundation of China (No. LY17F020015), the Chinese Knowledge Center of Engineering Science and Technology (CKCEST) and MOE Engineering Research Center of Digital Library.
-  (2006) NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions, pp. 69–72. Cited by: §4.1.
-  (2019) BAG: bi-directional attention entity graph convolutional network for multi-hop reasoning question answering. In NAACL-HLT, Cited by: §1, §2, §2, §3.3, §3.4, §4.2, §4.3, Table 2.
-  (2017) Go for a walk and arrive at the answer: reasoning over paths in knowledge bases using reinforcement learning. In ICLR, Cited by: §2.
-  (2019) Question answering by reasoning across documents with graph convolutional networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2306–2317. Cited by: §1, §2, §2, §3.2, §3.3, §4.2, §4.3, Table 2.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §4.4.
-  (2018) Neural models for reasoning over multiple mentions using coreference. In NAACL-HLT, Cited by: §2, §2, §4.2, Table 2.
-  (1997) Long short-term memory. Neural Computation 9, pp. 1735–1780. Cited by: §3.3.
-  (2017) Reinforced mnemonic reader for machine reading comprehension. In IJCAI, Cited by: §1.
-  (2019) Explore, propose, and assemble: an interpretable model for multi-hop reading comprehension. In ACL, Cited by: §2, §2, §4.2, Table 2.
-  (2018) Looking beyond the surface: a challenge set for reading comprehension over multiple sentences. In NAACL-HLT, Cited by: §2.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
-  (2017) The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6, pp. 317–328. Cited by: §2.
-  (2019) Exploiting explicit paths for multi-hop reading comprehension. In ACL (1), pp. 2737–2747. Cited by: §2, §2, §4.2, Table 2.
-  (2019) Adapting meta knowledge graph information for multi-hop reasoning over few-shot relations. ArXiv abs/1908.11513. Cited by: §2.
The stanford corenlp natural language processing toolkit. In ACL, Cited by: §4.1.
-  (2018) Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, Cited by: §2.
-  (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §3.3.
-  (2018) Deep contextualized word representations. In Proceedings of NAACL-HLT, pp. 2227–2237. Cited by: §3.3.
-  (2019) Dynamically fused graph network for multi-hop reasoning. In ACL, Cited by: §2, §2.
-  (2017) Bidirectional attention flow for machine comprehension. In ICLR, Cited by: §1, Table 2.
-  (2017) ReasoNet: learning to stop reading in machine comprehension. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1047–1055. Cited by: §1.
-  (2018) Exploring graph-structured passage representation for multi-hop reading comprehension with graph neural networks. ArXiv abs/1809.02040. Cited by: §2, §2, §4.2, Table 2.
-  (2019) Multi-hop reading comprehension across multiple documents by reasoning over heterogeneous graphs. In ACL, Cited by: §1, §2, §2, §3.3, §4.2, Table 2.
-  (2018) Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics 6, pp. 287–302. Cited by: 3rd item, §1, §2, §3.1, §4.1, §4.2.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, pp. 229–256. Cited by: §2.
-  (2018) HotpotQA: a dataset for diverse, explainable multi-hop question answering. In EMNLP, Cited by: §2.
-  (2019) Coarse-grain fine-grain coattention network for multi-evidence question answering. In ICLR, Cited by: §2, §2, §4.2, Table 2.
-  (2019) Token-level dynamic self-attention network for multi-passage reading comprehension. In ACL, Cited by: §2, §2, §4.2, Table 2.