Exploring Graph-structured Passage Representation for Multi-hop Reading Comprehension with Graph Neural Networks

by   Linfeng Song, et al.

Multi-hop reading comprehension focuses on one type of factoid question, where a system needs to properly integrate multiple pieces of evidence to correctly answer a question. Previous work approximates global evidence with local coreference information, encoding coreference chains with DAG-styled GRU layers within a gated-attention reader. However, coreference is limited in providing information for rich inference. We introduce a new method for better connecting global evidence, which forms more complex graphs compared to DAGs. To perform evidence integration on our graphs, we investigate two recent graph neural networks, namely graph convolutional network (GCN) and graph recurrent network (GRN). Experiments on two standard datasets show that richer global information leads to better answers. Our method performs better than all published results on these datasets.


page 1

page 2

page 3

page 4


Multi-hop Reading Comprehension across Documents with Path-based Graph Convolutional Network

Multi-hop reading comprehension across multiple documents attracts much ...

Multi-hop Reading Comprehension through Question Decomposition and Rescoring

Multi-hop Reading Comprehension (RC) requires reasoning and aggregation ...

Explore, Propose, and Assemble: An Interpretable Model for Multi-Hop Reading Comprehension

Multi-hop reading comprehension requires the model to explore and connec...

Why can't memory networks read effectively?

Memory networks have been a popular choice among neural architectures fo...

Graph-free Multi-hop Reading Comprehension: A Select-to-Guide Strategy

Multi-hop reading comprehension (MHRC) requires not only to predict the ...

Constructing Datasets for Multi-hop Reading Comprehension Across Documents

Most Reading Comprehension methods limit themselves to queries which can...

Heterogeneous Graph Attention Network for Multi-hop Machine Reading Comprehension

Multi-hop machine reading comprehension is a challenging task in natural...

1 Introduction

Recent years have witnessed a growing interest in the task of machine reading comprehension. However, most existing work (Hermann et al., 2015; Wang and Jiang, 2017; Seo et al., 2016; Wang et al., 2016; Weissenborn et al., 2017; Dhingra et al., 2017a; Shen et al., 2017) focuses on a factoid scenario where the questions can be answered by simply considering very local information, such as one or two sentences. For example, to correctly answer a question “What causes precipitation to fall?”, a QA system only needs to refer to one sentence in a passage: “… In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. …”

A more challenging yet practical extension is the multi-hop reading comprehension (MHRC) (Welbl et al., 2018), where a system needs to properly integrate multiple evidence to correctly answer a question. Figure 1 shows an example, which has three passages, a question and several candidate choices. In order to correctly answer the question, a system has to integrate the facts “The Hanging Gardens are in Mumbai” and “Mumbai is a city in India”. There are also some irrelevant facts, such as “The Hanging Gardens provide sunset views over the Arabian Sea” and “The Arabian Sea is bounded by Pakistan and Iran”, which make the task more challenging, as an MHRC model has to distinguish the relevant facts from the irrelevant ones.

Despite being a practical task, so far MHRC has received relatively little research attention. One notable method, Coref-GRU (Dhingra et al., 2018), integrates multiple evidence associated with each entity mention by incorporating coreference information using a collection of GRU layers of a gated-attention reader (Dhingra et al., 2017a). However, the main disadvantage of Coref-GRU is that the coreferences it considers are usually local to a sentence, neglecting other useful global information. The top part of Figure 2 shows a directed acyclic graph (DAG) with only coreference edges. In particular, the two coreference edges indicate the two facts: “The Hanging Gardens provide views over the Arabian Sea” and “Mumbai is a city in India”, from which we cannot indicate the ultimate fact, “The Hanging Gardens are in India”, for correctly answering this instance.

[The Hanging Gardens], in [Mumbai], also known as Pherozeshah Mehta Gardens, are terraced gardens … [They] provide sunset views over the [Arabian Sea] …
[Mumbai] (also known as Bombay, the official name until 1995) is the capital city of the Indian state of Maharashtra. [It] is the most populous city in [India] …
The [Arabian Sea] is a region of the northern Indian Ocean bounded on the north by [Pakistan] and [Iran], on the west by northeastern [Somalia] and the Arabian Peninsula, and on the east by [India] …
Q: (The Hanging gardens, country, ?)
Options: Iran, India, Pakistan, Somalia, …
Figure 1: An example from WikiHop (Welbl et al., 2018), where some relevant entity mentions and their coreferences are highlighted.

In this paper, we propose a new approach for evidence integration by considering two more useful types of edges in addition to coreferences. The bottom part of Figure 2 shows one example graph generated by our approach. In particular, we consider three types of edges. The first type of edges connect the mentions of the same entity appearing across passages or further apart in the same passage. Shown in Figure 2, one instance connects the two “Mumbai” across the two passages. Intuitively, same-typed edges help to integrate global evidence related to the same entity, which are not covered by pronouns. The second type of edges connect two mentions of different entities within a context window. They help to pass useful evidence further across entities. For example, in the bottom graph of Figure 2, both window-typed edges of 1⃝ and 6⃝ help to pass evidence from “The Hanging Gardens” to “India”, the answer of this instance. Besides, window-typed edges enhance the relations between local mentions that can be missed by the sequential encoding baseline. Finally, coreference-typed edges are further complimentary to the previous two types.

With three types of edges, our generated graphs are complex and can have cycles, making it difficult to directly apply a DAG network (e.g. the structure of Coref-GRU). In addition, certain groups of nodes cannot be reached from each other through graph edges. Take Figure 2 as an example. Information of “They” and “Arabian Sea” in the first passage cannot reach “Mumbai”, “It” or “India” in the second passage, or vice versa. To handle these problems, we adopt graph neural networks (Scarselli et al., 2009), which can encode arbitrary graphs. In particular, we choose graph convolutional network (GCN) and graph recurrent network (GRN), as they have been shown successful on encoding semantic graphs (Song et al., 2018a), dependency graphs (Bastings et al., 2017; Marcheggiani and Titov, 2017; Song et al., 2018b) and raw texts (Zhang et al., 2018).

Figure 2: A DAG generated by Dhingra et al. (2018) (top) and a graph by considering all three types of edges (bottom) on the example in Figure 1.

Given an instance containing several passages and a list of candidates, we first use NER and coreference resolution tools to obtain entity mentions, and then create a graph out of the mentions and relevant pronouns. As the next step, evidence integration is executed on the graph by adopting a graph neural network on top of a sequential layer. The sequential layer learns local representation for each mention, while the graph network learns a global representation. The answer is decided by matching the representations of the mentions against the question representation.

Experiments on WikiHop (Welbl et al., 2018) and ComplexWebQuestions (Talmor and Berant, 2018) show that the additional types of edges we introduced are highly useful for MHRC. On the holdout testset of WikiHop, it achieves an accuracy of 65.4%, which is the best over all published results on the leaderboard111http://qangaroo.cs.ucl.ac.uk/leaderboard.html as of the paper submission time. On the testset of ComplexWebQuestions, it also achieves better numbers than all published results without additional annotations. To our knowledge, we are among the first to investigate graph neural networks on reading comprehension222The concurrent unpublished work Cao et al. (2018) also investigate the usage of graph convolution networks on WikiHop. Our work proposes a different model architecture, and focus more on the exploration and comparison of multiple edge types for building the graph-structured passage representation..

Figure 3: Baselines. The upper dotted box is a DAG LSTM layer with addition coreference links, while the bottom is a typical BiLSTM layer. Either layer is used.

2 Baseline

As shown in Figure 3, we introduce two baselines, which are inspired by Dhingra et al. (2018). The first baseline, Local

, uses a standard BiLSTM layer (shown in the green dotted box), where inputs are first encoded with a BiLSTM layer, then the representation vectors for the mentions in the passages are extracted, before being matched against the question for selecting an answer. The second baseline,

Coref LSTM, differs from Local by replacing the BiLSTM layer with a DAG LSTM layer (shown in the orange dotted box) for encoding additional coreference information, as proposed by Dhingra et al. (2018).

2.1 Local: BiLSTM encoding

Given a list of relevant passages, we first concatenate them into one large passage , where each is a passage word and is the embedding of it. It adopts a Bi-LSTM to encode the passage:

Each hidden state contains the information of its local context. Similarly, the question words are first converted into embeddings before being encoded by another BiLSTM:

2.2 Coref LSTM: DAG LSTM with conference

Taking the passage word embeddings and coreference information as the input, the DAG LSTM layer encodes each input word embedding (such as ) with the following gated operations333Only the forward direction is shown for space consideration.:

represents all preceding words of in the DAG, , and are the input, output and forget gates, respectively. , and () are model parameters.

2.3 Representation extraction

After encoding both the passage and the question, we obtain a representation vector for each entity mention , spanning from to , by concatenating the hidden states of its start and end positions, before they are correlated with a fully connected layer:


where and are model parameters for compressing the concatenated vector. Note that the current multi-hop reading comprehension datasets all focus on the situation where the answer is a named entity mention. Similarly, the representation vector for the question is generated by concatenating the hidden states of its first and last positions:


where and are also model parameters.

2.4 Attention-based matching

After obtaining the representation vectors for the question and the entity mentions in the passages, an additive attention model

(Bahdanau et al., 2015)

is adopted by treating all entity mention representations and the question representation as the memory and the query, respectively. In particular, the probability for an entity

being the answer is calculated by summing up all the occurrences of across the input passages:


where and represent all occurrences of entity and all occurrences of all entities, respectively. Previous work (Wang et al., 2018a) shows that summing the probabilities over all occurrences of the same entity mention is important for the multi-passage scenario. is the attention score for the entity mention , calculated by an additive attention model shown below:


where , , and are model parameters.

Comparison with Dhingra et al. (2018)

The Coref-GRU model (Dhingra et al., 2018) is based on the gated-attention reader (GA reader) (Dhingra et al., 2017a), which is designed for the cloze-style reading comprehension task (Hermann et al., 2015), where one

token is selected from the input passages as the answer for each instance. To adapt their model for the WikiHop benchmark, where an answer candidate can contain multiple tokens, they first generate a probability distribution over the passage tokens with GA reader, and then compute the probability for each candidate

by aggregating the probabilities of all passage tokens that appear in and renormalizing over the candidates.

In addition to using LSTM instead of GRU444Model architectures are selected according to dev results., the main difference between our two baselines and Dhingra et al. (2018) is that our baselines consider each candidate as a whole unit no matter whether it contains multiple tokens or not. This makes our models more effective on the datasets containing phrasal answer candidates.

Figure 4: Model framework.

3 Evidence integration with graph network

After obtaining the representation vectors for a question and the corresponding entity mentions, we build a graph out of the entity mentions by connecting relevant mentions with edges, and then integrating relevant information for each graph node (entity mention) with a graph recurrent network (GRN) (Zhang et al., 2018; Song et al., 2018a) or a graph convolutional network (GCN) (Kipf and Welling, 2017). Figure 4 shows the overall procedure.

3.1 Graph construction

As a first step, we create a graph from a list of input passages. The entity mentions within the passages are taken as the graph nodes. They are automatically generated by NER and coreference annotators, so that each graph node is either an entity mention or a pronoun representing an entity. We then create a graph by ensuring that edges between two nodes follow the situations below:

  • They are occurrences of the same entity mention across passages or with a distance larger than a threshold when being in the same passage.

  • One is an entity mention and the other is its coreference. Our coreference information is automatically generated by a coreference annotator.

  • Between two mentions of different entities in the same passage within a window threshold of .

Between every two entities that satisfy the situations above, we make two edges in opposite directions. As a result, each generated graph can also be considered as an undirected graph.

3.2 Evidence integration with graph network

Tackling multi-hop reading comprehension requires inferring on global context. As the next step, we merge related information through the three types of edges just created. We investigate two recent graph networks: GRN and GCN.

Graph recurrent network (GRN) GRN models a graph as a single state, performing recurrent information exchange between graph nodes through graph state transitions. Formally, given a graph , a hidden state vector is created to represent each node . The state of the graph can thus be represented as:

In order to integrate non-local evidence among nodes, information exchange between neighborhooding nodes is performed through recurrent state transitions, leading to a sequence of graph states , where and

is a hyperparameter representing the number of graph state transition decided by a development experiment. For initial state

, we initialize each by:


where is the corresponding representation vector of entity mention , calculated by Equation 1. is the question representation. and are model parameters.

A gated recurrent neural network is used to model the state transition process. In particular, the transition from

to consists of a hidden state transition for each node. At each step , direct information exchange is conducted between a node and all its neighbors via the following LSTM (Hochreiter and Schmidhuber, 1997) operations:


where is the cell vector to record memory for , and , and are the input, output and forget gates, respectively. and () are parameters. is the sum of the neighborhood hidden states for the node 555We tried distinguishing the neighbors connected by different types of edges, but it does not improve the performance.:


represents the set of all neighbors of .

Graph convolutional network (GCN) GCN is a convolution-based alternative to GRN for encoding graphs. Similar with GRN, a GCN model consists of two main steps: state initialization and state update. For state initialization, GCN adopts the same approach as with GRN by initializing from the representations vectors of entity mentions, as shown in Equation 6. The main difference between GCN and GRN is the way for updating node states. GRN adopts gated operations (shown in Equation 7), while GCN uses linear transportation with sigmoid:


where is also the sum of the neighborhood hidden states defined in Equation 8. and are model parameters.

3.3 Matching and combination

After evidence integration, we match the hidden states at each graph encoding step with the question representation using the same additive attention mechanism introduced in the Baseline section. In particular, for each entity , the matching results for the baseline and each GRN step are first generated, before being combined using a weighted sum to obtain the overall matching result:


where is the baseline matching result for , is the matching results after GRN steps and is the number of graph encoding steps. , , , , and are model parameters. In addition, a probability distribution is calculated from the overall matching results using softmax, similar to Equations 5. Finally, probabilities that belong to the same entity mention are merged to obtain the final distribution, as shown in Equation 3.

4 Training

We train both the baseline and our models using the cross-entropy loss:

where is ground-truth answer, and are the input and model parameters, respectively. Adam (Kingma and Ba, 2014) with a learning rate of 0.001 is used as the optimizer. Dropout with rate 0.1 and a 2 normalization weight of are used during training.

5 Experiments on WikiHop

We study the effectiveness of the three types of edges and the graph encoders using WikiHop (Welbl et al., 2018) dataset.

5.1 Data

The dataset contains around 51K instances, including 44K for training, 5K for development and 2.5K for held-out testing. Each instance consists of a question, a list of associated passages, a list of candidate answers and a correct answer. One example is shown in Figure 1. We use Stanford CoreNLP (Manning et al., 2014) to obtain coreference and NER annotations. Then the entity mentions, pronoun coreferences and the provided candidates are taken as graph nodes to create an evidence graph. The distance thresholds ( and , in Graph construction) for making same and window typed edges are set to 200 and 20, respectively.

Figure 5: Dev performances of different transition steps.

5.2 Settings

We study the model behavior on the WikiHop devset, choosing the best hyperparameters for online system evaluation on the final holdout testset. Our word embeddings are initialized from the 300-dimensional pretrained Glove word embeddings (Pennington et al., 2014) on Common Crawl, and are not updated during training.

For model hyperparameters, we set the graph state transition number as 3 according to development experiments. Each node takes information from at most 200 neighbors, where same and coref typed neighbors are kept first. The hidden vector sizes for both bidirectional LSTM and GRN layers are set to 300666We tried larger hidden sizes for our baselines, but did not observe further improvement..

5.3 Development experiments

Figure 5

shows the devset performances of our model using GRN or GCN with different transition steps. It shows the baseline performances when transition step is 0. The performances go up for both models when increasing the transition step to 3. Further increasing the transition step leads to a slight decrease in performance. One reason can be that executing more transition steps may also introduce more noise through richly connected edges. GRN shows better performances than GCN with large transition steps, indicating that GRN are better at capturing long-range dependency. This is likely because the gated operations of GRN is better at handling the vanishing/exploding gradient problem than the linear operations of GCN.

Model Dev Test
GA w/ GRU 54.9
GA w/ Coref-GRU 56.0 59.3
Local 61.0
Coref LSTM 61.4
Coref GRN 61.4
MHQA-GRN 62.8 65.4
Table 1: Main results (unmasked) on WikiHop.

5.4 Main results

Table 1 shows the main comparison results with existing work, where GA w/ GRU and GA w/ Coref-GRU correspond to Dhingra et al. (2018), and their reported numbers are copied. The former is their baseline, a gated-attention reader (Dhingra et al., 2017a), and the latter is their proposed method. Local is our baseline encoding input passages with a BiLSTM, which only captures local information for each mention. Coref LSTM is our baseline that encodes input passages with coreference annotations by using a bidirectional DAG LSTM. This can be considered as a reimplementation of Dhingra et al. (2018) based on our framework. Coref GRN is another baseline that uses GRN for encoding coreference. It is an ablation study of our model on coreference DAGs, and is for contrasting a DAG network with a graph network. MHQA-GCN and MHQA-GRN correspond to our evidence integration approaches via graph encoding, adopting GCN and GRN for graph encoding, respectively.

Edge type Dev
all types 62.8
     w/o same 61.9
     w/o coref 61.7
     w/o window 62.4
only same 61.6
only coref 61.4
only window 61.1
Table 2: Ablation study on different types of edges using GRN as the graph encoder.

Our baselines and models show much higher accuracies compared with GA w/ GRU and GA w/ Coref-GRU, as our models are more compatible with the evaluated dataset. In particular, we consider each candidate answer as a single unit, while GA w/ GRU and GA w/ Coref-GRU calculate the probability for each candidate by summing up the probabilities of all tokens within the candidate.

Coref LSTM only shows 0.4 points gains over Local. On the other hand, MHQA-GCN and MHQA-GRN are 1.4 and 1.8 points more accurate than Local, respectively. This is mainly because our graphs are better connected than coreference DAGs and are more suitable for integrating relevant evidence. Coref GRN gives a comparable performance with Coref LSTM, showing that graph networks may not necessarily be better than DAG networks on encoding DAGs. However, the former are more general on encoding arbitrary graphs. Finally, MHQA-GRN shows a higher testing accuracy than all published results777At submission time we observe a recent short arXiv paper (Cao et al., 2018), available on August 28th, showing an accuracy of 67.6 using ELMo (Peters et al., 2018), which is the only result better than ours. ELMo has achieved dramatic performance gains of 3+ points over a broad range of tasks. Our main contribution is studying an evidence integration approach, which is orthogonal to the contribution of ELMo on large-scale training. For more fair comparison with existing work, we did not adopt ELMo, but we will conduct experiments with ELMo as well..

5.5 Analysis

Effectiveness of edge types

Table 2 shows the effectiveness of different types of edges that we introduce. The first group shows the ablation study, which indicates the importance of each type of edges. Among all these types, removing window-typed edges causes the least performance drop. One possible reason is that some information captured by window-typed edges has been well captured by sequential encoding. However, window-typed are still useful, as they can help passing evidence through to further nodes. Take Figure 2 as an example, window-typed edges help to pass information from “The Hanging Gardens” to “India”. The other two types of edges are more important than window-typed ones. Intuitively, they help to learn a better representation for an entity by integrating the contextual information from its co-references and occurrences.

The second group of Table 2 shows the model performances when only one type of edges are used. The numbers generally demonstrate the same patterns as the first group. In addition, only same is slightly better than only coref. It is likely because some coreference information can also be captured by sequential encoding. None of the results with a single edge type is significantly better than our strong baseline, whereas the combination of all three types achieves a much better result. This indicates the importance of evidence integration over multiple edge types.

Figure 6: Distribution of distances between a question and an answer on the Devset.

Distance Figure 6 shows the percentage distribution of distances between a question and its closet answer when either all types of edges are adopted or only coreference edges are used. The subject of each question888As shown in Figure 1, each question is a three-element tuple of subject, relation and a question mark (asking for the object). is used to locate the question on the corresponding graph.

When all types of edges are used, the instances with distances less than or equal to 3 count for around 70% of all the instances. On the other hand, the instances with distances longer than 4 only count for 10%. This can be the reason why performances do not increase when more than 3 transition steps are performed in our model. The advantage of our graph construction method can be shown by contrasting the distance distributions over graphs generated by both the baseline and our method. We further evaluate both methods on a subset of the devset instances, where for each instance the distance between the answer and the question is at most 3 in our graph but is infinity on the coreference DAG. The performances of Coref LSTM and MHQA-GRN on this subset are 61.1 and 63.8, respectively. Comparing with the performances on the whole devset (61.4 vs 62.8), the performance gap is increased by 1.3 points on this subset, which further confirms our observation.

6 Experiments on ComplexWebQuestions

In additional to WikiHop, we conduct experiments on the newly released ComplexWebQuestions version 1.1 (Talmor and Berant, 2018) for better evaluating our approach. Compared with WikiHop, where the complexity is implicitly specified in the passages, the complexity of this dataset is explicitly specified on the question side. One example question is “What city is the birthplace of the author of ‘Without end”’. A two-step reasoning is involved, with the first step being “the author of ‘Without end”’ and the second being “the birthplace of ”. is the answer of the first step.

Model Dev Test
SimpQA 30.6
SplitQA 31.1
Local 31.2 28.1
MHQA-GRN 33.2 30.1
MHQA-GRN w/ only same 32.2
SplitQA w/ additional labeled data 35.6 34.2
Table 3: Results on the ComplexWebQuestions dataset.

In this dataset, web snippets (instead of passages as in WikiHop) are used for extracting answers. The baseline of Talmor and Berant (2018) (SimpQA) only uses a full question to query the web for obtaining relevant snippets, while their model (SplitQA) obtains snippets for both the full question and its sub-questions. With all the snippets, SplitQA models the QA process based on a computation tree999A computation tree is a special type of semantic parse, which has two levels. The first level contains sub-questions and the second level is a composition operation. of the full question. In particular, they first obtain the answers for the sub-questions, and then integrate those answers based on the computation tree. In contrast, our approach creates a graph from all the snippets, thus the succeeding evidence integration process can join all associated evidence.

Main results As shown in Table 3, similar to the observations in WikiHop, both MHQA-GRN and MHQA-GCN achieve large improvements over Local, and MHQA-GRN gives slightly better accuracy. Both the baselines and our models use all web snippets, but MHQA-GRN and MHQA-GCN further consider the structural relations among entity mentions. SplitQA achieves 0.5% improvement over SimpQA101010Upon the submission time, the authors of ComplexWebQuestions have not reported testing results for the two methods. To make a fair comparison we compare the devset accuracy.. Our Local baseline is comparable with SplitQA and our graph-based models contribute a further 2% improvement over Local. This indicates that considering structural information on passages is important for the dataset.

Analysis   To deal with complex questions that require evidence from multiple passages to answer, previous work (Wang et al., 2018b; Lin et al., 2018; Wang et al., 2018c) collect evidence from occurrences of an entity in different passages. The above methods correspond to a special case of our method, i.e. MHQA with only the same-typed edges. From Table 3, our method gives 1 point increase over MHQA-GRN w/ only same, and it gives more increase in WikiHop (comparing all types with only same in Table 2). Both results indicate that our method could capture more useful information for multi-hop QA tasks, compared to the methods developed for previous multi-passage QA tasks. This is likely because our method integrates not only evidences for an entity but also these for other related entities.

The leaderboard reports SplitQA with additional sub-question annotations and gold answers for sub-questions. These pairs of sub-questions and answers are used as additional data for training SplitQA. The above approach relies on annotations of ground-truth answers for sub-questions and semantic parses, thus is not practically useful in general. However, the results have additional value since it can be viewed as an upper bound of SplitQA. Note that the gap between this upper bound and our MHQA-GRN is small, which further proves that larger improvement can be achieved by introducing structural information on the passage side.

7 Related Work

Question answering with multi-hop reasoning   Most existing work on multi-hop QA focuses on hopping over knowledge bases or tables (Jain, 2016; Neelakantan et al., 2016; Yin et al., 2016), thus the problem is reduced to deduction on a readily-defined structure with known relations. On the other hand, we study multi-hop QA on textual data and we introduce an effective approach on creating graph structures over the textual input for solving our problems. Previous work (Hill et al., 2015; Shen et al., 2017) studying multi-hop QA on text does not create structures. In addition, they only evaluate on a simple task (Weston et al., 2015) with a very limited vocabulary and passage length. Our work is fundamentally different from theirs by modeling structures over the input, and we evaluate our models on more challenging tasks.

Recent work starts to exploit ways for creating structures from inputs. Talmor and Berant (2018) build a two-level computation tree over each question where the first-level nodes are sub-questions and the second-level node is a composition operation. The answers for the sub-questions are first generated, and then combined with the composition operation. They predefine two composition operations, so it is not general enough for other QA problems. On the other hand, Dhingra et al. (2018) create DAGs over passages with coreference. The DAGs are then encoded with a DAG network. Our work follows the second direction by creating graphs on the passage side. However, we consider more types of relations than coreference, making a thorough study on relation types. Besides, we also investigate recent graph networks on this problem.

Question answering over multiple passages Recent efforts in open-domain QA start to generate answers from multiple passages instead of from a single passage. However, most existing work on multi-passage QA selects the most relevant passage for answering the given question, thus reducing the problem to single-passage reading comprehension (Chen et al., 2017; Dunn et al., 2017; Dhingra et al., 2017b; Wang et al., 2018a; Clark and Gardner, 2018). Our method is fundamentally different by truly leveraging multiple passages.

A few multi-passage QA approaches merge evidence from multiple passages before selecting an answer (Wang et al., 2018b; Lin et al., 2018; Wang et al., 2018c). Similar to our work, they combine evidences from multiple passages, thus they fully utilize the input passages. The key difference is that their approaches focus on how the contexts of a single answer candidate from different passages could cover different aspects of a complex question, while our approach studies how to properly integrate the related evidence of an answer candidate, some of which come from the contexts of different entity mentions. Specially, it increases the difficulty, since those contexts do not co-occur with the candidate answer nor the question. This is also demonstrated by our empirical comparison, where our approach shows much better performance than combining only the evidence of the same entity mentions.

8 Conclusion

We have introduced a new approach for tackling multi-hop reading comprehension (MHRC) with an evidence integration process. Given a question and a list of passages, we first use three types of edges to connect related evidence, and then adopt recent graph neural networks to encode resulted graphs for performing evidence integration. Results show that the three types of edges are useful on combining global evidence and that the graph neural networks are effective on encoding complex graphs resulted by the first step. Our approach shows the highest performance among all published results on two standard MHRC datasets.