Propagate-Selector: Detecting Supporting Sentences for Question Answering via Graph Neural Networks

by   Seunghyun Yoon, et al.
Seoul National University

In this study, we propose a novel graph neural network, called propagate-selector (PS), which propagates information over sentences to understand information that cannot be inferred when considering sentences in isolation. First, we design a graph structure in which each node represents the individual sentences, and some pairs of nodes are selectively connected based on the text structure. Then, we develop an iterative attentive aggregation, and a skip-combine method in which a node interacts with its neighborhood nodes to accumulate the necessary information. To evaluate the performance of the proposed approaches, we conducted experiments with the HotpotQA dataset. The empirical results demonstrate the superiority of our proposed approach, which obtains the best performances compared to the widely used answer-selection models that do not consider the inter-sentential relationship.



There are no comments yet.


page 1

page 2

page 3

page 4


Complex Factoid Question Answering with a Free-Text Knowledge Graph

We introduce DELFT, a factoid question answering system which combines t...

Fine-tuning Multi-hop Question Answering with Hierarchical Graph Network

In this paper, we present a two stage model for multi-hop question answe...

InsertGNN: Can Graph Neural Networks Outperform Humans in TOEFL Sentence Insertion Problem?

Sentence insertion is a delicate but fundamental NLP problem. Current ap...

Sentence Structure and Word Relationship Modeling for Emphasis Selection

Emphasis Selection is a newly proposed task which focuses on choosing wo...

Neural Topic Modeling by Incorporating Document Relationship Graph

Graph Neural Networks (GNNs) that capture the relationships between grap...

Graph attentive feature aggregation for text-independent speaker verification

The objective of this paper is to combine multiple frame-level features ...

A Compare-Aggregate Model with Latent Clustering for Answer Selection

In this paper, we propose a novel method for a sentence-level answer-sel...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding texts and being able to answer a question posed by a human is a long-standing goal in the artificial intelligence field. Given the rapid advancement of neural network-based models and the availability of large-scale datasets, such as SQuAD 

rajpurkar2016squad and TriviaQA joshi2017triviaqa, researchers have begun to concentrate on building automatic question-answering (QA) systems. One example of such a system is called the machine-reading question-answering (MRQA) model, which provides answers to questions from given passages xiong2016dynamic; wang2017gated; shen2017reasonet.

Figure 1: An example of dataset. Detecting supporting sentences is an essential step being able to answer the question.

Recently, research has revealed that most of the questions in the existing MRQA datasets do not require reasoning across sentences in the given context (passage); instead, they can be answered by looking at only a single sentence weissenborn2017making. Using this characteristic, a simple model can achieve performances competitive with that of a sophisticated model. However, in most of the real scenarios of QA applications, more than one sentences should be utilized to extract a correct answer.

To alleviate this limitation in the previous datasets, another type of dataset was developed in which answering the question requires reasoning over multiple sentences in the given passages yang2018hotpotqa; welbl2018constructing. Figure 1 shows an example of a recently released dataset, the HotpotQA. This dataset consists of not only question-answer pairs with context passages but also supporting sentence information for answering the question annotated by a human.

In this study, we are interested in building a model that exploits the relational information among sentences in passages and in classifying the

supporting sentences that contain the essential information for answering the question. To this end, we propose a novel graph neural network model, named Propagate-selector (PS), that can be directly employed as a subsystem in the QA pipeline. First, we design a graph structure to hold information in the HotpotQA dataset by assigning each sentence to an independent graph node. Then, we connect the undirected edges between nodes using a proposed graph topology (see the discussion in 4.1). Next, we allow PS to propagate information between the nodes through iterative hops to perform reasoning across the given sentences. Trough the propagate process, the model learns to understand information that cannot be inferred when considering sentences in isolation.

To the best of our knowledge, this is the first work to employ a graph neural network structure to find supporting sentences for a QA system. Through experiments, we demonstrate that the proposed method achieves better performances when classifying supporting sentences than those of the widely used answer-selection models wang2016compare; bian2017compare; shen2017inter; tran2018context.

2 Related Work

Previous researchers have also investigated neural network-based models for MRQA. One line of inquiry employs an attention mechanism between tokens in the question and passage to compute the answer span from the given text seo2016bidirectional; wang2017gated. As the task scope was extended from specific- to open-domain QA, several models have been proposed to select a relevant paragraph from the text to predict the answer span wang2018r; clark2018simple. However, none of these methods have addressed reasoning over multiple sentences.

To understand the relational patterns in the dataset, graph neural network algorithms have also been previously proposed. kipf2016semi proposed a graph convolutional network to classify graph-structured data. This model was further investigated for applications involving large-scale graphs hamilton2017inductive, for the effectiveness of aggregating and combining graph nodes by employing an attention mechanism velivckovic2017graph, and for adopting recurrent node updates palm2018recurrent. In addition, one trial involved applying graph neural networks to QA tasks; however, this usage was limited to the entity level rather than sentence level understanding de2018question.

3 Task and Dataset

The specific problem we aim to tackle in this study is to classify supporting sentences in the MRQA task. We consider the target dataset HotpotQA, by yang2018hotpotqa, which is comprised of tuples (Q, , , A) where Q is the question, is the set of passages as the given context, and each passage is further comprised of a set of sentences (. Here, is a binary label indicating whether contains the information required to answer the question, and A is the answer. In particular, we call a sentence, , a supporting sentence when is true. Figure 1 shows an example of the HotpotQA dataset.

In this study, we do not use the answer information from the dataset; we use only the subsequent tuples Q, , when classifying supporting sentences. We believe that this subproblem plays an important role in building a full QA pipeline because the proposed models for this task will be combined with other MRQA models in an end-to-end training process.

4 Methodology

4.1 Propagate-Selector

In this paper, we are interested in identifying supporting sentences, among sentences in the given text that contain information essential to answering the question. To build a model that can perform reasoning across multiple sentences, we propose a graph neural network model called Propagate-selector (PS). PS consists of the following parts:


Figure 2: Topology of the proposed model. Each node represents a sentence from the passage and the question.

To build a model that understands the relationship between sentences for answering a question, we propose a graph neural network where each node represents a sentence from passages and the question. Figure 2 depicts the topology of the proposed model. In an offline step, we organize the content of each instance in a graph where each node represents a sentence from the passages and the question. Then, we add edges between nodes using the following topology:

  • we fully connect nodes that represent sentences from the same passage (dotted-black);

  • we fully connect nodes that represent the first sentence of each passage (dotted-red);

  • we add an edge between the question and every node for each passage (dotted-blue).

In this way, we enable a path by which sentence nodes can propagate information between both inner and outer passages.

Node representation: Question and sentence , (where is the dimensionality of the word embedding and and represent the lengths of the sequences in Q and , respectively), are processed to acquire the sentence-level information. Recent studies have shown that a pretrained language model helps the model capture the contextual meaning of words in the sentence peters2018deep; devlin2019bert. Following this study, we select an ELMo peters2018deep language model for the word-embedding layer of our model as follows: . Using these new representations, we compute the sentence representation as follows:


where is the RNN function with the weight parameters , and   and  are node representations for the question and sentence, respectively (where is the dimensionality of the RNN hidden units).

Aggregation: An iterative attentive aggregation function to the neighbor nodes is utilized to compute the amount of information to be propagated to each node in the graph as follows:


where is the aggregated information for the v-th node computed by attentive weighted summation of its neighbor nodes, is attention weight between node v and its neighbor nodes , is the u-th node representation,

is a nonlinear activation function, and

is the learned model parameter. Because all the nodes belong to a graph structure in which the iterative aggregation is performed among nodes, the k in the equation indicates that the computation occurs in the k-th hop (iteration).

Update: The aggregated information for the v-th node, in equation (2), is combined with its previous node representation to update the node. We apply a skip connection to allow the model to learn the amount of information to be updated in each hop as follows:



is a nonlinear activation function, {;} indicates vector concatenation, and

is the learned model parameter.

4.2 Optimization

Because our objective is to classify supporting sentences () from the given tuples Q, , , we define two types of loss to be minimized. One is a rank loss that computes the cross-entropy loss between a question and each sentence using the ground-truth as follows:


where is a feedforward network that computes a similarity score between the final representation of the question and each sentence. The other is attention loss, which is defined in each hop as follows:


where indicates the relevance between the question node q and the i-th sentence node in the k-th hop as computed by equation (2).

Finally, these two losses are combined to construct the final objective function:



is a hyperparameter.

properties train dev
# questions 90,447 7,405
# sentences 3,703,344 306,487
passages / question 9.95 9.95
sentences / passage 4.12 4.16
sentences / question 40.94 41.39
supporting sentences / question 2.39 2.43
avg tokens (question) 17.92 15.83
avg tokens (sentence) 22.38 22.41
Table 1: Properties of the dataset.

5 Experiments

We regard the task as the problem of selecting the supporting sentences from the passages to answer the questions. Similar to the answer-selection task in the QA literature, we report the model performance using the mean average precision (MAP) and mean reciprocal rank (MRR) metrics. To evaluate the model performance, we use the HotpotQA dataset, which is described in section “Task and Dataset”. Table 1 shows properties of the dataset. We conduct a series of experiments to compare baseline methods with the newly proposed models. All codes developed for this research will be made available via a public web repository along with the dataset.

5.1 Implementation Details

To implement the Propagate-selector (PS) model, we first use a small version of ELMo (13.6 M parameters) that provides 256-dimensional context embedding. This choice was based on the available batch size (50 for our experiments) when training the complete model on a single GPU (GTX 1080 Ti). When we tried using the original version of ELMo (93.6 M parameters, 1024-dimensional context embedding), we were able to increase the batch size only up to 20, which results in excessive training time (approximately 90 hours). For the sentence encoding, we used a GRU chung2014empirical with a hidden unit dimension of 200. The hidden unit weight matrix of the GRU is initialized using orthogonal weights saxe2013exact. Dropout is applied for regularization purposes at a ratio of 0.7 for the RNN (in equation 1) to 0.7 for the attention weight matrix (in equation 2). For the nonlinear activation function (in equation 2 and 3), we use the function.

Regarding the vocabulary, we replaced vocabulary with fewer than 12 instances in terms of term-frequency with “UNK” tokens. The final vocabulary size was 138,156. We also applied the Adam optimizer kingma2014adam

, including gradient clipping by norm at a threshold of 5.

Model dev train
IWAN [1] 0.526 0.680 0.605 0.775
sCARNN [2] 0.534 0.698 0.620 0.792
CompAggr [3] 0.659 0.812 0.796 0.911
CompClip [4] 0.670 0.825 0.767 0.901
CompClip-LM [5] 0.696 0.841 0.748 0.873
PS-avg 0.566 0.708 0.889 0.959
PS-rnn 0.700 0.822 0.919 0.971
PS-rnn-elmo-s 0.716 0.841 0.813 0.916
PS-rnn-elmo 0.734 0.853 0.863 0.945
Table 2: Model performance on the HotpotQA dataset (top scores marked in bold). Models [1-5] are from shen2017inter; tran2018context; wang2016compare; bian2017compare; yoon2019compare, respectively.

5.2 Comparisons with Other Methods

Table 2 shows the model performances on the HotpotQA dataset. Because the dataset only provides training (trainset) and validation (devset) subsets, we report the model performances on these datasets. While training the model, we implement early termination based on the devset performance and measure the best performance. To compare the model performances, we choose widely used answer-selection models such as CompAggr wang2016compare, IWAN shen2017inter, CompClip bian2017compare, sCARNN tran2018context, and CompClip-LM yoon2019compare which were primarily developed to rank candidate answers for a given question. The CompClip-LM is based on CompClip and adopts ELMo in its word-embedding layer.

(a) hop-1
(b) hop-2
(c) hop-3
(d) hop-4
Figure 3: Attention weights between the question and sentences in the passages. As the number of hops increases, the proposed model correctly classifies supporting sentences (ground-truth index 4 and 17).

In addition to the main proposed model, PS-rnn-elmo, we also investigate three model variants: PS-rnn-elmo-s uses a small version of ELMo, PS-rnn uses GloVe pennington2014glove instead of ELMo as a word-embedding layer, and PS-avg employs average pooling ( and ) instead of RNN encoding in equation (1).

As shown in Table 2, the proposed PS-rnn-elmo shows a significant MAP performance improvement compared to the previous best model, CompClip-LM (0.696 to 0.734 absolute).

# hop dev train
1 0.651 0.794 0.716 0.842
2 0.653 0.797 0.721 0.850
3 0.698 0.830 0.800 0.908
4 0.734 0.853 0.863 0.945
5 0.700 0.827 0.803 0.906
6 0.457 0.606 0.467 0.621
Table 3: Model performance (top scores marked in bold) as the number of hop increases.

5.3 Hop Analysis

Table 3 shows the model performance (PS-elmo

) as the number of hops increases. We find that the model achieves the best performance in the 4-hop case but starts to degrade when the number of hops exceeds 4. We assume that the model experiences the vanishing gradient problem under a larger number of iterative propagations (hops). Table 

4 shows model performance with small version of ELMo.

Figure 3 depicts the attention weight between the question node and each sentence node (hop-4 model case). As the hop number increases, we observe that the model properly identifies supporting sentences (in this example, sentence #4 and #17). This behavior demonstrates that our proposed model correctly learns how to propagate the necessary information among the sentence nodes via the iterative process.

# hop dev train
1 0.648 0.790 0.708 0.842
2 0.655 0.801 0.720 0.853
3 0.681 0.816 0.768 0.886
4 0.706 0.834 0.796 0.906
5 0.716 0.841 0.813 0.916
6 0.441 0.596 0.452 0.600
7 0.434 0.589 0.450 0.606
Table 4: Model performance with small version of ELMo (top scores marked in bold) as the number of hop increases.

6 Conclusion

In this paper, we propose a graph neural network that finds the sentences crucial for answering a question. The experiments demonstrate that the model correctly classifies supporting sentences by iteratively propagating the necessary information through its novel architecture. We believe that our approach will play an important role in building a QA pipeline in combination with other MRQA models trained in an end-to-end manner.