Graph Sequential Network for Reasoning over Sequences

04/04/2020 ∙ by Ming Tu, et al. ∙, Inc. 33

Recently Graph Neural Network (GNN) has been applied successfully to various NLP tasks that require reasoning, such as multi-hop machine reading comprehension. In this paper, we consider a novel case where reasoning is needed over graphs built from sequences, i.e. graph nodes with sequence data. Existing GNN models fulfill this goal by first summarizing the node sequences into fixed-dimensional vectors, then applying GNN on these vectors. To avoid information loss inherent in the early summarization and make sequential labeling tasks on GNN output feasible, we propose a new type of GNN called Graph Sequential Network (GSN), which features a new message passing algorithm based on co-attention between a node and each of its neighbors. We validate the proposed GSN on two NLP tasks: interpretable multi-hop reading comprehension on HotpotQA and graph based fact verification on FEVER. Both tasks require reasoning over multiple documents or sentences. Our experimental results show that the proposed GSN attains better performance than the standard GNN based methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph neural network (GNN) has attracted much attention recently, and have been applied to various tasks such as bio-medicine Zitnik et al. (2018), computational chemistry Gilmer et al. (2017), social networks Fan et al. (2019)

, computer vision

Li and Gupta (2018), and natural language understanding Xiao et al. (2019); Tu et al. (2019b). GNN assumes structured graphical inputs, for example, molecule graphs, protein-protein interaction networks, or language syntax trees, which can be represented with a graph . defines a set of nodes and defines a set of edges, each of which connecting two different nodes in .

Different GNN variants have been proposed to learn graph representation, which include Graph Convolutional Network (GCN) Kipf and Welling (2016), GraphSage Hamilton et al. (2017), Graph Isomorphism Network (GIN) Xu et al. (2018) and Graph Attention Network (GAT) Veličković et al. (2017). Existing GNN variants assume features of each node to be a vector, which is initialized by predefined features or learnt by feature encoding networks. In cases where each node

is represented by a sequence of feature vectors, usually in natural language processing (NLP) tasks, common practice would take the encoded sequential feature vectors, and go through a summarization module that is either based on simple average/max pooling or parametric attentive pooling to convert the sequential feature vectors to a fixed-dimensional feature vector. Then GNN-based message passing algorithm is applied to obtain node representations from these summarized feature vectors

Tu et al. (2019b); Xiao et al. (2019); Zhou et al. (2019).

However, this early summarization strategy (summarization before GNN based representation learning) could bring inevitable information loss Seo et al. (2016), and result in information flow bottleneck thus less powerful reasoning ability among graph nodes. Furthermore, early summarization also makes sequential labeling tasks impossible because GNN only outputs one vector for each input sequence, while sequential labeling tasks need sequential inputs.

Figure 1: Diagram of the proposed GSN and its comparison with GNN when dealing with a graph built from multiple sequences (3 as in the figure). With the same input, the first row demonstrates the common pipeline of GNN based models, and the second row is the pipeline of our proposed GSN based models. We also show feasible tasks, and our proposed GSN can tackle sequential labeling tasks while GNN can not.

To alleviate these limitations, in this paper we propose a new type of GNN: Graph Sequential Network (GSN) to directly learn feature representations over graphs with a sequence for each node. GSN differs from previous GNN variants in the following way:

  1. GSN can directly conduct message passing over nodes represented with sequential feature vectors, thus avoid information loss due to the pooling for early summarization.

  2. Both the input and output of the proposed GSN are sequences, making sequential labeling tasks on GSN output possible.

To achieve these advantages, we propose a new message passing algorithm based on co-attention between a node and each of its neighbors. Co-attention is commonly used in NLP tasks, especially in machine reading comprehension (MRC), as a way to encode query-aware contextual information based on affinity matrix between two sequences

Xiong et al. (2016); Seo et al. (2016); Zhong et al. (2019). In the context of this paper, the advantage of co-attention is that it can encode neighbor-aware information of the current node represented by a sequence of feature vectors, even when neighbors have different sequence lengths. The learned sequential representation of each node can then be used for node-level sequence classification or sequential labeling, or graph-level classification tasks. The general idea of our proposed GSN and its comparison with existing GNN based methods is shown in Figure 1.

To validate the effectiveness of the proposed GSN, we experiment on two NLP datasets: HotpotQA Yang et al. (2018) and fact extraction and verification data set provided by FEVER shared task 1.0 Thorne et al. (2018). Both tasks require the model to have reasoning ability, and top performance has been achieved with early summarization followed by GNN Zhou et al. (2019); Xiao et al. (2019); Fang et al. (2019); Tu et al. (2019a). With thorough experiments, we show that the proposed GSN achieves better performance than standard GNN, proving its stronger ability to do reasoning over sequences.

2 Related Work

GNN has been proposed as powerful models for graph representation learning. Different from Convolutional Neural Networks (CNN) which work on Euclidean space, GNN operates on graph data, which are usually defined as a set of graph nodes and the edges connecting those nodes. GNN implements neural-network-like message passing algorithms to update graph node representation from each node’s neighborhood. The resulting node representations encode structural information from the subgraph within

hops away from each node.

Multiple GNN variants have been proposed with different message passing algorithms. For example GCN Kipf and Welling (2016), Graph SageHamilton et al. (2017), GATVeličković et al. (2017), GINXu et al. (2018), etc. Our proposed GSN can be regarded as a variant of GNN. However, GSN differs from previous GNN variants in that GSN operates on graphs with sequences as nodes. Thus GSN needs a new message passing algorithm for nodes represented with sequences.

GNN for NLP:

recently various research work on NLP adopted GNN and gained benefit from them. These work can be roughly categorized into two groups depending on the way to build graphs. The first group usually builds graphs from parsing trees or develops graph-like Recurrent Neural Networks (RNN).

Bastings et al. (2017) and Marcheggiani et al. (2018) explored building graphs from syntactic or semantic parsing trees and inserted a GCN based sub-network to the encoder of sequence-to-sequence machine translation models. Zhang et al. (2018b) applied GCN on pruned syntactic dependency trees for relation extraction. Zhang et al. (2019) proposed to use GCN over syntactic dependency trees for aspect-based sentiment classification. Vashishth et al. (2019) applied similar idea to derive word embeddings based on GCN. Furthermore, the tree-LSTM model Tai et al. (2015) and sent-LSTM Zhang et al. (2018a) model can also be regarded as implementation of GNN because they both explored recurrent message passing algorithm over tree-structured text. To summarize, the methods in the first group utilize the intrinsic linguistic properties of a sentence to guide the graph building, and then employ GNN to learn better representation of text.

On the other hand, the second group of studies build graphs in a more heuristic way (e.g. whether an entity appears in a sentence or paragraph), and over wider range of context (e.g. multiple documents).

De Cao et al. (2019) and Xiao et al. (2019) both constructed graphs over entities in documents and capitalized GCN to achieve reasoning over multiple documents. Later, Tu et al. (2019b) proposed to include nodes representing documents to the graph to better model the global information presented in the context. Fang et al. (2019) built a hierarchical graph consisting of entity nodes, sentence nodes and paragraph nodes for multi-hop reasoning over multiple paragraphs. All these methods attained strong performance on multi-hop reading comprehension tasks. Similar idea was also explored for text classification task Yao et al. (2018). The methods in this group aim to learn relational information presented in very long context to achieve reasoning ability by reformatting the context into graph structure.

When dealing with graph nodes represented with sequences (multiple tokens in an entity or a sentence or a paragraph), all previous studies convert the sequence into a feature vector. Then it is possible to apply existing GNN algorithms. However, the GSN proposed in this paper presents a new model to directly conduct message passing algorithm over sequences on graph nodes.

3 Methodology

This section starts with a brief introduction on GNN. Then, we introduce the proposed GSN and how it is implemented with an emphasis on its difference with existing GNN variants. Finally, we elaborate on how to apply GSN to NLP tasks that require reasoning over sequences.

3.1 Graph Neural Network

Assume a graph represented by ; defines a set of N nodes with each node denoting a -dimensional feature vector; defines a set of edges connecting two of the nodes. Here, we only consider undirected connections between nodes.

GNN is designed for machine learning tasks with structural data that can be represented by a graph to inform the relational information among nodes. GNN has two basic operations that can be named as

aggregation and combination in contrast to convolution and pooling in CNN Hamilton et al. (2017); Xu et al. (2018). One step of these two operations is usually called a hop, and the computation of -th hop can be formulated with aggregation and combination respectively:


where and represent the aggregation and combination operation respectively. is the neighboring nodes of node . is the node representation learned after -th hop. The aggregation step collects information from neighboring nodes, while the combination step fuses the collected information with the representation of the current node. For example, GCN implements these two steps in one formula:


where Proj

is a linear layer with a specific activation function and

is the degree of node .

3.2 Graph Sequential Network

Assume a different graph represented by ; defines a set of N nodes with each node however denoting a sequence of feature vectors ; is the sequence length of node ; is a -dimensional feature vector. also defines a set of edges connecting two of the nodes.

Like previous GNN variants, GSN also implements a two-step computation process: aggregation and combination, which can be formulated by


where indicates -th computation step. Still, the aggregation step calculates structure-aware feature representations from the neighborhood of node , and the combination step fuses the with node ’s current feature representation.

To enable aggregation and combination over nodes specified by a sequence of feature vectors, we design new aggregation and combination functions and put them in one formula


where defines the co-attention based aggregation function. For , there are two choices: max-pooling or average pooling ( is the degree of node ). We can also extend GSN to multi-relational setting as in Schlichtkrull et al. (2018), where there are multiple types of edges. Then, the message passing algorithm with max-pooling based combination becomes


where is the set of all relation types and is its size; is node ’s neighbor set, and is the parametrized aggregation function under relation .

There are several ways of implementation for co-attention Xiong et al. (2016); Seo et al. (2016); Zhong et al. (2019). Instead of the Recurrent Neural Networks (RNN) based co-attention in Xiong et al. (2016); Zhong et al. (2019), we choose the Bidirectional Attention Flow (BiDAF) Seo et al. (2016) as the co-attention implementation for GSNs for the following reasons: 1) it introduces much less weight parameters compared to RNN (bidirectional) based co-attention. 2) it is much faster than the RNN based co-attention especially when the graph is dense (meaning the graph has almost maximum number of edges that it can have). Our implementation of BiDAF can be summarized in Algorithm 1 (we remove the node and layer indices for clarity). We assume the input and output feature dimensions are both , however it can be adjusted.

For each layer of GSN, the only weight parameters introduced are and ; the output size of is 1 so it is negligible; the number of parameters of is when input and output feature dimensions are the same.

Input : current node and one of its neighbor
Output : neighbor-aware representation
9 , and ;
10 , and  and ;
11 , and and ;
12 , and ;
13 , and ;
2 : and : represent the th row and th column of a matrix respectively;
3 stands for element-wise multiplication;
4 ‘‘[;]’’ represents vector concatenation;
” represents taking maximum values over rows of a matrix.
Algorithm 1 Implementation of .
6 : and : represent the th row and th column of a matrix respectively;
7 stands for element-wise multiplication;
8 ‘‘[;]’’ represents vector concatenation;
” represents taking maximum values over rows of a matrix.

3.3 Applications on NLP tasks

Some NLP tasks require reasoning over multiple sentences/paragraphs, such as multi-hop machine reading comprehension or fact verification over multiple sentences/documents Yang et al. (2018); Thorne et al. (2018). Previous studies have shown that by applying GNN to the graph with sequence (phrase, sentence or document) embeddings as nodes can improve the performance of these tasks Xiao et al. (2019); Tu et al. (2019a); Zhou et al. (2019); Fang et al. (2019). Instead of summarizing sequences into vectors and using them for graph node initialization, our proposed GSN avoids the sequence summarization and directly take sequence features as graph node representation. The co-attention based message passing of GSN can learn neighbor-aware representations of the current node. The current node acts as the context sequence and each of its neighbor acts as the query sequence as in co-attention for MRC Seo et al. (2016). Thus the GSN enables aggregation of relational information among sequences and strengthens the model’s reasoning ability over sequences. Furthermore, based on its sequential output, GSN also makes sequential labeling tasks possible, such as start and end positions prediction for extraction based QA tasks, while it is impossible for current GNNs variants to achieve this. This property could bring more potential to sequential labeling tasks in NLP which requires complex reasoning.

4 Experiments and Results

In this section we validate the efficacy of our proposed GSN on three NLP tasks: multi-hop MRC span extraction, multi-hop MRC supporting sentence prediction and fact verification using two data sets: HotpotQA Yang et al. (2018) and FEVER Thorne et al. (2018). Our goal in this study is not to achieve the state-of-the-art performance on these two data sets, but rather to show the effectiveness of the proposed GSN when compared to existing GNN models.

4.1 HotpotQA data set

HotpotQA is the first multi-hop QA data set taking the explanation ability of models into account. HotpotQA is constructed in the way that crowd workers are presented with multiple documents and are asked to provide a question, corresponding answer and supporting sentences used to reach the answer. There are about 90K training samples, 7.4K development and test samples. HotpotQA presents two tasks: answer span prediction (to extract a text span from the context) and supporting facts prediction (to predict whether a sentence supports the answer or not). Models are evaluated based on Exact Match (EM) and score of the two tasks. Joint EM and scores are used as the overall performance measurements, which encourage the model to be accurate on both tasks for each example. In this study we apply the proposed GSN to the distractor setting of the data set.

Since each HotpotQA example comes with 10 documents with 8 of them are distraction, and only the remaining 2 are useful for answering the question, we choose to only use the 2 gold documents as the context of each question to focus on comparing GNN and GSN. The 2 gold documents have multiple sentences: some of them are annotated as supporting sentences and the answer span resides in one of the sentence. We concatenate the 2 gold documents as in Xiao et al. (2019); Tu et al. (2019a) and use BERT (base uncased) Devlin et al. (2019) to encode the “[CLS]+question+[SEP]+context+[SEP]” input. A sentence extractor is applied on the output of BERT to get the sequential output of each sentence with pre-calculated sentence start and end indices.

To build a graph on these sentences, we extracted named entities (NE) and noun phrases (NP) and the question, and two sentences are connected if 1) they come from the same document; 2) they come from the different documents but share the same NEs or NPs; 3) they come from the different documents but both have one or more NEs or NPs appeared in the question. We treat those three types of edges differently as in Equation 7. Then we apply the proposed GSN on the built graph and get the updated sequential representation of each node. Finally, all sentences can be re-concatenated for the the span prediction task by predicting the start and end indices, or summarized into fixed-dimensional vectors for supporting sentence prediction. The former task is optimized with a cross entropy (CE) loss while the later with a binary CE loss. We also jointly optimize the two tasks together by weighted summation of the two loss items. Since the point of our paper is to show the efficacy of the proposed GSN model, we only show results on development set; getting numbers on test set requires several other modules Xiao et al. (2019); Tu et al. (2019a) that are unrelated to our proposed GSN.

4.2 Results on HotpotQA

4.2.1 Experimental Settings

We present the results on HotpotQA data set in three experimental settings as we want to show the proposed GSN performs better in different tasks compared to baseline models. In the first setting, we use multi-relational GSN model on top of the BERT sequential output to only predict supporting sentences from the context. We compare it with the model based on multi-relational GCN Schlichtkrull et al. (2018) over early summarized feature vectors to learn structure-aware sentence representation, which has been employed in previous studies for multi-hop MRC De Cao et al. (2019); Xiao et al. (2019); Tu et al. (2019a).

In the second setting, multi-relational GSN model is applied on top of BERT sequential output to only predict answer span, which is a sequential labeling task on the output of the GSN model. Note that standard GNN models are incompetent at this task because the sentences are summarized into vectors and there is no way to make predictions on position (token) level with GNN output. Thus, the baseline model we compare with is to directly classify the tokens in BERT sequential output.

In the third setting, we train the GSN model by jointly predicting answer span at token level and supporting sentences at sentence level. We compare with the baseline model which jointly trains answer span prediction and supporting sentence prediction baseline models.

All results are the average of five runs with different random seeds, and we also show the standard deviation of the numbers. Please refer to supplementary materials for details about implementation and hyperparameter settings.

4.2.2 Results

We report the results using the best hyperparameters for each experimental setting. First, in Table 1 we show the results of the baseline GCN-based model and the proposed GSN-based model for answer prediction only and supporting sentence prediction only tasks in terms of EM and score. The results show that the proposed GSN-based models perform better on both tasks with strong statistical significance compared to the baseline GCN-based model. The improvement on EM score is slightly more significant, indicating GSN-based models are better at finding complete answer span or supporting sentences than the baseline models.

Table 2 demonstrates the results when we make predictions of both answer span and supporting sentences over the GSN output. Compared to the baseline model, more improvement is attained than models trained only on one task, especially for the supporting sentence prediction task. With joint training, the performance on answer span prediction drops while the performance on supporting sentence prediction increases. Actually we observed better joint EM and scores with joint training compared to separate training for both baseline models and GSN-based models. Thus, joint training still boosts the performance overall, because the joint trained models find the correct answer and supporting sentences of a question simultaneously.

ANS-only SUP-only
baseline 63.870.16 77.690.16 62.140.16 88.940.07
GSN 64.390.06 78.270.10 62.960.14 89.290.07
Table 1: Results comparison with average and standard deviation of five runs. “ANS-only” and “SUP-only” indicate the model is only trained on two separate tasks.
baseline 62.990.16 76.900.31 61.350.17 88.730.09 41.760.40 69.640.28
GSN 63.560.31 77.260.11 63.260.16 89.350.04 43.510.27 70.430.14
Table 2: Results comparison with average and standard deviation of five runs. “ANS”, “SUP” and “JOINT” indicate the jointly trained models’ performance in terms of measurements on answer span prediction, supporting sentence prediction and joint tasks.

4.2.3 Analysis

ANS-only SUP-only JOINT
1 layer 64.38 78.24 62.36 89.09 42.70 70.23
2 layer 64.29 78.06 62.38 89.21 43.81 70.56
3 layer 64.23 77.96 63.07 89.37 43.36 70.24
Table 3: Effect of the number of GSN layers on the performance.
mean-pooling 63.47 77.04 61.93 89.03 42.61 70.01
max-pooling 63.92 77.39 63.36 89.39 43.81 70.56
Table 4: Results comparison between mean-pooling and max-pooling combination functions.

Effect of number of layers: we investigated the influence of the number of layers (hops) of the proposed GSN on the performance. We changed the number of layers from 1 to 3, and record the same set of measurements with all three experimental settings. The results of the best random seed are presented in Table 3. We only show the joint measurements for the joint training experiment.

The results show that the three tasks with different training objectives demonstrate totally different performance patterns in terms of the number of GSN layers: the answer prediction task achieves the best result with 1-layer GSN (still better than without GSN as shown in Table 1), while the supporting sentence prediction task requires a 3-layer GSN. When jointly training both tasks, the 2-layer GSN gives the best performance. This pattern is reasonable because the powerful BERT based encoder possibly learns good contextual representation on token level. This learned representation can benefit the answer prediction task which is also on token level, thus less graph based reasoning is required. Similarly, some recent studies Min et al. (2019) found that considerable amount of questions in HotpotQA can be answered without multiple hops. However, it is not the case for supporting sentence prediction as it requires the model to find sentences that could be far from each other in the context. On the contrast, our proposed GSN is suitable to model the relational information among sentences, thus we observe that more layers give better results for supporting sentence prediction.

Effect of combination function: we have introduced two choices of combination function in Equation 6. Actually the results in section 4.2.2 are all achieved with the max-pooling based combination function as we found it is always better than the mean-pooling alternative. To demonstrate this, we show the results comparison only for the last experiment settings: the joint training strategy with the best random seeds. Table 4 gives the detail of the comparison.

Indication: as discussed previously, the two tasks on the HotpotQA data set can be regarded as a sequential labeling task (answer prediction) and a node classification (supporting sentence prediction) respectively. Through experiments with different settings, we have shown our proposed GSN model can 1) deal with sequential labeling tasks which require reasoning over context; existing GNN based models are unable to tackle such tasks. 2) attains better performance than GNN based models on node classification tasks with sequences as nodes.

4.3 FEVER data set

The FEVER data set is provided by the FEVER shared task 1.0111 The goal of FEVER shared task is to develop automatic methods to extract evidence from Wikipedia and verify human-generated claims given these evidence. In this study, we focus on the later task: fact verification. We used the same evidence extraction output from the baseline system Zhou et al. (2019). The resulting data set has a claim and multiple sentences for each sample, and the model needs to predict whether the evidence support or refute the claim or there is not enough information to make the prediction. In total, there are about 145K samples in training set, 20K samples in development and test set respectively. The baseline system Zhou et al. (2019) employed BERT to encode each claim-evidence pair and proposed a GAT based evidence aggregation model to exploit the relational information among multiple pieces of evidence. Then, a graph is built to connect nodes representing the encoded embedding of every claim-evidence pair. Finally, the fact verification becomes a graph classification task over the graph. An attention based read-out layer is designed to obtain a graph feature vector which is sent to a classifier to predict the target.

In our experiments, we follow exactly the same data preprocessing scripts with the baseline system 222 Our model design is different from the baseline system in the following aspects: 1) the baseline system trained the BERT-based sentence encoder and GAT model separately while we trained the two parts together. 2) For joint training, to save GPU memory usage, we concatenate all evidence sentences as the context, which is then paired with the claim and sent to BERT. We employ an attention based pooling strategy to get the sentence embedding from the BERT output given the start and end positions of each sentence in the context. We show later in the results that those two modifications give slightly better results than the baseline system. 3) The GAT based evidence aggregator is replaced with our proposed GSN model. To use GSN, there is no need to do the attention based pooling over BERT output to get sentence embedding; instead we directly input the sequential outputs of all sentences to the GSN model. After evidence aggregation, a two-step graph embedding extraction is applied: first step is to convert GSN’s output for each sentence to a vector, and second step is to convert all sentences’ embeddings to a single vector to predict targets. We use attentive pooling for both steps. For all models, the training objective is CE loss. Please refer to supplementary materials for details about hyperparameter settings.

4.4 Results on FEVER data set

For FEVER data set, the test set is blind and the prediction needs to be submitted to codalab 333 for evaluation. The measurements are label accuracy (ACC) to measure the accuracy of fact verification and official FEVER score to measure the label accuracy conditioned on providing at least one complete set of evidence. We report our best results on the development set and their corresponding numbers on the test set, and compare our proposed GSN-based model with the reported numbers in the baseline paper Zhou et al. (2019) and our implementation of the baseline system.

Table 5 shows the results of three systems on both the FEVER development and test set. Our re-implementation of the baseline system gets slightly better numbers than those reported by Zhou et al. (2019). And our proposed GSN based system improves over the baseline system by more than 1% in terms of both ACC and FEVER score. Since the fact verification task can be regarded as a graph classification task, we further prove the proposed GSN is able to achieve better performance than GNN based models when using sequences on graph nodes.

dev test
Zhou et al. (2019)
73.67 68.69 71.01 65.64
baseline (ours) 73.72 69.26 70.80 65.88
GSN 74.89 70.51 72.00 67.13
Table 5: Results on FEVER development and test sets.

4.5 Visualization of attention in GSNs

Figure 2: Heatmap of an attention matrix in the BiDAF based co-attention between two sentences. The “[UNK]” token is caused by Hangul characters.

To illustrate how the co-attention based message passing algorithm works, in Figure 2 we show the heatmap of an attention matrix (matrix in Algorithm 1) between two nodes on the graph built from a sample in the HotpotQA development set. The question of this sample is “2014 S/S is the debut album of a South Korean boy group that was formed by who?”. The current node is “2014 S/S is the debut album of South Korean group WINNER.” (after tokenization as shown in the y-axis of Figure 2), and its neighbor is “Winner (Hangul: 위너), often stylized as WINNER, is a South Korean boy group formed in 2013 by YG Entertainment and debuted in 2014.” (after tokenization as shown in the x-axis of Figure 2). The two sentences are from different documents. We apply softmax to the sequence direction of the neighbor node to see the attention pattern of each token in the current node over the tokens in the neighbor node. It is clear that almost all tokens in the current node assign high attention weight to the tokens “y” and “entertainment” because they are the start and end positions of the answer span. Meanwhile, the “winner” and “south korean boy group” are also attended by tokens in the current node because they act as the bridging entities leading to the final answer “YG Entertainment”. This figure clearly shows our GSN-based models can find multiple pieces of useful information with message passing over sequences. We include more visualization in supplemental materials.

5 Conclusion

This paper proposes graph sequential network as a novel neural architecture to facilitate reasoning over graphs with sequential data on the nodes. We develop a new message passing algorithm based on co-attention between two sequences on graph nodes. The scheme avoids the information loss inherent in the pooling based early summarization of existing GNN-based models, and improve the reasoning ability on sentence level. Through experiments on HotpotQA and FEVER, both of which require the model to perform multi-hop reasoning, we show that our proposed GSN attains better performance than existing GNNs on different types of tasks. For future work we would like to apply GSN to other applications in NLP that require complex reasoning.