1 Introduction
Graph neural networks (GNNs) process graphs and map each node to an embedding vector
zhang2018graph ; wu2019comprehensive . These node embeddings can be directly used for nodelevel applications, such as node classification kipf2017semi and link prediction schutt2017schnet . In addition, they can be used to learn the graph representation vector with graph pooling ying2018hierarchical ; zhang2018end ; lee2019self ; yuan2020structpool , in order to fit graphlevel tasks yanardag2015deep . Many variants of GNNs have been proposed, such as ChebNets defferrard2016convolutional , GCNs kipf2017semi , GraphSAGE hamilton2017inductive , GATs velivckovic2018graph , LGCN gao2018large and GINs xu2019powerful . Their advantages have been shown on various graph datasets and tasks errica2019fair. However, these GNNs share an multilayer local aggregation framework, which is similar to convolutional neural networks (CNNs)
lecun1998gradient on gridlike data such as images and texts.In recent years, the importance of nonlocal aggregation has been demonstrated in many applications in the field of computer vision
wang2018non ; wang2020nonand natural language processing
vaswani2017attention . In particular, the attention mechanism has been widely explored to achieve nonlocal aggregation and capture longrange dependencies from distant locations. Basically, the attention mechanism measures the similarity between every pair of locations and enables information to be communicated among distant but similar locations. In terms of graphs, nonlocal aggregation is also crucial for disassortative graphs, while previous studies of GNNs focus on assortative graph datasets (Section 2.2). The recently proposed GeomGCN pei2020geom explores to capture longrange dependencies in disassortative graphs. It contains an attentionlike step that computes the Euclidean distance between every pair of nodes. However, this step is computationally prohibitive for largescale graphs, as the computational complexity is quadratic in the number of nodes. In addition, GeomGCN employs pretrained node embeddings tenenbaum2000global ; nickel2017poincare ; ribeiro2017struc2vec that are not taskspecific, limiting the effectiveness and flexibility.In this work, we propose a simple yet effective nonlocal aggregation framework for GNNs. At the heart of the framework lies an efficient attentionguided sorting, which enables nonlocal aggregation through classic local aggregation operators in general deep learning. The proposed framework can be flexibly used to augment common GNNs with low computational costs. Based on the framework, we build various efficient nonlocal GNNs. In addition, we perform detailed analysis on existing disassortative graph datasets, and apply different nonlocal GNNs accordingly. Experimental results show that our nonlocal GNNs significantly outperform previous stateoftheart methods on node classification tasks on six benchmark datasets of disassortative graphs.
2 Background and Related Work
2.1 Graph Neural Networks
We focus on learning the embedding vector for each node through graph neural networks (GNNs). Most existing GNNs are inspired by the success of convolutional neural networks (CNNs) lecun1998gradient and follow a local aggregation framework. In general, each layer of GNNs scans every node in the graph and aggregates local information from directly connected nodes, i.e., the 1hop neighbors.
Specifically, a common layer of GNNs performs a twostep processing similar to the depthwise separable convolution chollet2017xception : spatial aggregation and feature transformation. The first step updates each node embedding using embedding vectors of spatially neighboring nodes. For example, GCNs kipf2017semi and GATs velivckovic2018graph compute a weighted sum of node embeddings within the 1hop neighborhood, where weights come from the degree of nodes and the interaction between nodes, respectively. GraphSAGE hamilton2017inductive
applies the max pooling, while GINs
xu2019powerful simply sums the node embeddings. The feature transformation step is similar to theconvolution, where each node embedding vector is mapped into a new feature space through a shared linear transformation
kipf2017semi ; hamilton2017inductive ; velivckovic2018graphor multilayer perceptron (MLP)
xu2019powerful . Different from these studies, LGCN gao2018large explores to directly apply the regular convolution through top ranking.Nevertheless, each layer of these GNNs only aggregates local information within the 1hop neighborhood. While stacking multiple layers can theoretically enable communication between nodes across the multihop neighborhood, the aggregation is essentially local. In addition, deep GNNs usually suffer from the oversmoothing problem xu2018representation ; li2018deeper ; chen2019measuring .
2.2 Assortative and Disassortative Graphs
There are many kinds of graphs in the literature, such as citation networks kipf2017semi , community networks chen2019measuring , cooccurrence networks tang2009social , and webpage linking networks rozemberczki2019multi . We focus on graph datasets corresponding to the node classification tasks. In particular, we categorize graph datasets into assortative and disassortative ones newman2002assortative ; ribeiro2017struc2vec according to the node homophily in terms of labels, i.e., how likely nodes with the same label are near each other in the graph.
Assortative graphs refer to those with a high node homophily. Common assortative graph datasets are citation networks and community networks. On the other hand, graphs in disassortative graph datasets contain more nodes that have the same label but are distant from each other. Example disassortative graph datasets are cooccurrence networks and webpage linking networks.
As introduced above, most existing GNNs perform local aggregation only and achieve good performance on assortative graphs kipf2017semi ; hamilton2017inductive ; velivckovic2018graph ; gao2018large . However, they may fail on disassortative graphs, where informative nodes in the same class tend to be out of the local multihop neighborhood and nonlocal aggregation is needed. Thus, in this work, we explore the nonlocal GNNs.
2.3 Attention Mechanism
The attention mechanism vaswani2017attention has been widely used in GNNs velivckovic2018graph ; gao2019graph ; knyazev2019understanding as well as other deep learning models yang2016hierarchical ; wang2018non ; wang2020non . A typical attention mechanism takes three groups of vectors as inputs, namely the query vector , key vectors , value vectors . Note that key and value vectors have a onetoone correspondence and can be the same sometimes. The attention mechanism computes the output vector as
(1) 
where the function could be any function that outputs a scalar attention score from the interaction between and , such as dot product gao2019graph or even a neural network velivckovic2018graph . The definition of the three groups of input vectors depends on the models and applications.
Notably, existing GNNs usually use the attention mechanism for local aggregation velivckovic2018graph ; gao2019graph . Specifically, when aggregating information for node , the query vector is the embedding vector of while the key and value vectors come from node embeddings of ’s directly connected nodes. And the process is iterated for each . It is worth noting that the attention mechanism can be easily extended for nonlocal aggregation wang2018non ; wang2020non , by letting the key and value vectors correspond to all the nodes in the graph when aggregating information for each node. However, it is computationally prohibitive given largescale graphs, as iterating it for each node in a graph of nodes requires time. In this work, we propose a novel nonlocal aggregation method that only requires time.
3 The Proposed Method
3.1 NonLocal Aggregation with AttentionGuided Sorting
We consider a graph , where is the set of nodes and is the set of edges. Each edge connects two nodes so that . Each node has a corresponding node feature vector . The hop neighborhood of a node refers to the set of nodes that can reach within edges in . For example, the set of ’s directly connected nodes is its 1hop neighborhood .
Our proposed nonlocal aggregation framework is composed of three steps, namely local embedding, attentionguided sorting, and nonlocal aggregation. In the following, we describe them one by one.
Local Embedding:
Our proposed framework is built upon a local embedding step that extracts local node embeddings from the node feature vectors. The local embedding step can be as simple as
(2) 
where the function refers to a multilayer perceptron (MLP), and is the dimension of the local node embedding . Note that the function is shared across all the nodes in the graph. Applying MLP only takes the node itself into consideration without aggregating information from the neighborhood.
On the other hand, graph neural networks (GNNs) can be used as the local embedding step as well, so that our proposed framework can be easily employed to augment existing GNNs. As introduced in Section 2.1, modern GNNs perform multilayer local aggregation. Typically, for each node, one layer of a GNN aggregates information from its 1hop neighborhood. Stacking such local aggregation layers allows each node to access information that is hops away. To be specific, the th layer of a layer GNN can be described as
(3) 
where , and represents the local node embedding. The and functions represent the spatial aggregation and feature transformation step introduced in Section 2.1, respectively. With the above framework, GNNs can capture the node feature information from nodes within a local neighborhood as well as the structural information.
When either MLP or GNNs is used as the local embedding step, the local node embedding only contains local information of a node . However, can be used to guide nonlocal aggregation, as distant but informative nodes are likely to have similar node features and local structures. Based on this intuition, we propose the attentionguided sorting to enable the nonlocal aggregation.
AttentionGuided Sorting:
The basic idea of the attentionguided sorting is to learn an ordering of nodes, where distant but informative nodes are put near each other. Specifically, given the local node embedding obtained through the local embedding step, we compute one set of attention scores by
(4) 
where is a calibration vector that is randomly initialized and jointly learned during training yang2016hierarchical . In this attention operator, serves as the query vector and are the key vectors. In addition, we also treat as the value vectors. However, unlike the attention mechanism introduced in Section 2.3, we use the attention scores to sort the value vectors instead of computing a weighted sum to aggregating them. Note that originally there is no ordering among nodes in a graph. To be specific, as and have onetoone correspondence through Equation (4), sorting the attention scores in nondecreasing order into provides an ordering among nodes, where is the number of nodes in the graph. The resulting sequence of local node embeddings can be denoted as .
The attention process in Equation (4) can be also understood as a projection of local node embeddings onto a 1dimensional space. The projection depends on the concrete function and the calibration vector . As indicated by its name, the calibration vector is used to calibrate the 1dimensional space, in order to push distant but informative nodes close to each other in this space. This goal is fulfilled through the following nonlocal aggregation step and the training of the calibration vector , as demonstrated below.
NonLocal Aggregation:
We point out that, with the attentionguided sorting, the nonlocal aggregation can be achieved by convolution, which is the most common local aggregation operator in deep learning. Specifically, given the sorted sequence of local node embeddings , we compute
(5) 
where the
function represents a 1D convolution with appropriate padding. Note that the
function can be replaced by a 1D convolutional neural network as long as the number of input and output vectors remains the same.To see how the function performs nonlocal aggregation with the attentionguided sorting, we take an example where the function is a 1D convolution of kernel size . In this case, is computed from , corresponding to the receptive field of the function. As a result, if the attentionguided sorting leads to containing nodes that are distant but informative to , the output aggregates nonlocal information. Another view is that we can consider the attentionguided sorting as reconnects nodes in the graph, where can be treated as the 1hop neighborhood of . After the function, and
are concatenated as the input to a classifier to predict the label of the corresponding node, where both nonlocal and local dependencies can be captured. In order to enable the endtoend training of the calibration vector
, we modify Equation (5) into(6) 
where we multiply the attention score with the corresponding local node embedding. As a result, the calibration vector receives gradients through the attention scores during training.
The remaining question is how to make sure that the attentionguided sorting pushes distant but informative nodes together. The short answer is that it is not necessary to guarantee this, as the requirement of nonlocal aggregation depends on the concrete graphs. In fact, our proposed framework grants GNNs the ability of nonlocal aggregation but lets the endtoend training process determine whether to use nonlocal information. The backpropagation from the supervised loss will tune the calibration vector and encourage to capture useful information that is not encoded by . In the case of disassortative graphs, usually needs to aggregate information from distant but informative nodes. Hence, the calibration vector tends to arrange the attentionguided sorting to put distant but informative nodes together. On the other hand, nodes within the local neighborhood are usually much more informative than distant nodes in assortative graphs. In this situation, may simply perform local aggregation that is similar to GNNs.
In Section 4, we demonstrate the effectiveness of our proposed nonlocal aggregation framework on six disassortative graph datasets. In particular, we achieve the stateoftheart performance on all the datasets with significant improvements over previous methods.
3.2 Time Complexity Analysis
We perform theoretical analysis of the time complexity of our proposed framework. As discussed in Section 2.3, using the attention mechanism vaswani2017attention ; wang2018non ; wang2020non to achieve nonlocal aggregation requires time for a graph of nodes. Essentially, the time complexity is due to the fact that the function needs to be computed between every pair of nodes. In particular, the recently proposed GeomGCN pei2020geom contains a similar nonlocal aggregation step. For each , GeomGCN finds the set of nodes from which the Euclidean distance to is less than a predefined number, where the Euclidean distance between every pair of nodes needs to be computed. As the computation of the the Euclidean distance between two nodes can be understood as the function, GeomGCN has at least time complexity.
In contrast, our proposed nonlocal aggregation framework requires only time. To see this, note that the function in Equation (4) only needs to be computed once, instead of iterating it for each node. As a result, computing the attention scores only takes time. Therefore, the time complexity of sorting, i.e. , dominates the total time complexity of our proposed framework. In Section 4.6, we compare the real running time on different datasets among common GNNs, GeomGCN, and our nonlocal GNNs as introduced in the next section.
3.3 Efficient NonLocal Graph Neural Networks
We apply our proposed nonlocal aggregation framework to build efficient nonlocal GNNs. Recall that our proposed framework starts with the local embedding step, followed by the attentionguided sorting and the nonlocal aggregation step.
In particular, the local embedding step can be implemented by either MLP or common GNNs, such as GCNs kipf2017semi or GATs velivckovic2018graph . MLP extracts the local node embedding only from the node feature vector and excludes the information from nodes within the local neighborhood. This property can be helpful on some disassortative graphs, where nodes within the local neighborhood provide more noises than useful information. On other disassortative graphs, informative nodes locate in both local neighborhood and distant locations. In this case, GNNs are more suitable as the local embedding step. Depending on the disassortative graphs in hand, we build different nonlocal GNNs with either MLP or GNNs as the local embedding step. In Section 4.3, we show that these two categories of disassortative graphs can be distinguished through simple experiments, where we apply different nonlocal GNNs accordingly. Specifically, the number of layers is set to 2 for both MLP and GNNs in our nonlocal GNNs.
In terms of the attentionguided sorting, we only need to specify the function in Equation (4). In order to make it as efficient as possible, we choose the simplest function as
(7) 
where is part of the training parameters, as described in Section 3.1.
With the the attentionguided sorting, we can implement the nonlocal aggregation step through convolution, as explained in Section 3.1 and shown in Equation (6). Specifically, we set the function to be a 2layer convolutional neural network composed of two 1D convolutions. The kernel size is set to or
depending on the datasets. The activation function between the two layers is ReLU
krizhevsky2012imagenet .Finally, we use a linear classifier that takes the concatenation of and as inputs and makes prediction for the corresponding node. Depending on the local embedding step, we build three efficient nonlocal GNNs, namely nonlocal MLP (NLMLP), nonlocal GCN (NLGCN), and nonlocal GAT (NLGAT). The models can be endtoend trained with the classification loss.
4 Experiments
4.1 Datasets
We perform experiments on six disassortative graph datasets rozemberczki2019multi ; tang2009social ; pei2020geom (Chameleon, Squirrel, Actor, Cornell, Texas, Wisconsin) and three assortative graph datasets kipf2017semi (Cora, Citeseer, Pubmed). These datasets are commonly used to evaluate GNNs on node classification tasks kipf2017semi ; velivckovic2018graph ; gao2018large ; pei2020geom . We provide detailed descriptions of disassortative graph datasets in Section A in the supplementary.
In order to distinguish assortative and disassortative graph datasets, Pei et al. pei2020geom propose a metric to measure the homophily of a graph , defined as
(8) 
Intuitively, a large indicates an assortative graph, and vice versa. The and other statistics are summarized in Table 1.
In our experiments, we focus on comparing the model performance on disassortative graph datasets, in order to demonstrate the effectiveness of our nonlocal aggregation framework. The performances on assortative graph datasets are provided for reference, indicating that the proposed framework will not hurt the performance when nonlocal aggregation is not strongly desired.
Assortative  Disassortative  

Datasets  Cora  Citeseer  Pubmed  Chameleon  Squirrel  Actor  Cornell  Texas  Wisconsin 
#Nodes  
#Edges  
#Features  
#Classes 
4.2 Baselines
We compare our proposed nonlocal MLP (NLMLP), nonlocal GCN (NLGCN), and nonlocal GAT (NLGAT) with various baselines:

MLP is the simplest deep learning model. It makes prediction solely based on the node feature vectors, without aggregating any local or nonlocal information.

GCN kipf2017semi and GAT velivckovic2018graph are the most common GNNs. As introduced in Section 2.1, they only perform local aggregation.

GeomGCN pei2020geom is a recently proposed GNN that can capture longrange dependencies. It is the current stateoftheart model on several disassortative graph datasets. GeomGCN requires the use of different node embedding methods, such as Isomap tenenbaum2000global , Poincare nickel2017poincare , and struc2vec ribeiro2017struc2vec . We simply report the best results from pei2020geom for GeomGCN and the following two variants without specifying the node embedding method.

GeomGCNg pei2020geom is a variant of GeomGCN that performs local aggregation only. It is similar to common GNNs.

GeomGCNs pei2020geom is a variant of GeomGCN that does not force local aggregation. The designed functionality is similar to our NLMLP.
We implement MLP, GCN, GAT, and our methods using Pytorch
paszke2017automatic and Pytorch Geometric fey2019fast . As has been discussed^{1}^{1}1https://openreview.net/forum?id=S1e2agrFvS¬eId=8tGKV1oSzCr, in fair settings, the results of GCN and GAT differ from those in pei2020geom .On each dataset, we follow pei2020geom and randomly split nodes of each class into 60%, 20%, and 20% for training, validation, and testing. The experiments are repeatedly run 10 times with different random splits and the average test accuracy over these 10 runs are reported. Testing is performed when validation accuracy achieves maximum on each run. Apart from the details specified in Section 3.3
, we tune the following hyperparameters individually for our proposed models: (1) the number of hidden unit
{16, 48, 96}, (2) dropout rate {0, 0.5, 0.8}, (3) weight decay {0, 5e4, 5e5, 5e6}, and (4) learning rate {0.01, 0.05}.4.3 Analysis of Disassortative Graph Datasets
As discussed in Section 3.3, the disassortative graph datasets can be divided into two categories. Nodes within the local neighborhood provide more noises than useful information in disassortative graphs belonging to the first category. Therefore, local aggregation should be avoided in models on such disassortative graphs. As for the second category, informative nodes locate in both local neighborhood and distant locations. Intuitively, a graph with lower is more likely to be in the first category. However, it is not an accurate way to determine the two categories.
Knowing the exact category of a disassortative graph is crucial, as we need to apply nonlocal GNNs accordingly. As analyzed above, the key difference lies in whether the local aggregation is useful. Based on this insight, we can distinguish two categories of disassortative graph datasets by comparing the performance between MLP and common GNNs (GCN, GAT) on each of the six disassortative graph datasets.
The results are summarized in Table 2. We can see that Actor, Cornell, Texas, and Wisconsin fall into the first category, while Chameleon and Squirrel belong to the second category. We add the performance on assortative graph datasets for reference, where the local aggregation is effective so that GNNs tend to outperform MLP.
Assortative  Disassortative  
Datasets  Cora  Citeseer  Pubmed  Chameleon  Squirrel  Actor  Cornell  Texas  Wisconsin 
MLP  35.1  81.6  81.3  84.9  
GCN  88.4  67.6  54.9  
GAT  88.4  76.1 
4.4 Comparisons with Baselines
According to the insights from Section 4.3, we apply different nonlocal GNNs according to the category of disassortative graph datasets, and make comparisons with corresponding baselines.
Datasets  Actor  Cornell  Texas  Wisconsin 

MLP  
GeomGCN  
GeomGCNs  
NLMLP  37.9  84.9  85.4  87.3 
Specifically, we employ NLMLP on Actor, Cornell, Texas, and Wisconsin. The corresponding baselines are MLP, GeomGCN, and GeomGCNs, as Table 2 has shown that GCN and GAT perform much worse than MLP on these datasets. And GeomGCNg is similar to GCN and has worse performance than GeomGCNs, which is shown in Section B in the supplementary. The comparison results are reported in Table 3. While GeomGCNs are the previous stateoftheart GNNs on these datasets pei2020geom , we find that MLP consistently outperforms GeomGCNs by large margins. In particular, although GeomGCNs does not explicitly perform local aggregation, it is still outperformed by MLP. A possible explanation is that GeomGCNs uses pretrained node embeddings, which aggregates information from the local neighborhood implicitly. In contrast, our NLMLP is built upon MLP with the proposed nonlocal aggregation framework, which excludes the local noises and collects useful information from nonlocal informative nodes. The NLMLP sets the new stateoftheart performance on these disassortative graph datasets.
Datasets  Chameleon  Squirrel 

GCN  
GAT  
GeomGCN  
GeomGCNg  
NLGCN  70.1  59.0 
NLGAT 
On Chameleon and Squirrel that belong to the second category of disassortative graph datasets, we apply NLGCN and NLGAT accordingly. The baselines are GCN, GAT, GeomGCN, and GeomGCNg. In these datasets, these baselines that explicitly perform local aggregation show advantages over MLP and GeomGCNs, as shown in Section B in the supplementary. Table 4 summarizes the comparison results. Our proposed NLGCN achieves the best performance on both datasets. In addition, it is worth noting that our NLGCN and NLGAT are built upon GCN and GAT, respectively. They show improvements over their counterparts, which indicates that the advantages of our proposed nonlocal aggregation framework are general for common GNNs.
We provide the results of all the models on all datasets in Section B in the supplementary for reference.
4.5 Analysis of the AttentionGuided Sorting
We analyze the results of the attentionguided sorting in our proposed framework, in order to show that our nonlocal GNNs indeed perform nonlocal aggregation.
Suppose the attentionguided sorting leads to the sorted sequence , which goes through a convolution or CNN into . As discussed in Section 3.1, we can consider the sequence as a reconnected graph , where we treat nodes within the receptive field of as directly connected to , i.e. ’s 1hop neighborhood. The information within this new 1hop neighborhood will be aggregated. If our nonlocal GNNs indeed perform nonlocal aggregation, the homophily of the reconnected graph should be larger than the original graph. Therefore, we compute for each dataset to verify this statement. Following Section 4.4, we apply NLMLP on Actor, Cornell, Texas, and Wisconsin and NLGCN on Chameleon and Squirrel.
Figure 1 compares with for each dataset. We can observe that is much larger than , indicating that distant but informative nodes are near each other in the reconnected graph . We also provide the visualizations of the sorted sequence for Cornell and Texas. We can see that nodes with the same label tend to be clustered together. These facts indicate that our nonlocal GNNs perform nonlocal aggregation with the attentionguided sorting.
4.6 Efficiency Comparisons
Chameleon  Squirrel  

GCN  
GAT  
GeomGCN  
NLGCN 
As analyzed in Section 3.2, our proposed nonlocal aggregation framework is more efficient than previous methods based on the original attention mechanism, such as GeomGCN pei2020geom . Concretely, our method requires only computation time in contrast to . In this section, we compare the real running time to verify our analysis. Specifically, we compare NLGCN with GeomGCN as well as GCN and GAT. For GeomGCN, we use the code provided in pei2020geom
. Each model is trained for 500 epochs on each dataset and the average training time per epoch is reported.
The results are shown in Table 5. Although our NLGCN is built upon GCN, it is just slightly slower than GCN and faster than GAT, showing the efficiency of our nonlocal aggregation framework. On the other hand, GeomGCN is significantly slower due to the fact that it has time complexity.
5 Conclusion
In this work, we propose a simple yet effective nonlocal aggregation framework for GNNs. The core of the framework is an efficient attentionguided sorting, which enables nonlocal aggregation through convolution. The proposed framework can be easily used to build nonlocal GNNs with low computational costs. We perform thorough experiments on node classification tasks to evaluate our proposed method. In particular, we experimentally analyze existing disassortative graph datasets and apply different nonlocal GNNs accordingly. The results show that our nonlocal GNNs significantly outperform previous stateoftheart methods on six benchmark datasets of disassortative graphs, in terms of both accuracy and speed.
This work was supported in part by National Science Foundation grants IIS1908198 and DBI1922969.
References
 [1] Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, D Zachary, Lin Zeming, Desmaison Alban, Antiga Luca, and Lerer Adam. Automatic differentiation in pytorch. In Proceedings of Neural Information Processing Systems Autodiff Workshop, 2017.

[2]
Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun.
Measuring and relieving the oversmoothing problem for graph neural
networks from the topological view.
In
ThirtyFourth AAAI Conference on Artificial Intelligence
, 2020. 
[3]
François Chollet.
Xception: Deep learning with depthwise separable convolutions.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 1251–1258, 2017.  [4] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
 [5] Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison of graph neural networks for graph classification. In International Conference on Learning Representations, 2020.
 [6] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
 [7] Hongyang Gao and Shuiwang Ji. Graph representation learning via hard and channelwise attention networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 741–749, 2019.
 [8] Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Largescale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1416–1424, 2018.
 [9] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
 [10] Thomas N. Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
 [11] Boris Knyazev, Graham W Taylor, and Mohamed Amer. Understanding attention and generalization in graph neural networks. In Advances in Neural Information Processing Systems, pages 4204–4214, 2019.
 [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 [13] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[14]
Junhyun Lee, Inyeop Lee, and Jaewoo Kang.
Selfattention graph pooling.
In
International Conference on Machine Learning
, pages 3734–3743, 2019. 
[15]
Qimai Li, Zhichao Han, and XiaoMing Wu.
Deeper insights into graph convolutional networks for semisupervised learning.
In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.  [16] Mark EJ Newman. Assortative mixing in networks. Physical review letters, 89(20):208701, 2002.
 [17] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pages 6338–6347, 2017.
 [18] Hongbin Pei, Bingzhe Wei, Kevin ChenChuan Chang, Yu Lei, and Bo Yang. GeomGCN: Geometric graph convolutional networks. In International Conference on Learning Representations, 2020.
 [19] Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 385–394, 2017.
 [20] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multiscale attributed node embedding. arXiv preprint arXiv:1909.13021, 2019.
 [21] Kristof Schütt, PieterJan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, Alexandre Tkatchenko, and KlausRobert Müller. Schnet: A continuousfilter convolutional neural network for modeling quantum interactions. In Advances in Neural Information Processing Systems, pages 991–1001, 2017.
 [22] Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. Social influence analysis in largescale networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 807–816, 2009.
 [23] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
 [24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
 [25] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
 [26] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Nonlocal neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
 [27] Zhengyang Wang, Na Zou, Dinggang Shen, and Shuiwang Ji. Nonlocal UNets for biomedical image segmentation. In ThirtyFourth AAAI Conference on Artificial Intelligence, 2020.
 [28] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
 [29] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019.
 [30] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Kenichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pages 5449–5458, 2018.
 [31] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374. ACM, 2015.
 [32] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, 2016.
 [33] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pages 4800–4810, 2018.
 [34] Hao Yuan and Shuiwang Ji. StructPool: Structured graph pooling via conditional random fields. In International Conference on Learning Representations, 2020.
 [35] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An endtoend deep learning architecture for graph classification. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [36] Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. Graph convolutional networks: Algorithms, applications and open challenges. In International Conference on Computational Social Networks, pages 79–91. Springer, 2018.
Appendix A Details of Disassortative Graph Datasets
Here are the details of disassortative graph datasets used in our experiments:

Chameleon and Squirrel are Wikipedia networks [20] where nodes represent web pages from Wikipedia and edges indicate mutual links between pages. Node feature vectors are bagofword representation of informative nouns in the corresponding pages. Each node is labeled with one of five classes according to the number of the average monthly traffic of the web page.

Actor is an actor cooccurrence network, where nodes denote actors and edges indicate cooccurrence on the same web page from Wikipedia. It is extracted from the filmdirectoractorwriter network proposed by Tang et al. [22]. Node feature vectors are bagofword representation of keywords in the actors’ Wikipedia pages. Each node is labeled with one of five classes according to the topic of the actor’s Wikipedia page.

Cornell, Texas, and Wisconsin come from the WebKB dataset collected by Carnegie Mellon University. Nodes represent web pages and edges denote hyperlinks between them. Node feature vectors are bagofword representation of the corresponding web pages. Each node is labeled with one of student, project, course, staff, and faculty.
Appendix B Full Experimental Results
Assortative  Disassortative  

Datasets  Cora  Citeseer  Pubmed  Chameleon  Squirrel  Actor  Cornell  Texas  Wisconsin 
MLP  
GCN  
GAT  
GeomGCN  
GeomGCNg  80.6  90.7  
GeomGCNs  
NLMLP  37.9  84.9  85.4  87.3  
NLGCN  70.1  59.0  
NLGAT  88.5 
Comments
There are no comments yet.