Non-Local Graph Neural Networks

05/29/2020 ∙ by Meng Liu, et al. ∙ Texas A&M University 34

Modern graph neural networks (GNNs) learn node embeddings through multilayer local aggregation and achieve great success in applications on assortative graphs. However, tasks on disassortative graphs usually require non-local aggregation. In this work, we propose a simple yet effective non-local aggregation framework with an efficient attention-guided sorting for GNNs. Based on it, we develop various non-local GNNs. We perform thorough experiments to analyze disassortative graph datasets and evaluate our non-local GNNs. Experimental results demonstrate that our non-local GNNs significantly outperform previous state-of-the-art methods on six benchmark datasets of disassortative graphs, in terms of both model performance and efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph neural networks (GNNs) process graphs and map each node to an embedding vector 

zhang2018graph ; wu2019comprehensive . These node embeddings can be directly used for node-level applications, such as node classification kipf2017semi and link prediction schutt2017schnet . In addition, they can be used to learn the graph representation vector with graph pooling ying2018hierarchical ; zhang2018end ; lee2019self ; yuan2020structpool , in order to fit graph-level tasks yanardag2015deep . Many variants of GNNs have been proposed, such as ChebNets defferrard2016convolutional , GCNs kipf2017semi , GraphSAGE hamilton2017inductive , GATs velivckovic2018graph , LGCN gao2018large and GINs xu2019powerful . Their advantages have been shown on various graph datasets and tasks errica2019fair

. However, these GNNs share an multilayer local aggregation framework, which is similar to convolutional neural networks (CNNs) 

lecun1998gradient on grid-like data such as images and texts.

In recent years, the importance of non-local aggregation has been demonstrated in many applications in the field of computer vision 

wang2018non ; wang2020non

and natural language processing 

vaswani2017attention . In particular, the attention mechanism has been widely explored to achieve non-local aggregation and capture long-range dependencies from distant locations. Basically, the attention mechanism measures the similarity between every pair of locations and enables information to be communicated among distant but similar locations. In terms of graphs, non-local aggregation is also crucial for disassortative graphs, while previous studies of GNNs focus on assortative graph datasets (Section 2.2). The recently proposed Geom-GCN pei2020geom explores to capture long-range dependencies in disassortative graphs. It contains an attention-like step that computes the Euclidean distance between every pair of nodes. However, this step is computationally prohibitive for large-scale graphs, as the computational complexity is quadratic in the number of nodes. In addition, Geom-GCN employs pre-trained node embeddings tenenbaum2000global ; nickel2017poincare ; ribeiro2017struc2vec that are not task-specific, limiting the effectiveness and flexibility.

In this work, we propose a simple yet effective non-local aggregation framework for GNNs. At the heart of the framework lies an efficient attention-guided sorting, which enables non-local aggregation through classic local aggregation operators in general deep learning. The proposed framework can be flexibly used to augment common GNNs with low computational costs. Based on the framework, we build various efficient non-local GNNs. In addition, we perform detailed analysis on existing disassortative graph datasets, and apply different non-local GNNs accordingly. Experimental results show that our non-local GNNs significantly outperform previous state-of-the-art methods on node classification tasks on six benchmark datasets of disassortative graphs.

2 Background and Related Work

2.1 Graph Neural Networks

We focus on learning the embedding vector for each node through graph neural networks (GNNs). Most existing GNNs are inspired by the success of convolutional neural networks (CNNs) lecun1998gradient and follow a local aggregation framework. In general, each layer of GNNs scans every node in the graph and aggregates local information from directly connected nodes, i.e., the 1-hop neighbors.

Specifically, a common layer of GNNs performs a two-step processing similar to the depthwise separable convolution chollet2017xception : spatial aggregation and feature transformation. The first step updates each node embedding using embedding vectors of spatially neighboring nodes. For example, GCNs kipf2017semi and GATs velivckovic2018graph compute a weighted sum of node embeddings within the 1-hop neighborhood, where weights come from the degree of nodes and the interaction between nodes, respectively. GraphSAGE hamilton2017inductive

applies the max pooling, while GINs 

xu2019powerful simply sums the node embeddings. The feature transformation step is similar to the

convolution, where each node embedding vector is mapped into a new feature space through a shared linear transformation 

kipf2017semi ; hamilton2017inductive ; velivckovic2018graph

or multilayer perceptron (MLP) 

xu2019powerful . Different from these studies, LGCN gao2018large explores to directly apply the regular convolution through top- ranking.

Nevertheless, each layer of these GNNs only aggregates local information within the 1-hop neighborhood. While stacking multiple layers can theoretically enable communication between nodes across the multi-hop neighborhood, the aggregation is essentially local. In addition, deep GNNs usually suffer from the over-smoothing problem xu2018representation ; li2018deeper ; chen2019measuring .

2.2 Assortative and Disassortative Graphs

There are many kinds of graphs in the literature, such as citation networks kipf2017semi , community networks chen2019measuring , co-occurrence networks tang2009social , and webpage linking networks rozemberczki2019multi . We focus on graph datasets corresponding to the node classification tasks. In particular, we categorize graph datasets into assortative and disassortative ones newman2002assortative ; ribeiro2017struc2vec according to the node homophily in terms of labels, i.e., how likely nodes with the same label are near each other in the graph.

Assortative graphs refer to those with a high node homophily. Common assortative graph datasets are citation networks and community networks. On the other hand, graphs in disassortative graph datasets contain more nodes that have the same label but are distant from each other. Example disassortative graph datasets are co-occurrence networks and webpage linking networks.

As introduced above, most existing GNNs perform local aggregation only and achieve good performance on assortative graphs kipf2017semi ; hamilton2017inductive ; velivckovic2018graph ; gao2018large . However, they may fail on disassortative graphs, where informative nodes in the same class tend to be out of the local multi-hop neighborhood and non-local aggregation is needed. Thus, in this work, we explore the non-local GNNs.

2.3 Attention Mechanism

The attention mechanism vaswani2017attention has been widely used in GNNs velivckovic2018graph ; gao2019graph ; knyazev2019understanding as well as other deep learning models yang2016hierarchical ; wang2018non ; wang2020non . A typical attention mechanism takes three groups of vectors as inputs, namely the query vector , key vectors , value vectors . Note that key and value vectors have a one-to-one correspondence and can be the same sometimes. The attention mechanism computes the output vector as

(1)

where the function could be any function that outputs a scalar attention score from the interaction between and , such as dot product gao2019graph or even a neural network velivckovic2018graph . The definition of the three groups of input vectors depends on the models and applications.

Notably, existing GNNs usually use the attention mechanism for local aggregation velivckovic2018graph ; gao2019graph . Specifically, when aggregating information for node , the query vector is the embedding vector of while the key and value vectors come from node embeddings of ’s directly connected nodes. And the process is iterated for each . It is worth noting that the attention mechanism can be easily extended for non-local aggregation wang2018non ; wang2020non , by letting the key and value vectors correspond to all the nodes in the graph when aggregating information for each node. However, it is computationally prohibitive given large-scale graphs, as iterating it for each node in a graph of nodes requires time. In this work, we propose a novel non-local aggregation method that only requires time.

3 The Proposed Method

3.1 Non-Local Aggregation with Attention-Guided Sorting

We consider a graph , where is the set of nodes and is the set of edges. Each edge connects two nodes so that . Each node has a corresponding node feature vector . The -hop neighborhood of a node refers to the set of nodes that can reach within edges in . For example, the set of ’s directly connected nodes is its 1-hop neighborhood .

Our proposed non-local aggregation framework is composed of three steps, namely local embedding, attention-guided sorting, and non-local aggregation. In the following, we describe them one by one.

Local Embedding:

Our proposed framework is built upon a local embedding step that extracts local node embeddings from the node feature vectors. The local embedding step can be as simple as

(2)

where the function refers to a multilayer perceptron (MLP), and is the dimension of the local node embedding . Note that the function is shared across all the nodes in the graph. Applying MLP only takes the node itself into consideration without aggregating information from the neighborhood.

On the other hand, graph neural networks (GNNs) can be used as the local embedding step as well, so that our proposed framework can be easily employed to augment existing GNNs. As introduced in Section 2.1, modern GNNs perform multilayer local aggregation. Typically, for each node, one layer of a GNN aggregates information from its 1-hop neighborhood. Stacking such local aggregation layers allows each node to access information that is hops away. To be specific, the -th layer of a -layer GNN can be described as

(3)

where , and represents the local node embedding. The and functions represent the spatial aggregation and feature transformation step introduced in Section 2.1, respectively. With the above framework, GNNs can capture the node feature information from nodes within a local neighborhood as well as the structural information.

When either MLP or GNNs is used as the local embedding step, the local node embedding only contains local information of a node . However, can be used to guide non-local aggregation, as distant but informative nodes are likely to have similar node features and local structures. Based on this intuition, we propose the attention-guided sorting to enable the non-local aggregation.

Attention-Guided Sorting:

The basic idea of the attention-guided sorting is to learn an ordering of nodes, where distant but informative nodes are put near each other. Specifically, given the local node embedding obtained through the local embedding step, we compute one set of attention scores by

(4)

where is a calibration vector that is randomly initialized and jointly learned during training yang2016hierarchical . In this attention operator, serves as the query vector and are the key vectors. In addition, we also treat as the value vectors. However, unlike the attention mechanism introduced in Section 2.3, we use the attention scores to sort the value vectors instead of computing a weighted sum to aggregating them. Note that originally there is no ordering among nodes in a graph. To be specific, as and have one-to-one correspondence through Equation (4), sorting the attention scores in non-decreasing order into provides an ordering among nodes, where is the number of nodes in the graph. The resulting sequence of local node embeddings can be denoted as .

The attention process in Equation (4) can be also understood as a projection of local node embeddings onto a 1-dimensional space. The projection depends on the concrete function and the calibration vector . As indicated by its name, the calibration vector is used to calibrate the 1-dimensional space, in order to push distant but informative nodes close to each other in this space. This goal is fulfilled through the following non-local aggregation step and the training of the calibration vector , as demonstrated below.

Non-Local Aggregation:

We point out that, with the attention-guided sorting, the non-local aggregation can be achieved by convolution, which is the most common local aggregation operator in deep learning. Specifically, given the sorted sequence of local node embeddings , we compute

(5)

where the

function represents a 1D convolution with appropriate padding. Note that the

function can be replaced by a 1D convolutional neural network as long as the number of input and output vectors remains the same.

To see how the function performs non-local aggregation with the attention-guided sorting, we take an example where the function is a 1D convolution of kernel size . In this case, is computed from , corresponding to the receptive field of the function. As a result, if the attention-guided sorting leads to containing nodes that are distant but informative to , the output aggregates non-local information. Another view is that we can consider the attention-guided sorting as re-connects nodes in the graph, where can be treated as the 1-hop neighborhood of . After the function, and

are concatenated as the input to a classifier to predict the label of the corresponding node, where both non-local and local dependencies can be captured. In order to enable the end-to-end training of the calibration vector

, we modify Equation (5) into

(6)

where we multiply the attention score with the corresponding local node embedding. As a result, the calibration vector receives gradients through the attention scores during training.

The remaining question is how to make sure that the attention-guided sorting pushes distant but informative nodes together. The short answer is that it is not necessary to guarantee this, as the requirement of non-local aggregation depends on the concrete graphs. In fact, our proposed framework grants GNNs the ability of non-local aggregation but lets the end-to-end training process determine whether to use non-local information. The back-propagation from the supervised loss will tune the calibration vector and encourage to capture useful information that is not encoded by . In the case of disassortative graphs, usually needs to aggregate information from distant but informative nodes. Hence, the calibration vector tends to arrange the attention-guided sorting to put distant but informative nodes together. On the other hand, nodes within the local neighborhood are usually much more informative than distant nodes in assortative graphs. In this situation, may simply perform local aggregation that is similar to GNNs.

In Section 4, we demonstrate the effectiveness of our proposed non-local aggregation framework on six disassortative graph datasets. In particular, we achieve the state-of-the-art performance on all the datasets with significant improvements over previous methods.

3.2 Time Complexity Analysis

We perform theoretical analysis of the time complexity of our proposed framework. As discussed in Section 2.3, using the attention mechanism vaswani2017attention ; wang2018non ; wang2020non to achieve non-local aggregation requires time for a graph of nodes. Essentially, the time complexity is due to the fact that the function needs to be computed between every pair of nodes. In particular, the recently proposed Geom-GCN pei2020geom contains a similar non-local aggregation step. For each , Geom-GCN finds the set of nodes from which the Euclidean distance to is less than a pre-defined number, where the Euclidean distance between every pair of nodes needs to be computed. As the computation of the the Euclidean distance between two nodes can be understood as the function, Geom-GCN has at least time complexity.

In contrast, our proposed non-local aggregation framework requires only time. To see this, note that the function in Equation (4) only needs to be computed once, instead of iterating it for each node. As a result, computing the attention scores only takes time. Therefore, the time complexity of sorting, i.e. , dominates the total time complexity of our proposed framework. In Section 4.6, we compare the real running time on different datasets among common GNNs, Geom-GCN, and our non-local GNNs as introduced in the next section.

3.3 Efficient Non-Local Graph Neural Networks

We apply our proposed non-local aggregation framework to build efficient non-local GNNs. Recall that our proposed framework starts with the local embedding step, followed by the attention-guided sorting and the non-local aggregation step.

In particular, the local embedding step can be implemented by either MLP or common GNNs, such as GCNs kipf2017semi or GATs velivckovic2018graph . MLP extracts the local node embedding only from the node feature vector and excludes the information from nodes within the local neighborhood. This property can be helpful on some disassortative graphs, where nodes within the local neighborhood provide more noises than useful information. On other disassortative graphs, informative nodes locate in both local neighborhood and distant locations. In this case, GNNs are more suitable as the local embedding step. Depending on the disassortative graphs in hand, we build different non-local GNNs with either MLP or GNNs as the local embedding step. In Section 4.3, we show that these two categories of disassortative graphs can be distinguished through simple experiments, where we apply different non-local GNNs accordingly. Specifically, the number of layers is set to 2 for both MLP and GNNs in our non-local GNNs.

In terms of the attention-guided sorting, we only need to specify the function in Equation (4). In order to make it as efficient as possible, we choose the simplest function as

(7)

where is part of the training parameters, as described in Section 3.1.

With the the attention-guided sorting, we can implement the non-local aggregation step through convolution, as explained in Section 3.1 and shown in Equation (6). Specifically, we set the function to be a 2-layer convolutional neural network composed of two 1D convolutions. The kernel size is set to or

depending on the datasets. The activation function between the two layers is ReLU 

krizhevsky2012imagenet .

Finally, we use a linear classifier that takes the concatenation of and as inputs and makes prediction for the corresponding node. Depending on the local embedding step, we build three efficient non-local GNNs, namely non-local MLP (NLMLP), non-local GCN (NLGCN), and non-local GAT (NLGAT). The models can be end-to-end trained with the classification loss.

4 Experiments

4.1 Datasets

We perform experiments on six disassortative graph datasets rozemberczki2019multi ; tang2009social ; pei2020geom (Chameleon, Squirrel, Actor, Cornell, Texas, Wisconsin) and three assortative graph datasets kipf2017semi (Cora, Citeseer, Pubmed). These datasets are commonly used to evaluate GNNs on node classification tasks kipf2017semi ; velivckovic2018graph ; gao2018large ; pei2020geom . We provide detailed descriptions of disassortative graph datasets in Section A in the supplementary.

In order to distinguish assortative and disassortative graph datasets, Pei et al. pei2020geom propose a metric to measure the homophily of a graph , defined as

(8)

Intuitively, a large indicates an assortative graph, and vice versa. The and other statistics are summarized in Table 1.

In our experiments, we focus on comparing the model performance on disassortative graph datasets, in order to demonstrate the effectiveness of our non-local aggregation framework. The performances on assortative graph datasets are provided for reference, indicating that the proposed framework will not hurt the performance when non-local aggregation is not strongly desired.

Assortative Disassortative
Datasets Cora Citeseer Pubmed Chameleon Squirrel Actor Cornell Texas Wisconsin
#Nodes
#Edges
#Features
#Classes
Table 1: Statistics of the nine datasets used in our experiments. The definition of is provided in Section 4.1. can be used to distinguish assortative and disassortative graph datasets.

4.2 Baselines

We compare our proposed non-local MLP (NLMLP), non-local GCN (NLGCN), and non-local GAT (NLGAT) with various baselines:

  • MLP is the simplest deep learning model. It makes prediction solely based on the node feature vectors, without aggregating any local or non-local information.

  • GCN kipf2017semi and GAT velivckovic2018graph are the most common GNNs. As introduced in Section 2.1, they only perform local aggregation.

  • Geom-GCN pei2020geom is a recently proposed GNN that can capture long-range dependencies. It is the current state-of-the-art model on several disassortative graph datasets. Geom-GCN requires the use of different node embedding methods, such as Isomap tenenbaum2000global , Poincare nickel2017poincare , and struc2vec ribeiro2017struc2vec . We simply report the best results from pei2020geom for Geom-GCN and the following two variants without specifying the node embedding method.

  • Geom-GCN-g pei2020geom is a variant of Geom-GCN that performs local aggregation only. It is similar to common GNNs.

  • Geom-GCN-s pei2020geom is a variant of Geom-GCN that does not force local aggregation. The designed functionality is similar to our NLMLP.

We implement MLP, GCN, GAT, and our methods using Pytorch 

paszke2017automatic and Pytorch Geometric fey2019fast . As has been discussed111https://openreview.net/forum?id=S1e2agrFvS&noteId=8tGKV1oSzCr, in fair settings, the results of GCN and GAT differ from those in pei2020geom .

On each dataset, we follow pei2020geom and randomly split nodes of each class into 60%, 20%, and 20% for training, validation, and testing. The experiments are repeatedly run 10 times with different random splits and the average test accuracy over these 10 runs are reported. Testing is performed when validation accuracy achieves maximum on each run. Apart from the details specified in Section 3.3

, we tune the following hyperparameters individually for our proposed models: (1) the number of hidden unit

{16, 48, 96}, (2) dropout rate {0, 0.5, 0.8}, (3) weight decay {0, 5e-4, 5e-5, 5e-6}, and (4) learning rate {0.01, 0.05}.

4.3 Analysis of Disassortative Graph Datasets

As discussed in Section 3.3, the disassortative graph datasets can be divided into two categories. Nodes within the local neighborhood provide more noises than useful information in disassortative graphs belonging to the first category. Therefore, local aggregation should be avoided in models on such disassortative graphs. As for the second category, informative nodes locate in both local neighborhood and distant locations. Intuitively, a graph with lower is more likely to be in the first category. However, it is not an accurate way to determine the two categories.

Knowing the exact category of a disassortative graph is crucial, as we need to apply non-local GNNs accordingly. As analyzed above, the key difference lies in whether the local aggregation is useful. Based on this insight, we can distinguish two categories of disassortative graph datasets by comparing the performance between MLP and common GNNs (GCN, GAT) on each of the six disassortative graph datasets.

The results are summarized in Table 2. We can see that Actor, Cornell, Texas, and Wisconsin fall into the first category, while Chameleon and Squirrel belong to the second category. We add the performance on assortative graph datasets for reference, where the local aggregation is effective so that GNNs tend to outperform MLP.

Assortative Disassortative
Datasets Cora Citeseer Pubmed Chameleon Squirrel Actor Cornell Texas Wisconsin
MLP 35.1 81.6 81.3 84.9
GCN 88.4 67.6 54.9
GAT 88.4 76.1
Table 2: Comparisons between MLP and common GNNs (GCN, GAT). These analytical experiments are used to determine the two categories of disassortative graph datasets, as introduced in Section 4.3.

4.4 Comparisons with Baselines

According to the insights from Section 4.3, we apply different non-local GNNs according to the category of disassortative graph datasets, and make comparisons with corresponding baselines.

Datasets Actor Cornell Texas Wisconsin
MLP
Geom-GCN
Geom-GCN-s
NLMLP 37.9 84.9 85.4 87.3
Table 3: Comparisons between our NLMLP and strong baselines on the four disassortative graph datasets belonging to the first category as defined in Section 4.3.

Specifically, we employ NLMLP on Actor, Cornell, Texas, and Wisconsin. The corresponding baselines are MLP, Geom-GCN, and Geom-GCN-s, as Table 2 has shown that GCN and GAT perform much worse than MLP on these datasets. And Geom-GCN-g is similar to GCN and has worse performance than Geom-GCN-s, which is shown in Section B in the supplementary. The comparison results are reported in Table 3. While Geom-GCN-s are the previous state-of-the-art GNNs on these datasets pei2020geom , we find that MLP consistently outperforms Geom-GCN-s by large margins. In particular, although Geom-GCN-s does not explicitly perform local aggregation, it is still outperformed by MLP. A possible explanation is that Geom-GCN-s uses pre-trained node embeddings, which aggregates information from the local neighborhood implicitly. In contrast, our NLMLP is built upon MLP with the proposed non-local aggregation framework, which excludes the local noises and collects useful information from non-local informative nodes. The NLMLP sets the new state-of-the-art performance on these disassortative graph datasets.

Datasets Chameleon Squirrel
GCN
GAT
Geom-GCN
Geom-GCN-g
NLGCN 70.1 59.0
NLGAT
Table 4: Comparisons between our NLGCN, NLGAT and strong baselines on the two disassortative graph datasets belonging to the second category as defined in Section 4.3.

On Chameleon and Squirrel that belong to the second category of disassortative graph datasets, we apply NLGCN and NLGAT accordingly. The baselines are GCN, GAT, Geom-GCN, and Geom-GCN-g. In these datasets, these baselines that explicitly perform local aggregation show advantages over MLP and Geom-GCN-s, as shown in Section B in the supplementary. Table 4 summarizes the comparison results. Our proposed NLGCN achieves the best performance on both datasets. In addition, it is worth noting that our NLGCN and NLGAT are built upon GCN and GAT, respectively. They show improvements over their counterparts, which indicates that the advantages of our proposed non-local aggregation framework are general for common GNNs.

We provide the results of all the models on all datasets in Section B in the supplementary for reference.

4.5 Analysis of the Attention-Guided Sorting

We analyze the results of the attention-guided sorting in our proposed framework, in order to show that our non-local GNNs indeed perform non-local aggregation.

Suppose the attention-guided sorting leads to the sorted sequence , which goes through a convolution or CNN into . As discussed in Section 3.1, we can consider the sequence as a re-connected graph , where we treat nodes within the receptive field of as directly connected to , i.e. ’s 1-hop neighborhood. The information within this new 1-hop neighborhood will be aggregated. If our non-local GNNs indeed perform non-local aggregation, the homophily of the re-connected graph should be larger than the original graph. Therefore, we compute for each dataset to verify this statement. Following Section 4.4, we apply NLMLP on Actor, Cornell, Texas, and Wisconsin and NLGCN on Chameleon and Squirrel.

Figure 1 compares with for each dataset. We can observe that is much larger than , indicating that distant but informative nodes are near each other in the re-connected graph . We also provide the visualizations of the sorted sequence for Cornell and Texas. We can see that nodes with the same label tend to be clustered together. These facts indicate that our non-local GNNs perform non-local aggregation with the attention-guided sorting.

Figure 1: (a) Comparisons of the homophily between the original graph and the re-connected graph given by our NLGCN on Chameleon and Squirrel. (b) Comparisons of the homophily between the original graph and the re-connected graph given by our NLMLP on Actor, Cornell, Texas, and Wisconsin. (c) Visualization of sorted node sequence after the attention-guided sorting for Cornell and Texas. The colors denote node labels. Details are explained in Section 4.5.

4.6 Efficiency Comparisons

Chameleon Squirrel
GCN
GAT
Geom-GCN
NLGCN
Table 5: Comparisons in terms of real running time (milliseconds).

As analyzed in Section 3.2, our proposed non-local aggregation framework is more efficient than previous methods based on the original attention mechanism, such as Geom-GCN pei2020geom . Concretely, our method requires only computation time in contrast to . In this section, we compare the real running time to verify our analysis. Specifically, we compare NLGCN with Geom-GCN as well as GCN and GAT. For Geom-GCN, we use the code provided in pei2020geom

. Each model is trained for 500 epochs on each dataset and the average training time per epoch is reported.

The results are shown in Table 5. Although our NLGCN is built upon GCN, it is just slightly slower than GCN and faster than GAT, showing the efficiency of our non-local aggregation framework. On the other hand, Geom-GCN is significantly slower due to the fact that it has time complexity.

5 Conclusion

In this work, we propose a simple yet effective non-local aggregation framework for GNNs. The core of the framework is an efficient attention-guided sorting, which enables non-local aggregation through convolution. The proposed framework can be easily used to build non-local GNNs with low computational costs. We perform thorough experiments on node classification tasks to evaluate our proposed method. In particular, we experimentally analyze existing disassortative graph datasets and apply different non-local GNNs accordingly. The results show that our non-local GNNs significantly outperform previous state-of-the-art methods on six benchmark datasets of disassortative graphs, in terms of both accuracy and speed.

This work was supported in part by National Science Foundation grants IIS-1908198 and DBI-1922969.

References

  • [1] Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, D Zachary, Lin Zeming, Desmaison Alban, Antiga Luca, and Lerer Adam. Automatic differentiation in pytorch. In Proceedings of Neural Information Processing Systems Autodiff Workshop, 2017.
  • [2] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In

    Thirty-Fourth AAAI Conference on Artificial Intelligence

    , 2020.
  • [3] François Chollet. Xception: Deep learning with depthwise separable convolutions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 1251–1258, 2017.
  • [4] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
  • [5] Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison of graph neural networks for graph classification. In International Conference on Learning Representations, 2020.
  • [6] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
  • [7] Hongyang Gao and Shuiwang Ji. Graph representation learning via hard and channel-wise attention networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 741–749, 2019.
  • [8] Hongyang Gao, Zhengyang Wang, and Shuiwang Ji. Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1416–1424, 2018.
  • [9] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
  • [10] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2017.
  • [11] Boris Knyazev, Graham W Taylor, and Mohamed Amer. Understanding attention and generalization in graph neural networks. In Advances in Neural Information Processing Systems, pages 4204–4214, 2019.
  • [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
  • [13] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [14] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. Self-attention graph pooling. In

    International Conference on Machine Learning

    , pages 3734–3743, 2019.
  • [15] Qimai Li, Zhichao Han, and Xiao-Ming Wu.

    Deeper insights into graph convolutional networks for semi-supervised learning.

    In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [16] Mark EJ Newman. Assortative mixing in networks. Physical review letters, 89(20):208701, 2002.
  • [17] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pages 6338–6347, 2017.
  • [18] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-GCN: Geometric graph convolutional networks. In International Conference on Learning Representations, 2020.
  • [19] Leonardo FR Ribeiro, Pedro HP Saverese, and Daniel R Figueiredo. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 385–394, 2017.
  • [20] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding. arXiv preprint arXiv:1909.13021, 2019.
  • [21] Kristof Schütt, Pieter-Jan Kindermans, Huziel Enoc Sauceda Felix, Stefan Chmiela, Alexandre Tkatchenko, and Klaus-Robert Müller. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. In Advances in Neural Information Processing Systems, pages 991–1001, 2017.
  • [22] Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. Social influence analysis in large-scale networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 807–816, 2009.
  • [23] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
  • [24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • [25] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
  • [26] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
  • [27] Zhengyang Wang, Na Zou, Dinggang Shen, and Shuiwang Ji. Non-local U-Nets for biomedical image segmentation. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  • [28] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
  • [29] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019.
  • [30] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pages 5449–5458, 2018.
  • [31] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374. ACM, 2015.
  • [32] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, 2016.
  • [33] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pages 4800–4810, 2018.
  • [34] Hao Yuan and Shuiwang Ji. StructPool: Structured graph pooling via conditional random fields. In International Conference on Learning Representations, 2020.
  • [35] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning architecture for graph classification. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [36] Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. Graph convolutional networks: Algorithms, applications and open challenges. In International Conference on Computational Social Networks, pages 79–91. Springer, 2018.

Appendix A Details of Disassortative Graph Datasets

Here are the details of disassortative graph datasets used in our experiments:

  • Chameleon and Squirrel are Wikipedia networks [20] where nodes represent web pages from Wikipedia and edges indicate mutual links between pages. Node feature vectors are bag-of-word representation of informative nouns in the corresponding pages. Each node is labeled with one of five classes according to the number of the average monthly traffic of the web page.

  • Actor is an actor co-occurrence network, where nodes denote actors and edges indicate co-occurrence on the same web page from Wikipedia. It is extracted from the film-director-actor-writer network proposed by Tang et al. [22]. Node feature vectors are bag-of-word representation of keywords in the actors’ Wikipedia pages. Each node is labeled with one of five classes according to the topic of the actor’s Wikipedia page.

  • Cornell, Texas, and Wisconsin come from the WebKB dataset collected by Carnegie Mellon University. Nodes represent web pages and edges denote hyperlinks between them. Node feature vectors are bag-of-word representation of the corresponding web pages. Each node is labeled with one of student, project, course, staff, and faculty.

Appendix B Full Experimental Results

Assortative Disassortative
Datasets Cora Citeseer Pubmed Chameleon Squirrel Actor Cornell Texas Wisconsin
MLP
GCN
GAT
Geom-GCN
Geom-GCN-g 80.6 90.7
Geom-GCN-s
NLMLP 37.9 84.9 85.4 87.3
NLGCN 70.1 59.0
NLGAT 88.5
Table 6: Comparisons between our NLMLP, NLGCN, NLGAT and baselines on all the nine datasets.