Hierarchical graph neural nets can capture long-range interactions

07/15/2021 ∙ by Ladislav Rampasek, et al. ∙ Montréal Institute of Learning Algorithms 66

Graph neural networks (GNNs) based on message passing between neighboring nodes are known to be insufficient for capturing long-range interactions in graphs. In this project we study hierarchical message passing models that leverage a multi-resolution representation of a given graph. This facilitates learning of features that span large receptive fields without loss of local information, an aspect not studied in preceding work on hierarchical GNNs. We introduce Hierarchical Graph Net (HGNet), which for any two connected nodes guarantees existence of message-passing paths of at most logarithmic length w.r.t. the input graph size. Yet, under mild assumptions, its internal hierarchy maintains asymptotic size equivalent to that of the input graph. We observe that our HGNet outperforms conventional stacking of GCN layers particularly in molecular property prediction benchmarks. Finally, we propose two benchmarking tasks designed to elucidate capability of GNNs to leverage long-range interactions in graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 7

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph neural networks (GNNs), and the field of geometric deep learning, have seen rapid development in recent years

[hamilton2020, bronstein2021G5]

and have attained popularity in various fields involving graph and network structures. Prominent examples of GNN applications include molecular property prediction, physical systems simulation, combinatorial optimization, or interaction detection in images and text. Many of the current GNN designs are based on the principle of neural message-passing

[gilmer2017MP], where information is iteratively passed between neighboring nodes along existing edges. However, this paradigm is known to suffer from several deficiencies, including theoretical limits of their representational capacity [xu2018gin] and observed limitations of their information propagation over graphs [alon2020bottleneck, li2018insights, min2020scattering].

Two of the most prominent deficiencies of GNNs are known as oversquashing and oversmoothing. Information oversquashing refers to the exponential growth in the amount of information that has to be encoded by the network with each message-passing iteration, which rapidly grows beyond the capacity of a fixed hidden-layer representation [alon2020bottleneck]. Signal oversmoothing refers to the tendency of node representations to converge to local averages [li2018insights], which can also be observed in graph convolutional networks implementing low pass filtering over the graph [min2020scattering]

. A significant repercussion of these phenomena is that they limit the ability of most GNN architectures to represent long-range interactions (LRIs) in graphs. Namely, they struggle in capturing dependencies between distant nodes, even when these have potentially significant impact on output prediction or appropriate internal feature extraction towards it. Capturing LRIs typically requires the number of GNN layers (i.e., implementing individual message passing steps) to be proportional to the diameter of the graph, which in turn exacerbates the oversquashing of massive amount of information and the oversmoothing that tends towards averaging over wide regions of the graph, if not the entire graph.

In this paper, we study the utilization of multiscale hierarchical meta-structures to enhance message passing in GNNs and facilitate capturing of LRIs. By leveraging hierarchical message passing between nodes, our Hierarchical Graph Net (HGNet) architecture can propagate information within steps instead of , leading to particular improvements for sparse graphs with large diameters.

We note that a few works have recently proposed related approaches using hierarchical constructions, namely g-U-Net [gao2019gUnet] and GXN [li2020GXN]. g-U-Net employs a similarity-based top-k pooling called gPool for hierarchical construction over which it implements bottom-up and simple top-down message passing. GXN introduced mutual information based pooling (VIPool) together with a more complex cross-level message passing. Next, MGKN [li2020multipole] introduced multi-resolution GNN with V-cycle algorithm specifically for learning solutions operators to PDEs. Broadly related are also differentiable pooling methods such as DiffPool [ying2018diffpool], EdgePool [diehl2019edgepool], or GraphZoom [deng2020graphzoom]. However, these do not employ two-directional hierarchical message passing.

While LRIs are widely accepted as being important for both theoretical studies and in practice, most benchmarks used to empirically validate GNN models do not clearly exhibit this property. Out of these, the importance of LRIs is perhaps best justified in biochemistry datasets, where the 2D structure of proteins and molecules is used as their graph representation. However, edges of such graphs do not encode 3D forces and global properties, leaving it up to the model to learn to recognize such LRIs. Several highly specialized models have been proposed for molecular data, but these are typically not applicable to other domains, which also hinders analysis of their modeling improvements towards particularly capturing LRIs. Therefore, in our experiments we primarily focus on quantifying the benefit of using a hierarchical structure compared to the standard practice of GNN layer stacking. We also introduce two benchmarking tasks designed to elucidate capability of general-purpose GNNs to leverage LRIs. Here, we show hierarchical models outperform their standard GNN counterparts when their hierarchical graph construction matches well with the original graph structure and the prediction task, while uncovering related limitations of gPool in g-U-Net.

2 Hierarchical graph net

To build a hierarchical message passing model, we need to construct a hierarchical graph representation and define an inter- and intra-level message passing mechanism.

2.1 Graph coarsening for hierarchical representation

Building a hierarchical representation principally involves iterative application of graph coarsening and pooling operations. Graph coarsening computes a mapping from nodes of a starting graph onto nodes of a new smaller graph , while the pooling step computes node and edge features of from . Here we explore two different approaches: EdgePool [diehl2019edgepool] and the Louvain method for community detection [blondel2008Louvain].

EdgePool [diehl2019edgepool] is a method based on the principle of edge contractions. First, the raw score of an edge is obtained by a linear combination of respective node features and : . Raw scores of edges incident to a node are then normalized as to obtain the final edge scores . Finally, a maximal set of edges is greedily selected according to their scores and then contracted to create a new graph from , while nodes in that were not merged are carried forward to . Two nodes in are then connected by an edge iff there exist two nodes in the were constructed from that had been adjacent in .

Contraction of an edge results in a new node with features . Multiplying the new node features by the edge score facilitates gradient-based learning of the scoring function, which would otherwise be independent of the final objective function.

Louvain method for community detection [blondel2008Louvain]

is a heuristic method based on greedy maximization of modularity score of each community. It is an

algorithm without learnable parameters that is deterministic for a fixed random seed. The Louvain algorithm merges clusters (communities) into a single node and iteratively performs modularity clustering on the condensed graph until the score cannot be improved. The size of the condensed graph cannot be directly controlled, but seems to yield satisfying contraction ratios in practice.

To build a hierarchical meta-graph over a starting graph , we use average node and edge feature pooling according to the modular communities identified in by the Louvain method to construct the following level .

Figure 1: HGNet with two hierarchical levels over an original graph of 12 vertices (in black) and 14 edges. The dashed lines represent inter-level edges. (left) Two levels of EdgePool coarsening, highlighted by red arrows, create the hierarchical structure. A GCN layer is applied before each EdgePool coarsening and at the final coarsest level . (right) Message passing down the hierarchy is implemented by an RGCN layer at and then levels, highlighted by green arrows, where inter-level edges are treated as a distinct edge type.

2.2 Hierarchical message passing in HGNet

Both EdgePool and the Louvain method provide a recipe for construction of a hierarchical graph representation. We propose Hierarchical Graph Network (HGNet) based on either one of these approaches (see Figure 1), sharing the same hierarchical message passing approach that we describe next. Our message passing both within and between levels is principally similar to that of g-U-Net. Consider a hierarchical meta-graph with levels over some . The forward propagation in HGNet consists of a computational pass going up the hierarchy and of a pass going down the hierarchy, resulting in the final embedding of each node in . In the upwards pass we first apply a GCN layer [kipf2016GCN] to , starting with , followed by node and edge pooling according to either EdgePool or the Louvain method to instantiate the next hierarchical level . This process iterates until the final level , at which point no more pooling is done and the downwards pass starts. In this downwards pass we utilize RGCN [schlichtkrull2018RGCN] layers at each level , where we add special edges that connect merged nodes in with their respective representatives in by an edge of unique type.

Complexity. We now analyze the asymptotic complexity of our hierarchical meta-graph based on the EdgePool variant. Let us assume that in each round of edge contractions the size of the greedy maximum matching is at least a constant fraction of the number of remaining nodes, i.e., . Note that when the selected set of edges is a perfect matching. That means after the first round there will be nodes in the next level. Thus, the total number of nodes in the entire hierarchical structure over a with nodes is , while the number of possible levels is . This construction therefore guarantees that, if is connected, the shortest path length between any two nodes is upper-bounded by .

We can also expect the number of edges in our hierarchical graph to remain asymptotically equal to the number of edges in the input graph . Assume there are edges in out of

possible and that they are uniformly distributed. Then after one round of EdgePool, the number of edges in

is expected to be , because the number of possible edges in compared to has decreased from to , i.e., we can expect contraction factor for the number of edges. Therefore, we can expect intra-level edges in total. From the construction of the hierarchy it is also clear that the number of inter-level edges (connecting nodes between adjacent hierarchical levels) is as the total number of nodes is . Therefore, the total number of edges is expected to remain .

Given a deep enough hierarchy and large enough node representation capacity, the final node embeddings can incorporate LRIs from the entire graph , as well as local information. In the case of EdgePool, the asymptotic complexity of our HGNet remains that of GCN, as even despite our hierarchical graph having up to hierarchical levels, its size remains asymptotically unchanged under reasonable assumptions. For a standard message passing GNN to theoretically achieve this capability, it is necessary to stack layers, which may be prohibitively expensive.

3 Results

Table 1: Legacy graph benchmarks.CiteSeer, Cora and PubMed provide only one standard data split, and therefore we show test accuracy averaged over three runs with different random seeds for these datasets. For graph classification tasks (right side of the table) we used 10-fold stratified cross-validation. Shown heatmaps are normalized per dataset (column).

In order to evaluate the performance of HGNet, we consider a wide variety of graph data, including transductive node classification and inductive graph-level classification. Our benchmarks include two settings of HGNet (namely, with EdgePool and Louvain hierarchical structures) and six competitive baseline models: GCN [kipf2016GCN], GCN+VN (GCN extended with a Virtual Node connected to all other nodes), GAT [velickovic2017GAT], ChebNet [tang2019chebnet], GIN [xu2018gin], and g-U-Net [gao2019gUnet]

. The experimental setup is identical for all tested methods. Each method is trained for 200 epochs, followed by a selection of the best model based on the validation performance, and finally performance on the test split is reported. In case of GCN, GCN+VN, GAT, ChebNet and GIN, we always used a stack of 2 layers unless explicitly stated otherwise. In the case of g-U-Net, we reproduced published hyperparameters 

[gao2019gUnet] as closely as possible. For each method we default to 32-dimensional hidden node representation; other hyperparameters specific to certain tasks or datasets are described in the respective sections. We note that our reproduced g-U-Net results differ from the original publication [gao2019gUnet], as there only the best validation set results were reported rather than performance on independent test sets. This erroneous practice had occurred on several occasions in the relatively nascent field of graph deep learning [errica2019fair].

3.1 Node classification in citation networks

For our first benchmark, we consider semi-supervised node classification on the CiteSeer, Cora and PubMed citation networks [yang2016planetoid]. Our HGNet variants are configured with one hierarchical level and g-U-Net with four levels as per published hyperparameters. Citation networks are known to exhibit high homophily [zhu2020BeyondHomophily], i.e., nodes tend to have the same class label as most of their first degree neighbors. First-order message passing GNNs are known to perform well in high-homophily settings [zhu2020BeyondHomophily], which is validated by our experiments presented in Table 1, with the exception of GCN+VN and GIN. All three hierarchical methods (i.e., g-U-Net, HGNet-EdgePool, and HGNet-Louvain) attain very similar results, slightly behind the best performing GAT, GCN, and ChebNet.

The low performance of GCN+VN, a model geared towards capturing global information, and middle-of-the-pack performances of the hierarchical methods can be explained by the high homophily present in the data, and support prior findings [huang2020lp] showcasing that global graph information is not vital in these datasets. Hence, given similar model capacity and experimental settings, methods favoring local information, such as GAT and GCN, outperform the more sophisticated ones. We conclude that CiteSeer, Cora and PubMed are not directly suitable to test the ability of GNN models to capture global information or LRIs, despite their extensive use and popularity in such benchmarks [gao2019gUnet, li2020GXN].

Table 2: Citation networks with -hop sanitized dataset splits. The reported metric is the average test accuracy over three training runs with different random seeds, while keeping the same resampled splits. Heatmaps are normalized per block given by a dataset and neighborhood size combination.

Resampled citation networks

In an effort to make the prediction tasks of CiteSeer, Cora and PubMed citation networks more suitable for testing the models’ ability to utilize information from farther nodes, we experimented with a specific resampling of their training, validation and test splits. The standard semi-supervised splits [yang2016planetoid] follow the same key for each dataset: 20 examples from each class are randomly selected for training, while 500 and 1000 examples are drawn uniformly randomly for the validation and test splits. We used principally the same key, but a different random sampling strategy. Once a node is drawn, we enforced that none of its -th degree neighbors is selected for any split. This approach guarantees that a -hop neighborhood of each labeled node is “sanitized” of labels. As such, we prevent potential correct-class label imprinting in the representation of these -th degree neighbors during the semi-supervised transductive training. For a model to leverage such imprinting benefit of homophily, it has to be able to reach beyond this -hop neighborhood, assuming that the class homophily spans that far in the underlying data.

We experimented with for all 3 citation networks and kept the same hyperparameters from the prior experiments, but varied the number of stacked layers or hierarchy levels, as applicable, for each GNN method. Results averaged over runs with 3 random seeds are shown in Table 2. For we see consistent degradation of performance for single-layer GNNs, while even one level of hierarchy provides significant advantage for the hierarchical models. GAT and GCN recover competitive performance given two layers, which allows the models to reach second-order neighborhood with some nodes that are labeled during training. Hierarchical models however do not benefit from using two levels, as with even just one level their receptive field is already large enough to reach beyond first-order neighborhood of a node. In case of we observe similar behavior, but now hierarchical models typically benefit from employing two or three levels. This is particularly true for PubMed, the largest tested dataset. In this scenario we believe we have reached the limit of these datasets in the sense that we do not expect third-degree or further nodes to be consistently of significant relevance. We can see that for most methods the performance is relatively similar between two or three layers. Our resampling approach is fundamentally limited by the strong local homophily present in these citation networks and beyond cannot be used to test capability of the models to leverage LRIs.

3.2 Graph-level prediction

Table 3: OGB molecular benchmarks.

HGNet results are obtained and presented as per OGB standards, shown is the mean and standard deviation from 10 runs with different random seeds. HGNet models have 1, 2, or 3 levels and otherwise mirror hyperparameters of the OGB baselines that each have 5 layers. The metrics for baselines are from the OGB online leaderboard.

Table 4: Color-connectivity datasets. The average test accuracy in 10-fold stratified CV for various depths of the models.

We now turn our focus to graph-level classification. We start by benchmarking all methods using a set of commonly used datasets: COLLAB, IMDB-BINARY, IMDB-MULTI, D&D, NCI1, ENZYMES, and PROTEINS

[morris2020tudataset]

. In the second part we present a new set of datasets we designed to challenge the GNN methods in learning to recognize a complex set of features. In this section, we use global mean pooling for each method to obtain the graph-level representation from individual nodes of a graph. Using this representation, a graph is finally classified by a 2-layer MLP classifier with 128-dimensional hidden layer.

Our experimental results in common graph-classification datasets are presented in Table 1 (right side). One of our HGNet variants is the best performing method in 4 out of the 7 datasets. GCN+VN performs well on molecular datasets where global information is important, as does HGNet. However, g-U-Net falls behind in this setting, likely due to the nature of top-k pooling in its gPool, which destroys local information and appears to have difficulty extracting complex global features.

OGB molecular benchmarks

We tested HGNet on two Open Graph Benchmark (OGB) [hu2020OGB] molecular property prediction datasets: ogbg-molpcba and ogbg-molhiv. For our HGNet we used the same experimental setup and GCN layer implementation as provided by OGB. Both EdgePool and Louvain versions of HGNet with 2 hierarchical levels (2L), composed of 3 GCN and 2 RGCN-like layers, outperform GCN with 5 layers (see Table 3). Employing a hierarchical meta-graph is more powerful than stacking the same number of layers. We note that adding global readouts via Virtual Node is remarkably beneficial in ogbg-molpcba, albeit at the cost of many additional parameters.

Color-connectivity task

Open Graph Benchmark and other recent initiatives are increasing the bar for GNN benchmarking, as many established benchmarking datasets are too small or too simple to adequately test the expressive power of new GNN methods. However, the motivation to include a new dataset in a suite is typically based on the interest in a particular application domain and the scale of the dataset. Unfortunately, none of the existing benchmarks provably require the capture of LRIs for significant performance gain. This issue was not realized in the benchmarking of prior hierarchical methods [gao2019gUnet, li2020GXN], except [stachenfeld2020SMP] that proposed shortest path prediction task in random graphs. Here we propose to employ a task not used for GNN benchmarking before – classifying the connectivity of same colored nodes in graphs of varying topology. Our color-connectivity datasets are created by taking a graph and randomly coloring half of its nodes one color, e.g., red, and the other nodes blue, such that the red nodes either create a single connected island or two disjoint islands. The binary classification task is then distinguishing between these two cases. The node colorings were sampled by running two red-coloring random walks starting from two random nodes. We used 16x16 and 32x32 2D grids, as well as the Euroroad and Minnesota road networks [rossi2015NR] for the underlying graph topology. For each, we sampled a balanced set of 15,000 examples, except for Minnesota network for which we generated 6,000 examples due to memory constraints. Solving this task requires combination of local and long-range information, while a global readout, e.g., via Virtual Node, is expected to be unsatisfactory.

HGNet-EdgePool is the single best method in this suite of benchmarks (Table 4). Given the nature of the data, we observe a large difference in how suitable are the hierarchical graphs created by different approaches. In particular, gPool of g-U-Net fails to facilitate the learning process on large graphs. Next, global readout via Virtual Nodes in the GCN+VN model does not provide any improvement over the standard GCN, as evidently it is not able to capture complex features. On the other hand, we see that the ChebNet and GIN models perform well. ChebNet can learn filters that have large receptive field in graph space, which is important in this case. We suspect that GIN is powerful enough to learn local heuristics GCN and GAT fail to, which warrants further investigation.

4 Conclusion

Across many datasets, we saw hierarchical models outperform their standard GNN counterparts when construction of the hierarchical graph (its inductive bias) matches well with the graph structure and prediction task. We have not compared to methods highly specialized for a particular tasks, e.g., molecular property prediction, but rather focused on elucidating the effect of using a hierarchical structure compared to the standard approach of stacking GNN layers. Further research remains to be done in terms of exploring combinations of various pooling approaches, hierarchical message passing algorithms and utilization of, e.g., GIN layers instead of GCN. Our proposed color-connectivity task requires complex graph processing to which most existing message-passing GNNs do not scale. These datasets can serve as a common-sense validation for new and more powerful methods. Our testbed datasets can still be improved, as the node features are minimal and recognition of particular topological patterns (e.g., rings or other subgraphs) is not needed to solve the current task. Nevertheless, it represents a significant step forward in terms of understanding and benchmarking more complex graph neural networks.

Acknowledgments:

The authors would like to thank William L. Hamilton for insightful discussions and Semih Cantürk for help with proofreading of the manuscript.

References

Appendix

Dataset # Graphs
(avg.)
# Nodes
(avg.)
# Edges
(Node)
Features
# Classes Evaluation Metric
Cora 1 2,708 5,429 1,433 7 10x RS standard split accuracy
CiteSeer 1 3,327 4,552 3,703 6 10x RS standard split accuracy
PubMed 1 19,717 44,338 500 3 10x RS standard split accuracy
COLLAB 5,000 74.49 2457.78
node
degree
3 10-fold stratified CV accuracy
IMDB-BINARY 1,000 19.77 96.53
node
degree
2 10-fold stratified CV accuracy
IMDB-MULTI 1,500 13 65.94
node
degree
3 10-fold stratified CV accuracy
D&D 1,178 284.32 715.66 89 2 10-fold stratified CV accuracy
NCI1 4,110 29.87 32.3 37 2 10-fold stratified CV accuracy
ENZYMES 600 32.63 62.14 3 6 10-fold stratified CV accuracy
PROTEINS 1,113 39.06 72.82 3 2 10-fold stratified CV accuracy
ogbg-molpcba 437,929 26 28.1
9 node f.
3 edge f.
128 binary
multilabel
10x RS standard split avg. precision
ogbg-molhiv 41,127 25.5 27.5
9 node f.
3 edge f.
2 10x RS standard split ROC-AUC
C-C 16x16 grid 15,000 256 480 1 2 10-fold stratified CV accuracy
C-C 32x32 grid 15,000 1,024 1,984 1 2 10-fold stratified CV accuracy
C-C Euroroad 15,000 1,174 1,417 1 2 10-fold stratified CV accuracy
C-C Minnesota 6,000 2,642 3,304 1 2 10-fold stratified CV accuracy
Table A.1: Datasets summary. The graph statistics are computed over all graphs in respective datasets. For model evaluation we used either the standard train/validation/test split as provided with the respective benchmark dataset and repeated the experiment 10 times with different random seeds (10x RS standard split); or we used 10-fold stratified cross-validation protocol (10-fold stratified CV). In transductive semi-supervised node classification with -hop sanitized node pre-filtering we followed the same train/validation/test splitting procedure of [yang2016planetoid]; train: 20 random pre-filtered nodes per each class, validation: 500 random nodes from remaining pre-filtered nodes, and test: 1000 random nodes from remaining pre-filtered nodes.
(a) label = 0 (b) label = 0 (c) label = 1 (d) label = 1
Figure A.1: Two negative and two positive examples from 16x16 grid Color-connectivity dataset.
(a) label = 0 (b) label = 0 (c) label = 1 (d) label = 1
Figure A.2: Two negative and two positive examples from 32x32 grid Color-connectivity dataset.