HGNet
Hierarchical Graph Net
view repo
Graph neural networks (GNNs) based on message passing between neighboring nodes are known to be insufficient for capturing longrange interactions in graphs. In this project we study hierarchical message passing models that leverage a multiresolution representation of a given graph. This facilitates learning of features that span large receptive fields without loss of local information, an aspect not studied in preceding work on hierarchical GNNs. We introduce Hierarchical Graph Net (HGNet), which for any two connected nodes guarantees existence of messagepassing paths of at most logarithmic length w.r.t. the input graph size. Yet, under mild assumptions, its internal hierarchy maintains asymptotic size equivalent to that of the input graph. We observe that our HGNet outperforms conventional stacking of GCN layers particularly in molecular property prediction benchmarks. Finally, we propose two benchmarking tasks designed to elucidate capability of GNNs to leverage longrange interactions in graphs.
READ FULL TEXT VIEW PDFHierarchical Graph Net
Graph neural networks (GNNs), and the field of geometric deep learning, have seen rapid development in recent years
[hamilton2020, bronstein2021G5]and have attained popularity in various fields involving graph and network structures. Prominent examples of GNN applications include molecular property prediction, physical systems simulation, combinatorial optimization, or interaction detection in images and text. Many of the current GNN designs are based on the principle of neural messagepassing
[gilmer2017MP], where information is iteratively passed between neighboring nodes along existing edges. However, this paradigm is known to suffer from several deficiencies, including theoretical limits of their representational capacity [xu2018gin] and observed limitations of their information propagation over graphs [alon2020bottleneck, li2018insights, min2020scattering].Two of the most prominent deficiencies of GNNs are known as oversquashing and oversmoothing. Information oversquashing refers to the exponential growth in the amount of information that has to be encoded by the network with each messagepassing iteration, which rapidly grows beyond the capacity of a fixed hiddenlayer representation [alon2020bottleneck]. Signal oversmoothing refers to the tendency of node representations to converge to local averages [li2018insights], which can also be observed in graph convolutional networks implementing low pass filtering over the graph [min2020scattering]
. A significant repercussion of these phenomena is that they limit the ability of most GNN architectures to represent longrange interactions (LRIs) in graphs. Namely, they struggle in capturing dependencies between distant nodes, even when these have potentially significant impact on output prediction or appropriate internal feature extraction towards it. Capturing LRIs typically requires the number of GNN layers (i.e., implementing individual message passing steps) to be proportional to the diameter of the graph, which in turn exacerbates the oversquashing of massive amount of information and the oversmoothing that tends towards averaging over wide regions of the graph, if not the entire graph.
In this paper, we study the utilization of multiscale hierarchical metastructures to enhance message passing in GNNs and facilitate capturing of LRIs. By leveraging hierarchical message passing between nodes, our Hierarchical Graph Net (HGNet) architecture can propagate information within steps instead of , leading to particular improvements for sparse graphs with large diameters.
We note that a few works have recently proposed related approaches using hierarchical constructions, namely gUNet [gao2019gUnet] and GXN [li2020GXN]. gUNet employs a similaritybased topk pooling called gPool for hierarchical construction over which it implements bottomup and simple topdown message passing. GXN introduced mutual information based pooling (VIPool) together with a more complex crosslevel message passing. Next, MGKN [li2020multipole] introduced multiresolution GNN with Vcycle algorithm specifically for learning solutions operators to PDEs. Broadly related are also differentiable pooling methods such as DiffPool [ying2018diffpool], EdgePool [diehl2019edgepool], or GraphZoom [deng2020graphzoom]. However, these do not employ twodirectional hierarchical message passing.
While LRIs are widely accepted as being important for both theoretical studies and in practice, most benchmarks used to empirically validate GNN models do not clearly exhibit this property. Out of these, the importance of LRIs is perhaps best justified in biochemistry datasets, where the 2D structure of proteins and molecules is used as their graph representation. However, edges of such graphs do not encode 3D forces and global properties, leaving it up to the model to learn to recognize such LRIs. Several highly specialized models have been proposed for molecular data, but these are typically not applicable to other domains, which also hinders analysis of their modeling improvements towards particularly capturing LRIs. Therefore, in our experiments we primarily focus on quantifying the benefit of using a hierarchical structure compared to the standard practice of GNN layer stacking. We also introduce two benchmarking tasks designed to elucidate capability of generalpurpose GNNs to leverage LRIs. Here, we show hierarchical models outperform their standard GNN counterparts when their hierarchical graph construction matches well with the original graph structure and the prediction task, while uncovering related limitations of gPool in gUNet.
To build a hierarchical message passing model, we need to construct a hierarchical graph representation and define an inter and intralevel message passing mechanism.
Building a hierarchical representation principally involves iterative application of graph coarsening and pooling operations. Graph coarsening computes a mapping from nodes of a starting graph onto nodes of a new smaller graph , while the pooling step computes node and edge features of from . Here we explore two different approaches: EdgePool [diehl2019edgepool] and the Louvain method for community detection [blondel2008Louvain].
EdgePool [diehl2019edgepool] is a method based on the principle of edge contractions. First, the raw score of an edge is obtained by a linear combination of respective node features and : . Raw scores of edges incident to a node are then normalized as to obtain the final edge scores . Finally, a maximal set of edges is greedily selected according to their scores and then contracted to create a new graph from , while nodes in that were not merged are carried forward to . Two nodes in are then connected by an edge iff there exist two nodes in the were constructed from that had been adjacent in .
Contraction of an edge results in a new node with features . Multiplying the new node features by the edge score facilitates gradientbased learning of the scoring function, which would otherwise be independent of the final objective function.
Louvain method for community detection [blondel2008Louvain]
is a heuristic method based on greedy maximization of modularity score of each community. It is an
algorithm without learnable parameters that is deterministic for a fixed random seed. The Louvain algorithm merges clusters (communities) into a single node and iteratively performs modularity clustering on the condensed graph until the score cannot be improved. The size of the condensed graph cannot be directly controlled, but seems to yield satisfying contraction ratios in practice.To build a hierarchical metagraph over a starting graph , we use average node and edge feature pooling according to the modular communities identified in by the Louvain method to construct the following level .
Both EdgePool and the Louvain method provide a recipe for construction of a hierarchical graph representation. We propose Hierarchical Graph Network (HGNet) based on either one of these approaches (see Figure 1), sharing the same hierarchical message passing approach that we describe next. Our message passing both within and between levels is principally similar to that of gUNet. Consider a hierarchical metagraph with levels over some . The forward propagation in HGNet consists of a computational pass going up the hierarchy and of a pass going down the hierarchy, resulting in the final embedding of each node in . In the upwards pass we first apply a GCN layer [kipf2016GCN] to , starting with , followed by node and edge pooling according to either EdgePool or the Louvain method to instantiate the next hierarchical level . This process iterates until the final level , at which point no more pooling is done and the downwards pass starts. In this downwards pass we utilize RGCN [schlichtkrull2018RGCN] layers at each level , where we add special edges that connect merged nodes in with their respective representatives in by an edge of unique type.
Complexity. We now analyze the asymptotic complexity of our hierarchical metagraph based on the EdgePool variant. Let us assume that in each round of edge contractions the size of the greedy maximum matching is at least a constant fraction of the number of remaining nodes, i.e., . Note that when the selected set of edges is a perfect matching. That means after the first round there will be nodes in the next level. Thus, the total number of nodes in the entire hierarchical structure over a with nodes is , while the number of possible levels is . This construction therefore guarantees that, if is connected, the shortest path length between any two nodes is upperbounded by .
We can also expect the number of edges in our hierarchical graph to remain asymptotically equal to the number of edges in the input graph . Assume there are edges in out of
possible and that they are uniformly distributed. Then after one round of EdgePool, the number of edges in
is expected to be , because the number of possible edges in compared to has decreased from to , i.e., we can expect contraction factor for the number of edges. Therefore, we can expect intralevel edges in total. From the construction of the hierarchy it is also clear that the number of interlevel edges (connecting nodes between adjacent hierarchical levels) is as the total number of nodes is . Therefore, the total number of edges is expected to remain .Given a deep enough hierarchy and large enough node representation capacity, the final node embeddings can incorporate LRIs from the entire graph , as well as local information. In the case of EdgePool, the asymptotic complexity of our HGNet remains that of GCN, as even despite our hierarchical graph having up to hierarchical levels, its size remains asymptotically unchanged under reasonable assumptions. For a standard message passing GNN to theoretically achieve this capability, it is necessary to stack layers, which may be prohibitively expensive.
In order to evaluate the performance of HGNet, we consider a wide variety of graph data, including transductive node classification and inductive graphlevel classification. Our benchmarks include two settings of HGNet (namely, with EdgePool and Louvain hierarchical structures) and six competitive baseline models: GCN [kipf2016GCN], GCN+VN (GCN extended with a Virtual Node connected to all other nodes), GAT [velickovic2017GAT], ChebNet [tang2019chebnet], GIN [xu2018gin], and gUNet [gao2019gUnet]
. The experimental setup is identical for all tested methods. Each method is trained for 200 epochs, followed by a selection of the best model based on the validation performance, and finally performance on the test split is reported. In case of GCN, GCN+VN, GAT, ChebNet and GIN, we always used a stack of 2 layers unless explicitly stated otherwise. In the case of gUNet, we reproduced published hyperparameters
[gao2019gUnet] as closely as possible. For each method we default to 32dimensional hidden node representation; other hyperparameters specific to certain tasks or datasets are described in the respective sections. We note that our reproduced gUNet results differ from the original publication [gao2019gUnet], as there only the best validation set results were reported rather than performance on independent test sets. This erroneous practice had occurred on several occasions in the relatively nascent field of graph deep learning [errica2019fair].For our first benchmark, we consider semisupervised node classification on the CiteSeer, Cora and PubMed citation networks [yang2016planetoid]. Our HGNet variants are configured with one hierarchical level and gUNet with four levels as per published hyperparameters. Citation networks are known to exhibit high homophily [zhu2020BeyondHomophily], i.e., nodes tend to have the same class label as most of their first degree neighbors. Firstorder message passing GNNs are known to perform well in highhomophily settings [zhu2020BeyondHomophily], which is validated by our experiments presented in Table 1, with the exception of GCN+VN and GIN. All three hierarchical methods (i.e., gUNet, HGNetEdgePool, and HGNetLouvain) attain very similar results, slightly behind the best performing GAT, GCN, and ChebNet.
The low performance of GCN+VN, a model geared towards capturing global information, and middleofthepack performances of the hierarchical methods can be explained by the high homophily present in the data, and support prior findings [huang2020lp] showcasing that global graph information is not vital in these datasets. Hence, given similar model capacity and experimental settings, methods favoring local information, such as GAT and GCN, outperform the more sophisticated ones. We conclude that CiteSeer, Cora and PubMed are not directly suitable to test the ability of GNN models to capture global information or LRIs, despite their extensive use and popularity in such benchmarks [gao2019gUnet, li2020GXN].
In an effort to make the prediction tasks of CiteSeer, Cora and PubMed citation networks more suitable for testing the models’ ability to utilize information from farther nodes, we experimented with a specific resampling of their training, validation and test splits. The standard semisupervised splits [yang2016planetoid] follow the same key for each dataset: 20 examples from each class are randomly selected for training, while 500 and 1000 examples are drawn uniformly randomly for the validation and test splits. We used principally the same key, but a different random sampling strategy. Once a node is drawn, we enforced that none of its th degree neighbors is selected for any split. This approach guarantees that a hop neighborhood of each labeled node is “sanitized” of labels. As such, we prevent potential correctclass label imprinting in the representation of these th degree neighbors during the semisupervised transductive training. For a model to leverage such imprinting benefit of homophily, it has to be able to reach beyond this hop neighborhood, assuming that the class homophily spans that far in the underlying data.
We experimented with for all 3 citation networks and kept the same hyperparameters from the prior experiments, but varied the number of stacked layers or hierarchy levels, as applicable, for each GNN method. Results averaged over runs with 3 random seeds are shown in Table 2. For we see consistent degradation of performance for singlelayer GNNs, while even one level of hierarchy provides significant advantage for the hierarchical models. GAT and GCN recover competitive performance given two layers, which allows the models to reach secondorder neighborhood with some nodes that are labeled during training. Hierarchical models however do not benefit from using two levels, as with even just one level their receptive field is already large enough to reach beyond firstorder neighborhood of a node. In case of we observe similar behavior, but now hierarchical models typically benefit from employing two or three levels. This is particularly true for PubMed, the largest tested dataset. In this scenario we believe we have reached the limit of these datasets in the sense that we do not expect thirddegree or further nodes to be consistently of significant relevance. We can see that for most methods the performance is relatively similar between two or three layers. Our resampling approach is fundamentally limited by the strong local homophily present in these citation networks and beyond cannot be used to test capability of the models to leverage LRIs.
We now turn our focus to graphlevel classification. We start by benchmarking all methods using a set of commonly used datasets: COLLAB, IMDBBINARY, IMDBMULTI, D&D, NCI1, ENZYMES, and PROTEINS
[morris2020tudataset]. In the second part we present a new set of datasets we designed to challenge the GNN methods in learning to recognize a complex set of features. In this section, we use global mean pooling for each method to obtain the graphlevel representation from individual nodes of a graph. Using this representation, a graph is finally classified by a 2layer MLP classifier with 128dimensional hidden layer.
Our experimental results in common graphclassification datasets are presented in Table 1 (right side). One of our HGNet variants is the best performing method in 4 out of the 7 datasets. GCN+VN performs well on molecular datasets where global information is important, as does HGNet. However, gUNet falls behind in this setting, likely due to the nature of topk pooling in its gPool, which destroys local information and appears to have difficulty extracting complex global features.
We tested HGNet on two Open Graph Benchmark (OGB) [hu2020OGB] molecular property prediction datasets: ogbgmolpcba and ogbgmolhiv. For our HGNet we used the same experimental setup and GCN layer implementation as provided by OGB. Both EdgePool and Louvain versions of HGNet with 2 hierarchical levels (2L), composed of 3 GCN and 2 RGCNlike layers, outperform GCN with 5 layers (see Table 3). Employing a hierarchical metagraph is more powerful than stacking the same number of layers. We note that adding global readouts via Virtual Node is remarkably beneficial in ogbgmolpcba, albeit at the cost of many additional parameters.
Open Graph Benchmark and other recent initiatives are increasing the bar for GNN benchmarking, as many established benchmarking datasets are too small or too simple to adequately test the expressive power of new GNN methods. However, the motivation to include a new dataset in a suite is typically based on the interest in a particular application domain and the scale of the dataset. Unfortunately, none of the existing benchmarks provably require the capture of LRIs for significant performance gain. This issue was not realized in the benchmarking of prior hierarchical methods [gao2019gUnet, li2020GXN], except [stachenfeld2020SMP] that proposed shortest path prediction task in random graphs. Here we propose to employ a task not used for GNN benchmarking before – classifying the connectivity of same colored nodes in graphs of varying topology. Our colorconnectivity datasets are created by taking a graph and randomly coloring half of its nodes one color, e.g., red, and the other nodes blue, such that the red nodes either create a single connected island or two disjoint islands. The binary classification task is then distinguishing between these two cases. The node colorings were sampled by running two redcoloring random walks starting from two random nodes. We used 16x16 and 32x32 2D grids, as well as the Euroroad and Minnesota road networks [rossi2015NR] for the underlying graph topology. For each, we sampled a balanced set of 15,000 examples, except for Minnesota network for which we generated 6,000 examples due to memory constraints. Solving this task requires combination of local and longrange information, while a global readout, e.g., via Virtual Node, is expected to be unsatisfactory.
HGNetEdgePool is the single best method in this suite of benchmarks (Table 4). Given the nature of the data, we observe a large difference in how suitable are the hierarchical graphs created by different approaches. In particular, gPool of gUNet fails to facilitate the learning process on large graphs. Next, global readout via Virtual Nodes in the GCN+VN model does not provide any improvement over the standard GCN, as evidently it is not able to capture complex features. On the other hand, we see that the ChebNet and GIN models perform well. ChebNet can learn filters that have large receptive field in graph space, which is important in this case. We suspect that GIN is powerful enough to learn local heuristics GCN and GAT fail to, which warrants further investigation.
Across many datasets, we saw hierarchical models outperform their standard GNN counterparts when construction of the hierarchical graph (its inductive bias) matches well with the graph structure and prediction task. We have not compared to methods highly specialized for a particular tasks, e.g., molecular property prediction, but rather focused on elucidating the effect of using a hierarchical structure compared to the standard approach of stacking GNN layers. Further research remains to be done in terms of exploring combinations of various pooling approaches, hierarchical message passing algorithms and utilization of, e.g., GIN layers instead of GCN. Our proposed colorconnectivity task requires complex graph processing to which most existing messagepassing GNNs do not scale. These datasets can serve as a commonsense validation for new and more powerful methods. Our testbed datasets can still be improved, as the node features are minimal and recognition of particular topological patterns (e.g., rings or other subgraphs) is not needed to solve the current task. Nevertheless, it represents a significant step forward in terms of understanding and benchmarking more complex graph neural networks.
The authors would like to thank William L. Hamilton for insightful discussions and Semih Cantürk for help with proofreading of the manuscript.
Dataset  # Graphs 



# Classes  Evaluation  Metric  
Cora  1  2,708  5,429  1,433  7  10x RS standard split  accuracy  
CiteSeer  1  3,327  4,552  3,703  6  10x RS standard split  accuracy  
PubMed  1  19,717  44,338  500  3  10x RS standard split  accuracy  
COLLAB  5,000  74.49  2457.78 

3  10fold stratified CV  accuracy  
IMDBBINARY  1,000  19.77  96.53 

2  10fold stratified CV  accuracy  
IMDBMULTI  1,500  13  65.94 

3  10fold stratified CV  accuracy  
D&D  1,178  284.32  715.66  89  2  10fold stratified CV  accuracy  
NCI1  4,110  29.87  32.3  37  2  10fold stratified CV  accuracy  
ENZYMES  600  32.63  62.14  3  6  10fold stratified CV  accuracy  
PROTEINS  1,113  39.06  72.82  3  2  10fold stratified CV  accuracy  
ogbgmolpcba  437,929  26  28.1 


10x RS standard split  avg. precision  
ogbgmolhiv  41,127  25.5  27.5 

2  10x RS standard split  ROCAUC  
CC 16x16 grid  15,000  256  480  1  2  10fold stratified CV  accuracy  
CC 32x32 grid  15,000  1,024  1,984  1  2  10fold stratified CV  accuracy  
CC Euroroad  15,000  1,174  1,417  1  2  10fold stratified CV  accuracy  
CC Minnesota  6,000  2,642  3,304  1  2  10fold stratified CV  accuracy 