Repository for benchmarking graph neural networks
Graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. They have been successfully applied to a myriad of domains including chemistry, physics, social sciences, knowledge graphs, recommendation, and neuroscience. As the field grows, it becomes critical to identify the architectures and key mechanisms which generalize across graphs sizes, enabling us to tackle larger, more complex datasets and domains. Unfortunately, it has been increasingly difficult to gauge the effectiveness of new GNNs and compare models in the absence of a standardized benchmark with consistent experimental settings and large datasets. In this paper, we propose a reproducible GNN benchmarking framework, with the facility for researchers to add new datasets and models conveniently. We apply this benchmarking framework to novel medium-scale graph datasets from mathematical modeling, computer vision, chemistry and combinatorial problems to establish key operations when designing effective GNNs. Precisely, graph convolutions, anisotropic diffusion, residual connections and normalization layers are universal building blocks for developing robust and scalable GNNs.READ FULL TEXT VIEW PDF
Repository for benchmarking graph neural networks
GatedGCN Pattern Dataset Example
Working repo for studying aggregation functions in GNNs
Supplementary codes for NeurIPS 2021 submission 1423: Permutation-sensitive Neural Networks Express More on Graph
Since the pioneering works of (Scarselli et al., 2009; Bruna et al., 2013; Defferrard et al., 2016; Sukhbaatar et al., 2016; Kipf & Welling, 2017; Hamilton et al., 2017), graph neural networks (GNNs) have seen a great surge of interest in recent years with promising methods being developed. As the field grows, the question on how to build powerful GNNs has become central. What types of architectures, first principles or mechanisms are universal, generalizable, and scalable to large datasets of graphs and large graphs? Another important question is how to study and quantify the impact of theoretical developments for GNNs? Benchmarking provides a strong paradigm to answer these fundamental questions. It has proved to be beneficial in several areas of science for driving progress, identifying essential ideas, and solving domain-specific problems (Weber et al., 2019)
. Recently, the famous 2012 ImageNet(Deng et al., 2009)
challenge has provided a benchmark dataset that has triggered the deep learning revolution(Krizhevsky et al., 2012; Malik, 2017). International teams competed to produce the best predictive model for image classification on a large-scale dataset. Since breakthrough results on ImageNet, the Computer Vision community has forged the path forward towards identifying robust architectures and techniques for training deep neural networks (Zeiler & Fergus, 2014; Girshick et al., 2014; Long et al., 2015; He et al., 2016).
But designing successful benchmarks is highly challenging: it requires defining appropriate datasets, robust coding interfaces and common experimental setting for fair comparisons, all while being reproducible. Such requirements face several issues. First, how to define appropriate datasets? It may be hard to collect representative, realistic and large-scale datasets. This has been one of the most important issues with GNNs. Most published papers have been focused on quite small datasets like CORA and TU datasets (Kipf & Welling, 2017; Ying et al., 2018; Veličković et al., 2018; Xinyi & Chen, 2019; Xu et al., 2019; Lee et al., 2019), where all GNNs perform almost statistically the same. Somewhat counter-intuitively, baselines which do not consider graph structure perform equally well or, at times, better than GNNs (Errica et al., 2019). This has raised questions on the necessity of developing new and more complex GNN architectures, and even to the necessity of using GNNs (Chen et al., 2019). For example, in the recent works of Hoang & Maehara (2019) and Chen et al. (2019), the authors analyzed the capacity and components of GNNs to expose the limitations of the models on small datasets. They claim the datasets to be inappropriate for the design of complex structure-inductive learning architectures.
Another major issue in the GNN literature is to define common experimental settings. As noted in Errica et al. (2019), recent papers on TU datasets do not have a consensus on training, validation and test splits as well as evaluation protocols, making it unfair to compare the performance of new ideas and architectures. It is unclear how to perform good data splits beyond randomizes splits, which are known to provide over-optimistic predictions (Lohr, 2009)
. Additionally, different hyper-parameters, loss functions and learning rate schedules make it difficult to identify new advances in architectures.
This paper brings the following contributions:
We release an open benchmark infrastructure for GNNs, hosted on GitHub based on PyTorch(Paszke et al., 2019) and DGL (Wang et al., 2019) libraries. We focus on ease-of-use for new users, making it easy to benchmark new datasets and GNN models.
We aim to go beyond the popular but small CORA and TU datasets by introducing medium-scale datasets with 12k-70k graphs of variable sizes 9-500 nodes. Proposed datasets are from mathematical modeling (Stochastic Block Models), computer vision (super-pixels), combinatorial optimization (Traveling Salesman Problem) and chemistry (molecules’ solubility).
One of the goals of this work is to provide an easy to use collection of medium-scale datasets on which the different GNN architectures that have been proposed in the past few years exhibit clear and statistically meaningful differences in term of performance. We propose six datasets that are described in Table 1.
For the two computer vision datasets, each image from the classical MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky et al., 2009) datasets were converted into graphs using so called super-pixels, see section 5.2
. The task is then to classify these graphs into categories. The graphs in the PATTERN and CLUSTER datasets were generated according to a Stochastic Block Model, see section5.4. The tasks consist of recognizing specific predetermined subgraphs (for the PATTERN dataset) or identifying clusters (for the CLUSTER dataset). These are node classification tasks. The TSP dataset is based on the Traveling Salesman Problem (“Given a list of cities, what is the shortest possible route that visits each city and returns to the origin city?”), see section 5.5. We pose TSP on random Euclidean graphs as an edge classification/link prediction task, with the groundtruth value for each edge belonging to the TSP tour given by the Concorde solver (Applegate et al., 2006). ZINC, presented in section 5.3, is an already existing real-world molecular dataset. Each molecule can be converted into a graph: each atom becomes a node and each bond becomes an edge. The task is to regress a molecule property known as the constrained solubility (Jin et al., 2018).
|Domain/Construction||Dataset||# graphs||# nodes|
|Computer Vision/ Graphs constructed with super-pixels||MNIST||70K||40-75|
|Chemistry/ Real-world molecular graphs||ZINC||12K||9-37|
|Artificial/ Graphs generated from Stochastic Block Model||PATTERN||14K||50-180|
Artificial/ Graphs generated from uniform distribution
Each of the proposed datasets contains at least graphs. This is in stark contrast with CORA and popularly used TU datasets, which often contain only a few hundreds of graphs. On the other hand, the proposed datasets are mostly artificial or semi-artificial (except for ZINC), which is not the case with the CORA and TU datasets. We therefore view these benchmarks as complementary to each other. The main motivation of our work is to propose datasets that are large enough so that differences observed between various GNN architecture are statistically relevant.
where is the -dimensional embedding representation of node at layer , is the set of nodes connected to node on the graph, is the degree of node , is a nonlinearity, and is a learnable parameter. We refer to this vanilla version of a graph neural network as GCN–Graph Convolutional Networks (Kipf & Welling, 2017). GraphSage (Hamilton et al., 2017) and GIN–Graph Isomorphism Network (Xu et al., 2019) propose simple variations of this averaging mechanism. In the mean version of GraphSage, the first equation of (1) is replaced with
and the embeddings vectors are projected onto the unit ball before being passed to the next layer. In the GIN architecture, the equations in (1) are replaced with
are learnable parameters and BN is the Batch Normalization layer(Ioffe & Szegedy, 2015). Importantly, GIN uses the features at all intermediate layers for the final prediction. In all the above models, each neighbor contributes equally to the update of the central node. We refer to these model as isotropic—they treat every “edge direction” equally.
On the other hand, MoNet–Gaussian Mixture Model Networks(Monti et al., 2017), GatedGCN–Graph Convolutional Networks (Bresson & Laurent, 2017), and GAT–Graph Attention Networks (Veličković et al., 2018) propose anisotropic update schemes of the type
where the weights and are computed using various mechanisms (e.g. attention mechanism in GAT or gating mechanism in GatedGCN).
Finally, we also consider a hierarchical graph neural network, DiffPool–Differentiable Pooling (Ying et al., 2018), that uses the GraphSage formulation (2) at each stage of the hierarchy and for the pooling. Exact formulations for GNNs are available in the Supplementary Material. Refer to recent survey papers for a comprehensive overview of GNN literature (Bronstein et al., 2017; Zhou et al., 2018; Battaglia et al., 2018; Wu et al., 2019; Bacciu et al., 2019).
The field of GNNs has mostly used the CORA and TU datasets. These datasets are realistic but they are also small. CORA has 2.7k nodes, TU-IMDB has 1.5k graphs with 13 nodes on average and TU-MUTAG has 188 molecules with 18 nodes. Although small datasets are useful to quickly develop new ideas777such as the older Caltech object recognition datasets, with a few hundred examples: http://www.vision.caltech.edu/html-files/archive.html, they can become a liability in the long run as new GNN models will be designed to overfit the small test sets, instead of searching for more generalizable architectures. CORA and TU datasets are examples of this overfitting problem.
As mentioned previously, another major issue with CORA and TU datasets is the lack of reproducibility of experimental results. Most published papers do not use the same train-validation-test split. Besides, even for the same split, the performances of GNNs present a large standard deviation on a regular 10-fold cross-validation because the datasets are too small. Our numerical experiments clearly show this, see section5.1.
Errica et al. (2019) have recently introduced a rigorous evaluation framework to fairly compare 5 GNNs on 9 TU datasets for a single graph task–graph classification. This is motivated by earlier work by Shchur et al. (2018) on node classification, which highlighted GNN experimental pitfalls and the reproducibility issue. The paper by Errica et al. (2019) is an important first step towards a good benchmark. However, the authors only consider the small TU datasets and their rigorous evaluations are computationally expensive—they perform 47,000 experiments, where an experiment can last up to 48 hours. Additional tasks such as graph regression, node classification and edge classification are not considered, whereas the datasets are limited to the domains of chemistry and social networks. Open Graph Benchmark888http://ogb.stanford.edu/ is a recent initiative that is a very promising step toward the development of a benchmark of large real-world datasets from various domains.
This section presents our numerical experiments with the proposed open-source benchmarking framework. Most GNN implementations, GCN–Graph Convolutional Networks(Kipf & Welling, 2017), GAT–Graph Attention Networks (Veličković et al., 2018), GraphSAGE (Hamilton et al., 2017), DiffPool–Differential Pooling (Ying et al., 2018), GIN–Graph Isomorphism Network (Xu et al., 2019), MoNet–Gaussian Mixture Model Networks (Monti et al., 2017), were taken from the Deep Graph Library (DGL) (Wang et al., 2019) and implemented in PyTorch (Paszke et al., 2019). We upgrade all DGL GNN implementations with residual connections (He et al., 2016), batch normalization (Ioffe & Szegedy, 2015) and graph size normalization. GatedGCN–Gated Graph Convolutional Networks (Bresson & Laurent, 2017) are the final GNN we consider, with GatedGCN-E denoting the version which use edge attributes/features, if available in the dataset. Additionally, we implement a simple graph-agnostic baseline which parallel-ly applies an MLP on each node’s feature vector, independent of other nodes. This is optionally followed by a gating mechanism to obtain the Gated MLP baseline (see Supplementary Material for details). We run experiments for TU, MNIST, CIFAR10, ZINC and TSP on Nvidia 1080Ti GPUs, and for PATTERN and CLUSTER on Nvidia 2080Ti GPUs.
Our first experiment is graph classification on TU datasets999http://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets. We select three TU datasets—ENZYMES (480 train/60 validation/60 test graphs of sizes 2-126), DD (941 train/118 validation/119 test graphs of sizes 30-5748) and PROTEINS (889 train/112 validation/112 test graphs of sizes 4-620).
Here is the proposed benchmark protocol for TU datasets.
Splitting. We perform a -fold cross validation split which gives sets of train, validation and test data indices in the ratio . We use stratified sampling to ensure that the class distribution remains the same across splits. The indices are saved and used across all experiments for fair comparisons.
Training. We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate decay strategy. An initial learning rate is tuned from a range of to using grid search for every GNN models. The learning rate is reduced by half, i.e., reduce factor , if the validation loss does not improve after epochs. We do not set a maximum number of epochs—the training is stopped when the learning rate decays to a value of or less. The model parameters at the end of training are used for evaluation on test sets.
We use classification accuracy between the predicted labels and groundtruth labels as our evaluation metric. Model performance is evaluated on the test split of thefolds for all TU datasets. The reported performance is the average and standard deviation over all the folds.
We use a graph classifier layer which first builds a graph representation by averaging all node features extracted from the last GNN layer and then passing this graph representation to a MLP.
Our goal is not to find the optimal set of hyper-parameters for each dataset, but to identify performance trends. Thus, we fix a budget of around 100k parameters for all GNNs and arbitrarily select 4 layers. The number of hidden features is estimated to match the budget. An exception to this is DiffPool where we resort to the authors choice of using three graph convolutional layers each before and after the pooling layer. This selection is considered least to constitute a DiffPool based GNN model thus overshooting the learnable parameters above 100k for DD.
Our numerical results are presented in Table 2. All NNs have similar statistical test performance as the standard deviation is quite large. We also report a second run of these experiments with the same experimental protocol, i.e., the same 10-fold splitting but different initialization. We observe a change of model ranking, which we attribute to the small size of the datasets and the non-determinism of gradient descent optimizers. We also observed that, for DD and PROTEINS, the graph-agnostic MLP baselines perform as good and sometimes better than GNNs.
|Dataset||Model||#Param||seed 1||seed 2|
|Acc s.d.||Epoch/Total||Acc s.d.||Epoch/Total|
Performance on the standard TU test sets (higher is better). Two runs of all the experiments over same hyperparameters but different random seeds are shown separately to note the differences in ranking and variation for reproducibility. The top 3 performance scores are highlighted as:First, Second, Third.
For the second experiment, we use the popular MNIST and CIFAR10 image classification datasets from computer vision. The original MNIST and CIFAR10 images are converted to graphs using super-pixels. Super-pixels represent small regions of homogeneous intensity in images, and can be extracted with the SLIC technique (Achanta et al., 2012). We use SLIC super-pixels from (Knyazev et al., 2019)101010https://github.com/bknyaz/graph_attention_pool. MNIST has train/ validation/ test graphs of sizes - nodes (i.e., number of super-pixels) and CIFAR10 has train/ validation/ test graphs of sizes - nodes.
For each sample, we build a -nearest neighbor adjacency matrix with , where are the 2-D coordinates of super-pixels , and is the scale parameter defined as the averaged distance of the nearest neighbors for each node. Figure 3 presents visualizations of the super-pixel graphs of MNIST and CIFAR10.
We propose the following benchmark setup.
Splitting. We use the standard MNIST and CIFAR10 splits.
Training. We use the Adam optimizer with a learning rate decay strategy. For all GNNs, an initial learning rate is set to , the reduce factor is , the patience value is 5, and the stopping learning rate is .
Accuracy. The performance metric is the classification accuracy between predicted and groundtruth labels.
Reproducibility. We report accuracy averaged over 4 runs with 4 different random seeds. At each run, the same seed is used for all NNs.
Graph classifier layer. We use the same graph classifier as the TU dataset experiments, Section 5.1.
Hyper-parameters and parameter budget. We determine the model hyperparameters following the TU dataset experiments (Section 5.1) with a budget of 100k. Results for graph classification on MNIST and CIFAR10 are presented in Table 3 and analyzed in Section 6.
We use the ZINC molecular graphs dataset to regress a molecular property known as the constrained solubility (Jin et al., 2018). The statistics for ZINC are: train/ validation/ test graphs of sizes 9-37 nodes/atoms.
For each molecular graph, node features are the type of atoms and edge features are the type of edges.
The same experimental protocol as Section 5.2 is used, with the following changes:
Accuracy. The performance metric is the mean absolute error (MAE) between the predicted and the groundtruth constrained solubility.
Graph regression layer. The regression layer is similar to the graph classifier layer in section 5.2. Table 4 presents our numerical results, which we analyze in Section 6.
We consider the node-level tasks of graph pattern recognition(Scarselli et al., 2009) and semi-supervised graph clustering. The goal of graph pattern recognition is to find a fixed graph pattern embedded in larger graphs of variable sizes. Identifying patterns in different graphs is one of the most basic tasks for GNNs. The pattern and embedded graphs are generated with the stochastic block model (SBM) (Abbe, 2017)
. A SBM is a random graph which assigns communities to each node as follows: any two vertices are connected with the probabilityif they belong to the same community, or they are connected with the probability if they belong to different communities (the value of acts as the noise level). For all experiments, we generate graphs with 5 communities with sizes randomly generated between . The SBM of each community is , and the signal on is generated with a uniform random distribution with a vocabulary of size , i.e., . We randomly generate patterns composed of nodes with intra-probability and extra-probability (i.e., 50% of nodes in are connected to ). The signal on is also generated as a random signal with values . The statistics for the PATTERN dataset are train/ validation/ test graphs of sizes - nodes. The output signal has value 1 if the node belongs to and value 0 if it is in .
The semi-supervised clustering task is another fundamental task in network science. We generate 6 SBM clusters with sizes randomly generated between and probabilities . The statistics for the CLUSTER dataset is train/ validatin/ test graphs of sizes - nodes. We only provide a single label randomly selected for each community. The output signal is defined as the cluster class label.
We follow the same experimental protocol as Section 5.2, with the following changes:
Accuracy. The performance metric is the average accuracy over the classes.
Node classification layer. For classifying each node, we pass the node features from the last GNN layer to a MLP. Results for node classification on CLUSTER and PATTERN are presented in Table 5 and analyzed in Section 6.
Leveraging machine learning for solving NP-hard combinatorial optimization problems (COPs) has been the focus of intense research in recent years (Vinyals et al., 2015; Bengio et al., 2018). Recently proposed deep learning-based solvers for COPs (Khalil et al., 2017; Li et al., 2018; Kool et al., 2019) combine GNNs with classical graph search to predict approximate solutions directly from problem instances (represented as graphs). Consider the intensively studied Travelling Salesman Problem (TSP): given a 2D Euclidean graph, one needs to find an optimal sequence of nodes, called a tour, with minimal total edge weights (tour length). TSP’s multi-scale nature makes it a challenging graph task which requires reasoning about both local node neighborhoods as well as global graph structure.
|MLP (Gated)||105717||not used||95.180.18||22.43s/0.73hr|
|MLP (Gated)||106017||not used||56.780.12||27.85s/0.68hr|
*GatedGCN-E uses the graph adjacency weight as edge feature.
|MLP (Gated)||106970||not used||0.6810.005||1.16s/0.03hr|
*GatedGCN-E uses the molecule bond type as edge feature.
For our experiments with TSP, we follow the learning-based approach to COPs described in (Li et al., 2018; Joshi et al., 2019), where a GNN is the backbone architecture for assigning probabilities to each edge as belonging/not belonging to the predicted solution set. The probabilities are then converted into discrete decisions through graph search techniques. We create train, validation and test sets of , and TSP instances, respectively, where each instance is a graph of node locations sampled uniformly in the unit square and . We generate problems of varying size and complexity by uniformly sampling the number of nodes for each instance.
In order to isolate the impact of the backbone GNN architectures from the search component, we pose TSP as a binary edge classification task,
with the groundtruth value for each edge belonging to the TSP tour given by Concorde.
For scaling to large instances, we use sparse nearest neighbor graphs instead of full graphs, following Khalil et al. (2017).
See Figure 7 for sample TSP instances of various sizes.
We follow the same experimental protocol as Section 5.2 with the following changes:
Training. The patience value is 10 by default. For additional experiments on the impact of model depth, we use a patience value of 5.
Accuracy. Given the high class imbalance, i.e., only the edges in the TSP tour have positive label, we use the F1 score for the positive class as our performance metric.
Reproducibility. We report F1 scores averaged over 2 runs with 2 different random seeds.
Edge classifier layer. To make a prediction for each edge , we first concatenate node features and from the final GNN layer. The concatenated features are then passed to an MLP for prediction.
Non-learnt Baseline. In addition to reporting performance of GNNs, we compare with a simple
-nearest neighbor heuristic baseline, defined as follows: Predict true for the edges corresponding to thenearest neighbors of each node, and false for all other edges. We set for optimal performance. Comparing GNNs to the non-learnt baseline tells us whether models learn something more sophisticated than identifying a node’s nearest neighbors. Our numerical results are presented in Tables 6 and analyzed in Section 6.
|MLP (Gated)||103629||not used||50.130.00||9.78s/0.12hr|
|MLP (Gated)||104305||not used||20.970.01||7.37s/0.09hr|
Graph-agnostic NNs (MLP) perform as well as GNNs on small datasets. Tables 2 and 3 show there is no significant improvement by using GNNs over graph-agnostic MLP baselines for the small TU datasets and the (simple) MNIST. Besides, MLP can sometimes do better than GNNs (Errica et al., 2019; Luzhnica et al., 2019), such as for the DD dataset.
GNNs improve upon graph-agnostic NNs for larger datasets. Tables 4 and 5 present a significant gain of performance for the ZINC, PATTERN, and CLUSTER datasets, in which all GNNs vastly outperform the two MLP baselines. Table 6 shows that all GNNs using residual connections surpass the MLP baselines for the TSP dataset. Results reported in Table 3 for the CIFAR10 dataset are less discriminative, although the best GNNs perform notably better than MLPs.
|k-NN Heuristic||k=2||F1: 0.693|
|MLP (Gated)||115274||not used||0.5480.001||54.39s/2.44hr|
*GatedGCN-E uses the pairwise distance as edge feature.
Vanilla GCNs (Kipf & Welling, 2017) have poor performance. GCNs are the simplest form of GNNs. Their node representation update relies on an isotropic averaging operation over the neighborhood, Eq.(1). This isotropic property was analyzed in (Chen et al., 2019) and was shown to be unable to distinguish simple graph structures, explaining the low performance of GCNs across all datasets.
New isotropic GNN architectures improve on GCN. GraphSage (Hamilton et al., 2017) demonstrates the importance of using the central node information in the graph convolutional layer, Eq.(2). GIN (Xu et al., 2019) also employs the central node feature, Eq.(3), along with a new classifier layer that connects to convolutional features at all intermediate layers. DiffPool (Ying et al., 2018) considers a learnable graph pooling operation where GraphSage is used at each resolution level. These three isotropic GNNs significantly improve the performance of GCN for all datasets, apart from CLUSTER.
Anisotropic GNNs are accurate. Anisotropic models such as GAT (Veličković et al., 2018), MoNet (Monti et al., 2017) and GatedGCN (Bresson & Laurent, 2017) obtain the best results for each dataset, with the exception of PATTERN. Also, we note that GatedGCN performs consistently well across all datasets. Unlike isotropic GNNs that mostly rely on a simple sum over the neighboring features, anisotropic GNNs employ complex mechanisms (sparse attention mechanism for GAT, edge gates for GatedGCN) which are harder to implement efficiently. Our code for these models is not fully optimized and, as a result, much slower.
An additional advantage of this class of GNNs is their ability to explicitly use edge features, such as the bond type between two atoms in a molecule. In Table 4, for the ZINC molecular dataset, GatedGCN-E using the bond edge features significantly improved the MAE performance of GatedGCN without bonds.
Residual connections improve performance. Residual connections (RC), introduced in (He et al., 2016)
, have become a universal ingredient in deep learning architectures for computer vision. Using residual links helps GNNs in two ways. First, it limits the vanishing gradient problem during backpropagation in deep networks. Second, it allows the inclusion of self-node information during convolution in models like GCN and GAT, which do not use them explicitly.
We first test the influence of RC with layers. For MNIST (Table 3), RC do not improve the performance as most GNNs are able to easily overfit this dataset. For CIFAR10, RC enhance results for GCN, GIN, DiffPool and MoNet, but they do not help or degrade the performance for GraphSAGE, GAT and GatedGCN. For ZINC (Table 4), adding residuality significantly improves GCN, DiffPool, GAT, and MoNet, but it slightly degrades the performance of GIN, GraphSage and GatedGCN. For PATTERN and CLUSTER (Table 5), GCN is the only architecture that clearly benefits from RC, while the other models can see their accuracy increase or decrease in the presence of RC. For TSP (Table 6), models which do not implicitly use self-information (GCN, GAT, MoNet) benefit from skip connections while other GNNs hold almost the same performance.
Next, we evaluate the impact of RC for deep GNNs. Figure 8 and Table 7 present the results of deep GNNs for ZINC, CLUSTER and TSP with layers. Interestingly, all models benefit from residual links when the number of layers increase, expect GIN, which is already equipped with skip connections for readout—the classification layer is always connected to all intermediate convolutional layers. In summary, our results suggest residual connections are an important building block for designing deep GNNs.
Normalization layers can improve learning. Most real-world graph datasets are collections of irregular graphs with varying graph sizes. Batching graphs of variable sizes may lead to node representation at different scales. Hence, normalizing activations can be helpful to improve learning and generalization. We use two normalization layers in our experiments—batch normalization (BN) from (Ioffe & Szegedy, 2015) and graph size normalization (GN). Graph size normalization is a simple operation where the resulting node features are normalized w.r.t. the graph size, i.e., where is the number of nodes. This normalization layer is applied after the convolutional layer and before the activation layer.
|Dataset||Model||#Param||BN & GN||No BN & GN|
|MLP (Gated)||106970||not used||0.6830.004|
|MLP (Gated)||106017||not used||56.780.12|
|MLP (Gated)||104305||not used||20.970.01|
We evaluate the impact of the normalization layers on ZINC, CIFAR10 and CLUSTER datasets in Table 8. For the three datasets, BN and GN significantly improve GAT and GatedGCN. Besides, BN and GN boost GCN performance for ZINC and CLUSTER, but do not improve for CIFAR10. GraphSage and DiffPool do not benefit from normalizations but do not lose performance, except for CLUSTER. Additionally, GIN slightly benefits from normalization for ZINC and CLUSTER, but degrades for CIFAR10. We perform an ablation study in the Supplementary Material to study the influence of each normalization layer. In summary, normalization layers can be critical to design sound GNNs.
In this paper, we propose a benchmarking framework to facilitate the study of graph neural networks, and address experimental inconsistencies in the literature. We confirm how the widely used small TU datasets are inappropriate to examine innovations in this field and introduce six medium-scale datasets within the framework. Our experiments on multiple tasks for graphs show: i) Graph structure is important as we move towards larger datasets; ii) GCN, the simplest isotropic version of GNNs, cannot learn complex graph structures; iii) Self-node information, hierarchy, attention mechanisms, edge gates and better readout functions are key structures to improve GCN; iv) GNNs can scale deeper using residual connections and performance can be improved using normalization layers. As a final note, our benchmarking infrastructure, leveraging PyTorch and DGL, is fully reproducible and open to users on GitHub to experiment with new models and add datasets.
XB is supported by NRF Fellowship NRFF2017-10.
This section formally describes our experimental pipeline, illustrated in Figure 10. We detail the components of the setup including the input layers, the GNN layers and the MLP prediction layers.
Given a graph, we are given node features for each node and (optionally) edge features for each edge connecting node and node . The input features and are embedded to -dimensional hidden features and via a simple linear projection before passing them to a graph neural network:
where , and . If the input node/edge features are one-hot vectors of discrete variables, then biases are not used.
Each GNN layer computes -dimensional representations for the nodes/edges of the graph through recursive neighborhood diffusion (or message passing), where each graph node gathers features from its neighbors to represent local graph structure. Stacking GNN layers allows the network to build node representations from the -hop neighborhood of each node.
Let denote the feature vector at layer associated with node . The updated features at the next layer
are obtained by applying non-linear transformations to the central feature vectorand the feature vectors for all nodes in the neighborhood of node (defined by the graph structure). This guarantees the transformation to build local reception fields, such as in standard ConvNets for computer vision, and be invariant to both graph size and vertex re-indexing.
Thus, the most generic version of a feature vector at vertex at the next layer in the graph network is:
where denotes the set of neighboring nodes pointed to node , which can be replaced by , the set of neighbors of node , if the graph is undirected. In other words, a GNN is defined by a mapping taking as input a vector (the feature vector of the center vertex) as well as an un-ordered set of vectors (the feature vectors of all neighboring vertices), see Figure 9. The arbitrary choice of the mapping defines an instantiation of a class of GNNs. See Table 9 for an overview of the GNNs we study in this paper.
|Model||Self info.||Graph info.||Aggregation||Anisotropy||Heads/Kernels|
|GraphSage||✓||✓||Mean, Max, LSTM||x||Single|
|GIN||✓||✓||Mean, Max, Sum||x||Single|
|GatedGCN||✓||✓||Weighted mean||✓(Edge gates)||Single|
In the simplest formulation of GNNs, Graph ConvNets iteratively update node features via an isotropic averaging operation over the neighborhood node features, i.e.,
where (a bias is also used, but omitted for clarity purpose), is the in-degree of node , see Figure 13. Eq. (8) is called a convolution as it is a linear approximation of a localized spectral convolution. Note that it is possible to add the central node features in the update (8) by using self-loops or residual connections, see Section D.
GraphSage improves upon the simple GCN model by explicitly incorporating each node’s own features from the previous layer in its update equation:
where , see Figure 13. Observe that the transformation applied to the central node features is different to the transformation carried out to the neighborhood features . The node features are then projected onto the -unit ball before being passed to the next layer:
The authors also define more sophisticated neighborhood aggregation functions, such as Max-pooling or LSTM aggregators:
where and the cell also uses learnable weights. In our experiments, we use the mean version of GraphSage, Eq.(10) (numerical experiments with the max version did not show significant differences, see Section E.2).
The GIN architecture is based the Weisfeiler-Lehman Isomorphism Test (Weisfeiler & Lehman, 1968) to study the expressive power of GNNs. The node update equation is defined as:
where is a learnable constant, , BN denoted Batch Normalization (described in subsequent sections), see Figure 13.
GAT uses the attention mechanism of (Bahdanau et al., 2014) to introduce anisotropy in the neighborhood aggregation function. The network employs a multi-headed architecture to increase the learning capacity, similar to the Transformer (Vaswani et al., 2017). The node update equation is given by:
where are the linear projection heads, and are the attention coefficients for each head defined as:
where , see Figure 16. GAT learns a mean over each node’s neighborhood features sparsely weighted by the importance of each neighbor.
GatedGCN considers residual connections, batch normalization and edge gates to design another anisotropic variant of GCN. The authors propose to explicitly update edge features along with node features:
where , is the Hadamard product, and the edge gates are defined as:
is the sigmoid function,is a small fixed constant for numerical stability, , see Figure 16. Note that the edge gates (23) can be regarded as a soft attention process, related to the standard sparse attention mechanism (Bahdanau et al., 2014). Different from other anisotropic GNNs, the GatedGCN architecture explicitly maintains edge features at each layer, following (Bresson & Laurent, 2019; Joshi et al., 2019).
Anisotropic GNNs, such as MoNet, GAT and GatedGCN, generally update node features as:
Eq. (25) can be regarded as a learnable non-linear anisotropic diffusion operator on graphs where the discrete diffusion time is defined by , the index of the layer. Anisotropy does not come naturally on graphs. As arbitrary graphs have no specific orientations (like up, down, left, right directions in an image), a diffusion process on graphs is consequently isotropic, making all neighbors equally important. However, this may not be true in general, e.g., a neighbor in the same community of a node shares different information than a neighbor in a separate community. GAT makes the diffusion process anisotropic with the attention mechanism (Bahdanau et al., 2014; Vaswani et al., 2017). MoNet uses the node degrees as edge features, and GatedGCN employs learneable edge gates such as in Marcheggiani & Titov (2017). We believe the anisotropic property to be critical in designing GNNs. Anisotropic models learn the best edge representations for encoding special information flow on the graph structure for the task to be solved.
Irrespective of the class of GNNs used, we augment each GNN layer with batch normalization (BN) (Ioffe & Szegedy, 2015), graph size normalization (GN) and residual connections (He et al., 2016). As such, we consider a more specific class of GNNs than (7):
is a non-linear activation function andis a specific GNN layer.
As a reminder, BatchNorm normalizes each mini-batch of
feature vectors using the mini-batch mean and variance, as follows:
and then replace each with its normalized version, followed by a learnable affine transformation:
Batching graphs of variable sizes may lead to node representation at different scales, making it difficult to learn the optimal statistics and for BatchNorm. Therefore, we consider a GraphNorm layer to normalize the node features w.r.t. the graph size, i.e.,
where is the number of graph nodes. The GraphNorm layer is placed before the BatchNorm layer.
In addition to benchmarking various classes of GNNs, we consider a simple baseline using graph-agnostic networks for obtaining node features. We apply a MLP on each node’s feature vector, independently of other nodes, i.e.,
where . This defines our MLP layer. We also consider a slight upgrade by using a gating mechanism, which (independently) scales each node’s final layer features through a sigmoid function:
where is the sigmoid function. This establishes our second baseline, called MLP-Gated.
The final component of each network is a prediction layer to compute task-dependent outputs, which will be given to a loss function to train the network parameters in an end-to-end manner. The input of the prediction layer is the result of final the GNN layer for each node of the graph (except GIN, which uses features from all intermediate layers).
To perform graph classification, we first build a -dimensional graph-level vector representation by averaging over all node features in the final GNN layer:
The graph features are then passed to a MLP, which outputs un-normalized logits/scoresfor each class:
where is the number of classes. Finally, we minimize the cross-entropy loss between the logits and groundtruth labels.
For graph regression, we compute using Eq.(33) and pass it to a MLP which gives the prediction score :
where . The L1-loss between the predicted score and the groundtruth score is minimized during the training.
|Dataset||#Graphs||#Classes||Avg. Nodes||Avg. Edges||Node feat. (dim)||Edge feat. (dim)|
|ENZYMES||600||6||32.63||62.14||Node Attr (18)||N.A.|
|DD||1178||2||284.32||715.66||Node Label (89)||N.A.|
|PROTEINS||1113||2||39.06||72.82||Node Attr (29)||N.A.|
|MNIST||70000||10||70.57||564.53||Pixel+Coord (3)||Node Dist (1)|
|CIFAR10||60000||10||117.63||941.07||Pixel[RGB]+Coord (5)||Node Dist (1)|
|ZINC||12000||–||23.16||49.83||Node Label (28)||Edge Label (4)|
|PATTERN||14000||2||117.47||4749.15||Node Attr (3)||N.A.|
|CLUSTER||12000||6||117.20||4301.72||Node Attr (7)||N.A.|
|TSP||12000||2||275.76||6894.04||Coord (2)||Node Dist (1)|
For node classification, we independently pass each node’s feature vector to a MLP for computing the un-normalized logits for each class:
where . The cross-entropy loss weighted inversely by the class size is used during training.
To make a prediction for each graph edge , we first concatenate node features and from the final GNN layer. The concatenated edge features are then passed to a MLP for computing the un-normalized logits for each class:
where . The standard cross-entropy loss between the logits and groundtruth labels is used.
In this section, we perform an ablation study on Batch Normalization (BN) and Graph Size Normalization (GN) to empirically study the impact of normalization layers in GNNs. The results, drawn from graph regression (ZINC), graph classification (CIFAR10) and node classification (CLUSTER), are summarized in Table 11. We draw the following inferences.
|Dataset||Model||#Param||BN & GN||BN & NO_GN||NO_BN & GN||NO_BN & NO_GN|
|MLP||108975||not used||not used||not used||0.7100.001|
|MLP (Gated)||106970||not used||not used||not used||0.6830.004|
|MLP||104044||not used||not used||not used||56.010.90|
|MLP (Gated)||106017||not used||not used||not used||56.780.12|
|MLP||106015||not used||not used||not used||20.970.01|
|MLP (Gated)||104305||not used||not used||not used||20.970.01|
|Dataset||#Param||NO_SL & RC||NO_SL & NO_RC||SL & RC||SL & NO_RC|
GCN, GAT and GatedGCN. For these GNNs, GN followed by BN consistently improves performances. This empirically demonstrates the necessity to normalize the learnt features w.r.t. activation ranges and graph sizes for better training and generalization. Importantly, GN must always complement BN and is not useful on its own—solely using GN, in the absence of BN, hurts performance across all datasets compared to using no normalization.
GraphSage and GIN. Using BN and GN neither improves nor degrades the performance of GraphSage (except on CLUSTER). Intuitively, the implicit normalization in GraphSage, Eq.(11), acts similar to BN, controlling the range of activations.
Numerical results for GIN are not consistent across the datasets. For ZINC and CLUSTER, BN is useful with or without using GN. However, the sole usage of GN offers the best result. We hypothesize this discrepancy to be caused by the conflict between the internal BN in the GIN layer, Eq.(14), and the BN we placed generically after each graph convolutional layer, Eq.(26).
The GCN model, Eq.(8), is not explicitly designed to include the central node feature in the computation of the next layer feature . To solve this issue, the authors decided to augment the graphs with self-loops (SL). An alternative solution to the self-loop trick is the use of residual connections (RC). RC allow to explicitly include the central node features in the network architecture without increasing the size of the graphs. Besides, RC are particularly effective when designing deep networks as they limit the vanishing gradient problem. This led us to carry out an ablation study on self-loops and residual connections. The results are presented in Table 12.
RC without SL gives the best performance. Overall, the best results are obtained with RC in the absence of SL. Similar performances are obtained for MNIST, CIFAR10 and PATTERN when using RC and SL. However, the use of SL increases the size of the graphs and, consequently, the computational time (up to 12%).
Decoupling central node features from neighborhood features is critical for node-level tasks. It is interesting to notice the significant gain in accuracy for node classification tasks (PATTERN and CLUSTER) when using RC and NO SL vs. using SL and NO RC.
When using SL and NO RC, the same linear transformation is applied to the central node and the neighborhood nodes. For RC and NO SL, the central vertex is treated differently from the neighborhood vertices. Such distinction is essential to treat each node to be class-wise distinct. For graph-level tasks like graph regression (ZINC) or graph classification (MNIST and CIFAR10), the final graph representation is the mean of the node features, therefore showing a comparatively lesser margin of gain.
The TSP edge classification task, although artificially generated, presents an interesting empirical result. With the exception of GatedGCN, no GNN is able to outperform the non-learnt -nearest neighbor heuristic baseline (F1 score: 0.69). This lead us to further explore the suitability of GatedGCN architecture for edge classification, see Table 13.
Impact of scale. We study how the GatedGCN architecture scales, training models with as few as parameters up to parameters. Naturally, performance improves with larger models and more layers. Somewhat counter-intuitively, the smallest GatedGCN model, with 2 layers and 16 hidden features per layer, continued to outperform the non-learnt heuristic baseline and all other GNN architectures.
Explicitly maintaining edge features. To dissect the unreasonable effectiveness of GatedGCNs for edge tasks, we change the architecture’s node update equations to not explicitly maintain an edge feature across layers, i.e., we replace Eq.(24) as:
GatedGCN with Eq.(38) is similar to other anisotropic GNN variants, GAT and MoNet, which do not explicitly maintain an edge feature representation across layers.
We find that maintaining edge features is important for performance, especially for smaller models. Larger GatedGCNs without explicit edge features do not achieve the same performance as having edge features, but are still able to outperform the non-learnt heuristic. It will be interesting to further analyze the importance and trade-offs of maintaining edge features for real-world edge classification tasks.
|With edge repr: Eq.(24)||Without edge repr: Eq.(38)|
|k-NN Heuristic, k=2, F1: 0.693|
The Max-pool variant of GraphSage, Eq.(12), which adds an additional linear transformation to neighborhood features and takes an element-wise maximum across them, is isotropic but should be more powerful than the Mean variant, Eq.(10). For our main experiments, we used GraphSage-Mean in order to motivate our aim of identifying the basic building blocks for upgrading the GCN architecture. Empirically, we found GraphSage-Max to have similar performance as GraphSage-Mean on TSP, see Table 14. In future work, it would be interesting to further study the application of linear transformations before/after neighborhood aggregation as well as the trade-offs between various aggregation functions.
|k-NN Heuristic||k=2||F1: 0.693|
In our main experiments, we fix a parameter budget of 100k and arbitrarily select 4 GNN layers. Then, we estimate the model hyper-parameters to match the budget. However, for DiffPool (Ying et al., 2018), the number of trainable parameters is over 100k in one experiment—TU-DD111111Note that the input feature vector for each node in DD graphs is larger than other datasets, see Table 10.. Apparently, the minimum requirement to constitute a DiffPool model is a single differentiable pooling layer, preceded and followed by a number of GNN layers. Thus, DiffPool effectively uses more GNN layers compared to all other models, illustrated in Figure 17. Following Ying et al. (2018), we use GraphSage at each level of hierarchy and for the pooling.
In order to be as close as possible to the budget of 100k parameters for DiffPool, we set the hidden dimension to be significantly smaller than other GNNs (e.g., as few as 8 for DD, compared to 128 for GCN). Despite smaller hidden dimensions, DiffPool still has the highest trainable parameters per experiment due to increased depth as well as the dense pooling layer. However, DiffPool performs poorly compared to other GNNs on the TU datasets, see Table 2. We attribute its low performance to the small hidden dimension.
For completeness sake, we increase the hidden dimensions for DiffPool to 32, which raises the parameter count to 592k for DD. Our results for DD and PROTEINS, presented in Table 15, show that larger DiffPool models match the performance of other GNNs (Table 2). Nonetheless, our goal is not to find the optimal set of hyper-parameters for a model, but to identify performance trends and important mechanisms for designing GNNs. In future work, it would be interesting to further study the design of hierarchical representation learning methods such as DiffPool.
|Dataset||Model||#Param||seed 1||seed 2|
|Acc s.d.||Epoch/Total||Acc s.d.||Epoch/Total|
Timing research code can be tricky due to differences of implementations and hardware acceleration, e.g., our implementations of GAT can be optimized by taking a parallelized approach for multi-head computation. Similarly, MoNet can be improved by pre-computing the in-degrees of batched graph nodes that are used as pseudo edge features to compute gaussian weights. Somewhat counter-intuitively, our GIN implementation is significantly faster than all other models, including vanilla GCN. Nonetheless, we take a practical view and report the average wall clock time per epoch and the total training time for each model. All experiments were implemented in DGL/PyTorch. We run experiments for TU, MNIST, CIFAR10, ZINC and TSP on an Intel Xeon CPU E5-2690 v4 server with 4 Nvidia 1080Ti GPUs, and for PATTERN and CLUSTER on an Intel Xeon Gold 6132 CPU with 4 Nvidia 2080Ti GPUs. Each experiment was run on a single GPU and 4 experiments were run on the server at any given time (on different GPUs). We run each experiment for a maximum of 48 hours.
The hyperparameter settings for all models in the main paper across all benchmarked datasets are provided in Table 16.
|L||hidden||out||Other||init lr||patience||min lr|
|MLP (Gated)||4||128||128||gated:True; readout:mean||1e-3||25||1e-6||79014|
|GIN||4||96||96||n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum||7e-3||25||1e-6||80770|
|DiffPool||3||64||–||embedding_dim:64; sage_aggregator:meanpool; num_pool:1; pool_ratio:0.15; linkpred:True; readout:mean||7e-3||25||1e-6||94782|
|MoNet||4||80||80||kernel:3; pseudo_dim_MoNet:2; readout:mean||1e-3||25||1e-6||83538|
|MLP (Gated)||4||128||128||gated:True; readout:mean||1e-4||25||1e-6||87970|
|GIN||4||96||96||n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum||1e-3||25||1e-6||85646|
|DiffPool||3||8||–||embedding_dim:8; sage_aggregator:meanpool; num_pool:1; pool_ratio:0.15; linkpred:True; readout:mean||5e-4||25||1e-6||165342|
|MoNet||4||80||80||kernel:3; pseudo_dim_MoNet:2; readout:mean||7e-5||25||1e-6||89134|
|MLP (Gated)||4||128||128||gated:True; readout:mean||1e-4||25||1e-6||80290|
|GIN||4||96||96||n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum||7e-3||25||1e-6||79886|
|DiffPool||3||22||–||embedding_dim:22; sage_aggregator:meanpool; num_pool:1; pool_ratio:0.15; linkpred:True; readout:mean||1e-3||25||1e-6||93780|
|MoNet||4||80||80||kernel:3; pseudo_dim_MoNet:2; readout:mean||7e-5||25||1e-6||84334|
|MLP (Gated)||4||150||150||gated:True; readout:mean||1e-3||5||1e-5||105717|
|GIN||4||110||110||n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum||1e-3||5||1e-5||105434|
|DiffPool||3||32||–||embedding_dim:32; sage_aggregator:meanpool; num_pool:1; pool_ratio:0.15; linkpred:True; readout:mean||1e-3||5||1e-5||106538|
|MoNet||4||90||90||kernel:3; pseudo_dim_MoNet:2; readout:mean||1e-3||5||1e-5||104049|
|GatedGCN||4||70||70||edge_feat:False (edge_feat:True for GatedGCN-E); readout:mean||1e-3||5||1e-5||104217|
|MLP (Gated)||4||150||150||gated:True; readout:mean||1e-3||5||1e-5||106017|
|GIN||4||110||110||n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum||1e-3||5||1e-5||105654|
|DiffPool||3||32||–||embedding_dim:16; sage_aggregator:meanpool; num_pool:1; pool_ratio:0.15; linkpred:True; readout:mean||1e-3||5||1e-5||108042|
|MoNet||4||90||90||kernel:3; pseudo_dim_MoNet:2; readout:mean||1e-3||5||1e-5||104229|
|GatedGCN||4||70||70||edge_feat:False (edge_feat:True for GatedGCN-E); readout:mean||1e-3||5||1e-5||104357|
|MLP (Gated)||4||135||135||gated:True; readout:mean||1e-3||5||1e-5||106970|
|GIN||4||110||110||n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum||1e-3||5||1e-5||103079|
|DiffPool||3||56||–||embedding_dim:56; sage_aggregator:meanpool; num_pool:1; pool_ratio:0.15; linkpred:True; readout:mean||1e-3||5||1e-5||110561|
|MoNet||4||90||90||kernel:3; pseudo_dim_MoNet:2; readout:mean||1e-3||5||1e-5||106002|
|GatedGCN||4||70||70||edge_feat:False (edge_feat:True for GatedGCN-E); readout:mean||1e-3||5||1e-5||105735|
|MLP (Gated)||4||135||135||gated:True; readout:mean||1e-3||5||1e-5||103629|
|GIN||4||110||110||n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum||1e-3||5||1e-5||100884|
|MoNet||4||90||90||kernel:3; pseudo_dim_MoNet:2; readout:mean||1e-3||5||1e-5||103775|
|MLP (Gated)||4||135||135||gated:True; readout:mean||1e-3||5||1e-5||104305|
|GIN||4||110||110||n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum||1e-3||5||1e-5||103544|
|MoNet||4||90||90||kernel:3; pseudo_dim_MoNet:2; readout:mean||1e-3||5||1e-5||104227|
|MLP (Gated)||3||144||144||gated:True; readout:mean||1e-3||10||1e-5||115274|
|GIN||4||80||80||n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum||1e-3||10||1e-5||118574|
|MoNet||4||80||80||kernel:3; pseudo_dim_MoNet:2; readout:mean||1e-3||10||1e-5||94274|
|GatedGCN||4||64||64||edge_feat:False (edge_feat:True for GatedGCN-E); readout:mean||1e-3||10||1e-5||94946|