benchmarkinggnns
Repository for benchmarking graph neural networks
view repo
Graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. They have been successfully applied to a myriad of domains including chemistry, physics, social sciences, knowledge graphs, recommendation, and neuroscience. As the field grows, it becomes critical to identify the architectures and key mechanisms which generalize across graphs sizes, enabling us to tackle larger, more complex datasets and domains. Unfortunately, it has been increasingly difficult to gauge the effectiveness of new GNNs and compare models in the absence of a standardized benchmark with consistent experimental settings and large datasets. In this paper, we propose a reproducible GNN benchmarking framework, with the facility for researchers to add new datasets and models conveniently. We apply this benchmarking framework to novel mediumscale graph datasets from mathematical modeling, computer vision, chemistry and combinatorial problems to establish key operations when designing effective GNNs. Precisely, graph convolutions, anisotropic diffusion, residual connections and normalization layers are universal building blocks for developing robust and scalable GNNs.
READ FULL TEXT VIEW PDFRepository for benchmarking graph neural networks
None
GatedGCN Pattern Dataset Example
Working repo for studying aggregation functions in GNNs
Supplementary codes for NeurIPS 2021 submission 1423: Permutationsensitive Neural Networks Express More on Graph
Since the pioneering works of (Scarselli et al., 2009; Bruna et al., 2013; Defferrard et al., 2016; Sukhbaatar et al., 2016; Kipf & Welling, 2017; Hamilton et al., 2017), graph neural networks (GNNs) have seen a great surge of interest in recent years with promising methods being developed. As the field grows, the question on how to build powerful GNNs has become central. What types of architectures, first principles or mechanisms are universal, generalizable, and scalable to large datasets of graphs and large graphs? Another important question is how to study and quantify the impact of theoretical developments for GNNs? Benchmarking provides a strong paradigm to answer these fundamental questions. It has proved to be beneficial in several areas of science for driving progress, identifying essential ideas, and solving domainspecific problems (Weber et al., 2019)
. Recently, the famous 2012 ImageNet
(Deng et al., 2009)challenge has provided a benchmark dataset that has triggered the deep learning revolution
(Krizhevsky et al., 2012; Malik, 2017). International teams competed to produce the best predictive model for image classification on a largescale dataset. Since breakthrough results on ImageNet, the Computer Vision community has forged the path forward towards identifying robust architectures and techniques for training deep neural networks (Zeiler & Fergus, 2014; Girshick et al., 2014; Long et al., 2015; He et al., 2016).But designing successful benchmarks is highly challenging: it requires defining appropriate datasets, robust coding interfaces and common experimental setting for fair comparisons, all while being reproducible. Such requirements face several issues. First, how to define appropriate datasets? It may be hard to collect representative, realistic and largescale datasets. This has been one of the most important issues with GNNs. Most published papers have been focused on quite small datasets like CORA and TU datasets (Kipf & Welling, 2017; Ying et al., 2018; Veličković et al., 2018; Xinyi & Chen, 2019; Xu et al., 2019; Lee et al., 2019), where all GNNs perform almost statistically the same. Somewhat counterintuitively, baselines which do not consider graph structure perform equally well or, at times, better than GNNs (Errica et al., 2019). This has raised questions on the necessity of developing new and more complex GNN architectures, and even to the necessity of using GNNs (Chen et al., 2019). For example, in the recent works of Hoang & Maehara (2019) and Chen et al. (2019), the authors analyzed the capacity and components of GNNs to expose the limitations of the models on small datasets. They claim the datasets to be inappropriate for the design of complex structureinductive learning architectures.
Another major issue in the GNN literature is to define common experimental settings. As noted in Errica et al. (2019), recent papers on TU datasets do not have a consensus on training, validation and test splits as well as evaluation protocols, making it unfair to compare the performance of new ideas and architectures. It is unclear how to perform good data splits beyond randomizes splits, which are known to provide overoptimistic predictions (Lohr, 2009)
. Additionally, different hyperparameters, loss functions and learning rate schedules make it difficult to identify new advances in architectures.
This paper brings the following contributions:
We release an open benchmark infrastructure for GNNs, hosted on GitHub based on PyTorch
(Paszke et al., 2019) and DGL (Wang et al., 2019) libraries. We focus on easeofuse for new users, making it easy to benchmark new datasets and GNN models.We aim to go beyond the popular but small CORA and TU datasets by introducing mediumscale datasets with 12k70k graphs of variable sizes 9500 nodes. Proposed datasets are from mathematical modeling (Stochastic Block Models), computer vision (superpixels), combinatorial optimization (Traveling Salesman Problem) and chemistry (molecules’ solubility).
One of the goals of this work is to provide an easy to use collection of mediumscale datasets on which the different GNN architectures that have been proposed in the past few years exhibit clear and statistically meaningful differences in term of performance. We propose six datasets that are described in Table 1.
For the two computer vision datasets, each image from the classical MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky et al., 2009) datasets were converted into graphs using so called superpixels, see section 5.2
. The task is then to classify these graphs into categories. The graphs in the PATTERN and CLUSTER datasets were generated according to a Stochastic Block Model, see section
5.4. The tasks consist of recognizing specific predetermined subgraphs (for the PATTERN dataset) or identifying clusters (for the CLUSTER dataset). These are node classification tasks. The TSP dataset is based on the Traveling Salesman Problem (“Given a list of cities, what is the shortest possible route that visits each city and returns to the origin city?”), see section 5.5. We pose TSP on random Euclidean graphs as an edge classification/link prediction task, with the groundtruth value for each edge belonging to the TSP tour given by the Concorde solver (Applegate et al., 2006). ZINC, presented in section 5.3, is an already existing realworld molecular dataset. Each molecule can be converted into a graph: each atom becomes a node and each bond becomes an edge. The task is to regress a molecule property known as the constrained solubility (Jin et al., 2018).Domain/Construction  Dataset  # graphs  # nodes 

Computer Vision/ Graphs constructed with superpixels  MNIST  70K  4075 
CIFAR10  60K  85150  
Chemistry/ Realworld molecular graphs  ZINC  12K  937 
Artificial/ Graphs generated from Stochastic Block Model  PATTERN  14K  50180 
CLUSTER  12K  40190  
Artificial/ Graphs generated from uniform distribution 
TSP  12K  50500 
Each of the proposed datasets contains at least graphs. This is in stark contrast with CORA and popularly used TU datasets, which often contain only a few hundreds of graphs. On the other hand, the proposed datasets are mostly artificial or semiartificial (except for ZINC), which is not the case with the CORA and TU datasets. We therefore view these benchmarks as complementary to each other. The main motivation of our work is to propose datasets that are large enough so that differences observed between various GNN architecture are statistically relevant.
In their simplest form (Sukhbaatar et al., 2016; Kipf & Welling, 2017), graph neural networks iteratively update node representations from one layer to the other according to the formula:
(1) 
where is the dimensional embedding representation of node at layer , is the set of nodes connected to node on the graph, is the degree of node , is a nonlinearity, and is a learnable parameter. We refer to this vanilla version of a graph neural network as GCN–Graph Convolutional Networks (Kipf & Welling, 2017). GraphSage (Hamilton et al., 2017) and GIN–Graph Isomorphism Network (Xu et al., 2019) propose simple variations of this averaging mechanism. In the mean version of GraphSage, the first equation of (1) is replaced with
(2) 
and the embeddings vectors are projected onto the unit ball before being passed to the next layer. In the GIN architecture, the equations in (
1) are replaced with(3)  
(4) 
where
are learnable parameters and BN is the Batch Normalization layer
(Ioffe & Szegedy, 2015). Importantly, GIN uses the features at all intermediate layers for the final prediction. In all the above models, each neighbor contributes equally to the update of the central node. We refer to these model as isotropic—they treat every “edge direction” equally.On the other hand, MoNet–Gaussian Mixture Model Networks
(Monti et al., 2017), GatedGCN–Graph Convolutional Networks (Bresson & Laurent, 2017), and GAT–Graph Attention Networks (Veličković et al., 2018) propose anisotropic update schemes of the type(5) 
where the weights and are computed using various mechanisms (e.g. attention mechanism in GAT or gating mechanism in GatedGCN).
Finally, we also consider a hierarchical graph neural network, DiffPool–Differentiable Pooling (Ying et al., 2018), that uses the GraphSage formulation (2) at each stage of the hierarchy and for the pooling. Exact formulations for GNNs are available in the Supplementary Material. Refer to recent survey papers for a comprehensive overview of GNN literature (Bronstein et al., 2017; Zhou et al., 2018; Battaglia et al., 2018; Wu et al., 2019; Bacciu et al., 2019).
The field of GNNs has mostly used the CORA and TU datasets. These datasets are realistic but they are also small. CORA has 2.7k nodes, TUIMDB has 1.5k graphs with 13 nodes on average and TUMUTAG has 188 molecules with 18 nodes. Although small datasets are useful to quickly develop new ideas^{7}^{7}7such as the older Caltech object recognition datasets, with a few hundred examples: http://www.vision.caltech.edu/htmlfiles/archive.html, they can become a liability in the long run as new GNN models will be designed to overfit the small test sets, instead of searching for more generalizable architectures. CORA and TU datasets are examples of this overfitting problem.
As mentioned previously, another major issue with CORA and TU datasets is the lack of reproducibility of experimental results. Most published papers do not use the same trainvalidationtest split. Besides, even for the same split, the performances of GNNs present a large standard deviation on a regular 10fold crossvalidation because the datasets are too small. Our numerical experiments clearly show this, see section
5.1.Errica et al. (2019) have recently introduced a rigorous evaluation framework to fairly compare 5 GNNs on 9 TU datasets for a single graph task–graph classification. This is motivated by earlier work by Shchur et al. (2018) on node classification, which highlighted GNN experimental pitfalls and the reproducibility issue. The paper by Errica et al. (2019) is an important first step towards a good benchmark. However, the authors only consider the small TU datasets and their rigorous evaluations are computationally expensive—they perform 47,000 experiments, where an experiment can last up to 48 hours. Additional tasks such as graph regression, node classification and edge classification are not considered, whereas the datasets are limited to the domains of chemistry and social networks. Open Graph Benchmark^{8}^{8}8http://ogb.stanford.edu/ is a recent initiative that is a very promising step toward the development of a benchmark of large realworld datasets from various domains.
This section presents our numerical experiments with the proposed opensource benchmarking framework. Most GNN implementations, GCN–Graph Convolutional Networks
(Kipf & Welling, 2017), GAT–Graph Attention Networks (Veličković et al., 2018), GraphSAGE (Hamilton et al., 2017), DiffPool–Differential Pooling (Ying et al., 2018), GIN–Graph Isomorphism Network (Xu et al., 2019), MoNet–Gaussian Mixture Model Networks (Monti et al., 2017), were taken from the Deep Graph Library (DGL) (Wang et al., 2019) and implemented in PyTorch (Paszke et al., 2019). We upgrade all DGL GNN implementations with residual connections (He et al., 2016), batch normalization (Ioffe & Szegedy, 2015) and graph size normalization. GatedGCN–Gated Graph Convolutional Networks (Bresson & Laurent, 2017) are the final GNN we consider, with GatedGCNE denoting the version which use edge attributes/features, if available in the dataset. Additionally, we implement a simple graphagnostic baseline which parallelly applies an MLP on each node’s feature vector, independent of other nodes. This is optionally followed by a gating mechanism to obtain the Gated MLP baseline (see Supplementary Material for details). We run experiments for TU, MNIST, CIFAR10, ZINC and TSP on Nvidia 1080Ti GPUs, and for PATTERN and CLUSTER on Nvidia 2080Ti GPUs.Our first experiment is graph classification on TU datasets^{9}^{9}9http://ls11www.cs.tudortmund.de/staff/morris/graphkerneldatasets. We select three TU datasets—ENZYMES (480 train/60 validation/60 test graphs of sizes 2126), DD (941 train/118 validation/119 test graphs of sizes 305748) and PROTEINS (889 train/112 validation/112 test graphs of sizes 4620).
Here is the proposed benchmark protocol for TU datasets.
Splitting. We perform a fold cross validation split which gives sets of train, validation and test data indices in the ratio .
We use stratified sampling to ensure that the class distribution remains the same across splits.
The indices are saved and used across all experiments for fair comparisons.
Training. We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate decay strategy. An initial learning rate is tuned from a range of to using grid search for every GNN models. The learning rate is reduced by half, i.e., reduce factor , if the validation loss does not improve after epochs. We do not set a maximum number of epochs—the training is stopped when the learning rate decays to a value of or less.
The model parameters at the end of training are used for evaluation on test sets.
Accuracy.
We use classification accuracy between the predicted labels and groundtruth labels as our evaluation metric. Model performance is evaluated on the test split of the
folds for all TU datasets. The reported performance is the average and standard deviation over all the folds.We use a graph classifier layer which first builds a graph representation by averaging all node features extracted from the last GNN layer and then passing this graph representation to a MLP.
Our goal is not to find the optimal set of hyperparameters for each dataset, but to identify performance trends. Thus, we fix a budget of around 100k parameters for all GNNs and arbitrarily select 4 layers. The number of hidden features is estimated to match the budget. An exception to this is DiffPool where we resort to the authors choice of using three graph convolutional layers each before and after the pooling layer. This selection is considered least to constitute a DiffPool based GNN model thus overshooting the learnable parameters above 100k for DD.
Our numerical results are presented in Table 2. All NNs have similar statistical test performance as the standard deviation is quite large. We also report a second run of these experiments with the same experimental protocol, i.e., the same 10fold splitting but different initialization. We observe a change of model ranking, which we attribute to the small size of the datasets and the nondeterminism of gradient descent optimizers. We also observed that, for DD and PROTEINS, the graphagnostic MLP baselines perform as good and sometimes better than GNNs.
Dataset  Model  #Param  seed 1  seed 2  
Acc s.d.  Epoch/Total  Acc s.d.  Epoch/Total  
ENZYMES 
MLP  62502  59.674.58  0.24s/0.22hr  54.334.90  0.26s/0.24hr 
MLP (Gated)  79014  62.504.10  0.20s/0.18hr  63.675.36  0.22s/0.20hr  
GCN  80038  63.504.44  0.83s/0.77hr  59.333.74  1.36s/1.19hr  
GraphSage  82686  68.005.95  0.90s/0.78hr  67.335.01  0.93s/0.84hr  
GIN  80770  68.006.62  0.59s/0.67hr  68.175.84  0.55s/0.61hr  
DiffPool  94782  65.332.96  2.05s/2.22hr  67.505.74  1.95s/2.01hr  
GAT  80550  66.335.52  6.69s/5.75hr  67.334.36  7.00s/5.93hr  
MoNet  83538  59.336.38  1.58s/1.46hr  57.505.28  1.74s/1.58hr  
GatedGCN  89366  67.336.42  2.31s/2.03hr  68.004.14  2.30s/2.00hr  
DD 
MLP  71458  72.243.43  1.17s/1.27hr  70.883.89  1.30s/1.34hr 
MLP (Gated)  87970  78.532.80  1.23s/1.25hr  77.852.90  1.02s/1.05hr  
GCN  88994  77.842.27  3.35s/2.37hr  78.352.06  4.51s/2.98hr  
GraphSage  89402  77.592.85  4.06s/2.58hr  78.612.02  4.67s/4.15hr  
GIN  85646  74.113.69  2.21s/1.80hr  73.523.93  2.02s/1.65hr  
DiffPool  165342  65.919.45  37.87s/33.42hr  63.731.49  37.48s/32.87hr  
GAT  89506  77.422.88  20.63s/11.86hr  78.782.70  23.20s/13.20hr  
MoNet  89134  77.082.71  25.39s/15.30hr  76.753.97  23.74s/14.92hr  
GatedGCN  98386  78.351.74  7.85s/7.20hr  78.101.93  8.44s/8.45hr  
PROTEINS 
MLP  63778  76.272.92  0.41s/0.29hr  76.362.45  0.35s/0.25hr 
MLP (Gated)  80290  75.103.85  0.36s/0.24hr  74.833.15  0.45s/0.28hr  
GCN  81314  76.543.63  1.74s/1.64hr  75.103.31  1.63s/1.55hr  
GraphSage  83642  76.184.14  1.72s/1.09hr  74.833.76  1.73s/1.22hr  
GIN  79886  69.625.13  1.04s/1.31hr  69.435.68  0.94s/1.25hr  
DiffPool  93780  76.602.11  3.99s/3.93hr  75.902.88  3.99s/3.90hr  
GAT  81826  75.553.09  11.55s/10.74hr  74.392.56  12.29s/11.40hr  
MoNet  84334  77.263.12  3.54s/2.99hr  76.812.75  3.22s/2.60hr  
GatedGCN  90706  76.453.77  3.58s/2.92hr  76.002.19  4.31s/2.96hr 
Performance on the standard TU test sets (higher is better). Two runs of all the experiments over same hyperparameters but different random seeds are shown separately to note the differences in ranking and variation for reproducibility. The top 3 performance scores are highlighted as:
First, Second, Third.For the second experiment, we use the popular MNIST and CIFAR10 image classification datasets from computer vision. The original MNIST and CIFAR10 images are converted to graphs using superpixels. Superpixels represent small regions of homogeneous intensity in images, and can be extracted with the SLIC technique (Achanta et al., 2012). We use SLIC superpixels from (Knyazev et al., 2019)^{10}^{10}10https://github.com/bknyaz/graph_attention_pool. MNIST has train/ validation/ test graphs of sizes  nodes (i.e., number of superpixels) and CIFAR10 has train/ validation/ test graphs of sizes  nodes.
For each sample, we build a nearest neighbor adjacency matrix with , where are the 2D coordinates of superpixels , and is the scale parameter defined as the averaged distance of the nearest neighbors for each node. Figure 3 presents visualizations of the superpixel graphs of MNIST and CIFAR10.
We propose the following benchmark setup.
Splitting. We use the standard MNIST and CIFAR10 splits.
Training. We use the Adam optimizer with a learning rate decay strategy. For all GNNs, an initial learning rate is set to , the reduce factor is , the patience value is 5, and the stopping learning rate is .
Accuracy. The performance metric is the classification accuracy between predicted and groundtruth labels.
Reproducibility. We report accuracy averaged over 4 runs with 4 different random seeds. At each run, the same seed is used for all NNs.
Graph classifier layer. We use the same graph classifier as the TU dataset experiments, Section 5.1.
Hyperparameters and parameter budget.
We determine the model hyperparameters following the TU dataset experiments (Section 5.1) with a budget of 100k.
Results for graph classification on MNIST and CIFAR10 are presented in Table 3 and analyzed in Section 6.
We use the ZINC molecular graphs dataset to regress a molecular property known as the constrained solubility (Jin et al., 2018). The statistics for ZINC are: train/ validation/ test graphs of sizes 937 nodes/atoms.
For each molecular graph, node features are the type of atoms and edge features are the type of edges.
The same experimental protocol as Section 5.2 is used, with the following changes:
Accuracy. The performance metric is the mean absolute error (MAE) between the predicted and the groundtruth constrained solubility.
Graph regression layer. The regression layer is similar to the graph classifier layer in section 5.2.
Table 4 presents our numerical results, which we analyze in Section 6.
We consider the nodelevel tasks of graph pattern recognition
(Scarselli et al., 2009) and semisupervised graph clustering. The goal of graph pattern recognition is to find a fixed graph pattern embedded in larger graphs of variable sizes. Identifying patterns in different graphs is one of the most basic tasks for GNNs. The pattern and embedded graphs are generated with the stochastic block model (SBM) (Abbe, 2017). A SBM is a random graph which assigns communities to each node as follows: any two vertices are connected with the probability
if they belong to the same community, or they are connected with the probability if they belong to different communities (the value of acts as the noise level). For all experiments, we generate graphs with 5 communities with sizes randomly generated between . The SBM of each community is , and the signal on is generated with a uniform random distribution with a vocabulary of size , i.e., . We randomly generate patterns composed of nodes with intraprobability and extraprobability (i.e., 50% of nodes in are connected to ). The signal on is also generated as a random signal with values . The statistics for the PATTERN dataset are train/ validation/ test graphs of sizes  nodes. The output signal has value 1 if the node belongs to and value 0 if it is in .The semisupervised clustering task is another fundamental task in network science. We generate 6 SBM clusters with sizes randomly generated between and probabilities . The statistics for the CLUSTER dataset is train/ validatin/ test graphs of sizes  nodes. We only provide a single label randomly selected for each community. The output signal is defined as the cluster class label.
We follow the same experimental protocol as Section 5.2, with the following changes:
Accuracy. The performance metric is the average accuracy over the classes.
Node classification layer. For classifying each node, we pass the node features from the last GNN layer to a MLP.
Results for node classification on CLUSTER and PATTERN are presented in Table 5 and analyzed in Section 6.
Leveraging machine learning for solving NPhard combinatorial optimization problems (COPs) has been the focus of intense research in recent years (Vinyals et al., 2015; Bengio et al., 2018). Recently proposed deep learningbased solvers for COPs (Khalil et al., 2017; Li et al., 2018; Kool et al., 2019) combine GNNs with classical graph search to predict approximate solutions directly from problem instances (represented as graphs). Consider the intensively studied Travelling Salesman Problem (TSP): given a 2D Euclidean graph, one needs to find an optimal sequence of nodes, called a tour, with minimal total edge weights (tour length). TSP’s multiscale nature makes it a challenging graph task which requires reasoning about both local node neighborhoods as well as global graph structure.
Dataset  Model  #Param  Residual  No Residual  

Acc  Epoch/Total  Acc  Epoch/Total  
MNIST 
MLP  104044  not used  94.460.28  21.82s/1.02hr  
MLP (Gated)  105717  not used  95.180.18  22.43s/0.73hr  
GCN  101365  89.990.15  78.25s/1.81hr  89.050.21  79.18s/1.76hr  
GraphSage  102691  97.090.02  75.57s/1.36hr  97.200.17  76.80s/1.42hr  
GIN  105434  93.910.63  34.30s/0.73hr  93.961.30  34.61s/0.74hr  
DiffPool  106538  95.020.42  170.55s/4.26hr  94.660.48  171.38s/4.45hr  
GAT  110400  95.620.13  375.71s/6.35hr  95.560.16  377.06s/6.35hr  
MoNet  104049  90.360.47  581.86s/15.31hr  89.730.48  567.12s/12.05hr  
GatedGCN  104217  97.370.06  128.39s/2.01hr  97.360.12  127.15s/2.13hr  
GatedGCNE*  104217  97.240.10  135.10s/2.25hr  97.470.13  127.86s/2.15hr  
CIFAR10 
MLP  104044  not used  56.010.90  21.82s/1.02hr  
MLP (Gated)  106017  not used  56.780.12  27.85s/0.68hr  
GCN  101657  54.460.10  100.91s/2.73hr  51.640.45  100.30s/2.44hr  
GraphSage  102907  65.930.30  96.67s/1.88hr  66.080.24  96.00s/1.79hr  
GIN  105654  53.283.70  45.29s/1.24hr  47.660.47  44.30s/0.93hr  
DiffPool  108042  57.990.45  298.06s/10.17hr  56.840.37  299.64s/10.42hr  
GAT  110704  65.400.38  389.40s/7.32hr  65.480.33  386.14s/7.75hr  
MoNet  104229  53.420.43  836.32s/22.45hr  50.990.17  869.90s/21.79hr  
GatedGCN  104357  69.190.28  146.80s/2.48hr  68.920.38  145.14s/2.49hr  
GatedGCNE*  104357  68.640.60  158.80s/2.74hr  69.370.48  145.66s/2.43hr 
*GatedGCNE uses the graph adjacency weight as edge feature.
Model  #Param  Residual  No Residual  

Acc/MAE  Epoch/Total  Acc/MAE  Epoch/Total  
MLP  108975  not used  0.7100.001  1.19s/0.02hr  
MLP (Gated)  106970  not used  0.6810.005  1.16s/0.03hr  
GCN  103077  0.4690.002  3.02s/0.08hr  0.5250.007  2.97s/0.09hr 
GraphSage  105031  0.4290.005  3.24s/0.10hr  0.4100.005  3.20s/0.10hr 
GIN  103079  0.4140.009  2.49s/0.06hr  0.4080.008  2.50s/0.06hr 
DiffPool  110561  0.4660.006  12.41s/0.34hr  0.5140.007  12.36s/0.38hr 
GAT  102385  0.4630.002  20.97s/0.56hr  0.4960.004  21.03s/0.62hr 
MoNet  106002  0.4070.007  11.69s/0.28hr  0.4440.024  11.75s/0.34hr 
GatedGCN  105735  0.4370.008  6.36s/0.17hr  0.4220.006  6.12s/0.17hr 
GatedGCNE*  105875  0.3630.009  6.34s/0.17hr  0.3650.009  6.17s/0.17hr 
*GatedGCNE uses the molecule bond type as edge feature.
For our experiments with TSP, we follow the learningbased approach to COPs described in (Li et al., 2018; Joshi et al., 2019), where a GNN is the backbone architecture for assigning probabilities to each edge as belonging/not belonging to the predicted solution set. The probabilities are then converted into discrete decisions through graph search techniques. We create train, validation and test sets of , and TSP instances, respectively, where each instance is a graph of node locations sampled uniformly in the unit square and . We generate problems of varying size and complexity by uniformly sampling the number of nodes for each instance.
In order to isolate the impact of the backbone GNN architectures from the search component, we pose TSP as a binary edge classification task,
with the groundtruth value for each edge belonging to the TSP tour given by Concorde.
For scaling to large instances, we use sparse nearest neighbor graphs instead of full graphs, following Khalil et al. (2017).
See Figure 7 for sample TSP instances of various sizes.
We follow the same experimental protocol as Section 5.2 with the following changes:
Training. The patience value is 10 by default. For additional experiments on the impact of model depth, we use a patience value of 5.
Accuracy. Given the high class imbalance, i.e., only the edges in the TSP tour have positive label, we use the F1 score for the positive class as our performance metric.
Reproducibility. We report F1 scores averaged over 2 runs with 2 different random seeds.
Edge classifier layer. To make a prediction for each edge , we first concatenate node features and from the final GNN layer. The concatenated features are then passed to an MLP for prediction.
Nonlearnt Baseline. In addition to reporting performance of GNNs, we compare with a simple
nearest neighbor heuristic baseline, defined as follows: Predict true for the edges corresponding to the
nearest neighbors of each node, and false for all other edges. We set for optimal performance. Comparing GNNs to the nonlearnt baseline tells us whether models learn something more sophisticated than identifying a node’s nearest neighbors. Our numerical results are presented in Tables 6 and analyzed in Section 6.Dataset  Model  #Param  Residual  No Residual  

Acc  Epoch/Total  Acc  Epoch/Total  
PATTERN 
MLP  105263  not used  50.130.00  8.68s/0.10hr  
MLP (Gated)  103629  not used  50.130.00  9.78s/0.12hr  
GCN  100923  74.361.59  97.37s/2.06hr  55.220.17  97.46s/2.30hr  
GraphSage  98607  78.203.06  79.19s/2.57hr  81.253.84  79.43s/2.14hr  
GIN  100884  96.982.18  14.12s/0.32hr  98.250.38  14.11s/0.37hr  
GAT  109936  90.722.04  229.76s/5.73hr  88.914.48  229.65s/8.78hr  
MoNet  103775  95.523.74  879.87s/21.80hr  97.890.89  870.05s/24.86hr  
GatedGCN  104003  95.052.80  115.55s/2.46hr  97.241.19  115.03s/2.59hr  
CLUSTER 
MLP  106015  not used  20.970.01  6.54s/0.08hr  
MLP (Gated)  104305  not used  20.970.01  7.37s/0.09hr  
GCN  101655  47.824.91  66.58s/1.26hr  34.850.65  66.81s/1.21hr  
GraphSage  99139  44.893.70  54.53s/1.05hr  53.904.12  54.40s/1.19hr  
GIN  103544  49.642.09  11.60s/0.27hr  52.541.03  11.57s/0.27hr  
GAT  110700  49.086.47  158.23s/4.08hr  54.121.21  158.46s/4.53hr  
MoNet  104227  45.953.39  635.77s/15.32hr  39.482.21  600.04s/11.18hr  
GatedGCN  104355  54.203.58  81.39s/2.26hr  50.183.03  80.66s/2.07hr 
Graphagnostic NNs (MLP) perform as well as GNNs on small datasets. Tables 2 and 3 show there is no significant improvement by using GNNs over graphagnostic MLP baselines for the small TU datasets and the (simple) MNIST. Besides, MLP can sometimes do better than GNNs (Errica et al., 2019; Luzhnica et al., 2019), such as for the DD dataset.
GNNs improve upon graphagnostic NNs for larger datasets. Tables 4 and 5 present a significant gain of performance for the ZINC, PATTERN, and CLUSTER datasets, in which all GNNs vastly outperform the two MLP baselines. Table 6 shows that all GNNs using residual connections surpass the MLP baselines for the TSP dataset. Results reported in Table 3 for the CIFAR10 dataset are less discriminative, although the best GNNs perform notably better than MLPs.
Model  #Param  Residual  No Residual  

F1  Epoch/Total  F1  Epoch/Total  
kNN Heuristic  k=2  F1: 0.693  
MLP  94394  not used  0.5480.003  53.92s/2.85hr  
MLP (Gated)  115274  not used  0.5480.001  54.39s/2.44hr  
GCN  108738  0.6270.003  163.36s/11.26hr  0.5470.003  164.41s/10.28hr 
GraphSage  98450  0.6630.003  145.75s/16.05hr  0.6570.002  147.22s/14.33hr 
GIN  118574  0.6550.001  73.09s/5.44hr  0.6570.001  74.71s/5.60h 
GAT  109250  0.6690.001  360.92s/30.38hr  0.5670.003  360.74s/20.55hr 
MoNet  94274  0.6370.010  1433.97s/41.69hr  0.5690.002  1472.65s/42.44hr 
GatedGCN  94946  0.7940.004  203.28s/15.47hr  0.7910.003  202.12s/15.20hr 
GatedGCNE*  94946  0.8020.001  201.40s/15.19hr  0.7940.003  201.32s/15.05hr 
*GatedGCNE uses the pairwise distance as edge feature.
Vanilla GCNs (Kipf & Welling, 2017) have poor performance. GCNs are the simplest form of GNNs. Their node representation update relies on an isotropic averaging operation over the neighborhood, Eq.(1). This isotropic property was analyzed in (Chen et al., 2019) and was shown to be unable to distinguish simple graph structures, explaining the low performance of GCNs across all datasets.
New isotropic GNN architectures improve on GCN. GraphSage (Hamilton et al., 2017) demonstrates the importance of using the central node information in the graph convolutional layer, Eq.(2). GIN (Xu et al., 2019) also employs the central node feature, Eq.(3), along with a new classifier layer that connects to convolutional features at all intermediate layers. DiffPool (Ying et al., 2018) considers a learnable graph pooling operation where GraphSage is used at each resolution level. These three isotropic GNNs significantly improve the performance of GCN for all datasets, apart from CLUSTER.
Anisotropic GNNs are accurate. Anisotropic models such as GAT (Veličković et al., 2018), MoNet (Monti et al., 2017) and GatedGCN (Bresson & Laurent, 2017) obtain the best results for each dataset, with the exception of PATTERN. Also, we note that GatedGCN performs consistently well across all datasets. Unlike isotropic GNNs that mostly rely on a simple sum over the neighboring features, anisotropic GNNs employ complex mechanisms (sparse attention mechanism for GAT, edge gates for GatedGCN) which are harder to implement efficiently. Our code for these models is not fully optimized and, as a result, much slower.
An additional advantage of this class of GNNs is their ability to explicitly use edge features, such as the bond type between two atoms in a molecule. In Table 4, for the ZINC molecular dataset, GatedGCNE using the bond edge features significantly improved the MAE performance of GatedGCN without bonds.
Residual connections improve performance. Residual connections (RC), introduced in (He et al., 2016)
, have become a universal ingredient in deep learning architectures for computer vision. Using residual links helps GNNs in two ways. First, it limits the vanishing gradient problem during backpropagation in deep networks. Second, it allows the inclusion of selfnode information during convolution in models like GCN and GAT, which do not use them explicitly.
We first test the influence of RC with layers. For MNIST (Table 3), RC do not improve the performance as most GNNs are able to easily overfit this dataset. For CIFAR10, RC enhance results for GCN, GIN, DiffPool and MoNet, but they do not help or degrade the performance for GraphSAGE, GAT and GatedGCN. For ZINC (Table 4), adding residuality significantly improves GCN, DiffPool, GAT, and MoNet, but it slightly degrades the performance of GIN, GraphSage and GatedGCN. For PATTERN and CLUSTER (Table 5), GCN is the only architecture that clearly benefits from RC, while the other models can see their accuracy increase or decrease in the presence of RC. For TSP (Table 6), models which do not implicitly use selfinformation (GCN, GAT, MoNet) benefit from skip connections while other GNNs hold almost the same performance.
Next, we evaluate the impact of RC for deep GNNs. Figure 8 and Table 7 present the results of deep GNNs for ZINC, CLUSTER and TSP with layers. Interestingly, all models benefit from residual links when the number of layers increase, expect GIN, which is already equipped with skip connections for readout—the classification layer is always connected to all intermediate convolutional layers. In summary, our results suggest residual connections are an important building block for designing deep GNNs.
Model  #Param  Residual  No Residual  

F1  Epoch/Total  F1  Epoch/Total  
GCN  4  108738  0.628  165.15s/5.69hr  0.552  168.72s/6.14hr 
8  175810  0.639  279.33s/9.86hr  0.568  281.56s/14.16hr  
16  309954  0.651  502.59s/21.37hr  0.532  507.35s/10.72hr  
32  578242  0.666  1042.46s/28.96hr  0.361  1031.62s/15.19hr  
GIN  4  118574  0.653  71.41s/2.50hr  0.653  75.63s/3.34hr 
8  223866  0.675  93.95s/4.26hr  0.674  93.41s/5.19hr  
16  434450  0.681  146.09s/5.68hr  0.642  144.52s/2.89hr  
32  855618  0.669  274.5s/3.81hr  0.6063  282.97s/4.40hr  
GatedGCN  4  94946  0.792  214.67s/6.50hr  0.787  212.67s/9.75hr 
8  179170  0.817  374.39s/14.56hr  0.807  367.68s/19.72hr  
16  347618  0.833  685.41s/22.85hr  0.810  678.76s/22.07hr  
32  684514  0.843  1760.56s/48.00hr  0.722  1760.55s/33.27hr 
Normalization layers can improve learning. Most realworld graph datasets are collections of irregular graphs with varying graph sizes. Batching graphs of variable sizes may lead to node representation at different scales. Hence, normalizing activations can be helpful to improve learning and generalization. We use two normalization layers in our experiments—batch normalization (BN) from (Ioffe & Szegedy, 2015) and graph size normalization (GN). Graph size normalization is a simple operation where the resulting node features are normalized w.r.t. the graph size, i.e., where is the number of nodes. This normalization layer is applied after the convolutional layer and before the activation layer.
Dataset  Model  #Param  BN & GN  No BN & GN 

ZINC 
MLP  108975  not used  0.7100.001 
MLP (Gated)  106970  not used  0.6830.004  
GCN  103077  0.4690.002  0.4900.007  
GraphSage  105031  0.4290.005  0.4310.005  
GIN  103079  0.4140.009  0.4260.010  
DiffPool  110561  0.4660.006  0.4650.008  
GAT  102385  0.4630.002  0.4870.006  
MoNet  106002  0.4070.007  0.4770.009  
GatedGCNE  105875  0.3630.009  0.3990.003  
CIFAR10 
MLP  104044  not used  56.010.90 
MLP (Gated)  106017  not used  56.780.12  
GCN  101657  54.460.10  54.140.67  
GraphSage  102907  65.930.30  65.980.15  
GIN  105654  53.283.70  55.491.54  
DiffPool  108042  57.990.45  56.700.71  
GAT  110704  65.400.38  62.720.36  
GatedGCNE  104357  68.640.60  64.100.44  
CLUSTER 
MLP  106015  not used  20.970.01 
MLP (Gated)  104305  not used  20.970.01  
GCN  101655  47.824.91  27.055.79  
GraphSage  99139  44.893.70  48.833.84  
GIN  103544  49.642.09  47.601.05  
GAT  110700  49.086.47  40.044.90  
GatedGCN  104355  54.203.58  34.052.57 
We evaluate the impact of the normalization layers on ZINC, CIFAR10 and CLUSTER datasets in Table 8. For the three datasets, BN and GN significantly improve GAT and GatedGCN. Besides, BN and GN boost GCN performance for ZINC and CLUSTER, but do not improve for CIFAR10. GraphSage and DiffPool do not benefit from normalizations but do not lose performance, except for CLUSTER. Additionally, GIN slightly benefits from normalization for ZINC and CLUSTER, but degrades for CIFAR10. We perform an ablation study in the Supplementary Material to study the influence of each normalization layer. In summary, normalization layers can be critical to design sound GNNs.
In this paper, we propose a benchmarking framework to facilitate the study of graph neural networks, and address experimental inconsistencies in the literature. We confirm how the widely used small TU datasets are inappropriate to examine innovations in this field and introduce six mediumscale datasets within the framework. Our experiments on multiple tasks for graphs show: i) Graph structure is important as we move towards larger datasets; ii) GCN, the simplest isotropic version of GNNs, cannot learn complex graph structures; iii) Selfnode information, hierarchy, attention mechanisms, edge gates and better readout functions are key structures to improve GCN; iv) GNNs can scale deeper using residual connections and performance can be improved using normalization layers. As a final note, our benchmarking infrastructure, leveraging PyTorch and DGL, is fully reproducible and open to users on GitHub to experiment with new models and add datasets.
XB is supported by NRF Fellowship NRFF201710.
This section formally describes our experimental pipeline, illustrated in Figure 10. We detail the components of the setup including the input layers, the GNN layers and the MLP prediction layers.
Given a graph, we are given node features for each node and (optionally) edge features for each edge connecting node and node . The input features and are embedded to dimensional hidden features and via a simple linear projection before passing them to a graph neural network:
(6) 
where , and . If the input node/edge features are onehot vectors of discrete variables, then biases are not used.
Each GNN layer computes dimensional representations for the nodes/edges of the graph through recursive neighborhood diffusion (or message passing), where each graph node gathers features from its neighbors to represent local graph structure. Stacking GNN layers allows the network to build node representations from the hop neighborhood of each node.
Let denote the feature vector at layer associated with node . The updated features at the next layer
are obtained by applying nonlinear transformations to the central feature vector
and the feature vectors for all nodes in the neighborhood of node (defined by the graph structure). This guarantees the transformation to build local reception fields, such as in standard ConvNets for computer vision, and be invariant to both graph size and vertex reindexing.Thus, the most generic version of a feature vector at vertex at the next layer in the graph network is:
(7) 
where denotes the set of neighboring nodes pointed to node , which can be replaced by , the set of neighbors of node , if the graph is undirected. In other words, a GNN is defined by a mapping taking as input a vector (the feature vector of the center vertex) as well as an unordered set of vectors (the feature vectors of all neighboring vertices), see Figure 9. The arbitrary choice of the mapping defines an instantiation of a class of GNNs. See Table 9 for an overview of the GNNs we study in this paper.
Model  Self info.  Graph info.  Aggregation  Anisotropy  Heads/Kernels 

MLP  ✓  x       
GCN  x  ✓  Mean  x  Single 
GraphSage  ✓  ✓  Mean, Max, LSTM  x  Single 
GIN  ✓  ✓  Mean, Max, Sum  x  Single 
GAT  x  ✓  Weighted mean  ✓(Attention)  Multihead 
MoNet  x  ✓  Weighted sum  ✓(Pseudoedges)  Multikernel 
GatedGCN  ✓  ✓  Weighted mean  ✓(Edge gates)  Single 
In the simplest formulation of GNNs, Graph ConvNets iteratively update node features via an isotropic averaging operation over the neighborhood node features, i.e.,
(8)  
(9) 
where (a bias is also used, but omitted for clarity purpose), is the indegree of node , see Figure 13. Eq. (8) is called a convolution as it is a linear approximation of a localized spectral convolution. Note that it is possible to add the central node features in the update (8) by using selfloops or residual connections, see Section D.
GraphSage improves upon the simple GCN model by explicitly incorporating each node’s own features from the previous layer in its update equation:
(10) 
where , see Figure 13. Observe that the transformation applied to the central node features is different to the transformation carried out to the neighborhood features . The node features are then projected onto the unit ball before being passed to the next layer:
(11) 
The authors also define more sophisticated neighborhood aggregation functions, such as Maxpooling or LSTM aggregators:
(12) 
(13) 
where and the cell also uses learnable weights. In our experiments, we use the mean version of GraphSage, Eq.(10) (numerical experiments with the max version did not show significant differences, see Section E.2).
The GIN architecture is based the WeisfeilerLehman Isomorphism Test (Weisfeiler & Lehman, 1968) to study the expressive power of GNNs. The node update equation is defined as:
(14)  
(15) 
where is a learnable constant, , BN denoted Batch Normalization (described in subsequent sections), see Figure 13.
GAT uses the attention mechanism of (Bahdanau et al., 2014) to introduce anisotropy in the neighborhood aggregation function. The network employs a multiheaded architecture to increase the learning capacity, similar to the Transformer (Vaswani et al., 2017). The node update equation is given by:
(16) 
where are the linear projection heads, and are the attention coefficients for each head defined as:
(17)  
(18)  
where , see Figure 16. GAT learns a mean over each node’s neighborhood features sparsely weighted by the importance of each neighbor.
GatedGCN considers residual connections, batch normalization and edge gates to design another anisotropic variant of GCN. The authors propose to explicitly update edge features along with node features:
(22) 
where , is the Hadamard product, and the edge gates are defined as:
(23)  
(24)  
where
is the sigmoid function,
is a small fixed constant for numerical stability, , see Figure 16. Note that the edge gates (23) can be regarded as a soft attention process, related to the standard sparse attention mechanism (Bahdanau et al., 2014). Different from other anisotropic GNNs, the GatedGCN architecture explicitly maintains edge features at each layer, following (Bresson & Laurent, 2019; Joshi et al., 2019).Anisotropic GNNs, such as MoNet, GAT and GatedGCN, generally update node features as:
(25) 
Eq. (25) can be regarded as a learnable nonlinear anisotropic diffusion operator on graphs where the discrete diffusion time is defined by , the index of the layer. Anisotropy does not come naturally on graphs. As arbitrary graphs have no specific orientations (like up, down, left, right directions in an image), a diffusion process on graphs is consequently isotropic, making all neighbors equally important. However, this may not be true in general, e.g., a neighbor in the same community of a node shares different information than a neighbor in a separate community. GAT makes the diffusion process anisotropic with the attention mechanism (Bahdanau et al., 2014; Vaswani et al., 2017). MoNet uses the node degrees as edge features, and GatedGCN employs learneable edge gates such as in Marcheggiani & Titov (2017). We believe the anisotropic property to be critical in designing GNNs. Anisotropic models learn the best edge representations for encoding special information flow on the graph structure for the task to be solved.
Irrespective of the class of GNNs used, we augment each GNN layer with batch normalization (BN) (Ioffe & Szegedy, 2015), graph size normalization (GN) and residual connections (He et al., 2016). As such, we consider a more specific class of GNNs than (7):
(26) 
(27) 
where
is a nonlinear activation function and
is a specific GNN layer.As a reminder, BatchNorm normalizes each minibatch of
feature vectors using the minibatch mean and variance, as follows:
(28) 
and then replace each with its normalized version, followed by a learnable affine transformation:
(29) 
where .
Batching graphs of variable sizes may lead to node representation at different scales, making it difficult to learn the optimal statistics and for BatchNorm. Therefore, we consider a GraphNorm layer to normalize the node features w.r.t. the graph size, i.e.,
(30) 
where is the number of graph nodes. The GraphNorm layer is placed before the BatchNorm layer.
In addition to benchmarking various classes of GNNs, we consider a simple baseline using graphagnostic networks for obtaining node features. We apply a MLP on each node’s feature vector, independently of other nodes, i.e.,
(31) 
where . This defines our MLP layer. We also consider a slight upgrade by using a gating mechanism, which (independently) scales each node’s final layer features through a sigmoid function:
(32) 
where is the sigmoid function. This establishes our second baseline, called MLPGated.
The final component of each network is a prediction layer to compute taskdependent outputs, which will be given to a loss function to train the network parameters in an endtoend manner. The input of the prediction layer is the result of final the GNN layer for each node of the graph (except GIN, which uses features from all intermediate layers).
To perform graph classification, we first build a dimensional graphlevel vector representation by averaging over all node features in the final GNN layer:
(33) 
The graph features are then passed to a MLP, which outputs unnormalized logits/scores
for each class:(34) 
where is the number of classes. Finally, we minimize the crossentropy loss between the logits and groundtruth labels.
For graph regression, we compute using Eq.(33) and pass it to a MLP which gives the prediction score :
(35) 
where . The L1loss between the predicted score and the groundtruth score is minimized during the training.
Dataset  #Graphs  #Classes  Avg. Nodes  Avg. Edges  Node feat. (dim)  Edge feat. (dim) 
ENZYMES  600  6  32.63  62.14  Node Attr (18)  N.A. 
DD  1178  2  284.32  715.66  Node Label (89)  N.A. 
PROTEINS  1113  2  39.06  72.82  Node Attr (29)  N.A. 
MNIST  70000  10  70.57  564.53  Pixel+Coord (3)  Node Dist (1) 
CIFAR10  60000  10  117.63  941.07  Pixel[RGB]+Coord (5)  Node Dist (1) 
ZINC  12000  –  23.16  49.83  Node Label (28)  Edge Label (4) 
PATTERN  14000  2  117.47  4749.15  Node Attr (3)  N.A. 
CLUSTER  12000  6  117.20  4301.72  Node Attr (7)  N.A. 
TSP  12000  2  275.76  6894.04  Coord (2)  Node Dist (1) 

For node classification, we independently pass each node’s feature vector to a MLP for computing the unnormalized logits for each class:
(36) 
where . The crossentropy loss weighted inversely by the class size is used during training.
To make a prediction for each graph edge , we first concatenate node features and from the final GNN layer. The concatenated edge features are then passed to a MLP for computing the unnormalized logits for each class:
(37) 
where . The standard crossentropy loss between the logits and groundtruth labels is used.
In this section, we perform an ablation study on Batch Normalization (BN) and Graph Size Normalization (GN) to empirically study the impact of normalization layers in GNNs. The results, drawn from graph regression (ZINC), graph classification (CIFAR10) and node classification (CLUSTER), are summarized in Table 11. We draw the following inferences.
Dataset  Model  #Param  BN & GN  BN & NO_GN  NO_BN & GN  NO_BN & NO_GN 

ZINC 
MLP  108975  not used  not used  not used  0.7100.001 
MLP (Gated)  106970  not used  not used  not used  0.6830.004  
GCN  103077  0.4690.002  0.4860.006  0.5370.005  0.4900.007  
GraphSage  105031  0.4290.005  0.4280.004  0.4240.008  0.4310.005  
GIN  103079  0.4140.009  0.4070.011  0.4560.009  0.4260.010  
DiffPool  110561  0.4660.006  0.4690.006  0.4700.004  0.4650.008  
GAT  102385  0.4630.002  0.4790.004  0.5090.008  0.4870.006  
MoNet  106002  0.4070.007  0.4180.008  0.4550.010  0.4770.009  
GatedGCNE  105875  0.3630.009  0.3940.003  0.3890.004  0.3990.003  
CIFAR10 
MLP  104044  not used  not used  not used  56.010.90 
MLP (Gated)  106017  not used  not used  not used  56.780.12  
GCN  101657  54.460.10  54.920.08  49.270.97  54.140.67  
GraphSage  102907  65.930.30  66.020.19  65.960.26  65.980.15  
GIN  105654  53.283.70  51.072.17  59.092.24  55.491.54  
DiffPool  108042  57.990.45  57.181.01  57.250.29  56.700.71  
GAT  110704  65.400.38  65.450.28  60.190.82  62.720.36  
GatedGCNE  104357  68.640.60  69.740.35  59.571.30  64.100.44  
CLUSTER 
MLP  106015  not used  not used  not used  20.970.01 
MLP (Gated)  104305  not used  not used  not used  20.970.01  
GCN  101655  47.824.91  39.766.90  21.000.04  27.055.79  
GraphSage  99139  44.893.70  45.585.65  46.276.61  48.833.84  
GIN  103544  49.642.09  48.502.39  44.114.10  47.601.05  
GAT  110700  49.086.47  43.328.03  30.207.82  40.044.90  
GatedGCN  104355  54.203.58  50.385.02  27.803.32  34.052.57 
Dataset  #Param  NO_SL & RC  NO_SL & NO_RC  SL & RC  SL & NO_RC  

Acc  Epoch/Total  Acc  Epoch/Total  Acc  Epoch/Total  Acc  Epoch/Total  
ZINC  103077  0.4690.002  3.02s/0.08hr  0.4690.005  3.00s/0.08hr  0.4820.010  3.18s/0.08hr  0.4910.005  3.13s/0.09hr 
MNIST  101365  89.990.15  78.25s/1.81hr  89.290.36  79.41s/1.85hr  89.720.28  86.78s/2.11hr  89.290.36  79.41s/1.85hr 
CIFAR10  101657  54.460.10  100.91s/2.73hr  51.400.17  100.37s/2.67hr  53.870.93  113.43s/3.23hr  52.030.35  113.01s/2.60hr 
PATTERN  100923  74.361.59  97.37s/2.06hr  55.370.22  97.60s/2.56hr  74.571.39  102.01s/2.56hr  54.940.43  101.69s/2.38hr 
CLUSTER  101655  47.824.91  66.58s/1.26hr  36.230.68  66.85s/1.28hr  41.957.14  70.55s/1.42hr  40.870.79  70.37s/1.27hr 
GCN, GAT and GatedGCN. For these GNNs, GN followed by BN consistently improves performances. This empirically demonstrates the necessity to normalize the learnt features w.r.t. activation ranges and graph sizes for better training and generalization. Importantly, GN must always complement BN and is not useful on its own—solely using GN, in the absence of BN, hurts performance across all datasets compared to using no normalization.
GraphSage and GIN. Using BN and GN neither improves nor degrades the performance of GraphSage (except on CLUSTER). Intuitively, the implicit normalization in GraphSage, Eq.(11), acts similar to BN, controlling the range of activations.
Numerical results for GIN are not consistent across the datasets. For ZINC and CLUSTER, BN is useful with or without using GN. However, the sole usage of GN offers the best result. We hypothesize this discrepancy to be caused by the conflict between the internal BN in the GIN layer, Eq.(14), and the BN we placed generically after each graph convolutional layer, Eq.(26).
The GCN model, Eq.(8), is not explicitly designed to include the central node feature in the computation of the next layer feature . To solve this issue, the authors decided to augment the graphs with selfloops (SL). An alternative solution to the selfloop trick is the use of residual connections (RC). RC allow to explicitly include the central node features in the network architecture without increasing the size of the graphs. Besides, RC are particularly effective when designing deep networks as they limit the vanishing gradient problem. This led us to carry out an ablation study on selfloops and residual connections. The results are presented in Table 12.
RC without SL gives the best performance. Overall, the best results are obtained with RC in the absence of SL. Similar performances are obtained for MNIST, CIFAR10 and PATTERN when using RC and SL. However, the use of SL increases the size of the graphs and, consequently, the computational time (up to 12%).
Decoupling central node features from neighborhood features is critical for nodelevel tasks. It is interesting to notice the significant gain in accuracy for node classification tasks (PATTERN and CLUSTER) when using RC and NO SL vs. using SL and NO RC.
When using SL and NO RC, the same linear transformation is applied to the central node and the neighborhood nodes. For RC and NO SL, the central vertex is treated differently from the neighborhood vertices. Such distinction is essential to treat each node to be classwise distinct. For graphlevel tasks like graph regression (ZINC) or graph classification (MNIST and CIFAR10), the final graph representation is the mean of the node features, therefore showing a comparatively lesser margin of gain.
The TSP edge classification task, although artificially generated, presents an interesting empirical result. With the exception of GatedGCN, no GNN is able to outperform the nonlearnt nearest neighbor heuristic baseline (F1 score: 0.69). This lead us to further explore the suitability of GatedGCN architecture for edge classification, see Table 13.
Impact of scale. We study how the GatedGCN architecture scales, training models with as few as parameters up to parameters. Naturally, performance improves with larger models and more layers. Somewhat counterintuitively, the smallest GatedGCN model, with 2 layers and 16 hidden features per layer, continued to outperform the nonlearnt heuristic baseline and all other GNN architectures.
Explicitly maintaining edge features. To dissect the unreasonable effectiveness of GatedGCNs for edge tasks, we change the architecture’s node update equations to not explicitly maintain an edge feature across layers, i.e., we replace Eq.(24) as:
(38) 
GatedGCN with Eq.(38) is similar to other anisotropic GNN variants, GAT and MoNet, which do not explicitly maintain an edge feature representation across layers.
We find that maintaining edge features is important for performance, especially for smaller models. Larger GatedGCNs without explicit edge features do not achieve the same performance as having edge features, but are still able to outperform the nonlearnt heuristic. It will be interesting to further analyze the importance and tradeoffs of maintaining edge features for realworld edge classification tasks.
With edge repr: Eq.(24)  Without edge repr: Eq.(38)  
#Param  F1  Epoch/Total  #Param  F1  Epoch/Total  
2  16  3610  0.744  110.54s/10.04hr  2970  0.624  119.91s/6.66hr 
4  32  24434  0.779  149.24s/11.11hr  19890  0.722  158.04s/8.78hr 
4  64  94946  0.790  203.07s/16.31hr  77666  0.752  184.79s/10.27hr 
32  64  684514  0.843  1760.56s/48.00hr  547170  0.783  1146.92s/48.00hr 
kNN Heuristic, k=2, F1: 0.693 
The Maxpool variant of GraphSage, Eq.(12), which adds an additional linear transformation to neighborhood features and takes an elementwise maximum across them, is isotropic but should be more powerful than the Mean variant, Eq.(10). For our main experiments, we used GraphSageMean in order to motivate our aim of identifying the basic building blocks for upgrading the GCN architecture. Empirically, we found GraphSageMax to have similar performance as GraphSageMean on TSP, see Table 14. In future work, it would be interesting to further study the application of linear transformations before/after neighborhood aggregation as well as the tradeoffs between various aggregation functions.
Model  #Param  Residual  No Residual  

F1  Epoch/Total  F1  Epoch/Total  
GraphSage (Mean)  98450  0.6630.003  145.75s/16.05hr  0.6570.002  147.22s/14.33hr 
GraphSage (Max)  94522  0.6640.001  155.76s/12.94hr  0.667  156.23s/11.75hr 
kNN Heuristic  k=2  F1: 0.693 
In our main experiments, we fix a parameter budget of 100k and arbitrarily select 4 GNN layers. Then, we estimate the model hyperparameters to match the budget. However, for DiffPool (Ying et al., 2018), the number of trainable parameters is over 100k in one experiment—TUDD^{11}^{11}11Note that the input feature vector for each node in DD graphs is larger than other datasets, see Table 10.. Apparently, the minimum requirement to constitute a DiffPool model is a single differentiable pooling layer, preceded and followed by a number of GNN layers. Thus, DiffPool effectively uses more GNN layers compared to all other models, illustrated in Figure 17. Following Ying et al. (2018), we use GraphSage at each level of hierarchy and for the pooling.
In order to be as close as possible to the budget of 100k parameters for DiffPool, we set the hidden dimension to be significantly smaller than other GNNs (e.g., as few as 8 for DD, compared to 128 for GCN). Despite smaller hidden dimensions, DiffPool still has the highest trainable parameters per experiment due to increased depth as well as the dense pooling layer. However, DiffPool performs poorly compared to other GNNs on the TU datasets, see Table 2. We attribute its low performance to the small hidden dimension.
For completeness sake, we increase the hidden dimensions for DiffPool to 32, which raises the parameter count to 592k for DD. Our results for DD and PROTEINS, presented in Table 15, show that larger DiffPool models match the performance of other GNNs (Table 2). Nonetheless, our goal is not to find the optimal set of hyperparameters for a model, but to identify performance trends and important mechanisms for designing GNNs. In future work, it would be interesting to further study the design of hierarchical representation learning methods such as DiffPool.
Dataset  Model  #Param  seed 1  seed 2  

Acc s.d.  Epoch/Total  Acc s.d.  Epoch/Total  
DD  DiffPoolbig  592230  76.542.90  39.41s/33.49hr  75.912.76  37.83s/32.65hr 
DiffPoolbudget  165342  65.919.45  37.87s/33.42hr  63.731.49  37.48s/32.87hr  
PROTEINS  DiffPoolbig  137390  77.302.69  4.01s/4.09hr  76.704.20  3.96s/3.97hr 
DiffPoolbudget  93780  76.602.11  3.99s/3.93hr  75.902.88  3.99s/3.90hr 
Timing research code can be tricky due to differences of implementations and hardware acceleration, e.g., our implementations of GAT can be optimized by taking a parallelized approach for multihead computation. Similarly, MoNet can be improved by precomputing the indegrees of batched graph nodes that are used as pseudo edge features to compute gaussian weights. Somewhat counterintuitively, our GIN implementation is significantly faster than all other models, including vanilla GCN. Nonetheless, we take a practical view and report the average wall clock time per epoch and the total training time for each model. All experiments were implemented in DGL/PyTorch. We run experiments for TU, MNIST, CIFAR10, ZINC and TSP on an Intel Xeon CPU E52690 v4 server with 4 Nvidia 1080Ti GPUs, and for PATTERN and CLUSTER on an Intel Xeon Gold 6132 CPU with 4 Nvidia 2080Ti GPUs. Each experiment was run on a single GPU and 4 experiments were run on the server at any given time (on different GPUs). We run each experiment for a maximum of 48 hours.
The hyperparameter settings for all models in the main paper across all benchmarked datasets are provided in Table 16.
Dataset  Model  Hyperparameters  Learning Setup  #Params  
L  hidden  out  Other  init lr  patience  min lr  
ENZYMES 
MLP  4  128  128  gated:False; readout:mean  1e3  25  1e6  62502 
MLP (Gated)  4  128  128  gated:True; readout:mean  1e3  25  1e6  79014  
GCN  4  128  128  readout:mean  7e4  25  1e6  80038  
GraphSage  4  96  96  sage_aggregator:meanpool; readout:mean  7e4  25  1e6  82686  
GIN  4  96  96  n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum  7e3  25  1e6  80770  
DiffPool  3  64  –  embedding_dim:64; sage_aggregator:meanpool; num_pool:1; pool_ratio:0.15; linkpred:True; readout:mean  7e3  25  1e6  94782  
GAT  4  16  128  n_heads:8; readout:mean  1e3  25  1e6  80550  
MoNet  4  80  80  kernel:3; pseudo_dim_MoNet:2; readout:mean  1e3  25  1e6  83538  
GatedGCN  4  64  64  edge_feat:False; readout:mean  7e4  25  1e6  89366  
DD 
MLP  4  128  128  gated:False; readout:mean  1e4  25  1e6  71458 
MLP (Gated)  4  128  128  gated:True; readout:mean  1e4  25  1e6  87970  
GCN  4  128  128  readout:mean  1e5  25  1e6  88994  
GraphSage  4  96  96  sage_aggregator:meanpool; readout:mean  1e5  25  1e6  89402  
GIN  4  96  96  n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum  1e3  25  1e6  85646  
DiffPool  3  8  –  embedding_dim:8; sage_aggregator:meanpool; num_pool:1; pool_ratio:0.15; linkpred:True; readout:mean  5e4  25  1e6  165342  
GAT  4  16  128  n_heads:8; readout:mean  5e5  25  1e6  89506  
MoNet  4  80  80  kernel:3; pseudo_dim_MoNet:2; readout:mean  7e5  25  1e6  89134  
GatedGCN  4  64  64  edge_feat:False; readout:mean  1e5  25  1e6  98386  
PROTEINS 
MLP  4  128  128  gated:False; readout:mean  1e4  25  1e6  63778 
MLP (Gated)  4  128  128  gated:True; readout:mean  1e4  25  1e6  80290  
GCN  4  128  128  readout:mean  7e4  25  1e6  81314  
GraphSage  4  96  96  sage_aggregator:meanpool; readout:mean  7e5  25  1e6  83642  
GIN  4  96  96  n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum  7e3  25  1e6  79886  
DiffPool  3  22  –  embedding_dim:22; sage_aggregator:meanpool; num_pool:1; pool_ratio:0.15; linkpred:True; readout:mean  1e3  25  1e6  93780  
GAT  4  16  128  n_heads:8; readout:mean  1e3  25  1e6  81826  
MoNet  4  80  80  kernel:3; pseudo_dim_MoNet:2; readout:mean  7e5  25  1e6  84334  
GatedGCN  4  64  64  edge_feat:False; readout:mean  1e4  25  1e6  90706  
MNIST 
MLP  4  168  168  gated:False; readout:mean  1e3  5  1e5  104044 
MLP (Gated)  4  150  150  gated:True; readout:mean  1e3  5  1e5  105717  
GCN  4  146  146  readout:mean  1e3  5  1e5  101365  
GraphSage  4  108  108  sage_aggregator:meanpool; readout:mean  1e3  5  1e5  102691  
GIN  4  110  110  n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum  1e3  5  1e5  105434  
DiffPool  3  32  –  embedding_dim:32; sage_aggregator:meanpool; num_pool:1; pool_ratio:0.15; linkpred:True; readout:mean  1e3  5  1e5  106538  
GAT  4  19  152  n_heads:8; readout:mean  1e3  5  1e5  110400  
MoNet  4  90  90  kernel:3; pseudo_dim_MoNet:2; readout:mean  1e3  5  1e5  104049  
GatedGCN  4  70  70  edge_feat:False (edge_feat:True for GatedGCNE); readout:mean  1e3  5  1e5  104217  
CIFAR10 
MLP  4  168  168  gated:False; readout:mean  1e3  5  1e5  104044 
MLP (Gated)  4  150  150  gated:True; readout:mean  1e3  5  1e5  106017  
GCN  4  146  146  readout:mean  1e3  5  1e5  101657  
GraphSage  4  108  108  sage_aggregator:meanpool; readout:mean  1e3  5  1e5  102907  
GIN  4  110  110  n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum  1e3  5  1e5  105654  
DiffPool  3  32  –  embedding_dim:16; sage_aggregator:meanpool; num_pool:1; pool_ratio:0.15; linkpred:True; readout:mean  1e3  5  1e5  108042  
GAT  4  19  152  n_heads:8; readout:mean  1e3  5  1e5  110704  
MoNet  4  90  90  kernel:3; pseudo_dim_MoNet:2; readout:mean  1e3  5  1e5  104229  
GatedGCN  4  70  70  edge_feat:False (edge_feat:True for GatedGCNE); readout:mean  1e3  5  1e5  104357  
ZINC 
MLP  4  150  150  gated:False; readout:mean  1e3  5  1e5  108975 
MLP (Gated)  4  135  135  gated:True; readout:mean  1e3  5  1e5  106970  
GCN  4  145  145  readout:mean  1e3  5  1e5  103077  
GraphSage  4  108  108  sage_aggregator:meanpool; readout:mean  1e3  5  1e5  105031  
GIN  4  110  110  n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum  1e3  5  1e5  103079  
DiffPool  3  56  –  embedding_dim:56; sage_aggregator:meanpool; num_pool:1; pool_ratio:0.15; linkpred:True; readout:mean  1e3  5  1e5  110561  
GAT  4  18  144  n_heads:8; readout:mean  1e3  5  1e5  102385  
MoNet  4  90  90  kernel:3; pseudo_dim_MoNet:2; readout:mean  1e3  5  1e5  106002  
GatedGCN  4  70  70  edge_feat:False (edge_feat:True for GatedGCNE); readout:mean  1e3  5  1e5  105735  
PATTERN 
MLP  4  150  150  gated:False; readout:mean  1e3  5  1e5  105263 
MLP (Gated)  4  135  135  gated:True; readout:mean  1e3  5  1e5  103629  
GCN  4  146  146  readout:mean  1e3  5  1e5  100923  
GraphSage  4  106  106  sage_aggregator:meanpool; readout:mean  1e3  5  1e5  98607  
GIN  4  110  110  n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum  1e3  5  1e5  100884  
GAT  4  19  152  n_heads:8; readout:mean  1e3  5  1e5  109936  
MoNet  4  90  90  kernel:3; pseudo_dim_MoNet:2; readout:mean  1e3  5  1e5  103775  
GatedGCN  4  70  70  edge_feat:False; readout:mean  1e3  5  1e5  104003  
CLUSTER 
MLP  4  150  150  gated:False; readout:mean  1e3  5  1e5  106015 
MLP (Gated)  4  135  135  gated:True; readout:mean  1e3  5  1e5  104305  
GCN  4  146  146  readout:mean  1e3  5  1e5  101655  
GraphSage  4  106  106  sage_aggregator:meanpool; readout:mean  1e3  5  1e5  99139  
GIN  4  110  110  n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum  1e3  5  1e5  103544  
GAT  4  19  152  n_heads:8; readout:mean  1e3  5  1e5  110700  
MoNet  4  90  90  kernel:3; pseudo_dim_MoNet:2; readout:mean  1e3  5  1e5  104227  
GatedGCN  4  70  70  edge_feat:False; readout:mean  1e3  5  1e5  104355  
TSP 
MLP  3  144  144  gated:False; readout:mean  1e3  10  1e5  94394 
MLP (Gated)  3  144  144  gated:True; readout:mean  1e3  10  1e5  115274  
GCN  4  128  128  readout:mean  1e3  10  1e5  108738  
GraphSage  4  96  96  sage_aggregator:meanpool; readout:mean  1e3  10  1e5  98450  
GIN  4  80  80  n_mlp_GIN:2; learn_eps_GIN:True; neighbor_aggr_GIN:sum; readout:sum  1e3  10  1e5  118574  
GAT  4  16  128  n_heads:8; readout:mean  1e3  10  1e5  109250  
MoNet  4  80  80  kernel:3; pseudo_dim_MoNet:2; readout:mean  1e3  10  1e5  94274  
GatedGCN  4  64  64  edge_feat:False (edge_feat:True for GatedGCNE); readout:mean  1e3  10  1e5  94946 