1 Introduction
Semi-supervised node classification in graphs is a classic problem in graph mining with applications ranging from e-commerce to computational biology. The recently proposed graph neural network architectures have achieved unprecedented results on this task and significantly advanced the state of the art. Despite their massive success, we cannot accurately judge the progress being made due to certain problematic aspects of the empirical evaluation procedures. We can partially attribute this to the practice of replicating the experimental settings from earlier works, since they are perceived as standard. First, a number of proposed models have all been tested exclusively on the same train/validation/test splits of the same three datasets (CORA, CiteSeer and PubMed) from planetoid. Such experimental setup favors the model that overfits the most and defeats the main purpose of using a train/validation/test split — finding the model with the best generalization properties (friedman2001elements). Second, when evaluating performance of a new model, people often use a training procedure that is rather different from the one used for the baselines. This makes it difficult to identify whether the improved performance comes from (a) a superior architecture of the new model, or (b) a better-tuned training procedure and / or hyperparameter configuration that unfairly benefits the new model (lipton2018troubling).
In this paper we address these issues and perform a thorough experimental evaluation of four prominent GNN architectures on the transductive semi-supervised node classification task. We implement the four models – GCN (gcn), MoNet (monet), GraphSage (graphsage) and GAT (gat) – within the same framework.111Code is available at https://www.kdd.in.tum.de/gnn-benchmark In our evaluation we focus on two aspects: We use a standardized training and hyperparameter selection procedure for all models. In such a setting, the differences in performance can with high certainty be attributed to the differences in model architectures, not other factors. Second, we perform experiments on four well-known citation network datasets, as well as introduce four new datasets for the node classification problem. For each dataset we use 100 random train/validation/test splits and perform 20 random initializations for each split. This setup allows us to more accurately assess the generalization performance of different models, and does not just select the model that overfits one fixed test set.
Before we continue, we would like to make a disclaimer, that we do not believe that accuracy on benchmark datasets is the only important characteristic of a machine learning algorithm. Developing and generalizing the theory for existing methods, establishing connections to (and adapting ideas from) other fields are important research directions that move the field forward. However, thorough empirical evaluation is crucial for understanding the strengths and limitations of different models.
2 Models
We consider the problem of semi-supervised transductive node classification in a graph, as defined in planetoid. In this paper we compare the four following popular graph neural network architectures. Graph Convolutional Network (GCN) (gcn) is one of the earlier models that works by performing a linear approximation to spectral graph convolutions. Mixture Model Network (MoNet) (monet) generalizes the GCN architecture and allows to learn adaptive convolution filters. The authors of Graph Attention Network (GAT) (gat) propose an attention mechanism that allows to weigh nodes in the neighborhood differently during the aggregation step. Lastly, GraphSAGE (graphsage) focuses on inductive node classification, but can also be applied for transductive setting. We consider 3 variants of the GraphSAGE model from the original paper, denoted as GS-mean, GS-meanpool and GS-maxpool.
The original papers and reference implementations of all above-mentioned models consider different training procedures including different early stopping strategies, learning rate decay, full-batch vs. mini-batch training (a more detailed description is provided in Appendix A). Such diverse experimental setups makes it hard to empirically identify the driver behind the improved performance (lipton2018troubling). Thus, in our experiments we use a standardized training and hyperparameter tuning procedure for all models (more details in Sec. 3) to perform a more fair comparison.
In addition, we consider four baseline models. Logistic Regression (LogReg) and Multilayer Perceptron (MLP) are attribute-based models that do not consider the graph structure. Label Propagation (LabelProp) and Normalized Laplacian Label Propagation (LabelProp NL) (labelprop), on the other hand, only consider the graph structure and ignore the node attributes.
3 Evaluation
Datasets
For our experiments, we used the four well-known citation network datasets: PubMed (namata2012pubmed), CiteSeer and CORA from sen2008collective, as well as the extended version of CORA from bojchevski2018deep, denoted as CORA-Full. We also introduce four new datasets for the node classification task: Coauthor CS, Coauthor Physics, Amazon Computers and Amazon Photo. Descriptions of these new datasets, as well as statistics for all datasets can be found in Appendix B. For all datasests, we treat the graphs as undirected and only consider the largest connected component.
Setup
We keep the model architectures
as they are in the original papers / reference implementations. This includes the type and sequence of layers, choice of activation functions, placement of dropout, and choices as to where to apply
regularization. We also fixed the number of attention heads for GAT to 8 and the number of Gaussian kernels for MoNet to 2, as proposed in the respective papers. All the models have 2 layers (input features hidden layer output layer).For a more balanced comparison, however, we use the same training procedure for all the models. That is, we used the same optimizer (Adam (adam) with default parameters), same initialization (weights initialized according to glorot2010understanding
, biases initialized with zeros), no learning rate decay, same maximum number of training epochs, early stopping criterion, patience and validation frequency (display step) for all models (Appendix
C). We optimize all model parameters (attention weights for GAT, kernel parameters for MoNet, weight matrices for all models) simultaneously. In all cases we use full-batch training (using all nodes in the training set every epoch).Lastly, we used the exact same strategy for hyperparameter selection for every model. We performed an extensive grid search for learning rate, size of the hidden layer, strength of the
regularization, and dropout probability (Appendix
C). We restricted the random search space to ensure that every model has at most the same given number of trainable parameters. For every model, we picked the hyperparameter configuration that achieved the best average accuracy on Cora and CiteSeer datasets (averaged over 100 train/validation/test splits and 20 random initializations for each). The chosen best-performing configurations were used for all subsequent experiments and are listed in Table 4. In all cases, we use 20 labeled nodes per class as the training set, 30 nodes per class as the validation set, and the rest as the test set.Results
|
|
|
|
||||||
---|---|---|---|---|---|---|---|---|---|
GCN | |||||||||
GAT | |||||||||
MoNet | |||||||||
GS-mean | |||||||||
GS-maxpool | |||||||||
GS-meanpool | |||||||||
MLP | |||||||||
LogReg | |||||||||
LabelProp | |||||||||
LabelProp NL |
|
|
|
|
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
GCN | ||||||||||||
GAT | ||||||||||||
MoNet | ||||||||||||
GS-mean | ||||||||||||
GS-maxpool | N/A | |||||||||||
GS-meanpool | ||||||||||||
MLP | ||||||||||||
LogReg | ||||||||||||
LabelProp | ||||||||||||
LabelProp NL |
Mean test set accuracy and standard deviation in percent averaged over
random train/validation/test splits with random weight initializations each for all models and all datasets. For each dataset, the highest accuracy score is marked in bold. N/A stands for the dataset that couldn’t be processed by the full-batch version of GS-maxpool because of GPU RAM limitations.Table 1 shows mean accuracies (and their standard deviations222 Standard deviations are not the best representation of the variance of the accuracy scores, since the scores are not normally distributed.
We still include the standard deviations to give the reader a rough idea of the variance of the results for each model.
A more accurate picture is given by the box plots in Figure
Among the GNN approaches, there is no clear winner that dominates across all the datasets. In fact, for 5 out of 8 datasets, scores of the 2nd and 3rd best approaches are less than 1% away from the average score of the best-performing method. If we were interested in comparing one model versus the rest, we could perform pairwise t-tests, as done in
klicpera2018personalized. Since we are interested in comparing all the models to each other, we consider the relative accuracy of each model instead. For this, we take the best accuracy score for each split of each dataset (already averaged over 20 initializations) as 100%. Then, the score of each model is divided by this number, and the results for each model are averaged over all the datasets and splits. We also rank algorithms by their performance (1 = best performance, 10 = worst), and compute the average rank across all datasets and splits for each algorithm. The final scores are reported in Table 1(a). We observe that GCN is able to achieve the best performance across all models. While this result seems surprising, similar findings have been reported in other fields. Simpler models often outperform more sophisticated ones if hyperparameter tuning is performed equally carefully for all methods (melis2017state; lucic2017gans). In future work, we plan to further investigate what are the specific properties of the graphs that lead to the differences in performance of the GNN models.Another surprising finding is the relatively lower score and high variance in results obtained by GAT for the Amazon Computers and Amazon Photo datasets. To investigate this phenomenon, we additionally visualize the accuracy scores achieved by different models on the Amazon Photo dataset in Figure 2
in the appendix. While the median scores for all GNN models are very close to each other, GAT produces extremely low scores (below 40%) for some weight initializations. While these outliers occur rarely (for 138 out of 2000 runs), they significantly lower the average score of GAT.
Effect of the train/validation/test split
To demonstrate the effect of different train/validation/test splits on the performance, we execute the following simple experiment. We run the 4 models on the datasets and respective splits from (planetoid). As shown in Table 1(b), GAT achieves the best scores for the CORA and CiteSeer datasets, and GCN gets the top score for PubMed. If we, however, consider a different random split with the same train/validation/test set sizes the ranking of models is completely different, with GCN being first on CORA and CiteSeer, and MoNet winning on PubMed. This shows how fragile and misleading results obtained on a single split can be. Taking further into account that the predictions of GNNs can greatly change under small data perturbations (zugner2018adversarial) clearly confirms the need for evaluation strategies based on multiple splits.
|
|
4 Conclusion
We have performed an empirical evaluation of four state-of-the-art GNN architectures on the node classification task. We introduced four new attributed graph datasets, as well as open-sourced a framework that enables a fair and reproducible comparison of different GNN models. Our results highlight the fragility of experimental setups that consider only a single train/validation/test split of the data. We also find that, surprisingly, a simple GCN model can outperform the more sophisticated GNN architectures if the same hyperparameter selection and training procedures are used, and the results are averaged over multiple data splits. We hope that these results will encourage future works to use more robust evaluation procedures.
Acknowledgments
This research was supported by the German Research Foundation, Emmy Noether grant GU 1409/2-1.
References
Appendix A Differences in training procedures for GNN models
GCN
-
Early stopping: stop optimization if the validation loss is larger than the mean of validation losses of the last 10 epochs.
-
Full-batch training.
-
Maximum number of epochs: 200.
-
Train set: 20 per class; validation set: 500 nodes; test set: 1000 (as in the Planetoid split).
MoNet
-
No early stopping.
-
Full-batch training.
-
Maximum number of epochs: 3000 for CORA, 1000 for PubMed.
-
Train set: 20 per class; validation set: 500 nodes; test set: 1000 (as in the Planetoid split).
-
Alternating optimization of weight matrices and kernel parameters.
-
Learning rate decay at predefined iterations (only for CORA).
GAT
-
Early stopping: stop optimization if neither the validation loss nor the validation accuracy improve for 100 epochs.
-
Full-batch training.
-
Maximum number of epochs: 100000.
-
Train set: 20 per class; validation set: 500 nodes; test set: 1000 (as in the Planetoid split).
GraphSAGE
-
No early stopping.
-
Mini-batch training with batch size of 512.
-
Maximum number of epochs (each epoch consists of multiple mini-batches): 10.
Appendix B Datasets description and statistics
Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [mcauley2015amazon], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.
Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge 333https://kddcup2016.azurewebsites.net/. Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.
Classes | Features | Nodes | Edges | Label rate | Edge density | |
---|---|---|---|---|---|---|
CORA | ||||||
CiteSeer | ||||||
PubMed | ||||||
CORA-Full | ||||||
Coauthor CS | ||||||
Coauthor Physics | ||||||
Amazon Computers | ||||||
Amazon Photo |
Appendix C Hyperparameter configurations and Early Stopping
Grid search was performed over the following search space:
-
Hidden size: [8, 16, 32, 64]
-
Learning rate: [0.001, 0.003, 0.005, 0.008, 0.01]
-
Dropout probability: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
-
Attention coefficients dropout probability (only for GAT):
[0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
-
regularization strength: [1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2, 1e-1]
We train for a maximum of k epochs. However, the actual training time is considerably shorter since we use strict early stopping. Specifically, with our unified early stopping criterion training stops if the total validation loss (loss on the data plus regularization loss) does not improve for epochs. Once training has stopped, we reset the state of the weights to the step with the lowest validation loss.
|
|
|
|
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GCN | ||||||||||||||
GAT | / | |||||||||||||
MoNet | ||||||||||||||
GS-mean | ||||||||||||||
GS-maxpool | / | |||||||||||||
GS-meanpool | / | |||||||||||||
MLP | ||||||||||||||
LogReg | – | – |
Appendix D Performance of different models across datasets
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |

Comments
There are no comments yet.