Bag of Tricks for Training Deeper Graph Neural Networks: A Comprehensive Benchmark Study

08/24/2021 ∙ by Tianlong Chen, et al. ∙ The University of Texas at Austin Rice University 0

Training deep graph neural networks (GNNs) is notoriously hard. Besides the standard plights in training deep architectures such as vanishing gradients and overfitting, the training of deep GNNs also uniquely suffers from over-smoothing, information squashing, and so on, which limits their potential power on large-scale graphs. Although numerous efforts are proposed to address these limitations, such as various forms of skip connections, graph normalization, and random dropping, it is difficult to disentangle the advantages brought by a deep GNN architecture from those "tricks" necessary to train such an architecture. Moreover, the lack of a standardized benchmark with fair and consistent experimental settings poses an almost insurmountable obstacle to gauging the effectiveness of new mechanisms. In view of those, we present the first fair and reproducible benchmark dedicated to assessing the "tricks" of training deep GNNs. We categorize existing approaches, investigate their hyperparameter sensitivity, and unify the basic configuration. Comprehensive evaluations are then conducted on tens of representative graph datasets including the recent large-scale Open Graph Benchmark (OGB), with diverse deep GNN backbones. Based on synergistic studies, we discover the combo of superior training tricks, that lead us to attain the new state-of-the-art results for deep GCNs, across multiple representative graph datasets. We demonstrate that an organic combo of initial connection, identity mapping, group and batch normalization has the most ideal performance on large datasets. Experiments also reveal a number of "surprises" when combining or scaling up some of the tricks. All codes are available at https://github.com/VITA-Group/Deep_GCN_Benchmarking.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph neural networks (GNNs) Li et al. (2015) are powerful tools for modeling graph-structured data, and have been widely adopted in various real world scenarios, including inferring individual relations in social and academic networks Tang and Liu (2009); Ying et al. (2018a); Huang et al. (2019); Gao et al. (2018); You et al. (2020b, a, 2020), modeling proteins for drug discovery Zitnik and Leskovec (2017); Wale et al. (2008), improving predictions of recommendation systems Monti et al. (2017); Ying et al. (2018a), segmenting large point clouds Wang et al. (2019); Li et al. (2019), among others. While many classical GNNs have no more than just a few layers at most, recent advances attempt to investigate deeper GNN architectures Li et al. (2019, 2021), and there are massive examples of graphs on which deeper GNNs helps, such as “geometric" graphs representing structures such as molecules, point clouds Li et al. (2019), or meshes Gong et al. (2020). Deeper architectures are also found with superior performance on large-scale graph datasets such as the latest Open Graph Benchmark (OGB) Hu et al. (2020).

However, training deep GNNs is notoriously challenging Li et al. (2019). Besides the standard plights in training deep architectures such as vanishing gradients and overfitting, the training of deep GNNs suffers several unique barriers that limits their potential power on large-scale graphs. One of them is over-smoothing, i.e., the node features tending to become indistinguishable as the result of performing several recursive neighborhood aggregation Oono and Suzuki (2020). This behaviour was first observed in GCN models Li et al. (2018); NT and Maehara (2019), which act similarly to low-pass filters, and presents deep GNNs from effectively modeling the higher-order dependencies from multi-hop neighbors. Another phenomenon is the bottleneck

phenomenon: while deep GNNs are expected to utilize longer-range interactions, the structure of the graph often results in the exponential growth of the receptive field, causing “over-squashing” of information from exponentially many neighbours into fixed-size vectors

Alon and Yahav (2020), and explaining why some deeper GNNs do not improve in performance compared to their shallower peers.

To address the aforementioned roadblocks, typical approaches could be categorized as architectural modifications, and regularization & normalization

: both we view as “training tricks". The formal includes various types of residual connections 

Li et al. (2019); Chen et al. (2020b); Klicpera et al. (2018); Zhang et al. (2020); Li et al. (2018); Klicpera et al. (2018); Zhang et al. (2020); Xu et al. (2018b); Liu et al. (2020); and the latter include random edge dropping Rong et al. (2020); Huang et al. (2020), pairwise distance normalisation between node features (PairNorm) Zhao and Akoglu (2019)

, node-wise mean and variance normalisation (NodeNorm)

Zhou et al. (2020b), among many other normalization options  Ioffe and Szegedy (2015); Ba et al. (2016); Wu and He (2018); Yang et al. (2020); Zhou et al. (2020a). While these techniques in general contribute to effective training of deep GNNs with tens of layers, their gains are not always significant nor consistent Zhou et al. (2020b). Furthermore, it is often non-straightforward to disentangle the gain by deepening the GNN architecture, from the “tricks" necessary to train such a deeper architecture. In some extreme situations, contrary to the initial belief, newly proposed training techniques could improve shallower GNNs even to outperform deeper GNNs Zhou et al. (2020b), making our pursuit of depth unpersuasive. Those observations have shed light on a missing key knob on studying deeper GNNs: we lack a standardized benchmark that could offer fair and consistent comparison on the effectiveness of deep GCN training techniques. Without isolating the effects of deeper architectures from their training “tricks", one might never reach convincing answers whether deeper GNNs with ceteris paribus should perform better.

1.1 Our Contributions

Aiming to establish such a fair benchmark, our first step is to thoroughly investigate the design philosophy and implementation details on dozens of poplular deep GNN training techniques, including various residual connections, graph normalization, and random dropping. The summarization could be founded in Tables 1A9A10, and A11

. Somehow unfortunately, we find that even sticking to the same dataset and GNN backbone, the hyperparameter configurations (e.g., hidden dimension, learning rate, weight decay, dropout rate, training epochs, early stopping patience) are highly inconsistently implemented, often varying case-to-case, which make it troubling to draw any fair conclusion.

To this end, in this paper we carefully examine those sensitive hyperparameters and unify them into one “sweetpoint" hyperparameter set, to be adopted by all experiments. Such lays the foundation for a fair and reproducible benchmark of training deep GNN. Then, we comprehensively explore the diverse combinations of the available training techniques, over tens of classical graph datasets with commonly used deep GNN backbones. Our comprehensive study turns out to be worthy. We conclude the baseline training configurations with the superior combo of training tricks, that lead us to attaining the new state-of-the-art results across multiple representative graph datasets including OGB. Specifically, we show that an organic combination of initial connection, identity mapping, group and batch normalizations has preferably strong performance on large datasets. All implementations and benchmarks are publicly available, and our maintenance plan is presented in Appendix A1.

Our experiments also reveals a number of “surprises". For example, ❶ as we empirically show, while initial connection Chen et al. (2020b) and jumping connection Xu et al. (2018b) are both “beneficial" training tricks when applied alone, combining them together deteriorates deep GNN performance. ❷ Although dense connection brings considerable improvement on large-scale graphs with deep GNNs, it sacrifices the training stability to a severe extent. ❸ As another example, the gain from NodeNorm Zhou et al. (2020b) become diminishing when applied to large-scale datasets or deeper GNN backbones. ❹ Moreover, using random dropping techniques alone often yield unsatisfactory performance. ❺ Lastly, we observe that adopting initial connection and group normalization is universally effective across tens of classical graph datasets. Those findings urge more synergistic rethinking of those seminal works.

2 Related Works

Graph Neural Networks (GNNs) and Training Deep GNNs.

GNNs  Zhou et al. (2018); Kipf and Welling (2017); Chen et al. (2019); Veličković et al. (2017); Zhang and Chen (2018); Ying et al. (2018b); Xu et al. (2018a); You et al. (2020a, 2021); Chen et al. (2021); Hu et al. (2021) have established state-of-the-art results on various tasks  Kipf and Welling (2017); Veličković et al. (2017); Qu et al. (2019); Verma et al. (2019); Karimi et al. (2019); You et al. (2020b, 2020). There are also numerous GNN variants Dwivedi et al. (2020); Scarselli et al. (2008); Bruna et al. (2013); Kipf and Welling (2017); Hamilton et al. (2017); Battaglia et al. (2016); Monti et al. (2017); Veličković et al. (2018); Xu et al. (2019); Morris et al. (2019); Chen et al. (2019); Murphy et al. (2019). For example, Graph Convolutional Networks (GCNs) are widely adopted, which can be divided into spectral domain based methods Defferrard et al. (2016); Kipf and Welling (2017) and spatial domain bases methods Simonovsky and Komodakis (2017); Hamilton et al. (2017).

Unlike other deep architectures, not every useful GNN has to be deep. For example, many graphs, such as social networks and citation networks, are “small-world" Barceló et al. (2020), i.e., one node can reach any other node in a few hops. Hereby, adding more layers does not help model over those graphs since the receptive fields of just stacking a few layers would already suffice to provide global coverage. Many predictions in practice, such as in social networks, also mainly rely on short-range information from the local neighbourhood of a node. On the other hand, when the graph data is large in scale, does not have small-world properties, or its task requires long-range information to make prediction, then deeper GNNs do become necessary. Examples include molecular graphs, as chemical properties of a molecule can depend on the atom combination at its opposite sides Matlock et al. (2019). Point clouds or meshes also benefit from GNN depth to capture a whole object and its context in the visual scene Li et al. (2019); Gong et al. (2020).

With the prosperity of GNNs, understanding their training mechanism and limitation is of remarkable interest. As GNNs stack spatial aggregations recursively Li et al. (2015); Hamilton et al. (2017), the node representations will collapse to indistinguishable vectors NT and Maehara (2019); Oono and Suzuki (2020); Chen et al. (2020a). Such over-smoothing phenomenon hinders the training of deep GNNs and the dependency modeling to high-order neighbors. Recently, there has been a series of techniques developed to relieve the over-smoothing issue, including skip connection, graph normalization, random dropping. They will be detailed next.

Skip Connection.

Motivated by ResNets He et al. (2016), the skip connection was applied to GNNs to exploit node embeddings from the preceding layers, to relieve the over-smoothing issue. There are several skip connection patterns in deep GNNs, including: (1) residual connection connecting to the last layer Li et al. (2019, 2018), (2) initial connection connecting to the initial layer Chen et al. (2020b); Klicpera et al. (2018); Zhang et al. (2020), (3) dense connection connecting to all the preceding layers Li et al. (2019, 2018, 2020); Luan et al. (2019), and (4) jumping connection combining all all the preceding layers only at the final graph convolutional layer Xu et al. (2018b); Liu et al. (2020). Note that the last one of jumping connection is a simplified version of dense connection by omitting the complex skip connections at the intermediate layers.

Graph Normalization.

A series of normalization techniques has been developed for deep GNNs, including batch normalization (BatchNorm) Ioffe and Szegedy (2015), pair normalization (PairNorm) Zhao and Akoglu (2019), node normalization (NodeNorm) Zhou et al. (2020b), mean normalization (MeanNorm) Yang et al. (2020), differentiable group normalization (GroupNorm) Zhou et al. (2020a)

, and more. Their common mechanism is to re-scale node embeddings over an input graph to constrain pairwise node distance and thus alleviate over-smoothing. While BatchNorm and PairNorm normalize the whole input graph, GroupNorm first clusters node into groups and then normalizes each group independently. NodeNorm and MeanNorm operate on each node by re-scaling the associated node embedding with its standard deviation and mean values, respectively.

Random Dropping.

Dropout Srivastava et al. (2014)

refers to randomly dropping hidden units in a neural network with a pre-fixed probability, which effectively prevents over-fitting. Similar ideas have been inspired for GNNs. DropEdge 

Rong et al. (2020) and DropNode Huang et al. (2020) have been proposed to randomly remove a certain number of edges or nodes from the input graph at each training epoch. Alternatively, they could be regarded as data augmentation methods, which helps relieve both the over-fitting and over-smoothing issues in training very deep GNNs.

Methods Total epoch Learning rate & Decay Weight decay Dropout Hidden dimension
Chen et al. (2020) (Chen et al., 2020b) 100 0.01 5e-4 0.6 64
Xu et al. (2018) Xu et al. (2018b) - 0.005 5e-4 0.5 {16, 32}
Klicpera et al. (2018) Klicpera et al. (2018) 10000 0.01 5e-3 0.5 64
Zhang et al. (2020) Zhang et al. (2020) 1500 {0.001, 0.005, 0.01} - {0.1, 0.2, 0.3, 0.4, 0.5} 64
Luan et al. (2019) Luan et al. (2019) 3000 1.66e-4 1.86e-2 0.65277 1024
Liu et al. (2020) Liu et al. (2020) 100 0.01 0.005 0.8 64
Zhao et al. (2019) Zhao and Akoglu (2019) 1500 0.005 5e-4 0.6 32
Min et al. (2020) Min et al. (2020) 200 0.005 0 0.9 -
Zhou et al. (2020) Zhou et al. (2020a) 1000 0.005 5e-4 0.6 -
Zhou et al. (2020) Zhou et al. (2020b) 50 0.005 1e-5 0 -
Rong et al. (2020) Rong et al. (2020) 400 0.01 5e-3 0.8 128
Zou et al. (2020) Zou et al. (2019b) 100 0.001 - - 256
Hasanzadeh et al. (2020) Hasanzadeh et al. (2020) 2000 0.005 5e-3 0 128
Table 1:

Configurations of basic hyperparameters adopted to implement different approaches for training deep GNNs on Cora 

(Kipf and Welling, 2017). Other graph datasets are refer to Section A3.

3 A Fair and Scalable Study of Individual Tricks

3.1 Prerequisite: Unifying the Hyperparameter Configuration

We carefully examine previous implementations of deep GNNs Chen et al. (2020b); Klicpera et al. (2018); Liu et al. (2020); Luan et al. (2019); Xu et al. (2018b); Min et al. (2020); Zhao and Akoglu (2019); Zhou et al. (2020a); Rong et al. (2020); Zhou et al. (2020b), and list all their basic hyperparameters in Table 1A9A10, and A11. Those hyperparameters play a significant role in those methods’ achievable performance, but their inconsistency challenges fair comparison of training techniques, which has been traditionally somehow overlooked in literature.

We tune them all with an exhaustive grid search, and identify the most common and effective setting as in Table 2

, by choosing the configuration that performs the best across diverse backbones with different layers on the corresponding dataset. That configuration includes learning rate, weight decay, dropout rate, and the hidden dimension of multilayer perceptrons (MLP) in GNN. We recommend it as a “sweet point" hyperparameter configuration, and strictly follow it in our experiments.

For all experiments, deep GNNs are trained with a maximum of epochs with an early stop patience of epochs. One hundred independent repetitions are conduct for each experiment, and average performance with the standard deviations of the node classification are reported in Table 4A143A135A15A16, and A17

. We perform a comprehensive benchmark study of dozens of training approaches on four classical datasets, i.e., Cora, Citeseer, PubMed 

(Kipf and Welling, 2017), and OGBN-ArXiv (Hu et al., 2020) with layers GCN Kipf and Welling (2017) and simple graph convolutional (SGC) Wu et al. (2019) backbones. More details of adopted datasets are collected in Appendix 7

. The Pytorch framework and Pytorch Geometric 

Fey and Lenssen (2019) are used for all of our implementations. More specific setups for each group of training techniques are included in Section 3.33.2, and 3.4.

Settings Cora Citeseer PubMed OGBN-ArXiv
{Learning rate, Weight decay,
Dropout, Hidden dimension}
{, , , } {, , , } {, , , } {, , , }
Table 2: The “sweet point" hyperparameter configuration we used on representative datasets.

3.2 Skip Connection

Backbone Settings Cora Citeseer PubMed OGBN-ArXiv
2 16 32 2 16 32 2 16 32 2 16 32
GCN Residual 74.73 20.05 19.57 66.83 20.77 20.90 75.27 38.84 38.74 70.19 69.34 65.09
Initial 79.00 78.61 78.74 70.15 68.41 68.36 77.92 77.52 78.18 70.16 70.50 70.23
Jumping 80.98 76.04 75.57 69.33 58.38 55.03 77.83 75.62 75.36 70.24 71.83 71.87
Dense 77.86 69.61 67.26 66.18 49.33 41.48 72.53 69.91 62.99 70.08 71.29 70.94
None 82.38 21.49 21.22 71.46 19.59 20.29 79.76 39.14 38.77 69.46 67.96 45.48
SGC Residual 81.77 82.55 80.14 71.68 71.31 71.00 78.87 79.86 79.07 69.09 66.52 61.83
Initial 81.40 83.66 83.77 71.60 72.16 72.25 79.11 79.73 79.74 68.93 69.24 69.15
Jumping 77.75 83.42 83.88 69.96 71.89 71.88 77.42 79.99 80.07 68.76 70.61 70.65
Dense 77.31 81.24 77.66 70.99 67.75 66.35 77.12 72.77 74.84 69.39 71.42 71.52
None 79.31 75.98 68.45 72.31 71.03 61.92 78.06 69.18 66.61 61.98 41.58 34.22
Table 3: Test accuracy (%) under different skip connection mechanisms. Experiments are conducted on Cora, Citeseer, PubMed, and OGBN-ArXiv with layers GCN and SGC. Performance are averaged from independent repetitions, and their standard deviation are presented in Table A13.

Formulations.

Skip connections Li et al. (2019); Chen et al. (2020b); Klicpera et al. (2018); Zhang et al. (2020); Li et al. (2018); Klicpera et al. (2018); Zhang et al. (2020); Xu et al. (2018b); Liu et al. (2020) alleviate the vanishing gradient and over-smoothing, and substantially improve the accuracy and stability of training deep GCNs.

Let denotes the graph data, where represents the adjacent matrix, is the corresponding input node features, is the number of nodes in the graph , and is the dimension of each node feature . In the adjacent matrix , the values of entries can be weighted, i.e., . For a layers GCN, we can apply various types of skip connections after certain graph convolutional layers with the current and preceding embeddings , . Four representative types of skip connections are investigated:

Residual Connection. Li et al. (2019, 2018) .

Initial Connection. Chen et al. (2020b); Klicpera et al. (2018); Zhang et al. (2020) .

Dense Connection. Li et al. (2019, 2018, 2020); Luan et al. (2019) .

Jumping Connection. Xu et al. (2018b); Liu et al. (2020) .

where in residual and initial connections is a hyperparameter to weigh the contribution of node features from the current layer and previous layers. In our case, the best performing values are searched from for each experiment. Jumping connection as a simplified case of dense connection is only applied at the end of whole forward propagation process to combine the node features from all previous layers.

We lastly introduce the combination functions used in dense and jumping connections: ❶ Concat, ; ❷ Maxpool, , the operation return the maximum value of each column, i.e., each dimension in node features; ❸ Attention, , , .

is the activation function where

is usually adopted (Xu et al., 2018b; Liu et al., 2020). and are utilized to rearrange the data dimension so that dimension can be matched throughout the computation. Note that, for every dense connection and jumping connection experiment, we examine all three options for in the set {Concat, Maxpool, Attention} and report the one attaining the best performance.

Figure 1: Training loss under different skip connections on PubMed with layers SGC. Dense connection is omitted due to different scales of loss magnitude. More results are included in Section A5.

Experimental observations.

From Table 3 and A13, we can draw the following observations: In general, initial and jumping connection perform relatively better than the others, especially for training very deep GNNs (i.e., layers). It demonstrates that these two skip connections more effectively mitigate the over-smoothing issue, assisting graph neural networks going deeper. For shallow GCN backbones (e.g., layers), using skip connections can incur performance degradation, yet ot becomes beneficial on the large scale OGBN-ArXiv dataset. Meanwhile, SGC backbones benefit from skip connections in most of the case. Residual connection only works for -layer SGC on Cora, implying that although it also brings in shortcuts, their effects to overcome over-smoothening are diminished by increasing depth or growing dataset size, presumably due to its more “local" information flow (directly from only the preceding one layer), compared to initial and dense connections whose information flow can be directly imported from much earlier layers. (iv) Although dense connection on average brings significant accuracy improvements on OGBN-ArXiv with SGC, it sacrifices the training stability and leads to considerable performance variance, as consistently shown by Table A13. Figure 1 reveals that skip connections substantially accelerate the training of deep GNNs, which is aligned with the analysis by concurrent work (Xu et al., 2021).

3.3 Graph Normalization

Backbone Settings Cora Citeseer PubMed OGBN-ArXiv
2 16 32 2 16 32 2 16 32 2 16 32
GCN BatchNorm 69.91 61.20 29.05 46.27 26.25 21.82 67.15 58.00 55.98 70.44 70.52 68.74
PairNorm 74.43 55.75 17.67 63.26 27.45 20.67 75.67 71.30 61.54 65.74 65.37 63.32
NodeNorm 79.87 21.46 21.48 68.96 18.81 19.03 78.14 40.92 40.93 70.62 70.75 29.94
MeanNorm 82.49 13.51 13.03 70.86 16.09 7.70 78.68 18.92 18.00 69.54 70.40 56.94
GroupNorm 82.41 41.76 27.20 71.30 26.77 25.82 79.78 70.86 63.91 69.70 70.50 68.14
CombNorm 80.00 55.64 21.44 68.59 18.90 18.53 78.11 40.93 40.90 70.71 71.77 69.91
NoNorm 82.43 21.78 21.21 71.40 19.78 19.85 79.75 39.18 39.00 69.45 67.99 46.38
SGC BatchNorm 79.32 15.86 14.40 61.60 17.34 17.82 76.34 54.22 29.49 68.58 65.54 62.33
PairNorm 80.78 71.26 51.03 69.76 60.14 50.94 75.81 68.89 62.14 60.72 39.69 26.67
NodeNorm 78.09 78.77 73.93 63.42 61.81 60.22 71.64 71.50 73.30 63.21 26.81 16.18
MeanNorm 80.22 48.29 30.07 70.78 38.27 28.27 75.07 47.29 41.32 54.86 21.74 18.97
GroupNorm 82.81 75.81 74.94 72.32 67.54 61.75 78.87 76.43 74.62 66.12 67.29 66.11
CombNorm 77.65 75.16 74.45 63.66 59.97 54.52 71.67 71.50 72.23 65.73 54.37 47.52
NoNorm 79.38 75.93 68.75 72.36 71.06 62.64 78.01 69.06 66.55 61.96 41.43 34.24
Table 4: Test accuracy (%) under different graph normalizations. Experiments are conducted on Cora, Citeseer, PubMed, and OGBN-ArXiv with layers GCN and SGC. Performance are averaged from independent repetitions, and their standard deviation are presented in Table A14.

Formulations.

Graph normalization mechanisms  Ioffe and Szegedy (2015); Ba et al. (2016); Wu and He (2018); Zhao and Akoglu (2019); Zhou et al. (2020b); Yang et al. (2020); Zhou et al. (2020a) are also designed to tackle over-smoothing. Nearly all graph normalization techniques are applied upon the intermediate node embedding matrix, after passing through several graph convolutions. We use to denote the aggregated node embedding matrix in layer , where is the number of nodes and is the dimension of each intermediate node representation i.e, (we omit in later descriptions for notation’s simplicity). Correspondingly, represents the -th column of , recording values of the -th dimension of each feature embedding. Our investigated normalization mechanisms are formally depicted as follows.

PairNorm. Zhao and Akoglu (2019) .

NodeNorm. Zhou et al. (2020b) .

MeanNorm. Yang et al. (2020) .

BatchNorm. Ioffe and Szegedy (2015) .

GroupNorm. Zhou et al. (2020a)

where in PairNorm is a hyperparameter controlling the average pair-wise variance and we choose in our case. in NodeNorm denotes the normalization order and our paper uses . and are functions that calculate the standard deviation and mean value, respectively. in GroupNorm is the operation of a row-wise multiplication, represents a learned transformation, means the number of groups, and is the skip connection coefficient used in GroupNorm. Specifically, in our implementation, we adopt for Cora, for Pubmed, and for Citeceer and OGBN-ArXiv. CombNorm normalizes layer-wise graph embeddings by successively applying the two top-performing GroupNorm and NodeNorm.

Experimental observations.

Extensive results are collected in Table 4 and A14. We conclude in the following: () Dataset dependent observations. On small datasets such as Cora and Citeceer, non-parametric normalization methods such as NodeNorm and PairNorm often present on-par performance with parametric GroupNorm, and GroupNorm displays better performance than CombNorm (i.e., GroupNorm + NodeNorm). On the contrary for large graphs such as OGBN-ArXiv, GroupNorm and CombNorm show a superior generalization ability than non-parametric normalizations, and adding NodeNorm on top of GroupNorm leads to extra improvements on deep GCN backbones. () Backbone dependent observations. We observe that even the same normalization technique usually appears diverse behaviors on different GNN backbones. For example, PairNorm is nearly always better than NodeNorm on and layers of GCN on three small datasets, and worse than NodeNorm on OGBN-ArXiv. However the case is almost reversed when we experiment with the SGC backbone, where NodeNorm is outperforming PairNorm on small graphs and inferior on OGBN-ArXiv, for layers GNNs. Meanwhile, we notice that SGC with normalization tricks achieve similar performance to GCN backbone on three small graphs, while consistently fall behind on large-scale OGBN graphs. () Stability observations. In general, training deep GNNs with normalization incurs more instability (e.g., large performance variance) along with the increasing depth of GNNs. BatchNorm demonstrates degraded performance with large variance on small graphs, and yet it is one of the most effective mechanism on the large graph with outstanding generalization and improved stability. () Overall, across the five investigated norms, GroupNorm has witnessed most of top-performing scenarios, and PairNorm/NodeNorm occupied second best performance in GCN/SGC respectively, dependent on the scale of datasets. In contrast, MeanNorm performs the worst in training deep GCN/SGC, in terms of both achievable accuracy and performance stability.

3.4 Random Dropping

Dataset Settings Cora Citeseer PubMed OGBN-ArXiv
2 16 32 2 16 32 2 16 32 2 16 32
GCN No Dropout 80.68 28.56 29.36 71.36 23.19 23.03 79.56 39.85 40.00 69.53 66.14 41.96
Dropout 82.39 21.60 21.17 71.43 19.37 20.15 79.79 39.09 39.17 69.40 67.79 45.41
DropNode 77.10 27.61 27.65 69.38 21.83 22.18 77.39 40.31 40.38 66.67 67.17 43.81
DropEdge 79.16 28.00 27.87 70.26 22.92 22.92 78.58 40.61 40.50 68.67 66.50 51.70
LADIES 77.12 28.07 27.54 68.87 22.52 22.60 78.31 40.07 40.11 66.43 62.05 40.41
DropNode+Dropout 81.02 22.24 18.81 70.59 24.49 18.23 78.85 40.44 40.37 68.66 68.27 44.18
DropEdge+Dropout 79.71 20.45 21.10 69.64 19.77 18.49 77.77 40.71 40.51 66.55 68.81 49.82
LADIES+Dropout 78.88 19.49 16.92 69.02 27.17 18.54 78.53 41.43 40.70 66.35 65.13 39.99
SGC No Dropout 77.55 73.99 66.80 71.80 72.69 70.50 77.59 69.74 67.81 62.34 42.54 34.76
Dropout 79.37 75.91 68.40 72.35 71.21 62.35 78.04 69.12 66.53 61.96 41.47 34.22
DropNode 78.57 76.99 72.93 71.87 72.50 70.60 77.63 72.51 68.16 61.21 40.52 34.64
DropEdge 78.68 70.65 44.00 71.94 69.43 45.13 78.26 68.39 52.08 62.06 41.03 33.61
LADIES 78.50 78.35 72.71 71.88 71.69 69.80 77.65 74.86 72.27 61.49 38.96 33.17
DropNode+Dropout 80.60 74.83 55.04 72.33 70.30 65.85 78.10 67.98 52.01 61.54 39.48 32.63
DropEdge+Dropout 80.27 76.19 66.08 72.09 66.48 35.55 77.63 69.65 67.55 60.21 39.12 33.81
LADIES+Dropout 79.81 74.72 66.62 71.85 69.24 50.81 77.46 70.54 67.94 60.27 31.41 24.86
Table 5: Test accuracy (%) under different random dropping. Experiments are conducted on Cora, Citeseer, PubMed, and OGBN-ArXiv with layers GCN and SGC. Performance are averaged from repetitions, and standard deviations are reported in Table A15,A16,A17. Dropout rates are tuned for best performance.

Formulations.

Random dropping Srivastava et al. (2014); Rong et al. (2020); Huang et al. (2020); Zou et al. (2019a) is another group of approaches to address the over-smoothing issue by randomly removing or sampling a certain number of edges or nodes at each training epoch. Theoretically, it is effective in decelerating the over-smoothing and relieving the information loss caused by dimension collapse (Rong et al., 2020; Huang et al., 2020).

As described above, is the node embeddings at the -th layer, and is the corresponding feature dimension. Let denotes a normalization operator, where is the adjacency matrix. Then, a vanilla GCN layer can be written as , where denotes the weight matrix of the -th layer.

Dropout(Srivastava et al., 2014) , where

is a binary random matrix, and each element

is drawn from Bernoulli distribution.

DropEdge(Rong et al., 2020) , where is a binary random matrix, and each element is drawn from Bernoulli distribution.

DropNode(Huang et al., 2020) , where is a binary random matrix, and each element is drawn from Bernoulli distribution. Suppose the index of selected nodes are , is a row selection matrix, where if and only if the , and , .

LADIES(Zou et al., 2019a) , where is a binary random matrix with , and each element is drawn from Bernoulli distribution. is the row selection matrix sharing the same definition in DropNode.

Dataset
Identity
Mapping
GCNs
2 16 32
Cora with 82.980.75 67.233.61 40.5711.7
without 82.380.33 21.493.84 21.223.71
Citeseer with 68.250.71 56.394.36 35.285.85
without 71.460.44 19.591.96 20.291.79
PubMed with 79.090.48 79.550.41 73.740.71
without 79.760.39 39.141.38 38.771.20
OGBN-ArXiv with 71.080.28 69.220.84 62.852.22
without 69.460.22 67.960.38 45.484.50
Table 6: Average test accuracy (%) and its standard deviations from independent repetitions are reported with or without the identity mapping.

For these dropout techniques, the neuron dropout rate follows the “sweet point" configuration in Table 

2, and the graph dropout rates are picked from to attain best results. All presented results adopt the layer-wise dropout scheme consistent with the one in (Zou et al., 2019b).

Experimental observations.

Main evaluation results of random dropping mechanisms are presented in Table 5 and A15, and additional results are referred to Table A16 and A17. Several interesting findings can be summarized as follows: () The overall performance of random dropping is not as significant as skip connection and graph normalization. Such random dropping tricks at most retard the performance drop, but can not boost deep GNNs to be stronger than shallow GNNs. For examples, on Cora dataset, all dropouts fail to improve deep GCNs; on OGBN-ArXiv dataset, all dropouts also fail to benefits deep SGCs. On other graph datasets, only a small subset of dropping approaches can relieve performance degradation, while others tricks still suffer accuracy deterioration. () The effects of random dropping vary drastically across models, datasets, and layer number. It is hard to find a consistently improving technique. To be specific, we observe that, Sensitive to models: DropNode on SGC has been shown as an effective training trick for Cora dataset, but turns out to be useless when cooperating with GCN; Sensitive to datasets:

On Pubmed dataset, the LADIES is the best trick, while it becomes the worst one on OGBN-ArXiv;

Sensitive to layer number: Deepening GCN from 16 layers to 32 layers, the performance of DropEdge declines significantly from to . A noteworthy exception is LADIES trick on PubMed, which consistently achieves the best result for both layers GCN and SGC. () Dropout (with common rate setting ) is demonstrated to be beneficial and stronger than node or edge dropping for shallow GCNs (i.e., 2 layers). However, this dropout rate downgrades the performance when GCN goes deeper. Combining Dropout with graph dropping may not always be helpful. For instance, incorporating dropout into LADIES downgrades the accuracy of 32 layers SGC from to . () SGC with random dropping shows better tolerance for deep architectures, and more techniques presented in Table 5 can improve the generalization ability compared to the plain GCN. Meanwhile, with the same random dropping, SGC steadily surpasses GCN by a substantial performance margin for training deep GCN (e.g., layers) on three small graph datasets, while GCN shows a superior accuracy on large-scale OGBN graphs. () From the stability perspective, we notice that SGC consistently has smaller performance variance than GCN of which instability is significantly enlarged by the increasing depth.

3.5 Other Tricks

Although there are multiple other tricks in literature, such as shallow sub-graph sampling Zeng et al. (2020) and geometric scattering transform of adjacency matrix Min et al. (2020), most of them are intractable to be incorporated into the existing deep GNN frameworks. To be specific, the shallow sub-graph sampling has to sample the shallow neighborhood for each node iteratively and the geometric scattering requires multiscale wavelet transforms on matrix . Both of them do not directly propagate message based upon , which is deviate from the common fashion of deep GNNs. Fortunately, we find that the Identity Mapping proposed in Chen et al. (2020b) is effective to augment existing frameworks and relieves the over-smoothing and overfitting issues. It can be depicted as follows:

Identity MappingChen et al. (2020b) , is computed by , where is a positive hyperparameter, and decreases with layers to avoid the overfitting. In our case, is grid searched for different graph datasets: for Cora, Citeser and PubMed, and for OGBN-ArXiv.

Experimental observations.

As shown in Table 6, we conduct experiments on layers GCNs with or without the identity mapping. From the results on {Cora, Citeseer, PubMed, OGBN-ArXiv} datasets, () we find the identity mapping consistently brings deep GCNs (e.g., layers) significant accuracy improvements up to {, , , }, respectively. It empirically evidenced the effectiveness of identity mapping technique in mitigating over-smoothing issue, particularly for small graphs. However, on shallow GCNs (e.g., layers), identity mapping does not work very well, which either obtains a marginal gain or even incurs some degradation. () Although the generalization for deep GCNs on small-scale graphs (e.g., Cora and Citeseer) is largely enhanced, it also amplify the performance instability.

Figure 2: Average test accuracy (%) from independent repetitions on Cora, PubMed, and OGBN-ArXiv graphs respectively. Ours indicates our best combo of tricks on the corresponding dataset. means add/remove certain tricks from the trick combos. The top-performing setups are highlighted with red fonts.

4 Combining Individual Tricks into the Best Combo

The best trick combos: Provded that the “sweet point" hyperparameter configurations in Table 2 are adopted: ❶ On Cora, layers SGC with Initial Connection, Dropout, and Label Smoothing; ❷ On PubMed, layers SGC with Jumping Connection and No Drpout; ❸ On Citeseer, layers GCN with Initial Connections, Identity Mapping, and Label Smoothing; ❹ On OGBN-ArXiv, layer GCNII with Initial Connection, Identity Mapping, GroupNorm, NodeNorm, and Dropout.

Our best combos of tricks and ablation study.

We summarize our best trick combos for Cora, Citeseer, PubMed, and OGBN-ArXiv as above. Meanwhile, comprehensive ablation studies are conducted to support the superiority of our trick combos in Figure 2. For each datasets, we examine combination variants by adding, removing or replacing certain tricks from our best trick combos. Results extensively endorse our carefully selected combinations.

Comparison to previous state-of-the-art frameworks.

To further validate the effectiveness of our explored best combo of tricks, we perform comparisons with other previous state-of-the-art frameworks, including SGC (Wu et al., 2019), DAGNN (Liu et al., 2020), GCNII (Chen et al., 2020b), JKNet (Xu et al., 2018b), APPNP (Klicpera et al., 2018), GPRGNN (Chien et al., 2021). As shown in Table 7, we find that organically combining existing training tricks consistently outperforms previous elaborately design deep GCNs frameworks, on both small- and large-scale graph datasets.

Model Cora (Ours: 85.48) Citeseer (Ours: 73.35) PubMed (Ours: 80.76) ArXiv (Ours: 72.70)
2 16 32 2 16 32 2 16 32 2 16 32
SGC (Wu et al., 2019) 79.31 75.98 68.45 72.31 71.03 61.92 78.06 69.18 66.61 61.98 41.58 34.22
DAGNN (Liu et al., 2020) 80.30 84.14 83.39 18.22 73.05 72.59 77.74 80.32 80.58 67.65 71.82 71.46
GCNII (Chen et al., 2020b) 82.19 84.69 85.29 67.81 72.97 73.24 78.05 80.03 79.91 71.24 72.61 72.60
JKNet (Xu et al., 2018b) 79.06 72.97 73.23 66.98 54.33 50.68 77.24 64.37 63.77 63.73 66.41 66.31
APPNP (Klicpera et al., 2018) 82.06 83.64 83.68 71.67 72.13 72.13 79.46 80.30 80.24 65.31 66.95 66.94
GPRGNN (Chien et al., 2021) 82.53 83.69 83.13 70.49 71.39 71.01 78.73 78.78 78.46 69.31 70.30 70.18
Table 7: Test accuracy (%) comparison with other previous state-of-the-art frameworks. Experiments are conducted on Cora, Citeseer, PubMed, and OGBN-ArXiv with GNNs. Performance are averaged from repetitions, and standard deviations are reported in Table A18. The superior performance achieved by our best tricks combo with deep GCNs (i.e., layers) are highlighted in the first raw.

Transferring trick combos across datasets.

The last sanity check is whether there exist certain trick combos can be effective across multiple different graph datasets. Therefore, we pick Initial Connection + GroupNorm as the trick combos since these two techniques achieve optimal performance under most of scenarios in Section 3.23.3, and 3.4, and this trick combo also performs on-par with the best trick combo on Cora and OGBN-ArXiv. Specially, we evaluate it on eight

other open-source graph datasets: (

) two Co-author datasets Shchur et al. (2018) (CS and Physics), () two Amazon datasets Shchur et al. (2018) (Computers and Photo), () three WebKB datasets Pei et al. (2020) (Texas, Wisconsin, Cornell), and () the Actor dataset Pei et al. (2020). In these transfer investigations, we follow the exact “sweet point" settings from Cora in Table 2, except that for two Coauthor datasets, the weight decay is set to and the dropout rate is set to , and in two Amazon datasets the dropout rate is set to , as adopted by Liu et al. (2020).

Category CS Physics Computers Photo Texas Wisconsin Cornell Actor
SGC 70.523.96 91.460.48 37.530.20 26.604.64 56.414.25 51.296.44 58.573.44 26.171.15
DAGNN 89.60±0.71 93.31±0.60 79.73±3.63 89.96±1.16 57.68±5.07 50.84±6.62 58.43±3.93 27.731.08
GCNII 71.67±2.68 93.15±0.92 37.56±0.43 62.95±9.41 69.19±6.56 70.31±4.75 74.16±6.48 34.281.12
JKNet 81.823.32 90.921.61 67.995.07 78.426.95 61.086.23 52.765.69 57.304.95 28.800.97
APPNP 91.610.49 93.750.61 43.0210.16 59.6223.27 60.684.50 54.245.94 58.433.74 28.651.28
GPRGNN 89.560.47 93.490.59 41.949.95 91.740.81 62.274.97 71.355.56 58.273.96 29.881.82
Ours on SGC 90.740.66 94.120.56 75.401.88c 88.291.73 79.683.77 83.593.32 81.245.77 36.840.70
Table 8: Transfer studies of our trick combos based on SGC (Ours). Comparison are conducted on eight hold out datasets with two widely adopted deep GNN baselines, i.e., GCNII and DAGNN. Performance are averaged from independent repetitions. All trained GNNs have layers.

As shown in Table 8, our chosen trick combo universally encourages the worst-performing SGC to become one of the top-performing candidates. It consistently surpasses SGC, DAGNN, GCNII, JKNet, APPNP, and GPRGNN by a substantial accuracy margin, expected the comparable performance on CS, Computers, and Photo graph datasets. The superior transfer performance of our trick combo suggests it as a stronger baseline for future research in the deep GNN community.

5 Conclusion and Discussion of Broad Impact

Deep graph neural networks (GNNs) is a promising field that has so far been a bit held back by finding out “truly" effective training ticks to alleviate the notorious over-smoothing issue. This work provides a standardized benchmark with fair and consistent experimental configurations to push this field forward. We broadly investigate dozens of existing approaches on tens of representative graphs. Based on extensive experiment results, we identify the combo of most powerful training tricks, and establish new sate-of-the-art performance for deep GNNs. We hope them to lay a solid, fair, and practical evaluation foundations by providing strong baselines and superior trick combos for deep GNN research community. All our codes are provides in the supplement for reproducible research.

We do not believe that this research place any significant risk of societal harm, since it is scientific in nature. However, if deep GNNs are utilized by vicious users, our proposal may amplify the damage since it boosts the power of deep GNNs with superior training trick combinations.

References

  • [1] U. Alon and E. Yahav (2020) On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205. Cited by: §1.
  • [2] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §1, §3.3.
  • [3] P. Barceló, E. V. Kostylev, M. Monet, J. Pérez, J. Reutter, and J. P. Silva (2020) The logical expressiveness of graph neural networks. In International Conference on Learning Representations, Cited by: §2.
  • [4] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. kavukcuoglu (2016) Interaction networks for learning about objects, relations and physics. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Cited by: §2.
  • [5] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv. Cited by: §2.
  • [6] D. Chen, Y. Lin, W. Li, P. Li, J. Zhou, and X. Sun (2020) Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 34, pp. 3438–3445. Cited by: §2.
  • [7] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li (2020) Simple and deep graph convolutional networks. arXiv preprint arXiv:2007.02133. Cited by: Table A10, Table A11, Table A9, Table A18, §1.1, §1, §2, Table 1, §3.1, §3.2, §3.2, §3.5, §3.5, §4, Table 7.
  • [8] T. Chen, Y. Sui, X. Chen, A. Zhang, and Z. Wang (2021-18–24 Jul) A unified lottery ticket hypothesis for graph neural networks. In

    Proceedings of the 38th International Conference on Machine Learning

    , M. Meila and T. Zhang (Eds.),
    Proceedings of Machine Learning Research, Vol. 139, pp. 1695–1706. Cited by: §2.
  • [9] Z. Chen, S. Villar, L. Chen, and J. Bruna (2019) On the equivalence between graph isomorphism testing and function approximation with gnns. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [10] E. Chien, J. Peng, P. Li, and O. Milenkovic (2021) Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations. https://openreview. net/forum, Cited by: Table A18, §4, Table 7.
  • [11] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NeuIPS, pp. 3844–3852. Cited by: §2.
  • [12] V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson (2020) Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982. Cited by: §2.
  • [13] M. Fey and J. E. Lenssen (2019) Fast graph representation learning with PyTorch Geometric. In ICLR Workshop, Cited by: §3.1.
  • [14] H. Gao, Z. Wang, and S. Ji (2018) Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1416–1424. Cited by: §1.
  • [15] S. Gong, M. Bahri, M. M. Bronstein, and S. Zafeiriou (2020) Geometrically principled connections in graph neural networks. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 11415–11424. Cited by: §1, §2.
  • [16] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NeuIPS, pp. 1024–1034. Cited by: §2, §2.
  • [17] A. Hasanzadeh, E. Hajiramezanali, S. Boluki, M. Zhou, N. Duffield, K. Narayanan, and X. Qian (2020) Bayesian graph neural networks with adaptive connection sampling. In International Conference on Machine Learning, pp. 4094–4104. Cited by: Table A10, Table A9, Table 1.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §2.
  • [19] T. Hu, F. Gama, T. Chen, Z. Wang, A. Ribeiro, and B. M. Sadler (2021) VGAI: end-to-end learning of vision-based decentralized controllers for robot swarms. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4900–4904. Cited by: §2.
  • [20] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020) Open graph benchmark: datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687. Cited by: Table A11, 2nd item, §1, §3.1.
  • [21] W. Huang, Y. Rong, T. Xu, F. Sun, and J. Huang (2020) Tackling over-smoothing for general graph convolutional networks. arXiv e-prints, pp. arXiv–2008. Cited by: §1, §2, §3.4, §3.4.
  • [22] X. Huang, Q. Song, Y. Li, and X. Hu (2019) Graph recurrent networks with attributed random walks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 732–740. Cited by: §1.
  • [23] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. ICML. Cited by: §1, §2, §3.3, §3.3.
  • [24] M. Karimi, D. Wu, Z. Wang, and Y. Shen (2019) Explainable deep relational networks for predicting compound-protein affinities and contacts. arXiv preprint arXiv:1912.12553. Cited by: §2.
  • [25] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. ICLR. Cited by: Table A10, Table A9, 1st item, §2, Table 1, §3.1.
  • [26] J. Klicpera, A. Bojchevski, and S. Günnemann (2018) Predict then propagate: graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997. Cited by: Table A10, Table A9, Table A18, §1, §2, Table 1, §3.1, §3.2, §3.2, §4, Table 7.
  • [27] K. Kong, G. Li, M. Ding, Z. Wu, C. Zhu, B. Ghanem, G. Taylor, and T. Goldstein (2020) Flag: adversarial data augmentation for graph neural networks. arXiv preprint arXiv:2010.09891. Cited by: Table A11.
  • [28] G. Li, M. Müller, G. Qian, I. C. D. Perez, A. Abualshour, A. K. Thabet, and B. Ghanem (2021) Deepgcns: making gcns go as deep as cnns. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
  • [29] G. Li, M. Muller, A. Thabet, and B. Ghanem (2019) Deepgcns: can gcns go as deep as cnns?. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9267–9276. Cited by: Table A11, §1, §1, §1, §2, §2, §3.2, §3.2, §3.2.
  • [30] G. Li, C. Xiong, A. Thabet, and B. Ghanem (2020) Deepergcn: all you need to train deeper gcns. arXiv preprint arXiv:2006.07739. Cited by: Table A11, §2, §3.2.
  • [31] Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning

    .
    In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §1, §2, §3.2, §3.2, §3.2.
  • [32] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2015) Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §1, §2.
  • [33] M. Liu, H. Gao, and S. Ji (2020) Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 338–348. Cited by: Table A10, Table A11, Table A9, Table A18, §1, §2, Table 1, §3.1, §3.2, §3.2, §3.2, §4, §4, Table 7.
  • [34] S. Luan, M. Zhao, X. Chang, and D. Precup (2019) Break the ceiling: stronger multi-scale deep graph convolutional networks. arXiv preprint arXiv:1906.02174. Cited by: Table A10, Table A9, §2, Table 1, §3.1, §3.2.
  • [35] M. K. Matlock, A. Datta, N. Le Dang, K. Jiang, and S. J. Swamidass (2019) Deep learning long-range information in undirected graphs with wave networks. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.
  • [36] J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel (2015) Image-based recommendations on styles and substitutes. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 43–52. Cited by: 4th item.
  • [37] Y. Min, F. Wenkel, and G. Wolf (2020) Scattering gcn: overcoming oversmoothness in graph convolutional networks. arXiv preprint arXiv:2003.08414. Cited by: Table A9, Table 1, §3.1, §3.5.
  • [38] F. Monti, M. M. Bronstein, and X. Bresson (2017) Geometric matrix completion with recurrent multi-graph neural networks. arXiv preprint arXiv:1704.06803. Cited by: §1, §2.
  • [39] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe (2019) Weisfeiler and leman go neural: higher-order graph neural networks. In AAAI, Cited by: §2.
  • [40] R. Murphy, B. Srinivasan, V. Rao, and B. Ribeiro (2019) Relational pooling for graph representations. In International Conference on Machine Learning, pp. 4663–4673. Cited by: §2.
  • [41] H. NT and T. Maehara (2019) Revisiting graph neural networks: all we have is low-pass filters. arXiv preprint arXiv:1905.09550. Cited by: §1, §2.
  • [42] K. Oono and T. Suzuki (2020) Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations, Cited by: §1, §2.
  • [43] H. Pei, B. Wei, K. C. Chang, Y. Lei, and B. Yang (2020) Geom-gcn: geometric graph convolutional networks. In ICLR, Cited by: 5th item, §4.
  • [44] J. Pineau, P. Vincent-Lamarre, K. Sinha, V. Lariviere, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and H. Larochelle (2021) Improving reproducibility in machine learning research. Journal of Machine Learning Research 22, pp. 1–20. Cited by: Appendix A2.
  • [45] M. Qu, Y. Bengio, and J. Tang (2019) GMNN: graph markov neural networks. arXiv preprint arXiv:1905.06214. Cited by: §2.
  • [46] Y. Rong, W. Huang, T. Xu, and J. Huang (2020) Dropedge: towards deep graph convolutional networks on node classification. In International Conference on Learning Representations. https://openreview. net/forum, Cited by: Table A10, Table A9, §1, §2, Table 1, §3.1, §3.4, §3.4.
  • [47] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §2.
  • [48] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann (2018) Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868. Cited by: 3rd item, 4th item, §4.
  • [49] Y. Shi, Z. Huang, W. Wang, H. Zhong, S. Feng, and Y. Sun (2020) Masked label prediction: unified message passing model for semi-supervised classification. arXiv preprint arXiv:2009.03509. Cited by: Table A11.
  • [50] M. Simonovsky and N. Komodakis (2017) Dynamic edgeconditioned filters in convolutional neural networks on graphs. In CVPR, Cited by: §2.
  • [51] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2, §3.4, §3.4.
  • [52] W. Sun, W. Jiang, E. Trulls, A. Tagliasacchi, and K. M. Yi (2020) ACNe: attentive context normalization for robust permutation-equivariant learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11286–11295. Cited by: Table A11.
  • [53] J. Tang, J. Sun, C. Wang, and Z. Yang (2009) Social influence analysis in large-scale networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 807–816. Cited by: 6th item.
  • [54] L. Tang and H. Liu (2009) Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 817–826. Cited by: §1.
  • [55] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.
  • [56] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In International Conference on Learning Representations, Cited by: §2.
  • [57] V. Verma, M. Qu, A. Lamb, Y. Bengio, J. Kannala, and J. Tang (2019)

    GraphMix: regularized training of graph neural networks for semi-supervised learning

    .
    arXiv preprint arXiv:1909.11715. Cited by: §2.
  • [58] N. Wale, I. A. Watson, and G. Karypis (2008) Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems 14 (3), pp. 347–375. Cited by: §1.
  • [59] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §1.
  • [60] F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q. Weinberger (2019) Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153. Cited by: Table A18, §3.1, §4, Table 7.
  • [61] Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §1, §3.3.
  • [62] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §2.
  • [63] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [64] K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka (2018) Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, pp. 5453–5462. Cited by: Table A11, Table A9, Table A18, §1.1, §1, §2, Table 1, §3.1, §3.2, §3.2, §3.2, §4, Table 7.
  • [65] K. Xu, M. Zhang, S. Jegelka, and K. Kawaguchi (2021) Optimization of graph neural networks: implicit acceleration by skip connections and more depth. arXiv preprint arXiv:2105.04550. Cited by: §3.2.
  • [66] C. Yang, R. Wang, S. Yao, S. Liu, and T. Abdelzaher (2020) Revisiting "over-smoothing" in deep gcns. arXiv preprint arXiv:2003.13663. Cited by: §1, §2, §3.3, §3.3.
  • [67] Z. Yang, W. W. Cohen, and R. Salakhutdinov (2016) Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861. Cited by: 1st item.
  • [68] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018) Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 974–983. Cited by: §1.
  • [69] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In NeurIPS, Cited by: §2.
  • [70] Y. You, T. Chen, Y. Shen, and Z. Wang (2021) Graph contrastive learning automated. arXiv preprint arXiv:2106.07594. Cited by: §2.
  • [71] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen (2020) Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems 33. Cited by: §1, §2.
  • [72] Y. You, T. Chen, Z. Wang, and Y. Shen (2020-13–18 Jul) When does self-supervision help graph convolutional networks?. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 10871–10880. External Links: Link Cited by: §1, §2.
  • [73] Y. You, T. Chen, Z. Wang, and Y. Shen (2020) L2-gcn: layer-wise and learned efficient training of graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2127–2135. Cited by: §1, §2.
  • [74] H. Zeng, M. Zhang, Y. Xia, A. Srivastava, A. Malevich, R. Kannan, V. Prasanna, L. Jin, and R. Chen (2020) Deep graph neural networks with shallow subgraph samplers. arXiv preprint arXiv:2012.01380. Cited by: §3.5.
  • [75] H. Zhang, T. Yan, Z. Xie, Y. Xia, and Y. Zhang (2020) Revisiting graph convolutional network on semi-supervised node classification from an optimization perspective. arXiv preprint arXiv:2009.11469. Cited by: Table A10, Table A11, Table A9, §1, §2, Table 1, §3.2, §3.2.
  • [76] M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. arXiv preprint arXiv:1802.09691. Cited by: §2.
  • [77] L. Zhao and L. Akoglu (2019) PairNorm: tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223. Cited by: Table A10, Table A9, §1, §2, Table 1, §3.1, §3.3, §3.3.
  • [78] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2018) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §2.
  • [79] K. Zhou, X. Huang, Y. Li, D. Zha, R. Chen, and X. Hu (2020) Towards deeper graph neural networks with differentiable group normalization. Advances in Neural Information Processing Systems 33. Cited by: Table A10, Table A9, §1, §2, Table 1, §3.1, §3.3, §3.3.
  • [80] K. Zhou, Y. Dong, K. Wang, W. S. Lee, B. Hooi, H. Xu, and J. Feng (2020) Understanding and resolving performance degradation in graph convolutional networks. arXiv preprint arXiv:2006.07107. Cited by: Table A11, Table A9, §1.1, §1, §2, Table 1, §3.1, §3.3, §3.3.
  • [81] M. Zitnik and J. Leskovec (2017) Predicting multicellular function through multi-layer tissue networks. Bioinformatics 33 (14), pp. i190–i198. Cited by: §1.
  • [82] D. Zou, Z. Hu, Y. Wang, S. Jiang, Y. Sun, and Q. Gu (2019) Layer-dependent importance sampling for training deep and large graph convolutional networks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §3.4, §3.4.
  • [83] D. Zou, Z. Hu, Y. Wang, S. Jiang, Y. Sun, and Q. Gu (2019) Layer-dependent importance sampling for training deep and large graph convolutional networks. NeurIPS. Cited by: Table A10, Table A9, Table 1, §3.4.

Appendix A1 Intended Use and Maintenance Plan

Licensing.

The license of our repository is MIT license. For more information, please see https://github.com/VITA-Group/Deep_GCN_Benchmarking/LICENSE.

Intended use.

Our benchmark platform is intended for data mining scientists and machine learning researchers to innovate novel methods to tackle problems (e.g., over-smoothing, overfitting) towards deep GNNs. We implement a series of popular tricks necessary to build up deep GNNs, include a set of promising deep models, give the interfaces to access a variety of graph data, and define the standard evaluation process. Based upon our code framework, it is simple for users to incorporate promising tricks and formulate deep GNNs, which could be further tested fairly and comprehensively. Please see more in our open repository for how to use the benchmark platform.

Maintenance plan and contribution policy.

Our deep GNNs benchmark is a community-driven platform. We host leaderboard in our Github to compare the outperforming models up to date, and welcome contributions from the machine learning community to pull the model incorporation request. We are committed to actively maintain the repository, enrich our model library, and update the learderboard to motivate more advanced GNNs in the future.

Appendix A2 Reproducibility Checklist

To ensure reproducibility, we use the Machine Learning Reproducibility Checklist v2.0, Apr. 7 2020 [44]. An earlier version of this checklist (v1.2) was used for NeurIPS 2019 [44].

  • For all models and algorithms presented,

    • A clear description of the mathematical settings, algorithm, and/or model. We clearly describe all of the settings, formulations, and algorithms in Section 3.

    • A clear explanation of any assumptions. We do not make assumptions.

    • An analysis of the complexity (time, space, sample size) of any algorithm. We do not make the analysis.

  • For any theoretical claim,

    • A clear statement of the claim. We do not make theoretical claims.

    • A complete proof of the claim. We do not make theoretical claims.

  • For all datasets used, check if you include:

    • The relevant statistics, such as number of examples. We use widely adopted public graph datasets in Section 3. We provide all the related statistics in Appendix A4.

    • The details of train/validation/test splits. We give this information in our repository.

    • An explanation of any data that were excluded, and all pre-processing step. We did not exclude any data or perform any pre-processing.

    • A link to a downloadable version of the dataset or simulation environment. Our repository contains all of the instructions to download and run experiments on the datasets in this work. See https://github.com/VITA-Group/Deep_GCN_Benchmarking.

    • For new data collected,a complete description of the data collection process, such as instructions to annotators and methods for quality control. We do not collect or release new datasets.

  • For all shared code related to this work, check if you include:

    • Specification of dependencies. We give installation instructions in the README of our repository.

    • Training code. The training code is available in our repository.

    • Evaluation code. The evaluation code is available in our repository.

    • (Pre-)trained model(s). We do not release any pre-trained models. The code to run all experiments in our work can be found in the GitHub repository.

    • README file includes table of results accompanied by precise command to run to produce those results. We include a README with detailed instructions to reproduce all of our experimental results.

  • For all reported experimental results, check if you include:

    • The range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results. We provide all details of the hyper-parameter tuning in Section 3.

    • The exact number of training and evaluation runs. One hundred independent repetitions are conducted for each experiments.

    • A clear definition of the specific measure or statistics used to report results. We use the classification accuracy on test-set, which is defined in Section 3.

    • A description of results with central tendency (e.g. mean) & variation (e.g. error bars). We report mean and standard deviation for all experiments, as indicated in Section 3.

    • The average runtime for each result, or estimated energy cost.

      We do not report the running time or energy cost.

    • A description of the computing infrastructure used. All detailed descriptions are presented in Section A4.

Appendix A3 More Technical Details of Investigated Algorithms

Diverse hyperparameter configurations of existing deep GNNs.

As shown in Table A9A10, and A11, we can see that these highly inconsistent hyperparameter settings pose severely challenges to fairly compare the existing training tricks for deep GNNs.

Methods Total epoch Learning rate & Decay Weight decay Dropout Hidden dimension
Chen et al. (2020) [7] 100 0.01 5e-4 0.6 256
Xu et al. (2018) [64] - 0.005 5e-4 0.5 {16, 32}
Klicpera et al. (2018) [26] 10000 0.01 5e-3 0.5 64
Zhang et al. (2020) [75] 1500 {0.001, 0.005, 0.01} - {0.1, 0.2, 0.3, 0.4, 0.5} 64
Luan et al. (2019) [34] 3000 2.8218e-03 1.9812e-02 0.98327 5000
Liu et al. (2020) [33] 100 0.01 0.005 0.8 64
Zhao et al. (2019) [77] 1500 0.005 5e-4 0.6 32
Min et al. (2020) [37] 200 0.005 0 0.9 -
Zhou et al. (2020) [79] 1000 0.005 5e-4 0.6 -
Zhou et al. (2020) [80] 50 0.005 1e-5 0 -
Rong et al. (2020) [46] 400 0.009 1e-3 0.8 128
Zou et al. (2020) [83] 100 0.001 - - 256
Hasanzadeh et al. (2020) [17] 1700 0.005 5e-3 0 128
Table A9: Configurations adopted to implement different approaches for training deep GNNs on Citeseer [25].
Methods Total epoch Learning rate & Decay Weight decay Dropout Hidden dimension
Chen et al. (2020) [7] 100 0.01 5e-4 0.5 256
Klicpera et al. (2018) [26] 10000 0.01 5e-3 0.5 64
Zhang et al. (2020) [75] 1500 {0.001, 0.005, 0.01} - {0.1, 0.2, 0.3, 0.4, 0.5} 64
Luan et al. (2019) [34] 3000 0.001 0.02 0.65277 128
Liu et al. (2020) [33] 100 0.01 0.005 0.8 64
Zhao et al. (2019) [77] 1500 0.005 5e-4 0.6 {32, 64}
Zhou et al. (2020) [79] 1000 0.01 1e-3 0.6 -
Rong et al. (2020) [46] - 0.01 1e-3 0.5 128
Zou et al. (2020) [83] 100 0.001 - - 256
Hasanzadeh et al. (2020) [17] 2000 0.005 5e-3 0 128
Table A10: Configurations adopted to implement different approaches for training deep GNNs on PubMed [25].
Methods Total epoch Learning rate & Decay Weight decay Dropout Hidden dimension
Xu et al. (2018) [64] 500 0.01 - 0.5 128
Li et al. (2019, 2020) [29, 30] 500 0.01 - 0.5 128
Chen et al. (2020) [7] 1000 0.001 - 0.1 256
Zhang et al. (2020) [75] 1500 {0.001, 0.005, 0.01} - 0.1, 0.2, 0.3, 0.4, 0.5 256
Liu et al. (2020) [33] 1000 0.005 - 0.2 256
Zhou et al. (2020) [80] 400 0.005 0.001 0.6 -
Shi et al. (2020) [49] 2000 0.001 0.0005 0.3 128
Kong et al. (2020) [27] 500 0.01 - 0.5 256
Sun et al. (2020) [52] 2000 0.002 - 0.75 256
Table A11: Configurations adopted to implement approaches for training deep GNNs on Ogbn-ArXiv [20].

Appendix A4 More Implementation Details

Datasets and computing facilities.

Table A12 provides the detailed properties and download links for all adopted datasets. We adopt the following benchmark datasets since i) they are widely applied to develop and evaluate GNN models, especially for deep GNNs studied in this paper; ii) they contains diverse graphs from small-scale to large scale or from homogeneous to heterogeneous; iii) they are collected from different applications including citation network, social network, etc. All experiments on large graph datasets, e.g., OGBN-ArXiv, are conducted on single 48G Quadro RTX 8000 GPU. For other experiments on small graphs such as Cora, are ran at singe 12G GTX 2080TI GPU. The detailed descriptions of these datasets are listed in the following.

  • Cora, Citeseer, Pubmed. They are the scientific citation network datasets [67, 25], where nodes and edges represent the scientific publications and their citation relationships, respectively. Each node is described by bag-of-words representation, i.e., the 0/1-valued word vector indicating the absence/presence of the corresponding words in the scientific publication. Each node is associated with a one-hot label, where the node classification task is to predict what field the corresponding publication belongs to.

  • OGBN-Arxiv. The OGBN-Arxiv dataset is a benchmark citation network collected in open graph benchmark (OGB) [20], which has been widely to evaluate GNN models recently. Each node represents an arXiv paper from computer science domain, and each directed edge indicates that one paper cites another one. The node is described by a 128-dimensional word embedding extracted from the title and abstract in the corresponding publication. Similar to Cora, Citeseer, and Pubmed, the node classification task is to predict the subject areas of the corresponding arXiv papers.

  • Coauthor CS, Coauthor Physics. They are the co-authorship graph datasets [48] from the scientific fields of computer science and physics, respectively. The nodes represent the authors, and the links indicate whether the two corresponding nodes co-authored papers. Node features represent paper keywords for each author’s papers. The node classification task is to predict the most active fields of study for the corresponding author.

  • Amazon Computers, Amazon Photo. These two datasets [48] are the segments of the Amazon co-purchase graph [36]. While nodes are the items, edges indicate whether the two items re frequently bought together or not. The node features are given by bag-of-words representations extracted from the product reviews, and the class labels denote the product categories.

  • Texas, Wisconsin, Cornell. They are the three subdatasets of WebKB111http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb collected from computer science departments of various universities by Carnegie Mellon University [43]. While nodes represent webpages in the webpage datasets, edges are hyperlinks between them. The node feature vectors are given by bag-of-word representation of the corresponding webpages. Each node is associated with one-hot label to indicate one of the following five categories, i.e., student, project, course, staff, and faculty.

  • Actor. This is an actor co-occurrence network, which is an actor-only induced subgraph of the film-director-actor-writer network [53]. In the actor co-occurrence network, nodes correspond to actors and edges denote the co-occurrence relationships on the same Wikipedia pages. Node feature vectors are described by bag-of-word representation of keywords in the actors’ Wikipedia pages. The node classification task is to predict the topic of the actor’s Wikipedia page.

Backbone Details.

We implement backbones GCN 222https://github.com/tkipf/gcn and SGC 333https://github.com/Tiiiger/SGC to compare the different tricks as studied in Section 3. Specifically, we follow the official model implementations of GCN and SGC to evaluate tricks of graph normalization and random dropping. While the default dropout rate listed in Table 2 is applied for backbones (GCN/SGC+NoNorm) in benchmarking normalization techniques, the dropout rate is optimized from set for backbones (GCN/SGC+Dropout) in benchmarking dropping methods. To evaluate the skip connections and identity mapping, we further include feature transformation layers to the initial and the end of backbones, which project initial features into hidden embedding space and map hidden embeddings to node labels, respectively. The backbones are denoted by GCN/SGC+None and GCN+without in Tables 3 and 6, respectively. The slight adaptions over the backbones explain the different baseline performances for the benchmark studies of skip connections, graph normalization, random dropping, and other tricks.

Dataset Nodes Edges Ave. Degree Features Classes Download Links
Cora 2,708 5,429 3.88 1,433 7 https://github.com/kimiyoung/planetoid/raw/master/data
Citeseer 3,327 4,732 2.84 3,703 6 https://github.com/kimiyoung/planetoid/raw/master/data
PubMed 19,717 44,338 4.50 500 3 https://github.com/kimiyoung/planetoid/raw/master/data
OGBN-ArXiv 169,343 1,166,243 13.77 128 40 https://ogb.stanford.edu/
Coauthor CS 18,333 81,894 8.93 6805 15 https://github.com/shchur/gnn-benchmark/raw/master/data/npz/
Coauthor Physics 34,493 247,962 14.38 8,415 5 https://github.com/shchur/gnn-benchmark/raw/master/data/npz/
Amazon Computers 13,381 245,778 36.74 767 10 https://github.com/shchur/gnn-benchmark/raw/master/data/npz/
Amazon Photo 7,487 119,043 31.80 745 8 https://github.com/shchur/gnn-benchmark/raw/master/data/npz/
Texas 183 309 3.38 1,703 5 https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master
Wisconsin 183 499 5.45 1,703 5 https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master
Cornell 183 295 3.22 1,703 5 https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master
Actor 7,600 33,544 8.83 931 5 https://raw.githubusercontent.com/graphdml-uiuc-jlu/geom-gcn/master
Table A12: Graph datasets statistics and download links.

Appendix A5 More Experimental Results

More results of skip connections.

Table A13 collects the achieved test accuracy under different skip connection mechanisms, and the standard deviations of independent runs are reported.

Backbone Settings Cora Citeseer PubMed OGBN-ArXiv
2 16 32 2 16 32 2 16 32 2 16 32
GCN Residual 74.73±3.41 20.05±4.66 19.57±4.94 66.83±1.37 20.77±1.25 20.90±1.34 75.27±1.72 38.84±1.39 38.74±1.33 70.19±0.17 69.34±0.30 65.09±1.08
Initial 79.00±1.01 78.61±0.84 78.74±0.86 70.15±0.84 68.41±0.93 68.36±1.10 77.92±1.76 77.52±1.55 78.18±0.49 70.16±0.21 70.50±0.28 70.23±0.26
Jumping 80.98±0.85 76.04±3.04 75.57±3.74 69.33±0.92 58.38±5.53 55.03±6.27 77.83±0.88 75.62±1.95 75.36±2.48 70.24±0.23 71.83±0.26 71.87±0.23
Dense 77.86±1.73 69.61±4.31 67.26±6.51 66.18±1.93 49.33±7.79 41.48±7.85 72.53±2.61 69.91±6.95 62.99±10.21 70.08±0.24 71.29±0.23 70.94±0.30
None 82.38±0.33 21.49±3.84 21.22±3.71 71.46±0.44 19.59±1.96 20.29±1.79 79.76±0.39 39.14±1.38 38.77±1.20 69.46±0.22 67.96±0.38 45.48±4.50
SGC Residual 81.77±0.28 82.55±0.41 80.14±0.40 71.68±0.33 71.31±0.57 71.00±0.49 78.87±0.29 79.86±0.25 79.07±0.35 69.09±0.13 66.52±0.23 61.83±0.36
Initial 81.40±0.26 83.66±0.38 83.77±0.38 71.60±0.33 72.16±0.30 72.25±0.38 79.11±0.23 79.73±0.23 79.74±0.24 68.93±0.11 69.24±0.16 69.15±0.17
Jumping 77.75±0.65 83.42±0.50 83.88±0.48 69.96±0.37 71.89±0.52 71.88±0.58 77.42±0.30 79.99±0.46 80.07±0.67 68.76±0.17 70.61±0.19 70.65±0.23
Dense 77.31±0.39 81.24±1.12 77.66±2.74 70.99±0.58 67.75±1.85 66.35±5.61 77.12±0.73 72.77±5.12 74.84±1.58 69.39±0.18 71.42±0.28 71.52±0.31
None 79.31±0.37 75.98±1.06 68.45±3.10 72.31±0.38 71.03±1.18 61.92±3.48 78.06±0.31 69.18±0.58 66.61±0.56 61.98±0.08 41.58±0.27 34.22±0.04
Table A13: Test accuracy (%) under different skip connection mechanisms. Experiments with independent repetitions are conducted on Cora, Citeseer, PubMed, and OGBN-ArXiv with layers GCN and SGC.

More results of graph normalizations.

Table A14 reports the performance with different graph normalization techniques. The standard deviations are included.

Backbone Settings Cora Citeseer PubMed OGBN-ArXiv
2 16 32 2 16 32 2 16 32 2 16 32
GCN BatchNorm 69.91±1.96 61.20±13.88 29.05±10.68 46.27±2.30 26.25±12.11 21.82±7.91 67.15±1.94 58.00±13.42 55.98±12.98 70.44±0.22 70.52±0.46 68.74±0.84
PairNorm 74.43±1.36 55.75±13.19 17.67±9.23 63.26±1.32 27.45±7.22 20.67±6.51 75.67±0.86 71.30±3.00 61.54±13.80 65.74±0.16 65.37±0.69 63.32±0.97
NodeNorm 79.87±0.80 21.46±9.50 21.48±9.41 68.96±1.00 18.81±2.12 19.03±2.35 78.14±0.79 40.92±0.29 40.93±0.29 70.62±0.23 70.75±0.43 29.94±2.29
MeanNorm 82.49±0.35 13.51±3.14 13.03±0.15 70.86±0.36 16.09±3.40 7.70±0.00 78.68±0.41 18.92±3.20 18.00±0.00 69.54±0.15 70.40±0.25 56.94±20.26
GroupNorm 82.41±0.33 41.76±15.07 27.20±11.61 71.30±0.46 26.77±5.68 25.82±5.73 79.78±0.39 70.86±5.55 63.91±8.26 69.70±0.19 70.50±0.31 68.14±1.10
CombNorm 80.00±0.86 55.64±13.23 21.44±7.52 68.59±1.03 18.90±2.17 18.53±1.68 78.11±0.76 40.93±0.29 40.90±0.28 70.71±0.21 71.77±0.31 69.91±0.64
NoNorm 82.43±0.33 21.78±3.32 21.21±3.48 71.40±0.48 19.78±1.95 19.85±2.00 79.75±0.33 39.18±1.38 39.00±1.59 69.45±0.27 67.99±0.31 46.38±3.87
SGC BatchNorm 79.32±0.76 15.86±8.86 14.40±7.02 61.60±1.31 17.34±4.05 17.82±3.86 76.34±0.75 54.22±15.63 29.49±11.83 68.58±0.15 65.54±0.38 62.33±0.61
PairNorm 80.78±0.23 71.26±2.12 51.03±3.25 69.76±0.33 60.14±2.40 50.94±4.76 75.81±0.34 68.89±0.49 62.14±4.33 60.72±0.03 39.69±2.36 26.67±10.82
NodeNorm 78.09±0.94 78.77±0.96 73.93±2.40 63.42±1.45 61.81±1.94 60.22±2.22 71.64±1.42 71.50±1.72 73.30±1.52 63.21±0.06 26.81±1.05 16.18±8.56
MeanNorm 80.22±0.43 48.29±8.55 30.07±8.07 70.78±0.75 38.27±5.79 28.27±5.77 75.07±0.44 47.29±6.02 41.32±1.26 54.86±0.03 21.74±9.95 18.97±9.52
GroupNorm 82.81±0.27 75.81±1.02 74.94±1.87 72.32±0.43 67.54±0.78 61.75±0.76 78.87±0.50 76.43±1.03 74.62±1.29 66.12±0.04 67.29±0.53 66.11±0.46
CombNorm 77.65±1.68 75.16±2.27 74.45±1.86 63.66±1.41 59.97±3.90 54.52±4.79 71.67±2.30 71.50±2.27 72.23±2.46 65.73±0.07 54.37±0.53 47.52±1.40
NoNorm 79.38±0.39 75.93±1.12 68.75±2.59 72.36±0.39 71.06±1.32 62.64±3.68 78.01±0.32 69.06±0.57 66.55±0.56 61.96±0.07 41.43±0.25 34.24±0.07
Table A14: Test accuracy (%) under different graph normalization mechanisms. Experiments with independent repetitions are conducted on Cora, Citeseer, PubMed, and OGBN-ArXiv with layers GCN and SGC.

More results of random droppings.

Table A15 shows the performance of diverse random dropping tricks, where all dropout rates are tuned for best test accuarcies. Table A16 and A17 report the achieved performance with fixed dropout rates and respectively, for better comparisons. For all experiments, the standard deviations of independent repetitions are provided.

Dataset Settings Cora Citeseer PubMed OGBN-ArXiv
2 16 32 2 16 32 2 16 32 2 16 32
GCN No Dropout 80.680.13 28.562.77 29.362.76 71.360.18 23.191.47 23.031.13 79.560.17 39.851.47 40.001.59 69.530.19 66.140.73 41.969.01
Dropout 82.390.36 21.603.54 21.173.29 71.430.46 19.371.88 20.151.85 79.790.37 39.091.23 39.171.43 69.400.12 67.790.54 45.413.63
DropNode 77.101.04 27.614.34 27.655.07 69.380.89 21.833.07 22.183.06 77.390.98 40.311.61 40.381.20 66.670.16 67.170.44 43.819.62
DropEdge 79.160.73 28.003.36 27.873.04 70.260.70 22.921.95 22.922.12 78.580.58 40.611.20 40.501.48 68.670.17 66.500.39 51.705.08
LADIES 77.120.82 28.075.07 27.544.19 68.870.84 22.522.42 22.602.23 78.310.82 40.071.49 40.111.36 66.430.21 62.050.80 40.416.22
DropNode+Dropout 81.020.75 22.248.30 18.814.30 70.590.80 24.496.50 18.230.55 78.850.64 40.441.22 40.371.36 68.660.14 68.270.33 44.184.90
DropEdge+Dropout 79.710.92 20.457.97 21.108.17 69.640.89 19.774.47 18.491.71 77.770.99 40.710.84 40.510.96 66.550.14 68.810.21 49.824.16
LADIES+Dropout 78.880.79 19.498.31 16.926.60 69.020.88 27.176.74 18.542.38 78.530.77 41.432.59 40.701.00 66.350.17 65.130.33 39.994.74
SGC No Dropout 77.550.08 73.990.03 66.800.01 71.800.00 72.690.05 70.500.02 77.590.03 69.740.07 67.810.03 62.340.07 42.540.25 34.760.06
Dropout 79.370.36 75.911.07 68.402.57 72.350.39 71.210.93 62.353.15 78.040.33 69.120.54 66.530.68 61.960.05 41.470.23 34.220.06
DropNode 78.570.27 76.990.31 72.930.39 71.870.27 72.500.20 70.600.11 77.630.32 72.510.29 68.160.33 61.210.08 40.520.11 34.640.05
DropEdge 78.680.24 70.650.80 44.000.90 71.940.32 69.430.57 45.130.93 78.260.32 68.390.26 52.080.79 62.060.05 41.030.23 33.610.06
LADIES 78.500.27 78.350.34 72.710.84 71.880.28 71.690.43 69.800.36 77.650.23 74.860.88 72.270.55 61.490.07 38.960.08 33.170.04
DropNode+Dropout 80.600.41 74.831.64 55.044.28 72.330.43 70.301.56 65.851.86 78.100.41 67.980.67 52.011.36 61.540.05 39.480.13 32.630.07
DropEdge+Dropout 80.270.50 76.191.28 66.083.55 72.090.45 66.483.38 35.552.79 77.630.37 69.650.67 67.550.79 60.210.07 39.120.10 33.810.06
LADIES+Dropout 79.810.56 74.721.34 66.622.49 71.850.39 69.241.83 50.814.09 77.460.33 70.540.36 67.940.63 60.270.07 31.410.03 24.861.15
Table A15: Test accuracy (%) under different random dropping mechanisms. Experiments with independent repetitions are conducted on Cora, Citeseer, PubMed, and OGBN-ArXiv with layers GCN and SGC. Dropout rates are tuned for best performance.
Dataset Settings Cora Citeseer PubMed OGBN-ArXiv
2 16 32 2 16 32 2 16 32 2 16 32
GCN DropNode () 77.101.04 27.614.34 27.655.07 69.380.89 21.833.07 22.183.06 77.390.98 40.181.18 39.882.54 66.670.16 67.170.47 43.819.62
DropEdge () 79.160.73 28.003.36 27.873.04 70.260.70 22.921.95 22.922.12 78.580.58 40.321.63 40.211.65 68.670.17 66.380.60 45.745.65
LADIES () 77.120.82 28.075.07 27.544.19 68.870.84 22.522.42 22.602.23 78.310.82 39.981.76 39.821.63 66.430.21 62.050.80 40.416.22
DropNode+Dropout () 81.020.75 18.554.68 18.814.30 70.590.80 20.034.68 18.160.79 78.850.64 40.221.60 39.921.73 68.660.14 68.190.25 44.184.90
DropEdge+Dropout () 79.710.92 17.834.80 19.325.60 69.640.89 18.481.87 18.491.71 77.770.99 39.921.29 39.751.28 66.550.14 68.790.34 49.824.16
LADIES+Dropout () 78.880.79 19.498.31 16.926.60 69.020.88 27.176.74 18.542.38 78.530.77 40.012.52 39.921.36 66.350.17 65.130.33 39.994.74
SGC DropNode () 77.890.23 75.220.22 69.510.40 71.870.27 72.500.20 70.600.11 77.630.32 70.280.21 68.160.33 61.210.08 40.520.11 34.640.05
DropEdge () 78.160.24 70.650.80 44.000.90 71.940.32 69.430.57 45.130.93 78.260.32 68.390.26 52.080.79 62.060.05 41.030.23 33.610.06
LADIES () 77.850.28 76.930.39 72.140.38 71.880.28 71.690.43 69.800.36 77.650.23 72.460.40 69.930.46 61.490.07 38.960.08 33.170.04
DropNode+Dropout () 79.950.49 74.831.64 55.044.28 72.330.43 70.301.56 65.851.86 78.100.41 67.980.67 52.011.36 61.540.05 39.480.13 32.630.07
DropEdge+Dropout () 80.110.38 76.191.28 66.083.55 72.090.45 66.483.38 35.552.79 77.630.37 69.650.67 67.550.79 60.210.07 39.120.10 33.810.06
LADIES+Dropout () 79.630.44 74.721.34 66.622.49 71.850.39 69.241.83 50.814.09 77.460.33 69.810.82 67.940.63 60.270.07 31.410.03 24.861.15
Table A16: Test accuracy (%) under different random dropping mechanisms. Experiments with independent repetitions are conducted on Cora, Citeseer, PubMed, and OGBN-ArXiv with layers GCN and SGC. Dropout rates are unified as .
Dataset Settings Cora Citeseer PubMed OGBN-ArXiv
2 16 32 2 16 32 2 16 32 2 16 32
GCN DropNode () 70.281.36 21.095.86 20.625.92 65.471.31 18.683.05 19.382.84 73.821.38 40.311.61 40.381.20 60.640.16 67.170.44 35.6514.39
DropEdge () 75.941.19 25.384.20 25.274.37 68.061.08 21.062.68 21.342.53 76.331.03 40.611.20 40.501.48 66.860.23 66.500.39 51.705.08
LADIES () 70.591.15 24.577.84 25.777.43 65.191.39 20.642.98 19.982.67 75.321.39 40.071.49 40.111.36 60.460.19 55.860.48 34.116.28
DropNode+Dropout () 78.301.00 22.248.30 18.746.68 68.201.05 24.496.50 18.230.55 76.751.02 40.441.22 40.371.36 66.820.20 68.270.33 42.484.50
DropEdge+Dropout () 72.311.21 20.457.97 21.108.17 64.691.35 19.774.47 18.220.90 75.211.04 40.710.84 40.510.96 60.170.14 68.810.21 47.038.06
LADIES+Dropout () 71.651.13 17.827.84 16.797.79 64.281.36 23.635.84 18.051.48 75.891.34 41.432.59 40.701.00 59.960.22 59.210.40 30.211.94
SGC DropNode () 78.570.27 76.990.31 72.930.39 70.840.32 71.590.20 70.540.28 77.380.22 72.510.29 68.080.39 58.290.08 38.800.06 33.920.03
DropEdge () 78.680.24 63.921.59 29.501.66 71.830.26 57.231.45 19.970.48 77.950.31 65.971.24 48.381.69 60.930.08 37.010.18 28.300.70
LADIES () 78.500.27 78.350.34 72.710.84 69.980.29 70.660.44 69.240.39 77.270.34 74.860.88 72.270.55 58.890.09 35.780.05 31.350.02
DropNode+Dropout () 80.600.41 68.372.54 8.140.72 72.080.61 63.333.27 48.222.64 77.940.35 64.553.05 44.720.61 59.790.09 34.280.06 27.420.56
DropEdge+Dropout () 80.270.50 59.527.94 22.448.83 70.860.46 27.486.40 18.120.30 77.370.47 68.671.43 64.271.78 55.730.07 36.290.05 31.660.03
LADIES+Dropout () 79.810.56 67.003.14 42.086.53 70.250.49 46.274.73 34.900.58 76.880.45 70.540.36 67.860.99 56.020.08 27.710.01 15.657.59
Table A17: Test accuracy (%) under different random dropping mechanisms. Experiments with independent repetitions are conducted on Cora, Citeseer, PubMed, and OGBN-ArXiv with layers GCN and SGC. Dropout rates are unified as .

More results of comparison with previous state-of-the-art frameworks.

We present a more complete comparison with other previous state-of-the-art frameworks of training deep GNNs in Table A18. Error bars are also recorded in the table.

Model Cora (Ours: 85.480.39) Citeseer (Ours: 73.350.79) PubMed (Ours: 80.760.26) OGBN-ArXiv (Ours: 72.700.15)
2 16 32 2 16 32 2 16 32 2 16 32
SGC [60] 79.31±0.37 75.98±1.06 68.45±3.10 72.31±0.38 71.03±1.18 61.92±3.48 78.06±0.31 69.18±0.58 66.61±0.56 61.98±0.08 41.58±0.27 34.22±0.04
DAGNN [33] 80.300.78 84.140.59 83.390.59 18.223.48 73.050.62 72.590.54 77.740.57 80.320.38 80.580.51 67.650.52 71.820.28 71.460.27
GCNII [7] 82.190.77 84.690.51 85.290.47 67.810.89 72.970.71 73.240.78 78.051.53 80.030.50 79.910.27 71.240.17 72.610.29 72.600.25
JKNet [64] 79.060.11 72.973.94 73.233.59 66.981.82 54.337.74 50.688.73 77.240.92 64.378.80 63.779.21 63.730.38 66.410.56 66.310.63
APPNP [26] 82.060.46 83.640.48 83.680.48 71.670.78 72.130.53 72.130.59 79.460.47 80.300.30 80.240.33 65.310.23 66.950.24 66.940.26
GPRGNN [10] 82.530.49 83.690.55 83.130.60 70.490.95 71.390.73 71.010.79 78.730.63 78.781.02 78.461.03 69.310.09 70.300.15 70.180.16
Table A18: Test accuracy (%) comparison with other previous state-of-the-art frameworks. Experiments are conducted on Cora, Citeseer, PubMed, and OGBN-ArXiv with GNNs. Performance are averaged from repetitions. The superior performance achieved by our best tricks combination are highlighted in the first raw.