1 Introduction
Graph Convolutional Networks (GCNs), which exploit the concept of message passing or equivalently the neighborhood aggregation function to extract highlevel features from a node as well as its neighborhoods, have boosted the stateofthearts for a variety of tasks on graphs, including node classification [1, 32], social recommendation [6, 27], link prediction [21] and many others. In other words, GCNs have been becoming one of the most crucial tools for graph representation learning.
Yet, when we revisit typical successful GCNs, such as the architecture developed in [16], one conspicuous observation is that they are all “shallow”—the number of the layers is never larger than 2. Deeper variants by simply stacking more layers, in principle can access more information, but perform worse [19, 31]. Even with the residue connections that are proved to be powerful in very deep Convolutional Neural Networks (CNNs), there is still no evidence to affirm that, GCNs with more than 2 layers perform as well as the 2layer one on popular benchmarks (e.g. Cora [28]). So the following questions remain: “what are the very factors that impede deeper GCNs to perform promisingly” and “how can we eliminate those factors by developing techniques specific to graphs”, both of which motivate the study of this paper.
We begin by investigating two contradict factors: overfitting and oversmoothing. The overfitting issue comes from the case when we utilize an overparameterized model to fit a distribution with limited training data, where the model we learn fits the training data very well but generalizes poorly to the testing data. This issue does exist if we apply a deep GCN on small graphs (see the empirical comparisons between 2layer GCN and 6layer GCN on Cora in Fig. 1). Rather, the overfitting issue is hard to be solved satisfactorily, even we have considered certain wellknown tricks like weight penalizing and dropout [13]. A more efficient method is in demand. By contrast, oversmoothing, towards the other extreme, makes training a very deep GCN difficult. As first introduced by [19] and further explained in [30, 31, 17], graph convolutions essentially push representations of adjacent nodes mixed with each other, such that, if extremely we go with an infinite number of layers, all nodes’ representations will converge to a stationary point, making them unrelated to the input features. We call this phenomenon as oversmoothing of node features. To illustrate its effect, we have conducted an example experiment with 32layer GCN in Fig. 1, in which the training of such a very deep GCN is observed not to converge and fail.
Both of the above two issues can be addressed, using DropEdge. The term “DropEdge” refers to randomly dropping out certain rate of edges of the input graph for each training time. In its particular form, each edge is independently dropped with a fixed probability , with being a hyperparameter and determined by validation. There are several benefits in applying DropEdge for the GCN training. First, DropEdge can be considered as a data augmentation technique. By DropEdge, we are actually generating different random deformed copies of the original graph; as such, we augment the randomness and the diversity of the input data, thus better capable of preventing overfitting. It is analogous to performing random rotation, cropping, or flapping for robust CNN training in the context of images. Second, DropEdge can also be treated as a message passing reducer. In GCNs, the message passing between adjacent nodes is conducted along edge paths. Removing certain edges is making node connections more sparse, and hence avoiding oversmoothing to some extent when the GCN goes very deep. Indeed, as we will draw in this paper, DropEdge theoretically slows down the smoothing of the hidden node features by a certain ratio. Finally, DropEdge is related but distinct to other concepts, such as the dropout skill [13] that drops out the activation units of the network by random. Since activation dropout does not perform any data augmentation, its effect on alleviating overfitting is not so strong as DropEdge, and it does not help prevent oversmoothing neither. We defer more discussions of DropEdge to other methods in § 4.1.
We also explore what kind of architectures can facilitate the training of GCNs, and what can be compatible with our DropEdge. To do so, we first review several successful CNN architectures that operate on images, and then recast them in the graph domain. We study ResNet [11], DenseNet [14] and InceptionNet [29] in this paper. The method by [16] has already imitated the residual connections of ResNets to GCNs, but the performance is unsatisfactory. DenseNet, which further generalizes the idea of skip connections, connects each layer to every other layer in a feedforward fashion. Here, for more efficiency, we instantiate the dense version of GCNs by retaining all short paths from immediate layers to the output layer but removing all others between immediate layers. From a graphical perspective, the outputs along the shortcut from kstep away layers are actually messages from khop neighborhoods; in other words, DenseNet allows us to obtain multihop messages output with one single network. Also, the short connections enable direct backpropagation from the loss to lower layers, and it alleviates the effect of vanishing gradients as observed in deep neural networks.
InceptionNet is another typical CNN structure. By its original design, it performs convolutions in each layer with multiple sizes of kernels/receptivefields, so as to model objective variations in size. This property is also crucial to the graph data, for the local structures within an input graph is diverse and deserved to capture. We then take inspiration from InceptionNet by adopting multiple atom GCNs of different layers to represent receptive fields of different sizes, and then concatenating all their outputs as an inception block. Stacking these blocks one by one leads to our inception variant of GCNs. Figure 2 illustrates an example of deep GCN with the inception backbone. We defer more details of different GCNs to § 4.2.
For the experiments on four public benchmarks (e.g. Cora, Citeseer, Pubmed [28], and Reddit [9]), we demonstrate that the residual connections, dense connections and inception blocks are compatible with the DropEdge skill, and they obtain promising testing error even when the number of layers is large (e.g. larger than 30 in Fig. 1)—to the best of our knowledge, this is the deepest GCN that has ever been developed, and more importantly it performs promisingly. Moreover, when equipped with DropEdge, we find that both dense GCNs and inception GCNs are able to promote the training consistently and hence give rise to better performance, compared to the plain GCNs.
2 Related Work
Inspired by the huge success of CNNs in computer vision, a large number of methods come redefining the notion of convolution on graphs under the umbrella of GCNs. The first prominent research on GCNs is presented in
[2], which develops graph convolution based on spectral graph theory. Later, [16, 4, 12, 20, 18] apply improvements, extensions, and approximations on spectralbased GCNs. With contending the scalability issue of spectralbased GCNs on large graphs, spatialbased GCNs have been rapidly developed [10, 24, 25, 7]. These methods directly perform convolution in the graph domain by aggregating the information from neighbor nodes. By recent, several samplingbased methods have been proposed for fast graph representation learning, including the nodewise sampling methods [10], the layerwise approach [3] and its layerdependent variant [15].Despite the fruitful progress, most previous works only focus on shallow GCNs while the deeper extension is seldom discussed. The work by [19] first introduces the concept of oversmoothing in GCNs, but it never proposes a deep GCN with addressing this issue. Its following study [17] solves oversmoothing by using personalized PageRank that additionally involves the rooted node into the message passing loop; however, the accuracy is still observed to decrease when the depth of GCN increases from 2. The JKNet [31] employs skip connections for multihop message passing, and it enables different neighborhood ranges for better structureaware representation learning. Unexpectedly, as shown in the experiments, the JKNets that obtain the best accuracy have depth less than 3 on all datasets, except the one on Cora where the best result is given by the 6layer network. In this paper, we propose the notion of DropEdge to overcome both the overfitting and oversmoothing issues simultaneously, and combine it with various backbone architectures to drive an indepth analysis on deep GCNs.
3 Notations and Preliminaries
Notations. Let represent the input graph, with nodes , edges , and defining the number of the nodes. The node features are denoted as , and the adjacent matrix associates each edge with its element . The degrees for all nodes are given by where computes the sum of edge weights connected to node . For simplicity, we define as the degree matrix with its diagonal elements given by .
PlainGCN. We call the original GCN developed by Kipf and Welling [16] as PlainGCN. By defining as the hidden feature in the th layer for node , the feed forward propagation becomes
(1) 
where
are the hidden vectors of
th layer; is the renormalization of the adjacency matrix; is a nonlinear function, i.e.the relu function; and
is the filter matrix in the th layer. We denote onelayer GCN as computed by Eq. (1) as Graph Convolutional Layer (GCL) in what follows.4 DeepGCN
In this section, we first introduce the formulation of DropEdge, and then follow it up by presenting several backbone architectures to extend PlainGCNs.
4.1 DropEdge
To involve randomness into the training data, the DropEdge technique drops out the edges of the input graph at each training iteration. Formally, it randomly enforces nonzero elements of the adjacent matrix to be zeros, where is the total number of edges and is the dropping rate. If we denote the resulting adjacent matrix as , then its relation to becomes
(2) 
where is a sparse matrix expanded by a random subset of size from original edges . Following the idea of [16], we also perform the renormalization trick on , leading to .
It is clear that DropEdge can prevent overfitting since the model is fed with diverse at different training times. Despite the randomness, the inputs for different training iterations still share all the same nodes and input features, hence all inputs could be considered to be drawn from similar underlying distribution; in other words, the training is still meaningful.
Now we focus more on the oversmoothing issue. To be specific, the oversmoothing issue states that all nodes’ features will degenerate to a stationary point and become isolated to the input features, if we employ a GCN of an infinite number of layers. This will impede model training, for the discriminative information of the input features is eliminated. To reveal what incurs oversmoothing and understand how it acts, we consider the randomwalk [23] version for the update in Eq. (1) as
(3) 
where we have omitted the nonlinear function and parameter matrix for simplicity. Here, can be viewed as the transition probability of the random walk. When goes to infinity, we arrive at
(4) 
where the stationary solution has been proved to satisfy regardless of the input state [23]. This yet implies the independence to the initial point, i.e. the input feature . In other words, the information of the input feature has vanished. In practice, we also observe the same trend in the standard GCNs (see Fig. 6).
By virtue of DropEdge, we replace with in Eq. (2). Although we will still go to the stationary point under the infinity case, we are able to slow down the convergent speed if using instead. This can be validated by using the concept of mixing time that has been studied in the random walk theory [23]. As its name implies, mixing time measures how fast the random walk converges to its limiting distribution. Its computation is given by .
Theorem 1.
If drops from a graph by removing an edge, then mixing time can only increase. i.e.
(5) 
Please refer to the supplemental materials for the proof of Theorem 1.
Corollary 1.
By increasing the mixing time, the deeper layers of the deep GCNs converge more slowly towards its limiting distribution [23]. Therefore, DropEdge alleviates the effect of oversmoothing and become more friendly to deeper models.
Our DropEdge is related to the dropout trick [13] and node sampling methods [3, 15]. The dropout trick is trying to disturb the feature matrix by randomly setting features to be zeros, which may reduce the effect of overfitting but has no help to the oversmoothing. The node sampling methods is trying to drop nodes by random or based on the adjacent connections between layers to reduce the computational complexity. However, node sampling only delivers a subgraph and reduce too much information of node features. In contrast, DropEdge only drops edges without losing the features of the nodes, with more input information retained.
4.2 Network Architecture Design
Even though we can alleviate the oversmoothing and train the layer PlainGCN model by DropEdge, we argue that PlainGCN has inevitable shortcomings that it doesn’t consider the graph locality and treat all nodes within hop equally. Graph locality [22] is very essential to obtain better node representation since it pays more attention to the nodes which are close to the target nodes.
The authors in [16] have applied residual connections between hidden layers to facilitate the training of deeper models. The residual connection carries over information from the previous layer, and it can be implemented by additionally adding the identity mapping in the right size of Eq. (1). While it does enable efficient backpropagation via shortcuts, the residual connection is still insufficient in capturing multihop neighbour messages, since it is only limited to connect between input and output of the same layer. In this section, we generalize the idea of the residual connection, and introduce two more powerful building blocks for GCNs: Dense Block and Inception Block, inspired from the success of DenseNet [14] and InceptionNet [29], respectively.
Dense Block.
The Dense Block is consist of a fixednumber of GCLs. Besides the feedforward connections, we add shortcuts that are linked to the output layer from each GCL and the input layer (see Figure. 3
(b)). All messages from the top GCL and skip connections are concatenated together to formulate an eventual output. In this way, we encode multihop neighbor information into the output representation, which allows us to capture diverse local graph structures with different neighbor expanding ranges. For example, suppose a node’s representation is sufficiently characterized by its 2hop neighbour features but not those beyond, the model will be trained to attend more closely to the 2hop outputs and overlook the 3hop ones. In a machine learning point of view, concatenating the outputs from subnetworks of different layers acts like performing an ensemble of different models, and the training will perform model selection, leading to better performance.
Inception Block.
The basic idea of inception operation [29] is to capture multiscale objective patterns by using convolution kernels of different sizes. In the graph domain, the size of the receptive field is explained as the distance/hop from the target node to its neighborhood. Similar to the design of Dense Block, we adopt the subnetwork of depth to define the convolution kernel of size . We concatenate those subnetworks of multiple depths to derive an inception block. As shown in Figure. 3(a), we build an inception block that contains 3 subnetworks with depth ranging from 1 to 3. Unlike Dense Block, all subnetworks in Inception Block do not share any GCL, by which we expect architectures with more capacity to capture various local graph structures. This is more prone to overfitting since more parameters are involved, but it still works promisingly when combined with our DropEdge. Furthermore, the independence of each subnetwork enables a more flexible model and brings more generalization ability to model complex graphs. We will provide more discussions in the experimental section.
5 Experiments
Datasets.
Joining the previous works’ practice, we focus on four benchmark datasets varying in graph size and feature type: (1) classifying the research topic of papers in three citation datasets: Cora, Citeseer and Pubmed
[28]; (2) predicting which community different posts belong to in the Reddit social network [10]. Note that the tasks in Cora, Citeseer and Pubmed are transductive underlying all node features are accessible during training, while the task in Reddit is inductive meaning the testing nodes are unseen for training. We apply the fullsupervised training fashion used in [15, 3] on all datasets in our experiments.Baselines. We compare our models against four baselines: the original GCN model [16] (denoted as PlainGCN); GraphSAGE [10], FastGCN [3] and AsGCN [15]. All baselines contain two layers, and we download their public codes for reimplementation. As for DeepGCNs, we compare the variants that use different blocks. Specifically, DeepGCNI and DeepGCND indicate using the inception block and the dense block, respectively. The DropEdge skill is equipped with our models by default; if not, we use the suffix “(ND)” for the specification: for example, DeepGCND (ND) means no DropEdge is involved. We also perform DropEdge in PlainGCN, denoted as PlainGCN+DropEdge, to evaluate how DropEdge affects the performance of PlainGCN.
Implementation.
We implement our models in Pytorch
[26]and use the Adam optimizer for network training. To ensure the reproductivity of the results, the random seed of all experiments is fixed. We fix the training epoch to
for Cora, Citeseer and Pubmed, and for Reddit. The early stopping criterion is applied during training over all datasets. We utilize the same set of hyperparameters for the model with and without DropEdge to avoid the unintentional “hyperparameter hacking”. For testing, we utilize the whole graph as the input without using DropEdge. We defer more implementation details in the supplementary material.5.1 Comparison with Stateoftheart Methods
Table 1 summaries the classification errors of our method as well as the baselines on four datasets. For DeepGCN, the number in parentheses indicates the number of layers it contains; we only report the best results among the architectures by ranging the depth from 2 to 15. It is observed that our methods outperform all other baselines on all datasets significantly. If removing DropEdge, both DeepGCNI and DeepGCND exhibit remarkable performance drop, which explains the necessity of conducting DropEdge in deep GCNs. We also find that PlainGCNs with DropEdge deliver much better performance than those without DropEdge (over 10% improvement on Cora and Pubmed), justifying the importance of DropEdge once again. Another interesting observation is DeepGCNI performs better than DeepGCND with DropEdge, but worse without DropEdge. This is consistent with our statement in § 4.2, as DeepGCNI contains more parameters and it will be more prone to overfitting. Even so, DeepGCNI works promisingly when combined with DropEdge. In general, the results here confirm the superiority of our proposed methods.
Transductive  Inductive  
Cora  Citeseer  Pubmed  
FastGCN  15.00  22.40  12.00  6.30 
GraphSAGE  17.80  28.60  12.90  5.68 
AsGCN  12.56  20.34  9.40  3.73 
PlainGCN  14.00  22.70  10.20  3.52 
DeepGCNI (No DropEdge)  13.90  22.00  10.60  3.25 
DeepGCND (No DropEdge)  13.30  20.90  9.70  3.52 
PlainGCN+DropEdge  12.80  20.90  9.10  3.46 
DeepGCNI  11.70 (6)  19.50 (6)  8.60 (14)  3.13 (10) 
DeepGCND  11.90 (11)  19.70 (4)  8.60 (10)  3.22 (14) 
avg. % reduce over DropEdge  11.6%  8.3%  13.7%  4.6% 
5.2 Ablation Studies
Now we continue a more detailed analysis on our models. Due to the space limit, we only provide the results on Cora, and defer the evaluations on other datasets to the supplementary material. Note that this section mainly focuses on assessing each component in DeepGCNs, without the concern about pushing stateoftheart results. So, we do not perform delicate model selection in what follows. We construct the DeepGCN architecture which contains one input GCL, one inception/dense block and one output GCL. The number of layers in inception/dense block is selected as 4, thus the complete network depth is 6. As a comparison, we also implement another popular architecture: ResGCN that shortly connects the input with output in each immediate GCL. The hidden dimension, learning rate and weight decay are fixed to 256, 0.005 and 0.0005, receptively. The random seed is fixed and no early stopping is adopted. We train all models with 200 epochs.
The Comparison with Different Architectures. We investigate the converging behaviors of different architectures without DropEdge. Figure 4 displays the training (in dash line) and validation (in bold line) loss of all architectures of depth 6 on Cora (we provide more experimental results in the supplementary material). We also plot the results by 2layer PlainGCN as a reference.
We have two major observations from Figure 4. First, both DeepGCNI and DeepGCND exhibit consistently lower training error and fast convergence rate compared with PlainGCN, which verifies the benefit of the architecture design for the dense and inception block. Second, compared with 2layer PlainGCN, all variants of DeepGCNs suffer from overfitting when the training epochs are larger than 30. Actually, we will demonstrate carrying out DropEdge helps prevent overfitting in the following experiment.
How important is the DropEdge? To justify the importance of DropEdge, we contrast the validation loss of models with and without DropEdge. We fix the dropping rate for all cases. We first check the results by PlainGCNs. Figure 4(a) summarizes the validation loss of different layers. It shows that, DropEdge generally helps all PlainGCNs to obtain lower validation loss, except for the 2layer one; since the shallow PlainGCN could be free of the overfitting issue, operating DropEdge is unnecessary. Regarding DeepGCNs, as depicted in Figure 4(b), the architectures considering DropEdge outperform those without DropEdge significantly over different numbers of layers (4 and 6) and different types of building block (the inception, dense and also residual block). We also observe the loss curves of DeepGCNI6 and DeepGCND6 are close to each other but clearly lower than that of ResGCN6. This explains the superiority of dense connection and inception block compared to the residual design.
The Justification of OverSmoothing. We now justify how DropEdge help alleviate the oversmoothing issue. As discussed in § 4.1, the oversmoothing issue is incurred when the toplayer output of GCN converge to the stationary point and become unrelated to the input features with the increase in depth. In other words, the closer the output is to the stationary point, the more serious the oversmoothing issue becomes. Since we are unable to derive the stationary point explicitly, we instead compute the difference between the outputs of adjacent layers to measure the degree of oversmoothing. We adopt the Euclidean distance for the difference computation. Lower distance means more serious selfsmoothing.
Experiments are conducted on 8layer PlainGCN. All parameters are initialized by the same uniform distribution. Figure
6 (a) shows the distances of different intermediate layers (from 2 to 6) under different edge dropping rates (0 and 0.8). Clearly, the oversmoothing issue becomes more serious as the layer grows, which is consistent with our conjecture. Moreover, we find that the model with DropEdge () reveals higher distance and slower convergent speed than that without DropEdge (), implying the crucial importance of DropEdge on alleviating oversmoothing.We are also interested in how the oversmoothing will act after model training. For this purpose, we display the results after 150epoch training in Figure 6 (b). For PlainGCN (), the difference between outputs of the 5th and 6th layers is equal to 0, indicating that hidden features have converged to a certain stationary point. Compatible with such observation, Figure 6 (c) shows that the training loss fails to decrease for PlainGCN (). By contrast, PlainGCN () exhibits more promising outcomes, as the distance increases when the number of layers grows. It indicates that PlainGCN () has successfully learned meaningful node representations after training, which can also be validated by the training loss in Figure 6 (c)
Results of very deep GCNs. We test whether our methods still help or not for the very deep GCNs, when, for example, setting the number of layers to exceed 30. To answer this question, we implement 32layer PlainGCN and our 32layer DeepGCNI on Cora, where the edge dropping rate is . We report the training and validation loss in Figure. 1. We observe that the training of 32layer PlainGCN fails to converge probably due to oversmoothing, while our 32layer DeepGCN works quite well for both training and validation. This is exciting as our proposed techniques have enabled more broader choices for network design with much deeper layers.
6 Conclusion
We have presented DropEdge, a novel and efficient technique to facilitate the development of deep Graph Convolutional Networks (GCNs). By dropping out a certain rate of edges by random, DropEdge includes more diversity into the input data to prevent overfitting, and reduces message passing in graph convolution to alleviate oversmoothing. By enjoying its benefits, DropEdge enables us to consider various kinds of building blocks in deep GCNs, including the dense and inception blocks. Considerable experiments on Cora, Citeseer, Pubmed and Reddit have verified the effectiveness of Deep GCNs when the proposed techniques are embedded. It is expected that our research will open up a new venue on more indepth exploration of deep GCNs for broader potential applications.
References
 [1] (2011) Node classification in social networks. In Social network data analytics, pp. 115–148. Cited by: §1.
 [2] (2013) Spectral networks and locally connected networks on graphs. In Proceedings of International Conference on Learning Representations, Cited by: §2.
 [3] (2018) FastGCN: fast learning with graph convolutional networks via importance sampling. In Proceedings of the 6th International Conference on Learning Representations, Cited by: §2, §4.1, §5, §5.
 [4] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844–3852. Cited by: §2.
 [5] (2017) Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems, pp. 6530–6539. Cited by: §8.
 [6] (2000) Visualizing social networks. Journal of social structure 1 (1), pp. 4. Cited by: §1.
 [7] (2018) Largescale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1416–1424. Cited by: §2.
 [8] (2008) Minimizing effective resistance of a graph. SIAM review 50 (1), pp. 37–66. Cited by: §7.
 [9] (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1.
 [10] (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1025–1035. Cited by: §2, §5, §5.

[11]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §1.  [12] (2015) Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163. Cited by: §2.
 [13] (2012) Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §1, §1, §4.1.
 [14] (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1, §4.2.
 [15] (2018) Adaptive sampling towards fast graph representation learning. In Advances in Neural Information Processing Systems, pp. 4558–4567. Cited by: §2, §4.1, §5, §5.
 [16] (2017) Semisupervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §1, §2, §3, §4.1, §4.2, §5.
 [17] (2019) Predict then propagate: graph neural networks meet personalized pagerank. In Proceedings of the 7th International Conference on Learning Representations, Cited by: §1, §2.
 [18] (2017) Cayleynets: graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing 67 (1), pp. 97–109. Cited by: §2.

[19]
(2018)
Deeper insights into graph convolutional networks for semisupervised learning
. InThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §1, §1, §2.  [20] (2018) Adaptive graph convolutional neural networks. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §2.
 [21] (2007) The linkprediction problem for social networks. Journal of the American society for information science and technology 58 (7), pp. 1019–1031. Cited by: §1.
 [22] (1992) Locality in distributed graph algorithms. SIAM Journal on Computing 21 (1), pp. 193–201. Cited by: §4.2.
 [23] (1993) Random walks on graphs: a survey. Combinatorics, Paul erdos is eighty 2 (1), pp. 1–46. Cited by: §4.1, §4.1, §7, Corollary 1.

[24]
(2017)
Geometric deep learning on graphs and manifolds using mixture model cnns
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115–5124. Cited by: §2.  [25] (2016) Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023. Cited by: §2.
 [26] (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §5.
 [27] (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1.
 [28] (2008) Collective classification in network data. AI magazine 29 (3), pp. 93. Cited by: §1, §1, §5.
 [29] (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1, §4.2, §4.2.
 [30] (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §1.
 [31] (2018) Representation learning on graphs with jumping knowledge networks. In Proceedings of the 35th International Conference on Machine Learning, Cited by: §1, §1, §2.
 [32] (2018) An endtoend deep learning architecture for graph classification. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §1.
7 Proof of Theorem 1
To explain why the conductance of graph can only decrease after removing edges from the original graph, we need to adopt some concepts from electrical networks. Consider the graph as electrical networks, where each edge represents a unit resistance. Then the effective resistance, from node to node is defined as the total resistance between node and . Moreover, the conductance of the graph is defined as the following.
Definition 1.
Let as a graph and , . The conductance of the is defined as:
and the conductance of the graph is defined as
By the graph theory in [23], the mixing time () of the graph is bounded by the conductance of the graph as
(6) 
Then, we will have:
Lemma 2.
Let be the conductance of the graph. Then for the hidden vectors of the th layer (mixing time), , its distance to the stationary point is bounded by a small error, , where . Namely , whenever
where is the number of nodes.
Proof.
By constraining the upper bound of the mixing time via inequality (6), we have
(7) 
Since , then . Therefore, the inequality (7) becomes
∎
Corollary 2.
By decreasing the conductance of the graph, the lower bound of the mixing time is increasing.
Note that . Therefore, in order to reach the same level of smoothing, the graph with smaller conductance needs larger mixing time.
On the other hand, the conductance of the graph can also be bounded by the first eigenvalue gap of the renormalized adjacency matrix as
Note that the first eigenvalue of the renormalized adjacency matrix is always equal to 1. Moreover, the first eigenvalue gap of the renormalized adjacency matrix is bounded by the effective resistance as
Therefore,
(8) 
By the properties of effective resistance, the effective resistance can only increase if one edge that not connected to either or is removed from the circuit.[8] It implies that according to Inequality 8, the conductance of the graph can only decrease if one edge is removed from the graph . Consequently, it slows down the mixing time after DropEdge by Inequality (6).
8 More Details in Experiments
Datasets
The statistics of all datasets are summarized in Table 2.
Datasets  Nodes  Edges  Classes  Features  Traing/Validation/Testing  Type 

Cora  2,708  5,429  7  1,433  1,208/500/1,000  Transductive 
Citeseer  3,327  4,732  6  3,703  1,812/500/1,000  Transductive 
Pubmed  19,717  44,338  3  500  18,217/500/1,000  Transductive 
232,965  11,606,919  41  602  152,410/23,699/55,334  Inductive 
Self Feature Modeling To emphasize the importance the selffeatures, we also implement a variant of GCN with self feature modeling [5]:
(9) 
where .
Network Architecture and Hyperparameters
We conduct random search strategy to optimize the hyperparameter for each dataset in Section 5.1. The hyperparameter decryption is summarized in Table
3. Table 4 reports the network architecture and hyperparameters. In Table 4: “Architecture” column, GCL indicate graph convolution Layer, D is the dense block with layers and I is the inception block with layers.Hyperparameter  Description 

hidden  the number of hidden dimension in intermediate layers 
lr  learning rate 
weightdecay  L2 regulation weight 
p  DropEdge percent 
dropout  dropout rate 
withloop  using self feature modeling 
withbn  using batch normalization 
Dataset  Model  Architeceture  Hyperparameters 

Cora  DeepGCNI  GCL  I1  I1  I1  I1  GCL  hidden:128, lr:0.007, weight_decay:5e3, p:0.8, dropout:0.9, withbn 
DeepGCND  GCL  D3  D3  D3  GCL  hidden:512, lr:0.0005, weight_decay:5e4, p:0.6, dropout:0.5  
Citeseer  DeepGCNI  GCL  I2  I2  GCL  hidden:256, lr:0.009, weight_decay:5e3, p:0.85, dropout:0.9, withloop, withbn 
DeepGCND  GCL  D2  GCL  hidden:128, lr:0.012, weight_decay:5e4, p:0.85, dropout:0.7, withloop  
Pubmed  DeepGCNI  GCL  I3  I3  I3  I3  GCL  hidden:64, lr:0.0005, weight_decay:1e4, p:0.6, dropout:0.2, withloop, withbn 
DeepGCND  GCL  D4  D4  GCL  hidden:64,lr:0.001, weight_decay:1e3, p:0.2, dropout:0.6, withloop, withbn  
DeepGCNI  GCL  I2  I2  I2  I2  GCL  hidden:64,lr:0.002, weight_decay:1e5, sampling_percent:0.6, dropout:0.4, withloop, withbn  
DeepGCND  GCL  D4  D4  D4  GCL  hidden:64,lr:0.003, weight_decay:1e4, p:0.05, dropout:0.5, withloop 
8.1 More Results in Ablation Studies
The Comparison with Different Architectures.
Fig. 7 reports the result of training and validation loss on Cora and Citesser.
How important is the DropEdge?
Fig. 8 demonstrates the validation loss of different architectures with Fig. 8(a) summarizes the validation loss of different layers with and without DropEdge on Citeseer. Fig. 8(b) summarizes the validation loss of architectures described in Section 5.2 with and without DropEdge on Citeseer.

The Justification of Oversmoothing.
Fig. 10 demonstrates more results about the distances of different intermediate layers under differnent edge dropping rates (0, 0.2, 0.4, 0.6, 0.8).
Comments
There are no comments yet.