DropEdge: Towards the Very Deep Graph Convolutional Networks for Node Classification

07/25/2019 ∙ by Yu Rong, et al. ∙ 0

Existing Graph Convolutional Networks (GCNs) are shallow---the number of the layers is usually not larger than 2. The deeper variants by simply stacking more layers, unfortunately perform worse, even involving well-known tricks like weight penalizing, dropout, and residual connections. This paper reveals that developing deep GCNs mainly encounters two obstacles: over-fitting and over-smoothing. The over-fitting issue weakens the generalization ability on small graphs, while over-smoothing impedes model training by isolating output representations from the input features with the increase in network depth. Hence, we propose DropEdge, a novel technique to alleviate both issues. At its core, DropEdge randomly removes a certain number of edges from the input graphs, acting like a data augmenter and also a message passing reducer. More importantly, DropEdge enables us to recast a wider range of Convolutional Neural Networks (CNNs) from the image field to the graph domain; in particular, we study DenseNet and InceptionNet in this paper. Extensive experiments on several benchmarks demonstrate that our method allows deep GCNs to achieve promising performance, even when the number of layers exceeds 30---the deepest GCN that has ever been proposed.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graph Convolutional Networks (GCNs), which exploit the concept of message passing or equivalently the neighborhood aggregation function to extract high-level features from a node as well as its neighborhoods, have boosted the state-of-the-arts for a variety of tasks on graphs, including node classification [1, 32], social recommendation [6, 27], link prediction [21] and many others. In other words, GCNs have been becoming one of the most crucial tools for graph representation learning.

Yet, when we revisit typical successful GCNs, such as the architecture developed in [16], one conspicuous observation is that they are all “shallow”—the number of the layers is never larger than 2. Deeper variants by simply stacking more layers, in principle can access more information, but perform worse [19, 31]. Even with the residue connections that are proved to be powerful in very deep Convolutional Neural Networks (CNNs), there is still no evidence to affirm that, GCNs with more than 2 layers perform as well as the 2-layer one on popular benchmarks (e.g. Cora [28]). So the following questions remain: “what are the very factors that impede deeper GCNs to perform promisingly” and “how can we eliminate those factors by developing techniques specific to graphs”, both of which motivate the study of this paper.

Figure 1: Comparisons of the training loss (in dash line) and validation loss (in bold line) between various architectures on Cora. We implement 2-layer PlainGCN, 6-layer PlainGCN / DeepGCN and 32-layer PlainGCN / DeepGCN. Particularly for DeepGCNs, we use the inception backbone + DropEdge

. Regarding PlainGCNs, the 6-layer network gets stuck in the over-fitting issue attaining lower training error but higher testing error than the 2-layer one; the 32-layer network fails to converge probably due to over-smoothing. By contrast, our DeepGCNs work well for both training and testing.

We begin by investigating two contradict factors: over-fitting and over-smoothing. The over-fitting issue comes from the case when we utilize an over-parameterized model to fit a distribution with limited training data, where the model we learn fits the training data very well but generalizes poorly to the testing data. This issue does exist if we apply a deep GCN on small graphs (see the empirical comparisons between 2-layer GCN and 6-layer GCN on Cora in Fig. 1). Rather, the over-fitting issue is hard to be solved satisfactorily, even we have considered certain well-known tricks like weight penalizing and dropout [13]. A more efficient method is in demand. By contrast, over-smoothing, towards the other extreme, makes training a very deep GCN difficult. As first introduced by [19] and further explained in [30, 31, 17], graph convolutions essentially push representations of adjacent nodes mixed with each other, such that, if extremely we go with an infinite number of layers, all nodes’ representations will converge to a stationary point, making them unrelated to the input features. We call this phenomenon as over-smoothing of node features. To illustrate its effect, we have conducted an example experiment with 32-layer GCN in Fig. 1, in which the training of such a very deep GCN is observed not to converge and fail.

Both of the above two issues can be addressed, using DropEdge. The term “DropEdge” refers to randomly dropping out certain rate of edges of the input graph for each training time. In its particular form, each edge is independently dropped with a fixed probability , with being a hyper-parameter and determined by validation. There are several benefits in applying DropEdge for the GCN training. First, DropEdge can be considered as a data augmentation technique. By DropEdge, we are actually generating different random deformed copies of the original graph; as such, we augment the randomness and the diversity of the input data, thus better capable of preventing over-fitting. It is analogous to performing random rotation, cropping, or flapping for robust CNN training in the context of images. Second, DropEdge can also be treated as a message passing reducer. In GCNs, the message passing between adjacent nodes is conducted along edge paths. Removing certain edges is making node connections more sparse, and hence avoiding over-smoothing to some extent when the GCN goes very deep. Indeed, as we will draw in this paper, DropEdge theoretically slows down the smoothing of the hidden node features by a certain ratio. Finally, DropEdge is related but distinct to other concepts, such as the dropout skill [13] that drops out the activation units of the network by random. Since activation dropout does not perform any data augmentation, its effect on alleviating over-fitting is not so strong as DropEdge, and it does not help prevent over-smoothing neither. We defer more discussions of DropEdge to other methods in § 4.1.

We also explore what kind of architectures can facilitate the training of GCNs, and what can be compatible with our DropEdge. To do so, we first review several successful CNN architectures that operate on images, and then recast them in the graph domain. We study ResNet [11], DenseNet [14] and InceptionNet [29] in this paper. The method by [16] has already imitated the residual connections of ResNets to GCNs, but the performance is unsatisfactory. DenseNet, which further generalizes the idea of skip connections, connects each layer to every other layer in a feed-forward fashion. Here, for more efficiency, we instantiate the dense version of GCNs by retaining all short paths from immediate layers to the output layer but removing all others between immediate layers. From a graphical perspective, the outputs along the shortcut from k-step away layers are actually messages from k-hop neighborhoods; in other words, DenseNet allows us to obtain multi-hop messages output with one single network. Also, the short connections enable direct back-propagation from the loss to lower layers, and it alleviates the effect of vanishing gradients as observed in deep neural networks.

InceptionNet is another typical CNN structure. By its original design, it performs convolutions in each layer with multiple sizes of kernels/receptive-fields, so as to model objective variations in size. This property is also crucial to the graph data, for the local structures within an input graph is diverse and deserved to capture. We then take inspiration from InceptionNet by adopting multiple atom GCNs of different layers to represent receptive fields of different sizes, and then concatenating all their outputs as an inception block. Stacking these blocks one by one leads to our inception variant of GCNs. Figure 2 illustrates an example of deep GCN with the inception backbone. We defer more details of different GCNs to  § 4.2.

For the experiments on four public benchmarks (e.g. Cora, Citeseer, Pubmed [28], and Reddit [9]), we demonstrate that the residual connections, dense connections and inception blocks are compatible with the DropEdge skill, and they obtain promising testing error even when the number of layers is large (e.g. larger than 30 in Fig. 1)—to the best of our knowledge, this is the deepest GCN that has ever been developed, and more importantly it performs promisingly. Moreover, when equipped with DropEdge, we find that both dense GCNs and inception GCNs are able to promote the training consistently and hence give rise to better performance, compared to the plain GCNs.

Figure 2: An example DeepGCN model with three inception blocks.

2 Related Work

Inspired by the huge success of CNNs in computer vision, a large number of methods come redefining the notion of convolution on graphs under the umbrella of GCNs. The first prominent research on GCNs is presented in

[2], which develops graph convolution based on spectral graph theory. Later, [16, 4, 12, 20, 18] apply improvements, extensions, and approximations on spectral-based GCNs. With contending the scalability issue of spectral-based GCNs on large graphs, spatial-based GCNs have been rapidly developed [10, 24, 25, 7]. These methods directly perform convolution in the graph domain by aggregating the information from neighbor nodes. By recent, several sampling-based methods have been proposed for fast graph representation learning, including the node-wise sampling methods [10], the layer-wise approach [3] and its layer-dependent variant [15].

Despite the fruitful progress, most previous works only focus on shallow GCNs while the deeper extension is seldom discussed. The work by [19] first introduces the concept of over-smoothing in GCNs, but it never proposes a deep GCN with addressing this issue. Its following study [17] solves over-smoothing by using personalized PageRank that additionally involves the rooted node into the message passing loop; however, the accuracy is still observed to decrease when the depth of GCN increases from 2. The JKNet [31] employs skip connections for multi-hop message passing, and it enables different neighborhood ranges for better structure-aware representation learning. Unexpectedly, as shown in the experiments, the JKNets that obtain the best accuracy have depth less than 3 on all datasets, except the one on Cora where the best result is given by the 6-layer network. In this paper, we propose the notion of DropEdge to overcome both the over-fitting and over-smoothing issues simultaneously, and combine it with various backbone architectures to drive an in-depth analysis on deep GCNs.

3 Notations and Preliminaries

Notations. Let represent the input graph, with nodes , edges , and defining the number of the nodes. The node features are denoted as , and the adjacent matrix associates each edge with its element . The degrees for all nodes are given by where computes the sum of edge weights connected to node . For simplicity, we define as the degree matrix with its diagonal elements given by .

PlainGCN. We call the original GCN developed by Kipf and Welling [16] as PlainGCN. By defining as the hidden feature in the -th layer for node , the feed forward propagation becomes



are the hidden vectors of

-th layer; is the re-normalization of the adjacency matrix; is a nonlinear function, i.e.

the relu function; and

is the filter matrix in the -th layer. We denote one-layer GCN as computed by Eq. (1) as Graph Convolutional Layer (GCL) in what follows.

4 DeepGCN

In this section, we first introduce the formulation of DropEdge, and then follow it up by presenting several backbone architectures to extend PlainGCNs.

4.1 DropEdge

To involve randomness into the training data, the DropEdge technique drops out the edges of the input graph at each training iteration. Formally, it randomly enforces non-zero elements of the adjacent matrix to be zeros, where is the total number of edges and is the dropping rate. If we denote the resulting adjacent matrix as , then its relation to becomes


where is a sparse matrix expanded by a random subset of size from original edges . Following the idea of [16], we also perform the re-normalization trick on , leading to .

It is clear that DropEdge can prevent over-fitting since the model is fed with diverse at different training times. Despite the randomness, the inputs for different training iterations still share all the same nodes and input features, hence all inputs could be considered to be drawn from similar underlying distribution; in other words, the training is still meaningful.

Now we focus more on the over-smoothing issue. To be specific, the over-smoothing issue states that all nodes’ features will degenerate to a stationary point and become isolated to the input features, if we employ a GCN of an infinite number of layers. This will impede model training, for the discriminative information of the input features is eliminated. To reveal what incurs over-smoothing and understand how it acts, we consider the random-walk [23] version for the update in Eq. (1) as


where we have omitted the non-linear function and parameter matrix for simplicity. Here, can be viewed as the transition probability of the random walk. When goes to infinity, we arrive at


where the stationary solution has been proved to satisfy regardless of the input state [23]. This yet implies the independence to the initial point, i.e. the input feature . In other words, the information of the input feature has vanished. In practice, we also observe the same trend in the standard GCNs (see Fig. 6).

By virtue of DropEdge, we replace with in Eq. (2). Although we will still go to the stationary point under the infinity case, we are able to slow down the convergent speed if using instead. This can be validated by using the concept of mixing time that has been studied in the random walk theory [23]. As its name implies, mixing time measures how fast the random walk converges to its limiting distribution. Its computation is given by .

Theorem 1.

If drops from a graph by removing an edge, then mixing time can only increase. i.e.


Please refer to the supplemental materials for the proof of Theorem 1.

Corollary 1.

By increasing the mixing time, the deeper layers of the deep GCNs converge more slowly towards its limiting distribution [23]. Therefore, DropEdge alleviates the effect of over-smoothing and become more friendly to deeper models.

Our DropEdge is related to the dropout trick [13] and node sampling methods [3, 15]. The dropout trick is trying to disturb the feature matrix by randomly setting features to be zeros, which may reduce the effect of over-fitting but has no help to the over-smoothing. The node sampling methods is trying to drop nodes by random or based on the adjacent connections between layers to reduce the computational complexity. However, node sampling only delivers a sub-graph and reduce too much information of node features. In contrast, DropEdge only drops edges without losing the features of the nodes, with more input information retained.

4.2 Network Architecture Design

Figure 3: Two basic building blocks in DeepGCN.

Even though we can alleviate the over-smoothing and train the -layer PlainGCN model by DropEdge, we argue that PlainGCN has inevitable shortcomings that it doesn’t consider the graph locality and treat all nodes within -hop equally. Graph locality [22] is very essential to obtain better node representation since it pays more attention to the nodes which are close to the target nodes.

The authors in [16] have applied residual connections between hidden layers to facilitate the training of deeper models. The residual connection carries over information from the previous layer, and it can be implemented by additionally adding the identity mapping in the right size of Eq. (1). While it does enable efficient back-propagation via shortcuts, the residual connection is still insufficient in capturing multi-hop neighbour messages, since it is only limited to connect between input and output of the same layer. In this section, we generalize the idea of the residual connection, and introduce two more powerful building blocks for GCNs: Dense Block and Inception Block, inspired from the success of DenseNet [14] and InceptionNet [29], respectively.

Dense Block.

The Dense Block is consist of a fixed-number of GCLs. Besides the feed-forward connections, we add shortcuts that are linked to the output layer from each GCL and the input layer (see Figure. 3

(b)). All messages from the top GCL and skip connections are concatenated together to formulate an eventual output. In this way, we encode multi-hop neighbor information into the output representation, which allows us to capture diverse local graph structures with different neighbor expanding ranges. For example, suppose a node’s representation is sufficiently characterized by its 2-hop neighbour features but not those beyond, the model will be trained to attend more closely to the 2-hop outputs and overlook the 3-hop ones. In a machine learning point of view, concatenating the outputs from sub-networks of different layers acts like performing an ensemble of different models, and the training will perform model selection, leading to better performance.

Inception Block.

The basic idea of inception operation [29] is to capture multi-scale objective patterns by using convolution kernels of different sizes. In the graph domain, the size of the receptive field is explained as the distance/hop from the target node to its neighborhood. Similar to the design of Dense Block, we adopt the sub-network of depth to define the convolution kernel of size . We concatenate those sub-networks of multiple depths to derive an inception block. As shown in Figure. 3(a), we build an inception block that contains 3 sub-networks with depth ranging from 1 to 3. Unlike Dense Block, all sub-networks in Inception Block do not share any GCL, by which we expect architectures with more capacity to capture various local graph structures. This is more prone to over-fitting since more parameters are involved, but it still works promisingly when combined with our DropEdge. Furthermore, the independence of each sub-network enables a more flexible model and brings more generalization ability to model complex graphs. We will provide more discussions in the experimental section.

5 Experiments


Joining the previous works’ practice, we focus on four benchmark datasets varying in graph size and feature type: (1) classifying the research topic of papers in three citation datasets: Cora, Citeseer and Pubmed 

[28]; (2) predicting which community different posts belong to in the Reddit social network [10]. Note that the tasks in Cora, Citeseer and Pubmed are transductive underlying all node features are accessible during training, while the task in Reddit is inductive meaning the testing nodes are unseen for training. We apply the full-supervised training fashion used in [15, 3] on all datasets in our experiments.

Baselines. We compare our models against four baselines: the original GCN model [16] (denoted as PlainGCN); GraphSAGE [10], FastGCN [3] and AsGCN [15]. All baselines contain two layers, and we download their public codes for re-implementation. As for DeepGCNs, we compare the variants that use different blocks. Specifically, DeepGCN-I and DeepGCN-D indicate using the inception block and the dense block, respectively. The DropEdge skill is equipped with our models by default; if not, we use the suffix “(ND)” for the specification: for example, DeepGCN-D (ND) means no DropEdge is involved. We also perform DropEdge in PlainGCN, denoted as PlainGCN+DropEdge, to evaluate how DropEdge affects the performance of PlainGCN.


We implement our models in Pytorch 


and use the Adam optimizer for network training. To ensure the re-productivity of the results, the random seed of all experiments is fixed. We fix the training epoch to

for Cora, Citeseer and Pubmed, and for Reddit. The early stopping criterion is applied during training over all datasets. We utilize the same set of hyper-parameters for the model with and without DropEdge to avoid the unintentional “hyper-parameter hacking”. For testing, we utilize the whole graph as the input without using DropEdge. We defer more implementation details in the supplementary material.

5.1 Comparison with State-of-the-art Methods

Table 1 summaries the classification errors of our method as well as the baselines on four datasets. For DeepGCN, the number in parentheses indicates the number of layers it contains; we only report the best results among the architectures by ranging the depth from 2 to 15. It is observed that our methods outperform all other baselines on all datasets significantly. If removing DropEdge, both DeepGCN-I and DeepGCN-D exhibit remarkable performance drop, which explains the necessity of conducting DropEdge in deep GCNs. We also find that PlainGCNs with DropEdge deliver much better performance than those without DropEdge (over 10% improvement on Cora and Pubmed), justifying the importance of DropEdge once again. Another interesting observation is DeepGCN-I performs better than DeepGCN-D with DropEdge, but worse without DropEdge. This is consistent with our statement in § 4.2, as DeepGCN-I contains more parameters and it will be more prone to over-fitting. Even so, DeepGCN-I works promisingly when combined with DropEdge. In general, the results here confirm the superiority of our proposed methods.

Transductive Inductive
Cora Citeseer Pubmed Reddit
FastGCN 15.00 22.40 12.00 6.30
GraphSAGE 17.80 28.60 12.90 5.68
AsGCN 12.56 20.34 9.40 3.73
PlainGCN 14.00 22.70 10.20 3.52
DeepGCN-I (No DropEdge) 13.90 22.00 10.60 3.25
DeepGCN-D (No DropEdge) 13.30 20.90 9.70 3.52
PlainGCN+DropEdge 12.80 20.90 9.10 3.46
DeepGCN-I 11.70 (6) 19.50 (6) 8.60 (14) 3.13 (10)
DeepGCN-D 11.90 (11) 19.70 (4) 8.60 (10) 3.22 (14)
avg. % reduce over DropEdge 11.6% 8.3% 13.7% 4.6%
Table 1: Error Rates (%) Comparisons with state-of-the-art methods.

5.2 Ablation Studies

Now we continue a more detailed analysis on our models. Due to the space limit, we only provide the results on Cora, and defer the evaluations on other datasets to the supplementary material. Note that this section mainly focuses on assessing each component in DeepGCNs, without the concern about pushing state-of-the-art results. So, we do not perform delicate model selection in what follows. We construct the DeepGCN architecture which contains one input GCL, one inception/dense block and one output GCL. The number of layers in inception/dense block is selected as 4, thus the complete network depth is 6. As a comparison, we also implement another popular architecture: ResGCN that shortly connects the input with output in each immediate GCL. The hidden dimension, learning rate and weight decay are fixed to 256, 0.005 and 0.0005, receptively. The random seed is fixed and no early stopping is adopted. We train all models with 200 epochs.

The Comparison with Different Architectures. We investigate the converging behaviors of different architectures without DropEdge. Figure 4 displays the training (in dash line) and validation (in bold line) loss of all architectures of depth 6 on Cora (we provide more experimental results in the supplementary material). We also plot the results by 2-layer PlainGCN as a reference.

We have two major observations from Figure 4. First, both DeepGCN-I and DeepGCN-D exhibit consistently lower training error and fast convergence rate compared with PlainGCN, which verifies the benefit of the architecture design for the dense and inception block. Second, compared with 2-layer PlainGCN, all variants of DeepGCNs suffer from over-fitting when the training epochs are larger than 30. Actually, we will demonstrate carrying out DropEdge helps prevent over-fitting in the following experiment.

Figure 4: Comparison of different architectures, left sub-figure for training loss and right for validation loss. PlainGCN- denotes PlainGCN of depth ; similar denotation follows for other methods.

How important is the DropEdge? To justify the importance of DropEdge, we contrast the validation loss of models with and without DropEdge. We fix the dropping rate for all cases. We first check the results by PlainGCNs. Figure 4(a) summarizes the validation loss of different layers. It shows that, DropEdge generally helps all PlainGCNs to obtain lower validation loss, except for the 2-layer one; since the shallow PlainGCN could be free of the over-fitting issue, operating DropEdge is unnecessary. Regarding DeepGCNs, as depicted in Figure 4(b), the architectures considering DropEdge outperform those without DropEdge significantly over different numbers of layers (4 and 6) and different types of building block (the inception, dense and also residual block). We also observe the loss curves of DeepGCN-I-6 and DeepGCN-D-6 are close to each other but clearly lower than that of ResGCN-6. This explains the superiority of dense connection and inception block compared to the residual design.

The Justification of Over-Smoothing. We now justify how DropEdge help alleviate the over-smoothing issue. As discussed in § 4.1, the over-smoothing issue is incurred when the top-layer output of GCN converge to the stationary point and become unrelated to the input features with the increase in depth. In other words, the closer the output is to the stationary point, the more serious the over-smoothing issue becomes. Since we are unable to derive the stationary point explicitly, we instead compute the difference between the outputs of adjacent layers to measure the degree of over-smoothing. We adopt the Euclidean distance for the difference computation. Lower distance means more serious self-smoothing.

Experiments are conducted on 8-layer PlainGCN. All parameters are initialized by the same uniform distribution. Figure 

6 (a) shows the distances of different intermediate layers (from 2 to 6) under different edge dropping rates (0 and 0.8). Clearly, the over-smoothing issue becomes more serious as the layer grows, which is consistent with our conjecture. Moreover, we find that the model with DropEdge () reveals higher distance and slower convergent speed than that without DropEdge (), implying the crucial importance of DropEdge on alleviating over-smoothing.

We are also interested in how the over-smoothing will act after model training. For this purpose, we display the results after 150-epoch training in Figure 6 (b). For PlainGCN (), the difference between outputs of the 5-th and 6-th layers is equal to 0, indicating that hidden features have converged to a certain stationary point. Compatible with such observation, Figure 6 (c) shows that the training loss fails to decrease for PlainGCN (). By contrast, PlainGCN () exhibits more promising outcomes, as the distance increases when the number of layers grows. It indicates that PlainGCN () has successfully learned meaningful node representations after training, which can also be validated by the training loss in Figure 6 (c)

Results of very deep GCNs. We test whether our methods still help or not for the very deep GCNs, when, for example, setting the number of layers to exceed 30. To answer this question, we implement 32-layer PlainGCN and our 32-layer DeepGCN-I on Cora, where the edge dropping rate is . We report the training and validation loss in Figure. 1. We observe that the training of 32-layer PlainGCN fails to converge probably due to over-smoothing, while our 32-layer DeepGCN works quite well for both training and validation. This is exciting as our proposed techniques have enabled more broader choices for network design with much deeper layers.

(a) PlainGCN with and without DropEdge.
(b) Performance comparisons with and without DropEdge.
Figure 5: The validation loss on different architectures.
Figure 6: Analysis on the over-smoothing issue. Smaller distance means more serious over-smoothing.

6 Conclusion

We have presented DropEdge, a novel and efficient technique to facilitate the development of deep Graph Convolutional Networks (GCNs). By dropping out a certain rate of edges by random, DropEdge includes more diversity into the input data to prevent over-fitting, and reduces message passing in graph convolution to alleviate over-smoothing. By enjoying its benefits, DropEdge enables us to consider various kinds of building blocks in deep GCNs, including the dense and inception blocks. Considerable experiments on Cora, Citeseer, Pubmed and Reddit have verified the effectiveness of Deep GCNs when the proposed techniques are embedded. It is expected that our research will open up a new venue on more in-depth exploration of deep GCNs for broader potential applications.


  • [1] S. Bhagat, G. Cormode, and S. Muthukrishnan (2011) Node classification in social networks. In Social network data analytics, pp. 115–148. Cited by: §1.
  • [2] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. In Proceedings of International Conference on Learning Representations, Cited by: §2.
  • [3] J. Chen, T. Ma, and C. Xiao (2018) FastGCN: fast learning with graph convolutional networks via importance sampling. In Proceedings of the 6th International Conference on Learning Representations, Cited by: §2, §4.1, §5, §5.
  • [4] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844–3852. Cited by: §2.
  • [5] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur (2017) Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems, pp. 6530–6539. Cited by: §8.
  • [6] L. C. Freeman (2000) Visualizing social networks. Journal of social structure 1 (1), pp. 4. Cited by: §1.
  • [7] H. Gao, Z. Wang, and S. Ji (2018) Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1416–1424. Cited by: §2.
  • [8] A. Ghosh, S. Boyd, and A. Saberi (2008) Minimizing effective resistance of a graph. SIAM review 50 (1), pp. 37–66. Cited by: §7.
  • [9] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1.
  • [10] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1025–1035. Cited by: §2, §5, §5.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §1.
  • [12] M. Henaff, J. Bruna, and Y. LeCun (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Cited by: §2.
  • [13] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §1, §1, §4.1.
  • [14] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1, §4.2.
  • [15] W. Huang, T. Zhang, Y. Rong, and J. Huang (2018) Adaptive sampling towards fast graph representation learning. In Advances in Neural Information Processing Systems, pp. 4558–4567. Cited by: §2, §4.1, §5, §5.
  • [16] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations, Cited by: §1, §1, §2, §3, §4.1, §4.2, §5.
  • [17] J. Klicpera, A. Bojchevski, and S. Günnemann (2019) Predict then propagate: graph neural networks meet personalized pagerank. In Proceedings of the 7th International Conference on Learning Representations, Cited by: §1, §2.
  • [18] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein (2017) Cayleynets: graph convolutional neural networks with complex rational spectral filters. IEEE Transactions on Signal Processing 67 (1), pp. 97–109. Cited by: §2.
  • [19] Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning


    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1, §1, §2.
  • [20] R. Li, S. Wang, F. Zhu, and J. Huang (2018) Adaptive graph convolutional neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
  • [21] D. Liben-Nowell and J. Kleinberg (2007) The link-prediction problem for social networks. Journal of the American society for information science and technology 58 (7), pp. 1019–1031. Cited by: §1.
  • [22] N. Linial (1992) Locality in distributed graph algorithms. SIAM Journal on Computing 21 (1), pp. 193–201. Cited by: §4.2.
  • [23] L. Lovász et al. (1993) Random walks on graphs: a survey. Combinatorics, Paul erdos is eighty 2 (1), pp. 1–46. Cited by: §4.1, §4.1, §7, Corollary 1.
  • [24] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein (2017)

    Geometric deep learning on graphs and manifolds using mixture model cnns

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115–5124. Cited by: §2.
  • [25] M. Niepert, M. Ahmed, and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In International conference on machine learning, pp. 2014–2023. Cited by: §2.
  • [26] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §5.
  • [27] B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1.
  • [28] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI magazine 29 (3), pp. 93. Cited by: §1, §1, §5.
  • [29] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1, §4.2, §4.2.
  • [30] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §1.
  • [31] K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka (2018) Representation learning on graphs with jumping knowledge networks. In Proceedings of the 35th International Conference on Machine Learning, Cited by: §1, §1, §2.
  • [32] M. Zhang, Z. Cui, M. Neumann, and Y. Chen (2018) An end-to-end deep learning architecture for graph classification. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.

7 Proof of Theorem 1

To explain why the conductance of graph can only decrease after removing edges from the original graph, we need to adopt some concepts from electrical networks. Consider the graph as electrical networks, where each edge represents a unit resistance. Then the effective resistance, from node to node is defined as the total resistance between node and . Moreover, the conductance of the graph is defined as the following.

Definition 1.

Let as a graph and , . The conductance of the is defined as:

and the conductance of the graph is defined as

By the graph theory in [23], the mixing time () of the graph is bounded by the conductance of the graph as


Then, we will have:

Lemma 2.

Let be the conductance of the graph. Then for the hidden vectors of the -th layer (mixing time), , its distance to the stationary point is bounded by a small error, , where . Namely , whenever

where is the number of nodes.


By constraining the upper bound of the mixing time via inequality (6), we have


Since , then . Therefore, the inequality (7) becomes

Corollary 2.

By decreasing the conductance of the graph, the lower bound of the mixing time is increasing.

Note that . Therefore, in order to reach the same level of -smoothing, the graph with smaller conductance needs larger mixing time.

On the other hand, the conductance of the graph can also be bounded by the first eigenvalue gap of the re-normalized adjacency matrix as

Note that the first eigenvalue of the re-normalized adjacency matrix is always equal to 1. Moreover, the first eigenvalue gap of the re-normalized adjacency matrix is bounded by the effective resistance as



By the properties of effective resistance, the effective resistance can only increase if one edge that not connected to either or is removed from the circuit.[8] It implies that according to Inequality 8, the conductance of the graph can only decrease if one edge is removed from the graph . Consequently, it slows down the mixing time after DropEdge by Inequality (6).

8 More Details in Experiments


The statistics of all datasets are summarized in Table 2.

Datasets Nodes Edges Classes Features Traing/Validation/Testing Type
Cora 2,708 5,429 7 1,433 1,208/500/1,000 Transductive
Citeseer 3,327 4,732 6 3,703 1,812/500/1,000 Transductive
Pubmed 19,717 44,338 3 500 18,217/500/1,000 Transductive
Reddit 232,965 11,606,919 41 602 152,410/23,699/55,334 Inductive
Table 2: Dataset Statistics

Self Feature Modeling To emphasize the importance the self-features, we also implement a variant of GCN with self feature modeling [5]:


where .

Network Architecture and Hyperparameters

We conduct random search strategy to optimize the hyperparameter for each dataset in Section 5.1. The hyperparameter decryption is summarized in Table 

3. Table 4 reports the network architecture and hyperparameters. In Table 4: “Architecture” column, GCL indicate graph convolution Layer, D is the dense block with layers and I is the inception block with layers.

Hyperparameter Description
hidden the number of hidden dimension in intermediate layers
lr learning rate
weight-decay L2 regulation weight
p DropEdge percent
dropout dropout rate
withloop using self feature modeling

using batch normalization

Table 3: Hyperparameter Description
Dataset Model Architeceture Hyperparameters
Cora DeepGCN-I GCL - I1 - I1 - I1 - I1 - GCL hidden:128, lr:0.007, weight_decay:5e-3, p:0.8, dropout:0.9, withbn
DeepGCN-D GCL - D3 - D3 - D3 - GCL hidden:512, lr:0.0005, weight_decay:5e-4, p:0.6, dropout:0.5
Citeseer DeepGCN-I GCL - I2 - I2 - GCL hidden:256, lr:0.009, weight_decay:5e-3, p:0.85, dropout:0.9, withloop, withbn
DeepGCN-D GCL - D2 - GCL hidden:128, lr:0.012, weight_decay:5e-4, p:0.85, dropout:0.7, withloop
Pubmed DeepGCN-I GCL - I3 - I3 - I3 - I3 - GCL hidden:64, lr:0.0005, weight_decay:1e-4, p:0.6, dropout:0.2, withloop, withbn
DeepGCN-D GCL - D4 - D4 - GCL hidden:64,lr:0.001, weight_decay:1e-3, p:0.2, dropout:0.6, withloop, withbn
Reddit DeepGCN-I GCL - I2 - I2 - I2 - I2 - GCL hidden:64,lr:0.002, weight_decay:1e-5, sampling_percent:0.6, dropout:0.4, withloop, withbn
DeepGCN-D GCL - D4 - D4 - D4 - GCL hidden:64,lr:0.003, weight_decay:1e-4, p:0.05, dropout:0.5, withloop
Table 4: The Architecture and Hyperparameters.

8.1 More Results in Ablation Studies

The Comparison with Different Architectures.

Fig. 7 reports the result of training and validation loss on Cora and Citesser.

Figure 7: Comparison of different architectures, left sub-figure for training loss and right for validation loss. PlainGCN- denotes PlainGCN of depth ; similar denotation follows for other methods.

How important is the DropEdge?

Fig. 8 demonstrates the validation loss of different architectures with Fig. 8(a) summarizes the validation loss of different layers with and without DropEdge on Citeseer. Fig. 8(b) summarizes the validation loss of architectures described in Section 5.2 with and without DropEdge on Citeseer.

Figure 8: The validation loss of different architectures with and without DropEdge on Cora.
(a) Comparison of PlainGCN with and without DropEdge.
(b) Performance comparisons with and without DropEdge.
Figure 9: The validation loss of different architectures.

The Justification of Over-smoothing.

Fig. 10 demonstrates more results about the distances of different intermediate layers under differnent edge dropping rates (0, 0.2, 0.4, 0.6, 0.8).

Figure 10: More justification of over-smoothing issue with different edge dropping rates.