AdaGCN: Adaboosting Graph Convolutional Networks into Deep Models

08/14/2019 ∙ by Ke Sun, et al. ∙ Peking University 0

The design of deep graph models still remains to be investigated and the crucial part is how to explore and exploit the knowledge from different hops of neighbors in an efficient way. In this paper, we propose a novel RNN-like deep graph neural network architecture by incorporating AdaBoost into the computation of network; and the proposed graph convolutional network called AdaGCN (AdaBoosting Graph Convolutional Network) has the ability to efficiently extract knowledge from high-order neighbors and integrate knowledge from different hops of neighbors into the network in an AdaBoost way. We also present the architectural difference between AdaGCN and existing graph convolutional methods to show the benefits of our proposal. Finally, extensive experiments demonstrate the state-of-the-art prediction performance and the computational advantage of our approach AdaGCN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, research related to learning on graph structural data has gained considerable attention in machine learning community. Graph neural networks 

gori2005new , particularly graph convolutional networks kipf2016semi ; defferrard2016convolutional ; bruna2013spectral have demonstrated their remarkable ability on node classification kipf2016semi , link prediction zhu2016max and clustering tasks fortunato2010community . There are a wide range of works focusing on the design of graph convolutions. Graph Convolutional Networks (GCN) kipf2016semi employed a first-order approximation of spectral graph convolution and achieved surprising performance. GraphSAGE hamilton2017inductive was proposed in an inductive manner by performing an aggregator over sampled fixed-size neighborhood of each node. Moreover, Graph Attention Networks (GAT) velivckovic2017graph introduced self-attention mechanism in the averaging of features from neighborhood of nodes.

Despite their enormous success, almost all of these models have shallow model architectures with only two or three layers. The shallow desgin of GCN appears counterintuitive since deep versions of these models, in principle, have access to more information, but perform worse. Oversmoothing li2018deeper has been proposed to explain why deep GCN fails, showing that by repeatedly applying Laplacian smoothing in GCN, GCN may mix the node features from different clusters and makes them indistinguishable. This also means by stacking too much graph convolutional layers, the embedding of each node in GCN is inclined to converge to certain value li2018deeper , making it harder for classification. These shallow model architectures restricted by oversmoothing issue limit their ability to extract the knowledge from high-order neighbors, i.e., features from remote hops of neighbors for current nodes. Therefore, it is crucial to design deep graph models such that high-order information can be aggregated in an effective way for better predictions. As a matter of fact, several works implicitly designed deep graph models to address this issue. A straightforward solution kipf2016semi ; xu2018representation

inspired by ResNets is by adding residual connections, but this practice is unsatisfactory both in prediction performance and computational efficiency towards building deep graph models, as shown in our experiments. More recently, JK (Jumping Knowledge Networks 

xu2018representation ) introduced jumping connections into final aggregation mechanism in order to extract knowledge from different layers of graph convolutions. However, this simple change of GCN architecture exhibited inconsistent empirical performance for different aggregation operators, which cannot demonstrate the successful construction of deep layers. In addition, LanczosNet liao2019lanczosnet utilizes Lanczos algorithm to construct low rank approximations of the graph Laplacian and then can exploit multi-scale information. Moreover, APPNP (Approximate Personalized Propagation of Neural Predictions, klicpera2018predict ) leverages the relationship between GCN and personalized PageRank to derive an improved global propagation scheme.

Based on the aforementioned works, we argue that a key direction of constructing deep graph models lies in the efficient exploration and effective combination of information from different orders of neighbors. Due to the apparent sequential relationship between different orders of neighbors, it is a natural choice to incorporate boosting algorithm into the design of deep graph models. As an important realization of boosting theory, AdaBoost freund1999short is extremely easy to implement and keeps competitive in terms of both practical performance and computational cost hastie2009multi

. Moreover, boosting theory has been used to analyze the success of ResNets in computer vision field 

huang2017learning and AdaGAN tolstikhin2017adagan has already successfully incorporated boosting algorithm into the training of GAN.

In this work, we focus on incorporating AdaBoost into the design of deep graph convolutional networks in a non-trivival way. Firstly, in pursuit of the introduction of AdaBoost framework, we refine the type of graph convolutions and thus obtain a novel RNN-like GCN architecture called AdaGCN. Our approach can efficiently extract knowledge from different orders of neighbors and then combine these information in an AdaBoost manner with iterative updating of the node weights. Also, we compare our AdaGCN with existing methods from the perspective of architectural difference to show the benefits of our method. Finally, we conduct extensive experiments to demonstrate the state-of-the-art performance of our approach and computational advantage over other alternatives.

2 Our Approach: AdaGCN

2.1 Establishment of AdaGCN

In the vanilla GCN model kipf2016semi for semi-supervised node classification, the graph embedding of nodes with two convolutional layers is formulated as:

(1)

where and denote the feature and the adjacent matrix, respectively. and is the degree matrix of . In addition, is the input-to-hidden weight matrix and is the hidden-to-output weight matrix where is the number of classes.

Our key motivation of constructing deep graph models is to efficiently explore information of high-order neighbors and then combine these messages from different orders of neighbors in an Adaboost way. Nevertheless, if we naively extract information from high-order neighbors based on GCN, we are faced with stacking layers’ parameter matrix , which is definitely costly in computation. In this paper, we propose to remove ReLU to avoid the expensive joint optimization of multiple parameter matrices; and Simplified Graph Convolution (SGC) felix2019simplifying also adopted this practice, arguing that nonlinearity between GCN layers is not crucial and the majority of the benefits arises from local averaging of features of the neighborhood. Then the simplified graph convolution is formulated as:

(2)

where . In particular, one crucial impact of ReLU in GCN is accelerating the convergence of matrix multiplication since the ReLU is a contraction mapping. Without ReLU, this simplified graph convolution can not only avoid the aforementioned joint optimization over multiple parameter matrices but also improve the efficiency of combination. This is mainly because that the removal of ReLU operation could alleviate the oversmoothing issue, i.e. slowering the convergence of node embeddings to indistinguishable ones li2018deeper

. Nevertheless, we find that this type of stacked linear transformation in graph convolution has insufficient power in representing information of high-order neighbors. This is revealed in our experiment described Appendix A.1. Therefore, we utilize a nonlinear function

to replace the linear transformation

and attain the graph convolution form of each base classifier in AdaGCN as follows:

(3)

where the -th classifier in AdaGCN is extracting knowledge from features of current nodes and

-th hop of neighbors. As for the realization of Multi-class AdaBoost, we apply SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss function) algorithm 

hastie2009multi , a natural and clean multi-class extension of the two-class AdaBoost adaptively combining weak classifiers.

Figure 1: AdaGCN: RNN-like architecture with base classifiers sharing neural network architecture . and denote node weights and parameters computed after -th base classifier, respectively.

As illustrated in Figure 1, we apply base classifier to extract knowledge from current node feature and -th hop of neighbors by minimizing current weighted loss. Then we directly compute the weighted error rate and corresponding weight of current base classifier as follows:

(4)

where in this formula denotes the weight of -th node. To attain a positive , we only need , i.e., the accuracy of each weak classifier should be better than random guess rather than  hastie2009multi . This can be met easily in order to guarantee the weights to be updated in the right direction. Then we adjust the weights of nodes by increasing weights on incorrectly classified ones:

(5)

After re-normalizing the weights, we then compute to sequentially extract knowledge from +1-th hop of neighbors in the following base classifier . One crucial point of AdaGCN is that different from traditional AdaBoost, we only define one

, e.g. a two-layer fully connected neural networks, that in practice is recursively optimized in each base classifier just the similar as a recurrent neural networks. This also implies that the parameters from last base classifier are leveraged as the initialization of next base classifier, which also coincides with our intuition that

-th hop of neighbors are directly connected from -th hop of neighbors. In this way, our AdaGCN can be efficient in space complexity. Next, we combine the predictions from different orders of neighbors in an Adaboost way to obtain the final prediction :

(6)

Finally, we obtain the concise form of AdaGCN in the following:

(7)

As for the architecture of AdaGCN shown in Figure 1, it is a variant of RNN with synchronous sequence input and output, similar with the scenario in video classification on frame level. In addition, we provide a more detailed description of the our AdaGCN algorithm in Section 3.

2.2 Comparison with Existing Methods

Architectural difference from GCN kipf2016semi , Sgc felix2019simplifying and JK xu2018representation .

As illustrated in Figure 1 and  2, there is an apparent difference among the architectures of GCN, SGC, JK and AdaGCN. Compared with these existing graph convolutional approaches that sequentially convey intermediate result to compute final prediction, our AdaGCN transmits weights of nodes , aggregated features of different hops of neighbors and constantly optimizes one by one training process. More importantly, in AdaGCN, node embedding is independent of the flow of computation in the network and the sparse adjacent matrix is also not directly involved in the computation of individual network because we compute in advance and then feed it instead of into the classifier , thus yielding significant computation reduction which will be discussed further in Section 4.3.

Figure 2: Comparison of the graph model architectures.

in JK network denotes one aggregation layer with aggregation function such as concatenation or max pooling.

Connection with APPNP.

We also established a strong connection between AdaGCN and the state-of-the-art APPNP klicpera2018predict method that leverages personalized pagerank to reconstruct graph convolutions in order to use information from a large and adjustable neighborhood. Firstly, we find that the form of APPNP is equivalent to the Exponential Moving Average (EMA) of different orders of neighbors in a sharing parameters version:

(8)

where is the teleport factor. On the other hand, our AdaGCN can be viewed as an adaptive form of APPNP, formulated as:

(9)

The first discrepancy of these two methods lies in the adaptive coefficient in AdaGCN determined by the error of -th classifier rather than fixed exponentially decreased weights in APPNP. In addition, AdaGCN employs classifier with different parameters to learn the embedding of different orders of neighbors, while APPNP shares these parameters in its form. We verified this benefit of our approach in our experiments shown in Section 4.2.

Computational Complexity.

For the time complexity, in contrast with Eq. 1 of GCN, our gets rid of sparse matrix in neural network and thus saves much expensive computation related to sparse matrix in the network. Detailed illustration is provided in Section 4.3. For the space complexity, by using a sparse representation for , we only require memory , i.e. linear in the number of edges . In practice, we use sparse-dense matrix multiplications in Eq. (7) and a two-layer fully connected networks as . Then the total space complexity in AdaGCN is that is much smaller than in a two-layer GCN kipf2016semi .

3 Algorithm

In practice, we employ SAMME.R hastie2009multi

, an improved version of SAMME, in AdaGCN. The SAMME leverages the predicted hard labels in the combination and a natural improvement is to use real-valued confidence-rated predictions, such as weighted probability estimates, to update the additive model rather than the classifications themselves. This soft version of SAMME called SAMME.R (R for Real) algorithm 

hastie2009multi has demonstrated to have a better generalization and faster convergence than SAMME. We elaborate the final version of our AdaGCN in Algorithm 1.

0.01 Input: Features Matrix , normalized adjacent matrix , a two-layer fully connected network , number of layers and number of classes .
Output: Final combined prediction .

Algorithm 1 AdaGCN based on SAMME.R Algorithm
1:  Initialize the node weights on training set, neighbors feature matrix and classifier .
2:  for  = 0 to L do
3:     Fit the graph convolutional classifier on neighbor feature matrix based on by minimizing current weighted loss.
4:     Obtain the weighted probability estimates for :
5:     Compute the individual prediction for the current graph convolutional classifier :
6:     Adjust the node weights for each node with label on training set:
7:     Re-normalize all weights .
8:     Update +1-hop neighbor feature matrix :
9:  end for
10:  Combine all predictions for .
11:  return Final combined prediction .

As for the choice of the number of layer in AdaGCN, the increasing of can exponentially decreases the empirical loss based on the AdaBoost theory, however, from the perspective of VC-dimension (more details are provided in Appendix A.4), an overly large can yield overfitting of AdaGCN. In practice, can be determined via cross-validation.

4 Experiments

In the experimental section, we focus on verifying three advantages of AdaGCN. Firstly, AdaGCN suffices to adaboost graph convolutional networks into deep models as the depth increases, thereby circumventing oversmoothing problem li2018deeper . Secondly, our approach can achieve the state-of-the-art performance on four datasets and shows consistent performance across different label rates of graphs. Thirdly, we demonstrate that our approach is competitive in computational efficiency.

Experimental Setup.

For the graph datasets, we select four commonly used graphs: CiteSeer, Cora-ML bojchevski2017deep ; mccallum2000automating , PubMed sen2008collective , MS-Academic shchur2018pitfalls . Note that the average shortest paths of these graphs are between 5 to 10 so deep graph models that can integrate knowledge from high-order neighbors are expected to present better performance. Dateset statistics are summarized in Table 1.

Dateset Type Nodes Edges Classes Features Label Rate
CiteSeer Citation 3,327 4,732 6 3,703 3.6%
Cora Citation 2,708 5,429 7 1,433 5.2%
PubMed Citation 19,717 44,338 3 500 0.3%
MS Academic Co-author 18,333 81,894 15 6,805 1.6%
Table 1: Dateset statistics

Evaluation.

Recent graph neural networks suffer from overfitting to a single splitting of training, validation and test datasets. To address this problem, inspired by klicpera2018predict , we test all approaches on multiple random splits and initialization to conduct a rigorous study. Detailed dataset splittings are provided in Appendix A.3.

Setting of Baselines and AdaGCN.

We compare AdaGCN with GCN kipf2016semi and Simple Graph Convolution (SGC) felix2019simplifying in Section 4.1. In Section 4.2, we employ the same baselines as klicpera2018predict : V.GCN (vanilla GCN) kipf2016semi and GCN with our early stopping, N-GCN (network of GCN) abu2018n , GAT (Graph Attention Networks) velivckovic2017graph , BT.FP (bootstrapped feature propagation) buchnik2018bootstrapped and JK (jumping knowledge networks with concatenation) xu2018representation . In the computation part, we additionally compare AdaGCN with FastGCN chen2018fastgcn and GraphSAGE hamilton2017inductive . We refer to the result of baselines from klicpera2018predict and the implementation of AdaGCN is adapted from APPNP with best setting. For AdaGCN on all datasets, we set hidden units and 14, 12, 14 and 6 layers respectively due to the different graph structures. In addition, we set dropout rate to 0 and regularization on the first linear layer. More detailed model parameters can be referred from Appendix A.3.

Early Stopping on AdaGCN.

In the statistical learning theory, early stopping is commonly viewed as a regularization. We apply the same early stopping mechanism across all the methods as 

klicpera2018predict for fair comparison. Furthermore, boosting theory also has the capacity to perfectly incorporate early stopping and it has been shown that for several boosting algorithms including AdaBoost, this regularization via early stopping can provide guarantees of consistency zhang2005boosting ; jiang2004process ; buhlmann2003boosting .

4.1 Design of Deep Graph Models to Circumvent Oversmoothing Effect

It is well-known that GCN suffers from oversmoothing li2018deeper with the stacking of more graph convolutions. However, combination of knowledge from each layer to design deep graph models is no doubt a smart decision to circumvent oversmoothing issue.

Figure 3: Comparison of test accuracy of different models as the layer increases.

In our experiment, we aim to explore the prediction performance of GCN, GCN with residual connection kipf2016semi , SGC and our AdaGCN with a growing number of layers. From Figure 3, it can be easily observed that oversmoothing leads to the rapid decreasing of accuracy for GCN (blue line) as the layer increases. In contrast, the speed of smoothing (green line) of SGC is much slower than GCN due to the lack of ReLU analyzed in Section 2.1. Similarly, GCN with residual connection (yellow line) partially mitigates the oversmoothing effect of original GCN but fails to take advantage of information from different orders of neighbors to improve the prediction performance constantly. Remarkably, AdaGCN (red line) suffices to consistently enhance the performance with the increasing of layers across the three datasets. This implies that AdaGCN can efficiently incorporate knowledge from different orders of neighbors and circumvent oversmoothing of original GCN in the process of constructing deep graph models. In addition, the fluctuation of performance for AdaGCN is much lower than GCN especially when the number of layer is large.

4.2 Prediction Performance

We conduct a rigorous study of AdaGCN on four datasets under multiple spiliting of dataset. The results from Table 2 suggest the state-of-the-art performance of our approach and the improvement compared with APPNP validates the benefit of adaptive form for our AdaGCN except the comparable performance on MS Academic dataset.

Model Citeseer Cora-ML Pubmed MS Academic
V.GCN 73.510.48 82.300.34 77.650.40 91.650.09
GCN 75.400.30 83.410.39 78.680.38 92.100.08
N-GCN 74.250.40 82.250.30 77.430.42 92.860.11
GAT 75.390.27 84.370.24 77.760.44 91.220.07
JK 73.030.47 82.690.35 77.880.38 91.710.10
BT.FP 73.550.57 80.840.97 72.941.00 91.610.24
PPNP 75.830.27 85.290.25 OOM OOM
APPNP 75.730.30 85.090.25 79.730.31 93.270.08
PPNP (ours) 75.530.32 84.390.28 OOM OOM
APPNP (ours) 75.410.35 84.280.28 79.410.34 92.980.07
AdaGCN 76.220.20 85.460.25 79.760.27 92.870.07
P value 1.6e-8 4.2e-16 2.3e-4
Table 2: Average accuracy under 100 runs with uncertainties showing the 95 % confidence level calculated by bootstrapping. OOM denotes “out of memory”. “(ours)” denotes the results based on our implementation, which are slight lower than numbers above from original literature klicpera2018predict

. P values of paired t test between APPNP (ours) and AdaGCN are provided in the last row and

represents non-significance and it shows the comparable performance.

In realistic setting, graphs usually have different labeled nodes and thus it is necessary to investigate the robust performance of methods on different number of labeled nodes. Here we utilize label rates to measure the different numbers of labeled nodes and then sample corresponding labeled nodes per class on graphs respectively. Table 3 presents the consistent state-of-the-art performance of AdaGCN under different label rates.

Citeseer Cora-ML Pubmed MS Academic
Label Rates 1.0%   /   2.0% 2.0%   /   4.0% 0.1%   /   0.2% 0.6%   /   1.2%
V.GCN 67.61.4/70.81.4 76.41.3/81.70.8 70.11.4/74.61.6 89.70.4/91.10.2
GCN 70.30.9/72.71.1 80.00.7/82.80.9 71.11.1/75.21.0 89.80.4/91.20.3
PPNP 72.50.9/74.70.7 80.10.7/83.00.6 OOM OOM
APPNP 72.21.3/74.21.1 80.10.7/83.20.6 74.01.5/77.21.2 91.70.2/92.60.2
AdaGCN 73.80.8/75.10.7 82.70.7/84.70.5 76.31.2/78.61.0 91.80.2/92.70.2
Table 3: Average accuracy under different label rates with 20 different splitting of datasets with uncertainties showing the 95 % confidence level calculated by bootstrapping.

4.3 Computational Efficiency

To verify the effectiveness of our method in computation, we investigate the per-epoch training time among different graph neural networks. Due to the fact that the training of AdaGCN is only involved in a two-layer fully connected network excluding the sparse matrix

in the network, it is natural to expect our method will be quite competitive in computational efficiency.

Figure 4: Left: Per-epoch training time of AdaGCN vs other methods under 5 runs on four datasets. Right: Per-epoch training time of AdaGCN compared with GCN and SGC with the increasing of layers and the digit after “

” denotes the slope in a fitted linear regression.

From the left part of Figure 4, it turns out that AdaGCN presents the fastest speed of training time in comparison with other methods except the comparative performance with FastGCN in Pubmed. With the need to propagate information through sparse adjacent matrices, GCN and GraphSAGE exhibit relatively slower speed of training than AdaGCN. In addition, there is a somewhat inconsistency in computation of FastGCN, with fastest speed in Pubmed but slower than GCN on Cora-ML and MS-Academic datasets. Furthermore, with multiple power iterations, APPNP is related to multiple computation about sparse matrix and thus has relatively expensive computation cost.

More specifically, we have also explored the computational cost of ReLU and sparse adjacent matrix in the right part of Figure 4 when increasing the number of layers. Particularly, we can easily observe that both SGC (blue line) and GCN (red line) show a linear increasing tendency and the discrepancy between them arises from ReLU and more parameters, yielding a larger slope in a linear regression for GCN. Moreover, for SGC and GCN, the computational cost through sparse matrices in neural networks almost plays a dominant role in all the cost especially when the layer is large enough. On the other hand, our AdaGCN (pink line) displays an almost constant trend as the layer increases because it avoids the computation through sparse matrices in a network, enjoying a competitive computational efficiency.

5 Discussions

Adaboost hastie2009multi ; freund1999short has rich theory, ranging from fitting a forward stage additive model, margin theory to game theoretic interpretation. In this regard, there will be solid theory behind our AdaGCN. However, traditional AdaBoost is established on i.i.d. hypothesis while graphs have inherent data-dependent property. Fortunately, it has been demonstrated that the statistical convergence and consistency of regularized boosting lugosi2001bayes ; mannor2003greedy , including AdaBoost, can be preserved when the samples are weakly dependent lozano2013convergence , i.e., the data sequences are stationary algebraically -mixing. This theoretical result can ensure the correctness of our approach. More explanations can be referred from Appendix A.2 for interested readers.

6 Conclusion

In this paper, we propose a novel graph neural network architecture called AdaGCN, incorporating the AdaBoost in the entire computation of networks. With the assistance of AdaBoost, our approach AdaGCN is capable of effectively exploring information from high-order neighbors and then integrating knowledge from different hops of neighbors efficiently. Our work paves a way towards better combining different-order neighbors to design deep graph models rather than only stacking on specific type of graph convolution.

References

  • [1] Sami Abu-El-Haija, Amol Kapoor, Bryan Perozzi, and Joonseok Lee. N-gcn: Multi-scale graph convolution for semi-supervised node classification. International Workshop on Mining and Learning with Graphs (MLG), 2018a.
  • [2] Aleksandar Bojchevski and Stephan Günnemann. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. ICLR, 2018.
  • [3] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. ICLR, 2014.
  • [4] Eliav Buchnik and Edith Cohen. Bootstrapped graph diffusions: Exposing the power of nonlinearity. In Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems, pages 8–10. ACM, 2018.
  • [5] Peter Bühlmann and Bin Yu. Boosting with the l 2 loss: regression and classification. Journal of the American Statistical Association, 98(462):324–339, 2003.
  • [6] Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: fast learning with graph convolutional networks via importance sampling. ICLR, 2018.
  • [7] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
  • [8] Amauri Holanda de Souza Jr. Christopher Fifty Tao Yu Kilian Q. Weinberger Felix Wu, Tianyi Zhang. Simplifying graph convolutional networks. ICML, 2019.
  • [9] Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
  • [10] Yoav Freund, Robert Schapire, and Naoki Abe. A short introduction to boosting.

    Journal-Japanese Society For Artificial Intelligence

    , 14(771-780):1612, 1999.
  • [11] Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, pages 729–734. IEEE, 2005.
  • [12] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
  • [13] Trevor Hastie, Saharon Rosset, Ji Zhu, and Hui Zou. Multi-class adaboost. Statistics and its Interface, 2(3):349–360, 2009.
  • [14] Furong Huang, Jordan Ash, John Langford, and Robert Schapire. Learning deep resnet blocks sequentially using boosting theory. ICML, 2018.
  • [15] Wenxin Jiang et al. Process consistency for adaboost. The Annals of Statistics, 32(1):13–29, 2004.
  • [16] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017.
  • [17] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. ICLR, 2018.
  • [18] Qimai Li, Zhichao Han, and Xiao-Ming Wu.

    Deeper insights into graph convolutional networks for semi-supervised learning.

    AAAI, 2018.
  • [19] Renjie Liao, Zhizhen Zhao, Raquel Urtasun, and Richard S Zemel. Lanczosnet: Multi-scale deep graph convolutional networks. ICLR, 2019.
  • [20] Aurelie C Lozano, Sanjeev R Kulkarni, and Robert E Schapire. Convergence and consistency of regularized boosting with weakly dependent observations. IEEE Transactions on Information Theory, 60(1):651–660, 2013.
  • [21] Gábor Lugosi and Nicolas Vayatis. On the bayes-risk consistency of boosting methods. 2001.
  • [22] Shie Mannor, Ron Meir, and Tong Zhang. Greedy algorithms for classification–consistency, convergence rates, and adaptivity. Journal of Machine Learning Research, 4(Oct):713–742, 2003.
  • [23] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127–163, 2000.
  • [24] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):93, 2008.
  • [25] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. In Relational Representation Learning Workshop (R2L 2018), NeurIPS, 2018.
  • [26] Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard Schölkopf. Adagan: Boosting generative models. In Advances in Neural Information Processing Systems, pages 5424–5433, 2017.
  • [27] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. ICLR, 2018.
  • [28] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. ICML, 2018.
  • [29] Tong Zhang, Bin Yu, et al. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33(4):1538–1579, 2005.
  • [30] Jun Zhu, Jiaming Song, and Bei Chen. Max-margin nonparametric latent feature models for link prediction. arXiv preprint arXiv:1602.07428, 2016.

Appendix A Appendix

a.1 Insufficient Representation Power of AdaSGC

As illustrated in Figure 5, with the increasing of layers, AdaSGC with only linear transformation has insufficient representation power both in extracting knowledge from high-order neighbors and combining information from different orders of neighbors while AdaGCN exhibits a consistent improvement of performance as the layer increases.

Figure 5: AdaSGC vs AdaGCN.

a.2 Explanation about Consistency of Boosting on Dependent Data

Definition 1

(-mixing sequences.) Let be the

-field generated by a strictly stationary sequence of random variables

. The -mixing coefficient is defined by:

(10)

Then a sequence is called -mixing if . Further, it is algebraically -mixing if there is a positive constant such that .

Definition 2

(Consistency) A classification rule is consistent for a certain distribution if as where is a constant. It is strongly Bayes-risk consistent if almost surely.

The convergence and consistence of regularized boosting method on stationary -mixing sequences mainly requires the following assumptions:

A1. Properties of the sample sequence: The samples are assumed to come from a stationary algebraically -mixing sequence with -mixing coefficients , being a positive constant.

A2. Properties of the cost function : is assumed to be differentiable, strictly convex, strictly increasing and such that and .

a.3 Experimental Details

Dataset Splitting.

We choose a training set of a fixed nodes per class, an early stopping set of 500 nodes and test set of remained nodes. Each experiment is run with 5 random initialization on each data split, leading to a total of 100 runs per experiment. On a standard setting, we randomly select 20 nodes per class. For the two different label rates on each graph, we select 6, 11 nodes per class on citeseer, 8, 16 nodes per class on Cora-ML, 7, 14 nodes per class on Pubmed and 8, 15 nodes per class on MS-Academic dataset.

Model parameters.

For all GCN-based approaches, we use the same hyper-parameters in the original paper: learning rate of 0.01, 0.5 dropout rate, regularization weight, and 16 hidden units. For FastGCN, we adopt the officially released code to conduct our experiments. PPNP and APPNP are adapted with best setting: power iteration steps for APPNP, teleport probability on Cora-ML, Citeseer and Pubmed, on Ms-Academic. In addition, we use two layers with hidden units and apply L2 regularization with on the weights of the first layer and use dropout with dropout rate on both layers and the adjacency matrix. The early stopping criterion uses a patience of and an (unreachably high) maximum of epochs.The implementation of AdaGCN is adapted from PPNP and APPNP. Corresponding patience and in the early stopping of AdaGCN. Moreover, SGC is re-implemented in a straightforward way without incorporating advanced optimization for better illustration and comparison. Other baselines are adopted the same parameters described in PPNP and APPNP.

a.4 Choice of the Number of Layers

As for the choice of the number of layers in AdaGCN, we start a VC-dimension-based analysis to illustrate that too large can yield overfitting of AdaGCN. For layers of AdaGCN, its hypothesis set is

(11)

Then the VC-dimension of can be bounded as follows in terms of the VC-dimension of the family of base hypothesis:

(12)

where is a constant and the upper bounds grows as increases. Combined with VC-dimension generalization bounds, these results imply that larger values of can lead to overfitting of AdaBoost. This situation also happens in AdaGCN, which inspires us that there is no need to stack too much layers on AdaGCN in order to avoid overfitting. In practice, is typically determined via cross-validation.