I Introduction
Rcently, research with graph structural data learning has received wide considerable attention in ample fields of artificial intelligence. Graph neural networks (GNNs), particularly, graph convolutional networks (GCNs)
[1, 2, 3] have shown remarkable success in learning graph structural data, and been applied in recommendation systems [4][5], molecular design [6][7], node classification [8] and clustering tasks [9]. Despite their great success, almost all of the GCNs focus on graph data with homogeneous node embedding or bagofwords representations. Ivanov and Prokhorenkova [10]incorporated gradient boosted decision trees (GBDT) into GNN and proposed a novel BGNN architecture to deal with heterogeneous tabular data for the first time. As is known to all, there are various types of heterogeneous data in realworld applications. Can we design a general framework to handle distinct types of heterogeneous data besides heterogeneous tabular data. Moreover, as laid out by
[11], graph convolution of GCN model is a special form of Laplacian smoothing, which mixes the node features from different clusters. Therefore, it brings the potential problem of oversmoothing [11]. Sun et al. [12] developed a RNNlike GCN by employing AdaBoost, which can extract knowledge from highorder neighbors of current nodes. However, does not imply , one apparent simple example is considering the situation (two layer), andThus, this RNNlike GCN will definitely lose the information of the original graph structure. The oversmoothing, the ability of dealing with heterogeneous data, and the working mechanisms of GCN model remain open.
Recap stacking method [13] or stacked generation is an approach to ensemble multiple different classifications, which contains of base models and metamodel. Stacking is successful in the feature extraction task of tackling different kinds of data, although the data may be disordered or irregular. This is because stacking has the following properties: (1). Combining multiple distinct learners in the base models, effective discernible features could be well learned; (2). Base models are fitted on the whole training data to compute the performance on the test data; (3). Metamodel is used to make a prediction on the test data.
Undoubtedly, we will get benefits by combining stacking and GCN model. To the best of our knowledge, there is no general framework for dealing with heterogenous graph structural data by GCN model. In this paper, we present a novel simplified stacking based architecture for handling graph data, SStaGCN, which combines the feature extraction ability of stacking and aggregation, and GCN’s ability to learn graph structure on graph data. These enable SStaGCN to inherit the advantages of the stacking method and graph convolutional network. Overall, the contributions of the paper are listed as follows:
(1). We propose a new delicate general architecture that combines simplified stacking and GCN, which is adaptive and flexible to tackle heterogeneous graph structural data of different types. To the best of the authors’ knowledge, this is the first systematic work that applies modified stacking approach to general graph structural data.
(2). Generalization bound is addressed to highlight the role of stacking and aggregation from the viewpoint of learning theory.
(3). Extensive evaluation of our approach against strong baselines in node prediction tasks is investigated. The experimental results indicate significant performance improvements on both homogeneous and heterogeneous node classification tasks over a variety of realworld graph structural data, the oversmoothing phenomenon can be well alleviated.
The remainder of the paper is organized as the following. In Section II, we give a brief review of related work. Section III addresses the theoretical analysis and the proposed algorithm for GCN. Experiments on public citation networks and another heterogeneous datasets are relegated in Section IV. Some discussions and concluding remarks go to Section V. Proof of the main result is provided in the Appendix.
Ii Related Work
Graph Convolutional Networks
GCNs can be routinely interpreted as extensions of traditional convolutional neural networks on the graph domain. Typically, there are two types of GCNs
[14]: spatial GCNs and spectral GCNs. Spatial GCNs construct new feature vectors for each vertex using its neighborhood information, in which convolution is viewed as “patch operator”. Spectral GCNs define the convolution by decomposing a graph signal on the spectral domain, and then employing a spectral filter (Fourier or wavelet,
[1], [15], [16]) on the spectral components [17], [18]. However, this model entails the computation of the Laplacian eigenvector, which is exhausted and impractical for largescale graphs. Hammond et al.
[19] used Chebyshev polynomials up to th order to approximate the spectral filter. Defferrard and Vandergheynst [20] constructed a localized ChebyNet. Kipf and Welling [8] considered the case and proposed a simple but powerful model for semisupervised classification task. Wu et al. [21] removed nonlinearities and collapsed the weight matrix between consecutive layers, achieved a simplified GCN, whilst [22, 23] considered the design of deep GCNs. Multiscale deep GCNs were investigated in [24].Ensemble learning based graph neural networks
GCNs may confront with the oversmoothing problem and can not handle heterogeneous graph data. Sun et al. [12] designed a RNNlike graph structure to extract the knowledge from highorder neighbors of the current nodes, while Ivanov and Prokhorenkova [10] incorporated GBDT into GNN and developed BGNN to tackle heterogeneous tabular data. A natural question arises: Is there a general GCN architecture to handle various heterogeneous graph structural data and mitigate the oversmoothing issue? This paper aims at investigating these challenges and answers the abovementioned questions.
Iii The proposed approach: SStaGCN
Iiia Graph convolutional networks
Given an undirected graph with nodes , edges . Denote as the adjacency matrix with corresponding degree matrix . Obviously, for an undirected graph, . In the conventional GCN models for semisupervised node classification task, the graph embedding of nodes with two convolutional layers is described as the following:
(1) 
where
is the final embedding matrix (output logits) of nodes before softmax with
the number of classes. stands for the feature matrix with the input dimension. , where is the degree matrix of and (stands for identity matrix). Moreover,
is the inputtohidden weight matrix for a hidden layer with feature maps, and denotes the hiddentooutput weight matrix.IiiB Stacking
Stacking, as a hierarchical model integration framework, is a wellknown and widely used ensemble machine learning algorithm
[25]. It uses a metalearning algorithm to learn how to best combine the prediction from two or more base machine learning algorithms. Traditional stacking model involves two or more base models, and a metamodel that combines the predictions of the base models. Base models use different types of models to fit on the training data and compile the predictions. Metamodel tries to best combine the predictions of the base models, which is often simple, providing a smooth interpretation of the predictions made by the base models. Hence, linear models are often used as the metamodel, such as linear regression for regression tasks and logistic regression for classification tasks.
IiiC The proposed approach
As mentioned above, GCN may mix the node features from different clusters and make them indistinguishable. Therefore, it is necessary to aggregate more node information in an effective way for better predictions. Motivated by traditional stacking approach and the work in [12, 10], to reduce the computational cost, we only use base models of the stacking approach, and then aggregate the output of them to attain the node features of the graph data. Specifically, the proposed method could be addressed as follows. At first, in the first layer, we attain by input (node feature matrix) through
base classifiers
(2) 
Thereafter, we get the preclassification results through (). Secondly, we use aggregation method to attain the final output results, i. e.,
(3) 
where denotes an aggregation method. The idea of aggregation method is very simple, which aims to group attribute values by a single value. Generally, we can choose mean, attention or voting approach. Recall
Mean: mean operator takes the elementwise mean of the components .
Attention[26]
: Attention mechanism has been widely used in various fields of deep learning, including but not limited to image processing
[27], speech recognition[28], and natural language processing[29]. The idea of attention is motivated from the attention mechanism of human beings. Denote query vector as the output of base classifiers, let query vector be the label of the data. Then we compute the attention coefficients between and as follows:(4) 
where
denotes cosine similarity. Finally, the input
of the graph convolutional layer is aggregated by considering the following sum with attention score(5) 
Voting [30]: Voting method is the most intuitive way among all integrated learning methods, which aims at selecting one or more winners. In this work, our objective is to choose the most popular prediction among the results of the base classifiers. Therefore, we will take the majority voting approach. if the number of categories appears the same, a category will be randomly selected.
Thus, we attain a novel GCN model to deal with heterogeneous graph data by elegantly combining stacking, aggregation, and vanilla GCN, namely, SStaGCN. The first layer of SStaGCN utilizes the base models of stacking approach, and the second layer of SStaGCN uses aggregation method such as mean, attention or voting, which can enhance the feature extraction ability of conventional GCN models. The aggregated data will henceforth be used as the input of conventional GCN model, and we attain the final prediction results. The workflow of the proposed model is demonstrated in Algorithm 1 and Fig. 1.
IiiD Generalization Bound Analysis
In this Section, we give theoretical generalization analysis of the proposed approach. In our analysis, we assume that the adjacency matrix and the node feature matrix are both fixed.
Theoretically, in learning theory, the risk of over the unknown population distribution is measured by
where
is the loss function defined as a map:
, Given a training data and adjacency matrix, the objective is to estimate parameters
from model (1) based on empirical data. Concretely, we attempts to minimize the empirical risk functional over some function class takes formwhere is the labelled sample achieved from the original training data via stacking and aggregation. Typically, the clustering algorithms will only produce discernible nodes. Hence, if . It is trivial that stacking and the three aggregation methods proposed in this paper will not violate the constraints, i.e., . Now we are in position to present the theoretical generalization bound analysis.
Theorem 1
Suppose , , . Denote the number of neighbors of node (the set of node indices with observed labels), let , be any given predictor of a class of GCNs with onehidden layer. Assume that the loss function is Lipschitz continuous with Lipschitz constant . Then, for any
, with probability at least
, we havewhere is the Frobenius norm, with .
Remark 1
Theorem 1 indicates that the dominant upper bound linearly depends on the maximum number of the neighbors of the nodes , bounds of the weights and , which strongly depend on the dimension index . Obviously, will yield a tight generalization bound. When (binary classification case), the results stated here is similar to the one outlined in [31].
Iv Experiments
Iva Datasets
To evaluate the performance of the proposed SStaGCN model for distinct types of graph structural data, we utilize realworld datasets for the semisupervised node classification task, including
commonly used citation networks: Cora, CiteSeer, and Pubmed
[32], and another heterogeneous datasets: Houseclass, VKclass and DBLP [10]. As indicated in [10], Houseclass and VKclass are from House and VK datasets, respectively, where the target labels are converted into several discrete classes due to the lack of publicly available heterogeneous graphstructured data.In the citation networks, nodes represent documens, and edges (undirected) stand for the citation relationships connected to documents. The characteristics of nodes are representative words in documents, and the label rate here denotes the percentage of node tags used for training. The Cora dataset contains nodes, edges, classes, and node features, the CiteSeer dataset contains nodes, edges, classes, and
node features, and the Pubmed dataset contains
nodes, edges, and classes. We select , , and nodes for Cora, CiteSeer, and Pubmed datasets for training, respectively. Each dataset uses nodes for testing, and nodes for crossvalidation. The data splitting we used is the same as that of GCN, Graph Attention Network (GAT, [33]) and GWNN [15]. Details about the citation network (heterogeneous network resp.) are described in Table I (Table II resp.)Dataset  Cora  CiteSeer  Pubmed 

Nodes  2708  3327  19717 
Edges  5429  4732  44338 
Features  1433  3703  500 
Classes  7  6  3 
Label Rate  5.2%  3.6%  0.3% 
Dataset  Houseclass  VKclass  DBLP 

Nodes  20640  54028  14475 
Edges  182146  213644  40269 
Features  6  14  5002 
Classes  5  7  4 
Min Target Nodes  0.14  13.48  745 
Max Target Nodes  5.00  118.39  1197 
IvB Baselines
We compare SStaGCN with classical graph convolutional networks: ChebyNet, GCN, GAT, and APPNP [34], and ensemble learning based GCNs: AdaGCN and BGNN models.
IvC Setting
As demonstrated in Fig. 1, the proposed SStaGCN model contains four layers, where the first and second layers are called feature extraction layer. The first layer is based on the base models of the stacking method. Here we consider the combination of
classical classifiers: KNN, Random Forest, Naive Bayes, Decision Tree, SVC
[35], GBDT [36], and Adaboost[37] in the first layer of SStaGCN. These classifiers are representative classical classifiers used in the community of machine learning, which have respective merits in dealing with distinct types of tasks. Moreover, we adopt three aggregation methods: mean, attention, and voting in the second layer. Among them, for the mean approach, we take the mean value of the output of the first layer and then round it. As for the attention mechanism, we consider the label data as vector , the predicted value of the feature extraction layer as the query vector , and then compute the attention coefficients. For the voting approach, we employ the hard voting technique in ensemble learning [38].Thereafter, the output of the second layer is considered as the input of the first layer of the GCN, which only has two layers in our setting. The GCN considered in this paper has hidden units, and Adam optimizer [39] is the default optimizer, cross entropy is used as the loss function. We set learning rate , number of iterations , weight decay , and dropout rate equals .
IvD Results
Method  Cora  CiteSeer  Pubmed 

ChebyNet  81.20  69.800.00  74.400.00 
GCN  81.50 0.00  70.300.00  79.000.00 
GAT  83.000.70  72.500.70  79.000.30 
APPNP  85.090.25  75.730.30  79.730.31 
AdaGCN  85.970.20  76.680.20  79.950.21 
BGNN  41.970.19  30.740.10  10.320.10 
SimStacking  43.192.05  62.70.53  87.70.23 
SStaGCN (Mean)  90.350.20  86.400.12  82.300.19 
SStaGCN (Attention)  91.600.18  87.200.12  82.400.23 
SStaGCN (Voting)  93.100.16  88.700.14  92.070.20 
Method  Houseclass  VKclass  DBLP 

ChebyNet  54.740.10  57.190.36  32.140.00 
GCN  55.070.13  56.400.09  39.491.37 
GAT  56.500.22  56.420.19  76.830.78 
APPNP  57.030.27  56.720.11  79.471.46 
AdaGCN  26.200.00  46.000.00  10.060.00 
BGNN  66.70.27  66.320.20  86.940.74 
SimStacking  53.890.29  56.640.10  71.580.64 
SStaGCN (Mean)  72.350.05  66.620.17  82.310.20 
SStaGCN (Attention)  72.400.12  77.640.08  82.510.22 
SStaGCN (Voting)  76.130.12  87.920.07  92.600.10 
Method  Cora  CiteSeer  Pubmed 

ChebyNet  77.990.54  63.760.34  77.740.42 
GCN  82.890.30  70.650.37  78.830.32 
GAT  83.590.25  70.620.29  77.770.40 
APPNP  84.290.22  71.050.38  79.660.31 
AdaGCN  79.550.19  63.620.19  78.550.21 
BGNN  40.810.25  32.730.13  8.460.08 
SimStacking  44.021.61  60.860.56  87.310.13 
SStaGCN (Mean)  90.660.18  86.420.12  82.300.19 
SStaGCN (Attention)  91.690.14  87.240.14  82.450.23 
SStaGCN (Voting)  92.760.16  88.730.14  92.070.20 
Method  Houseclass  VKclass  DBLP 

ChebyNet  31.340.12  57.440.27  26.840.62 
GCN  54.950.14  56.520.09  38.50.97 
GAT  56.540.68  56.410.07  77.11.86 
APPNP  57.880.32  56.610.07  79.340.23 
AdaGCN  25.010.00  37.030.00  9.600.00 
BGNN  66.480.22  66.180.11  87.20.60 
SimStacking  53.320.15  56.110.08  71.490.31 
SStaGCN (Mean)  72.230.04  66.740.21  82.130.38 
SStaGCN (Attention)  72.360.09  77.620.10  82.680.12 
SStaGCN (Voting)  75.450.82  87.840.04  92.640.06 
Models  Cora  CiteSeer  Pubmed  Houseclass  VKclass  DBLP 

ChebyNet  2.59e06  2.77e08  1.17e06  3.51e10  1.11e11  6.97e09 
GCN  4.19e16  5.15e17  1.84e15  2.36e08  2.25e11  2.48e07 
GAT  2.10e19  1.54e20  8.84e19  3.35e08  1.38e09  4.95e06 
APPNP  6.86e42  7.12e42  6.25e41  7.05e15  1.39e21  2.26e09 
AdaGCN  1.82e19  1.23e20  8.42e19  3.05e38  7.94e43  1.69e35 
BGNN  1.02e24  2.33e21  4.74e22  6.54e07  1.07e08  0.11e03 
SimStacking  5.61e12  3.79e15  2.52e09  4.56e08  6.07e11  1.07e06 
pvalues of the paired ttest of SStaGCN (Voting) with competitors on
different data sets (CORA, Citeseer, Pubmed, Houseclass, VKclass, and DBLP).The results of the comparative evaluation for node classification are summarized in Tables IIIXI, where SStaGCN (Mean) stands for mean mechanism is utilized in the second layer of SStaGCN, SStaGCN (Attention) and SStaGCN (Voting) have similar meanings. We report the accuracy, F1score (macro), and training time on the test set between the proposed SStaGCN model and other methods. Experimental results successfully demonstrate significant improvement of SStaGCN model over the baselines. Specifically, for public citation networks, SStaGCN (Voting) achieves almost (), (), and () improvement of the accuracy (resp. F1score) for Cora, CiteSeer and Pubmed datasets, respectively. For heterogeneous datasets, SStaGCN (Voting) obtains nearly (), (), and () improvement of the accuracy (resp. F1score) for Houseclass, VKclass, and DBLP datasets, respectively. AdaGCN performs bad on heterogeneous dataset. This maybe due to the reason that AdaGCN aims at designing deep GCNs, which may mix the node features of different clusters when the layers of GCNs go deeper. Intuitively, SStaGCN is able to enhance the performance of GCN, and provides better qualitative results for distinct types of graph structural data.
This impressive improvement can be explained as follows:
(1). The feature extraction step of SStaGCN can achieve a dimensionality reduction effect and make the graph data more discernible. For instance, the size of Cora dataset reduces to from after conducting feature extraction, which means we attain a relatively smaller as discussed in Remark 1, and this greatly improves the prediction ability and computation efficiency in the subsequent graph convolution model.
(2). In the aggregation step of SStaGCN model, the mean and attention mechanisms destroy the preclassification results to some extent, which is inappropriate for feature extraction, while voting mechanism does not. Therefore, experimental results demonstrate that SStaGCN (Voting) is more efficient in our datasets.
(3). Simplified stacking can extract efficient node features but ignore the graph structure information, while GCN model is weak at extracting node features. Hence, the SStaGCN model inherits the merits of simplified stacking and GCN, not only achieves higher classification accuracy, but also reduces the cost of computation time.
Tables VIII and IX indicate the comparison of training time between SStaGCN and other methods. We can see that BGNN runs faster on citation networks followed by our SStaGCN method, whilst the proposed SStaGCN runs faster on heterogeneous datasets except on DBLP dataset. We think the reason is that the feature extraction step takes extra computation time but yields more efficiency when the output of features is fed into the GCN model.
Method  Cora  CiteSeer  Pubmed 

ChebyNet  22.741.24  30.871.21  124.991.77 
GCN  13.410.16  99.210.98  55.610.73 
GAT  20.980.46  30.741.40  126.331.80 
APPNP  203.750.15  55.400.40  457.6212.77 
AdaGCN  772.2683.56  2129.02148.97  2098.10275.88 
BGNN  1.330.00  2.400.00  2.540.00 
SimStacking  11.90.08  27.90.24  79.11.40 
SStaGCN (Mean)  10.90.13  17.20.13  89.62.20 
SStaGCN (Attention)  11.20.24  17.60.41  87.60.96 
SStaGCN (Voting)  16.20.19  29.61.40  13.12.61 
Method  Houseclass  VKclass  DBLP 

ChebyNet  833.820.00  1394.680.00  8890.740.00 
GCN  46.060.80  120.13.35  268.55.35 
GAT  197.63.08  410.99.95  205.32.00 
APPNP  129.73.26  383.812.02  176.85.58 
AdaGCN  607.310.00  511.410.00  590.900.00 
BGNN  26.371.65  93.476.23  50.253.04 
SimStacking  16.410.36  43.582.78  380.812.82 
SStaGCN (Mean)  52.940.51  132.91.61  188.70.54 
SStaGCN (Attention)  57.381.75  133.21.11  246.20.37 
SStaGCN (Voting)  59.690.82  154.11.02  310.70.49 
To express the effect of feature extraction step of the proposed model, we provide a visualization with tSNE [40] as shown in Figs. 2(a) and 3(a). Figs. 2(a) and 3(a) indicate that the combination of stacking and aggregation could well extract the node features and make the graph data more discriminative. Moreover, the paired ttest in Table VII demonstrates that the proposed SStaGCN model is significantly different from the simplified stacking and other GCN models.
Table X indicates that we do not need all the seven classifiers. Taking Cora dataset as an example, we can observe that KNN, Random Forest, and Naive Bayes is the best combination, which attains the highest accuracy value without much cost in computation time (only seconds). Therefore, this demonstrates that the classifiers have their own merits in handling different specific tasks.
KNN  Random Forest  Naive Bayes  Decision Tree  GBDT  Adaboost  SVC  Accuracy  Training Time 
91.2  13.90  
93.6  16.60  
84.2  567.9  
92.9  144.7  
93.1  149.5  
92.8  15.90  
93.4  18.80  
92.9  568.3  
92.9  570.7  
92.8  708.5 
Furthermore, to demonstrate the effect of simplified stacking to the oversmoothing problem, we also add an experiment on the oversmoothing discussion. In Fig. 4(a), we can observe that conventional GCN may mix the features of vertices from different clusters when increasing the layers of GCN. However, as demonstrated in Fig. 5(a) and Table XI ^{1}^{1}1here the number of layers do not contain the number of layers included in feature extraction part (only layers) of SStaGCN., the proposed SStaGCN could effectively ameliorate the oversmoothing phenomenon and improve the accuracy.
Method  2layer  3layer  4layer  5layer  6layer  7layer 

GCN  80.5  80.4  75.8  71.9  72.6  60.8 
SStaGCN  93.3  88.8  87.5  86.4  84.8  84.3 
Overall, these experiments demonstrate the superiority of SStaGCN model over competitors.
IvE Visualization
To further demonstrate the performance of SStaGCN, we plot the final classification features via GCN, AdaGCN, BGNN, and our SStaGCN. Fig. 6(a) (Fig. 7(a) resp.) displays the final classification features of relevant methods on CiteSeer (DBLP resp.) dataset. From Figs. 6(a) and 7(a), we can observe that relatively smaller points are misclassified by the proposed SStaGCN, whilst many classes are wrongly predicted and classified by GCN, AdaGCN and BGNN.
V Conclusion
Traditional GCNs could not well deal with heterogeneous graph structural data. In this work, we propose a novel GCN architecture, namely, SStaGCN. SStaGCN first takes advantages of stacking method and aggregation technique to attain preclassified data features, and then utilizes GCN to conduct prediction for heterogeneous graph data. Our approach SStaGCN can effectively explore and exploit node features for heterogeneous graph data in a stacking way. Our work paves a way towards better combining classical machine learning methods to design GCN models, proposes a general framework for handling distinct types of graph structural data, and will definitely give insights to a better understanding of GCN. Extensive experiments demonstrate that the proposed model is superior to stateoftheart competitors in terms of accuracy, F1score, and training time. The proposed framework here could be generalized to the regression setting. Furthermore, we believe the proposed method can tackle various distinct types of heterogeneous graph data, although experiments are conducted on tabular data. A promising future research direction is to investigate deeper GCNs in our setting as discussed by [12].
Vi Appendix
In this part, we give the detailed proof of Theorem 1. Before addressing the proof, we give several lemmas. At first, let us present the contraction inequality of Rademacher complexity in the vector form.
Lemma 1
[41] Let be any set, and be a class of functions ad let have Lipschitz constant . Then
where is an independent doubly indexed Rademacher sequence and is the th component of .
Lemma 2
[42] Consider a loss function . Denote , and let be independently selected according to the probability measure . Then for any , with probability at least ,
We first give a lemma which plays essential role in the proof of Theorem 1.
Lemma 3
Let for each node , then
Proof 1
Denote as the submatrix of whose row and column indices belong to the set . Let be the feature matrix of the nodes in (subgraph of ). Hence,
where is the th row of the matrix with column index belong to the set . Notice that
and . Therefore,
Now we are in position to give the proof of Theorem 1.
Proof 2
To allow a slight abuse of notations, we will use to denote due to the explanation on page 3. Denote , with . Applying Proposition 4 in [43] to the case , we can attain the Lipschitz constant for standard softmax function is . Let stands for the th row of the matrix , for function set
the empirical Rademacher complexity is defined as
where is an i.i.d. family of Rademacher variables independent of . By the contraction property of Rademacher complexity,
and notice Lemma 2, we only need to bound . Therefore, we have the following estimate by utilizing Lemma 1.
where , notice the property of inner product, the above estimate can be further bounded as
the last inequality follows by the property that . Now the key point is how to estimate the term . We will employ the idea introduced in [31] (in the proof of Theorem 1) to remove the “sup” term. Let , , , with ,, and notice that , then we have
By the definition of Frobenius norm , the supremum of the above quantity under the constraint must be obtained when for some , and for all . Hence
Let be the th neighbor number of node (, ). Recall , with , therefore
Applying the conclusion and contraction property of Rademacher complexity again, we have
Therefore, we only need to estimate the term
Applying CauchySchwartz inequality yields that
where the last inequality is due to the i.i.d condition of Rademacher sequences. Plugging the conclusion of Lemma 3 into the above term leads to
and
This completes the proof by combining with Lemma 2.
References
 [1] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and locally connected networks on graphs,” in ICLR, 2014.
 [2] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in IJCNN, 2005.
 [3] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in NIPS, 2017.
 [4] J. Sun, W. Guo, D. Zhang, Y. Zhang, F. Regol, Y. Hu, H. Guo, R. Tang, H. Yuan, X. He, and M. Coates, “A framework for recommending accurate and diverse items using bayesian graph convolutional neural networks,” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
 [5] S. Casas, C. Gulino, R. Liao, and R. Urtasun, “Spagnn: Spatiallyaware graph neural networks for relational behavior forecasting from sensor data,” 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9491–9497, 2020.
 [6] J. M. Stokes, K. Yang, K. Swanson, W. Jin, and J. J. Collins, “A deep learning approach to antibiotic discovery,” Cell, vol. 180, no. 4, pp. 688–702.e13, 2020.
 [7] L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for text classification,” in AAAI, 2019.
 [8] T. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” ICLR, 2017.
 [9] J. Zhu, “Maxmargin nonparametric latent feature models for link prediction,” in ICML, 2012.
 [10] S. Ivanov and L. Prokhorenkova, “Boost then convolve: Gradient boosting meets graph neural networks,” in ICLR, 2021.

[11]
Q. Li, Z. Han, and X.M. Wu, “Deeper insights into graph convolutional networks for semisupervised learning,” in
AAAI, 2018.  [12] K. Sun, Z. Lin, and Z. Zhu, “Adagcn: Adaboosting graph convolutional networks into deep models,” in ICLR, 2021.
 [13] S. Dz̆eroski and B. Z̆enko, “Is combining classifiers with stacking better than selecting the best one?” Machine Learning, vol. 54, pp. 255–273, 2004.
 [14] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: Going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
 [15] B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng, “Graph wavelet neural network,” in ICLR, 2019.
 [16] M. Li, Z. Ma, Y. G. Wang, and X. Zhuang, “Fast haar transforms for graph neural networks,” Neural Networks, vol. 128, pp. 188–198, 2020.
 [17] A. Sandryhaila and J. Moura, “Discrete signal processing on graphs,” IEEE Transactions on Signal Processing, vol. 61, no. 7, pp. 1644–1656, 2013.

[18]
D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains,”
IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, 2013.  [19] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
 [20] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in NIPS, 2016.
 [21] F. Wu, T. Zhang, A. Souza, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” in ICML, 2019.

[22]
Q. Li, Z. Han, and X. M. Wu, “Deeper insights into graph convolutional networks for semisupervised learning,” in
AAAI, 2018.  [23] G. Li, M. Mueller, A. Thabet, and B. Ghanem, “Deepgcns: Can gcns go as deep as cnns?” in ICCV, 2019.
 [24] S. Luan, M. Zhao, X. W. Chang, and D. Precup, “Break the ceiling: Stronger multiscale deep graph convolutional networks,” in NIPS, 2019.
 [25] D. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, pp. 241–259, 1992.
 [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 6000–6010.
 [27] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of visual attention,” in NIPS, 2014.

[28]
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in
ICLR, 2015.  [29] W. Yin, H. Schütze, B. Xiang, and B. Zhou, “Abcnn: Attentionbased convolutional neural network for modeling sentence pairs,” Transactions of the Association for Computational Linguistics, vol. 4, pp. 259–272, 2016.
 [30] R. Schapire, Y. Freund, P. Barlett, and W. S. Lee, “Boosting the margin: A new explanation for the effectiveness of voting methods,” in ICML, 1997.
 [31] S. Lv, “Generalization bounds for graph convolutional neural networks via rademacher complexity,” 2021.
 [32] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. EliassiRad, “Collective classification in network data,” AI Mag., vol. 29, pp. 93–106, 2008.
 [33] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio’, and Y. Bengio, “Graph attention networks,” in ICLR, 2018.
 [34] J. Klicpera, A. Bojchevski, and S. Günnemann, “Predict then propagate: Graph neural networks meet personalized pagerank,” in ICLR, 2019.
 [35] K. Pal and B. Patel, “Data classification with kfold cross validation and holdout accuracy estimation methods with 5 different machine learning techniques,” 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), pp. 83–87, 2020.
 [36] J. Friedman, “Greedy function approximation: A gradient boosting machine.” Annals of Statistics, vol. 29, pp. 1189–1232, 2001.
 [37] Y. Freund and R. E. Schapire, “A short introduction to boosting,” Journal of Japanese Society for Artificial Intelligence, vol. 14, pp. 771–780, 1999.
 [38] F. Schwenker, “Ensemble methods: Foundations and algorithms,” pp. 77–79, 2013.
 [39] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
 [40] L. V. D. Maaten and G. E. Hinton, “Visualizing data using tsne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
 [41] A. Maurer, “A vectorcontraction inequality for rademacher complexities,” in International Conference on Algorithmic Learning Theory, 2016.

[42]
P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk
bounds and structural results,” in
Conference on Computational Learning Theory & and European Conference on Computational Learning Theory
, 2001. 
[43]
B. Gao and L. Pavel, “On the properties of the softmax function with application in game theory and reinforcement learning,” 2017.
Comments
There are no comments yet.