SStaGCN: Simplified stacking based graph convolutional networks

11/16/2021
by   Jia Cai, et al.
Microsoft
Nanjing Audit University
7

Graph convolutional network (GCN) is a powerful model studied broadly in various graph structural data learning tasks. However, to mitigate the over-smoothing phenomenon, and deal with heterogeneous graph structural data, the design of GCN model remains a crucial issue to be investigated. In this paper, we propose a novel GCN called SStaGCN (Simplified stacking based GCN) by utilizing the ideas of stacking and aggregation, which is an adaptive general framework for tackling heterogeneous graph data. Specifically, we first use the base models of stacking to extract the node features of a graph. Subsequently, aggregation methods such as mean, attention and voting techniques are employed to further enhance the ability of node features extraction. Thereafter, the node features are considered as inputs and fed into vanilla GCN model. Furthermore, theoretical generalization bound analysis of the proposed model is explicitly given. Extensive experiments on 3 public citation networks and another 3 heterogeneous tabular data demonstrate the effectiveness and efficiency of the proposed approach over state-of-the-art GCNs. Notably, the proposed SStaGCN can efficiently mitigate the over-smoothing problem of GCN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 7

12/04/2021

Multi-scale Graph Convolutional Networks with Self-Attention

Graph convolutional networks (GCNs) have achieved remarkable learning ab...
09/11/2021

Border-SegGCN: Improving Semantic Segmentation by Refining the Border Outline using Graph Convolutional Network

We present Border-SegGCN, a novel architecture to improve semantic segme...
04/09/2020

Graph Highway Networks

Graph Convolution Networks (GCN) are widely used in learning graph repre...
11/20/2019

Graph-Driven Generative Models for Heterogeneous Multi-Task Learning

We propose a novel graph-driven generative model, that unifies multiple ...
02/16/2022

Turn Tree into Graph: Automatic Code Review via Simplified AST Driven Graph Convolutional Network

Automatic code review (ACR), which can relieve the costs of manual inspe...
08/09/2020

Multivariate Relations Aggregation Learning in Social Networks

Multivariate relations are general in various types of networks, such as...
02/07/2020

Poisson Kernel Avoiding Self-Smoothing in Graph Convolutional Networks

Graph convolutional network (GCN) is now an effective tool to deal with ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Rcently, research with graph structural data learning has received wide considerable attention in ample fields of artificial intelligence. Graph neural networks (GNNs), particularly, graph convolutional networks (GCNs)

[1, 2, 3] have shown remarkable success in learning graph structural data, and been applied in recommendation systems [4]

, computer vision

[5], molecular design [6]

, natural language processing

[7], node classification [8] and clustering tasks [9]. Despite their great success, almost all of the GCNs focus on graph data with homogeneous node embedding or bag-of-words representations. Ivanov and Prokhorenkova [10]

incorporated gradient boosted decision trees (GBDT) into GNN and proposed a novel BGNN architecture to deal with heterogeneous tabular data for the first time. As is known to all, there are various types of heterogeneous data in real-world applications. Can we design a general framework to handle distinct types of heterogeneous data besides heterogeneous tabular data. Moreover, as laid out by

[11], graph convolution of GCN model is a special form of Laplacian smoothing, which mixes the node features from different clusters. Therefore, it brings the potential problem of over-smoothing [11]. Sun et al. [12] developed a RNN-like GCN by employing AdaBoost, which can extract knowledge from high-order neighbors of current nodes. However, does not imply , one apparent simple example is considering the situation (two layer), and

Thus, this RNN-like GCN will definitely lose the information of the original graph structure. The over-smoothing, the ability of dealing with heterogeneous data, and the working mechanisms of GCN model remain open.

Recap stacking method [13] or stacked generation is an approach to ensemble multiple different classifications, which contains of base models and meta-model. Stacking is successful in the feature extraction task of tackling different kinds of data, although the data may be disordered or irregular. This is because stacking has the following properties: (1). Combining multiple distinct learners in the base models, effective discernible features could be well learned; (2). Base models are fitted on the whole training data to compute the performance on the test data; (3). Meta-model is used to make a prediction on the test data.

Undoubtedly, we will get benefits by combining stacking and GCN model. To the best of our knowledge, there is no general framework for dealing with heterogenous graph structural data by GCN model. In this paper, we present a novel simplified stacking based architecture for handling graph data, SStaGCN, which combines the feature extraction ability of stacking and aggregation, and GCN’s ability to learn graph structure on graph data. These enable SStaGCN to inherit the advantages of the stacking method and graph convolutional network. Overall, the contributions of the paper are listed as follows:

(1). We propose a new delicate general architecture that combines simplified stacking and GCN, which is adaptive and flexible to tackle heterogeneous graph structural data of different types. To the best of the authors’ knowledge, this is the first systematic work that applies modified stacking approach to general graph structural data.

(2). Generalization bound is addressed to highlight the role of stacking and aggregation from the viewpoint of learning theory.

(3). Extensive evaluation of our approach against strong baselines in node prediction tasks is investigated. The experimental results indicate significant performance improvements on both homogeneous and heterogeneous node classification tasks over a variety of real-world graph structural data, the over-smoothing phenomenon can be well alleviated.

The remainder of the paper is organized as the following. In Section II, we give a brief review of related work. Section III addresses the theoretical analysis and the proposed algorithm for GCN. Experiments on public citation networks and another heterogeneous datasets are relegated in Section IV. Some discussions and concluding remarks go to Section V. Proof of the main result is provided in the Appendix.

Ii Related Work

Graph Convolutional Networks

GCNs can be routinely interpreted as extensions of traditional convolutional neural networks on the graph domain. Typically, there are two types of GCNs

[14]

: spatial GCNs and spectral GCNs. Spatial GCNs construct new feature vectors for each vertex using its neighborhood information, in which convolution is viewed as “patch operator”. Spectral GCNs define the convolution by decomposing a graph signal on the spectral domain, and then employing a spectral filter (Fourier or wavelet,

[1], [15], [16]) on the spectral components [17], [18]

. However, this model entails the computation of the Laplacian eigenvector, which is exhausted and impractical for large-scale graphs. Hammond et al.

[19] used Chebyshev polynomials up to th order to approximate the spectral filter. Defferrard and Vandergheynst [20] constructed a -localized ChebyNet. Kipf and Welling [8] considered the case and proposed a simple but powerful model for semi-supervised classification task. Wu et al. [21] removed nonlinearities and collapsed the weight matrix between consecutive layers, achieved a simplified GCN, whilst [22, 23] considered the design of deep GCNs. Multi-scale deep GCNs were investigated in [24].

Ensemble learning based graph neural networks

GCNs may confront with the over-smoothing problem and can not handle heterogeneous graph data. Sun et al. [12] designed a RNN-like graph structure to extract the knowledge from high-order neighbors of the current nodes, while Ivanov and Prokhorenkova [10] incorporated GBDT into GNN and developed BGNN to tackle heterogeneous tabular data. A natural question arises: Is there a general GCN architecture to handle various heterogeneous graph structural data and mitigate the over-smoothing issue? This paper aims at investigating these challenges and answers the above-mentioned questions.

Iii The proposed approach: SStaGCN

Iii-a Graph convolutional networks

Given an undirected graph with nodes , edges . Denote as the adjacency matrix with corresponding degree matrix . Obviously, for an undirected graph, . In the conventional GCN models for semi-supervised node classification task, the graph embedding of nodes with two convolutional layers is described as the following:

(1)

where

is the final embedding matrix (output logits) of nodes before softmax with

the number of classes. stands for the feature matrix with the input dimension. , where is the degree matrix of and (

stands for identity matrix). Moreover,

is the input-to-hidden weight matrix for a hidden layer with feature maps, and denotes the hidden-to-output weight matrix.

Iii-B Stacking

Stacking, as a hierarchical model integration framework, is a well-known and widely used ensemble machine learning algorithm

[25]

. It uses a meta-learning algorithm to learn how to best combine the prediction from two or more base machine learning algorithms. Traditional stacking model involves two or more base models, and a meta-model that combines the predictions of the base models. Base models use different types of models to fit on the training data and compile the predictions. Meta-model tries to best combine the predictions of the base models, which is often simple, providing a smooth interpretation of the predictions made by the base models. Hence, linear models are often used as the meta-model, such as linear regression for regression tasks and logistic regression for classification tasks.

Iii-C The proposed approach

As mentioned above, GCN may mix the node features from different clusters and make them indistinguishable. Therefore, it is necessary to aggregate more node information in an effective way for better predictions. Motivated by traditional stacking approach and the work in [12, 10], to reduce the computational cost, we only use base models of the stacking approach, and then aggregate the output of them to attain the node features of the graph data. Specifically, the proposed method could be addressed as follows. At first, in the first layer, we attain by input (node feature matrix) through

base classifiers

(2)

Thereafter, we get the pre-classification results through (). Secondly, we use aggregation method to attain the final output results, i. e.,

(3)

where denotes an aggregation method. The idea of aggregation method is very simple, which aims to group attribute values by a single value. Generally, we can choose mean, attention or voting approach. Recall

Fig. 1: Workflow of the SStaGCN model.
0:    Feature matrix , normalized adjacency matrix , graph , base classifier , , aggregation method ;
0:    Final predictor ;
1:  Attain () via base classifiers;, ;
2:  Aggregate ;;
3:  Feed into vanilla GCN ;;
4:  return  ;
Algorithm 1 SStaGCN.

Mean: mean operator takes the element-wise mean of the components .

Attention[26]

: Attention mechanism has been widely used in various fields of deep learning, including but not limited to image processing

[27], speech recognition[28], and natural language processing[29]. The idea of attention is motivated from the attention mechanism of human beings. Denote query vector as the output of base classifiers, let query vector be the label of the data. Then we compute the attention coefficients between and as follows:

(4)

where

denotes cosine similarity. Finally, the input

of the graph convolutional layer is aggregated by considering the following sum with attention score

(5)

Voting [30]: Voting method is the most intuitive way among all integrated learning methods, which aims at selecting one or more winners. In this work, our objective is to choose the most popular prediction among the results of the base classifiers. Therefore, we will take the majority voting approach. if the number of categories appears the same, a category will be randomly selected.

Thus, we attain a novel GCN model to deal with heterogeneous graph data by elegantly combining stacking, aggregation, and vanilla GCN, namely, SStaGCN. The first layer of SStaGCN utilizes the base models of stacking approach, and the second layer of SStaGCN uses aggregation method such as mean, attention or voting, which can enhance the feature extraction ability of conventional GCN models. The aggregated data will henceforth be used as the input of conventional GCN model, and we attain the final prediction results. The workflow of the proposed model is demonstrated in Algorithm 1 and Fig. 1.

Iii-D Generalization Bound Analysis

In this Section, we give theoretical generalization analysis of the proposed approach. In our analysis, we assume that the adjacency matrix and the node feature matrix are both fixed.

Theoretically, in learning theory, the risk of over the unknown population distribution is measured by

where

is the loss function defined as a map:

, Given a training data and adjacency matrix

, the objective is to estimate parameters

from model (1) based on empirical data. Concretely, we attempts to minimize the empirical risk functional over some function class takes form

where is the labelled sample achieved from the original training data via stacking and aggregation. Typically, the clustering algorithms will only produce discernible nodes. Hence, if . It is trivial that stacking and the three aggregation methods proposed in this paper will not violate the constraints, i.e., . Now we are in position to present the theoretical generalization bound analysis.

Theorem 1

Suppose , , . Denote the number of neighbors of node (the set of node indices with observed labels), let , be any given predictor of a class of GCNs with one-hidden layer. Assume that the loss function is Lipschitz continuous with Lipschitz constant . Then, for any

, with probability at least

, we have

where is the Frobenius norm, with .

Remark 1

Theorem 1 indicates that the dominant upper bound linearly depends on the maximum number of the neighbors of the nodes , bounds of the weights and , which strongly depend on the dimension index . Obviously, will yield a tight generalization bound. When (binary classification case), the results stated here is similar to the one outlined in [31].

Iv Experiments

Iv-a Datasets

To evaluate the performance of the proposed SStaGCN model for distinct types of graph structural data, we utilize real-world datasets for the semi-supervised node classification task, including

commonly used citation networks: Cora, CiteSeer, and Pubmed

[32], and another heterogeneous datasets: Houseclass, VKclass and DBLP [10]. As indicated in [10], Houseclass and VKclass are from House and VK datasets, respectively, where the target labels are converted into several discrete classes due to the lack of publicly available heterogeneous graph-structured data.

In the citation networks, nodes represent documens, and edges (undirected) stand for the citation relationships connected to documents. The characteristics of nodes are representative words in documents, and the label rate here denotes the percentage of node tags used for training. The Cora dataset contains nodes, edges, classes, and node features, the CiteSeer dataset contains nodes, edges, classes, and

node features, and the Pubmed dataset contains

nodes, edges, and classes. We select , , and nodes for Cora, CiteSeer, and Pubmed datasets for training, respectively. Each dataset uses nodes for testing, and nodes for cross-validation. The data splitting we used is the same as that of GCN, Graph Attention Network (GAT, [33]) and GWNN [15]. Details about the citation network (heterogeneous network resp.) are described in Table I (Table II resp.)

Dataset Cora CiteSeer Pubmed
Nodes 2708 3327 19717
Edges 5429 4732 44338
Features 1433 3703 500
Classes 7 6 3
Label Rate 5.2% 3.6% 0.3%
TABLE I: Summary of the citation networks.
Dataset Houseclass VKclass DBLP
Nodes 20640 54028 14475
Edges 182146 213644 40269
Features 6 14 5002
Classes 5 7 4
Min Target Nodes 0.14 13.48 745
Max Target Nodes 5.00 118.39 1197
TABLE II: Summary of the heterogeneous data.

Iv-B Baselines

We compare SStaGCN with classical graph convolutional networks: ChebyNet, GCN, GAT, and APPNP [34], and ensemble learning based GCNs: AdaGCN and BGNN models.

Iv-C Setting

As demonstrated in Fig. 1, the proposed SStaGCN model contains four layers, where the first and second layers are called feature extraction layer. The first layer is based on the base models of the stacking method. Here we consider the combination of

classical classifiers: KNN, Random Forest, Naive Bayes, Decision Tree, SVC

[35], GBDT [36], and Adaboost[37] in the first layer of SStaGCN. These classifiers are representative classical classifiers used in the community of machine learning, which have respective merits in dealing with distinct types of tasks. Moreover, we adopt three aggregation methods: mean, attention, and voting in the second layer. Among them, for the mean approach, we take the mean value of the output of the first layer and then round it. As for the attention mechanism, we consider the label data as vector , the predicted value of the feature extraction layer as the query vector , and then compute the attention coefficients. For the voting approach, we employ the hard voting technique in ensemble learning [38].

Thereafter, the output of the second layer is considered as the input of the first layer of the GCN, which only has two layers in our setting. The GCN considered in this paper has hidden units, and Adam optimizer [39] is the default optimizer, cross entropy is used as the loss function. We set learning rate , number of iterations , weight decay , and dropout rate equals .

Iv-D Results

Method Cora CiteSeer Pubmed
ChebyNet 81.20 69.800.00 74.400.00
GCN 81.50 0.00 70.300.00 79.000.00
GAT 83.000.70 72.500.70 79.000.30
APPNP 85.090.25 75.730.30 79.730.31
AdaGCN 85.970.20 76.680.20 79.950.21
BGNN 41.970.19 30.740.10 10.320.10
SimStacking 43.192.05 62.70.53 87.70.23
SStaGCN (Mean) 90.350.20 86.400.12 82.300.19
SStaGCN (Attention) 91.600.18 87.200.12 82.400.23
SStaGCN (Voting) 93.100.16 88.700.14 92.070.20
TABLE III: Average accuracy on citation networks under runs by computing the confidence interval via bootstrap
Method Houseclass VKclass DBLP
ChebyNet 54.740.10 57.190.36 32.140.00
GCN 55.070.13 56.400.09 39.491.37
GAT 56.500.22 56.420.19 76.830.78
APPNP 57.030.27 56.720.11 79.471.46
AdaGCN 26.200.00 46.000.00 10.060.00
BGNN 66.70.27 66.320.20 86.940.74
SimStacking 53.890.29 56.640.10 71.580.64
SStaGCN (Mean) 72.350.05 66.620.17 82.310.20
SStaGCN (Attention) 72.400.12 77.640.08 82.510.22
SStaGCN (Voting) 76.130.12 87.920.07 92.600.10
TABLE IV: Average accuracy on heterogeneous datasets under runs by calculating the confidence interval via bootstrap
Method Cora CiteSeer Pubmed
ChebyNet 77.990.54 63.760.34 77.740.42
GCN 82.890.30 70.650.37 78.830.32
GAT 83.590.25 70.620.29 77.770.40
APPNP 84.290.22 71.050.38 79.660.31
AdaGCN 79.550.19 63.620.19 78.550.21
BGNN 40.810.25 32.730.13 8.460.08
SimStacking 44.021.61 60.860.56 87.310.13
SStaGCN (Mean) 90.660.18 86.420.12 82.300.19
SStaGCN (Attention) 91.690.14 87.240.14 82.450.23
SStaGCN (Voting) 92.760.16 88.730.14 92.070.20
TABLE V: Average F1-score (macro) on citation networks under runs by computing the confidence interval via bootstrap
Method Houseclass VKclass DBLP
ChebyNet 31.340.12 57.440.27 26.840.62
GCN 54.950.14 56.520.09 38.50.97
GAT 56.540.68 56.410.07 77.11.86
APPNP 57.880.32 56.610.07 79.340.23
AdaGCN 25.010.00 37.030.00 9.600.00
BGNN 66.480.22 66.180.11 87.20.60
SimStacking 53.320.15 56.110.08 71.490.31
SStaGCN (Mean) 72.230.04 66.740.21 82.130.38
SStaGCN (Attention) 72.360.09 77.620.10 82.680.12
SStaGCN (Voting) 75.450.82 87.840.04 92.640.06
TABLE VI: Average F1-score(macro) on heterogeneous datasets under runs by calculating the confidence interval via bootstrap
Models Cora CiteSeer Pubmed Houseclass VKclass DBLP
ChebyNet 2.59e-06 2.77e-08 1.17e-06 3.51e-10 1.11e-11 6.97e-09
GCN 4.19e-16 5.15e-17 1.84e-15 2.36e-08 2.25e-11 2.48e-07
GAT 2.10e-19 1.54e-20 8.84e-19 3.35e-08 1.38e-09 4.95e-06
APPNP 6.86e-42 7.12e-42 6.25e-41 7.05e-15 1.39e-21 2.26e-09
AdaGCN 1.82e-19 1.23e-20 8.42e-19 3.05e-38 7.94e-43 1.69e-35
BGNN 1.02e-24 2.33e-21 4.74e-22 6.54e-07 1.07e-08 0.11e-03
SimStacking 5.61e-12 3.79e-15 2.52e-09 4.56e-08 6.07e-11 1.07e-06
TABLE VII:

p-values of the paired t-test of SStaGCN (Voting) with competitors on

different data sets (CORA, Citeseer, Pubmed, Houseclass, VKclass, and DBLP).

The results of the comparative evaluation for node classification are summarized in Tables III-XI, where SStaGCN (Mean) stands for mean mechanism is utilized in the second layer of SStaGCN, SStaGCN (Attention) and SStaGCN (Voting) have similar meanings. We report the accuracy, F1-score (macro), and training time on the test set between the proposed SStaGCN model and other methods. Experimental results successfully demonstrate significant improvement of SStaGCN model over the baselines. Specifically, for public citation networks, SStaGCN (Voting) achieves almost (), (), and () improvement of the accuracy (resp. F1-score) for Cora, CiteSeer and Pubmed datasets, respectively. For heterogeneous datasets, SStaGCN (Voting) obtains nearly (), (), and () improvement of the accuracy (resp. F1-score) for Houseclass, VKclass, and DBLP datasets, respectively. AdaGCN performs bad on heterogeneous dataset. This maybe due to the reason that AdaGCN aims at designing deep GCNs, which may mix the node features of different clusters when the layers of GCNs go deeper. Intuitively, SStaGCN is able to enhance the performance of GCN, and provides better qualitative results for distinct types of graph structural data.

This impressive improvement can be explained as follows:

(1). The feature extraction step of SStaGCN can achieve a dimensionality reduction effect and make the graph data more discernible. For instance, the size of Cora dataset reduces to from after conducting feature extraction, which means we attain a relatively smaller as discussed in Remark 1, and this greatly improves the prediction ability and computation efficiency in the subsequent graph convolution model.

(2). In the aggregation step of SStaGCN model, the mean and attention mechanisms destroy the pre-classification results to some extent, which is inappropriate for feature extraction, while voting mechanism does not. Therefore, experimental results demonstrate that SStaGCN (Voting) is more efficient in our datasets.

(3). Simplified stacking can extract efficient node features but ignore the graph structure information, while GCN model is weak at extracting node features. Hence, the SStaGCN model inherits the merits of simplified stacking and GCN, not only achieves higher classification accuracy, but also reduces the cost of computation time.

Tables VIII and IX indicate the comparison of training time between SStaGCN and other methods. We can see that BGNN runs faster on citation networks followed by our SStaGCN method, whilst the proposed SStaGCN runs faster on heterogeneous datasets except on DBLP dataset. We think the reason is that the feature extraction step takes extra computation time but yields more efficiency when the output of features is fed into the GCN model.

Method Cora CiteSeer Pubmed
ChebyNet 22.741.24 30.871.21 124.991.77
GCN 13.410.16 99.210.98 55.610.73
GAT 20.980.46 30.741.40 126.331.80
APPNP 203.750.15 55.400.40 457.6212.77
AdaGCN 772.2683.56 2129.02148.97 2098.10275.88
BGNN 1.330.00 2.400.00 2.540.00
SimStacking 11.90.08 27.90.24 79.11.40
SStaGCN (Mean) 10.90.13 17.20.13 89.62.20
SStaGCN (Attention) 11.20.24 17.60.41 87.60.96
SStaGCN (Voting) 16.20.19 29.61.40 13.12.61
TABLE VIII: Average Training time(s) on citation networks by computing the confidence interval via bootstrap
Method Houseclass VKclass DBLP
ChebyNet 833.820.00 1394.680.00 8890.740.00
GCN 46.060.80 120.13.35 268.55.35
GAT 197.63.08 410.99.95 205.32.00
APPNP 129.73.26 383.812.02 176.85.58
AdaGCN 607.310.00 511.410.00 590.900.00
BGNN 26.371.65 93.476.23 50.253.04
SimStacking 16.410.36 43.582.78 380.812.82
SStaGCN (Mean) 52.940.51 132.91.61 188.70.54
SStaGCN (Attention) 57.381.75 133.21.11 246.20.37
SStaGCN (Voting) 59.690.82 154.11.02 310.70.49
TABLE IX: Average Training time(s) on heterogeneous datasets by calculating the confidence interval via bootstrap

To express the effect of feature extraction step of the proposed model, we provide a visualization with t-SNE [40] as shown in Figs. 2(a) and 3(a). Figs. 2(a) and 3(a) indicate that the combination of stacking and aggregation could well extract the node features and make the graph data more discriminative. Moreover, the paired t-test in Table VII demonstrates that the proposed SStaGCN model is significantly different from the simplified stacking and other GCN models.

(b) SStaGCN
(a) GCN
Fig. 2: Visualization of classification features by the GCN (left) and the features after conducting feature extraction step in the SStaGCN model (right) on CiteSeer dataset, node colors denote classes.
(a) GCN
(b) SStaGCN
(a) GCN
Fig. 3: Visualization of classification features by the GCN (left) and the features after conducting feature extraction step in the SStaGCN model (right) on DBLP dataset, node colors denote classes.
(a) GCN

Table X indicates that we do not need all the seven classifiers. Taking Cora dataset as an example, we can observe that KNN, Random Forest, and Naive Bayes is the best combination, which attains the highest accuracy value without much cost in computation time (only seconds). Therefore, this demonstrates that the classifiers have their own merits in handling different specific tasks.

KNN Random Forest Naive Bayes Decision Tree GBDT Adaboost SVC Accuracy Training Time
91.2 13.90
93.6 16.60
84.2 567.9
92.9 144.7
93.1 149.5
92.8 15.90
93.4 18.80
92.9 568.3
92.9 570.7
92.8 708.5
TABLE X: Accuracy and training time(s) on Cora dataset by combinations of different classifiers based on SStaGCN model.

Furthermore, to demonstrate the effect of simplified stacking to the over-smoothing problem, we also add an experiment on the over-smoothing discussion. In Fig. 4(a), we can observe that conventional GCN may mix the features of vertices from different clusters when increasing the layers of GCN. However, as demonstrated in Fig. 5(a) and Table XI 111here the number of layers do not contain the number of layers included in feature extraction part (only layers) of SStaGCN., the proposed SStaGCN could effectively ameliorate the over-smoothing phenomenon and improve the accuracy.

(b) 3-layer
(c) 4-layer
(d) 5-layer
(e) 6-layer
(f) 7-layer
(a) 2-layer
Fig. 4: Visualization of final classification features via GCN on Cora dataset with , , , , , layers, node colors denote classes.
(a) 2-layer
(b) 3-layer
(c) 4-layer
(d) 5-layer
(e) 6-layer
(f) 7-layer
(a) 2-layer
Fig. 5: Visualization of final classification features via SStaGCN on Cora dataset with , , , , , layers, node colors denote classes.
(a) 2-layer
Method 2-layer 3-layer 4-layer 5-layer 6-layer 7-layer
GCN 80.5 80.4 75.8 71.9 72.6 60.8
SStaGCN 93.3 88.8 87.5 86.4 84.8 84.3
TABLE XI: Accuracy comparison between GCN and SStaGCN models on Cora dataset using distinct number of layers.

Overall, these experiments demonstrate the superiority of SStaGCN model over competitors.

Iv-E Visualization

(b) AdaGCN
(c) BGNN
(d) SStaGCN
(a) GCN
Fig. 6: Visualization of final classification features via (a). GCN , (b). AdaGCN , (c). BGNN , and (d). SStaGCN model on Citeseer dataset, node colors denote classes.
(a) GCN
(b) AdaGCN
(c) BGNN
(d) SStaGCN
(a) GCN
Fig. 7: Visualization of final classification features via (a). GCN , (b). AdaGCN , (c). BGNN, and (d). SStaGCN model on DBLP dataset, node colors denote classes.
(a) GCN

To further demonstrate the performance of SStaGCN, we plot the final classification features via GCN, AdaGCN, BGNN, and our SStaGCN. Fig. 6(a) (Fig. 7(a) resp.) displays the final classification features of relevant methods on CiteSeer (DBLP resp.) dataset. From Figs. 6(a) and 7(a), we can observe that relatively smaller points are misclassified by the proposed SStaGCN, whilst many classes are wrongly predicted and classified by GCN, AdaGCN and BGNN.

V Conclusion

Traditional GCNs could not well deal with heterogeneous graph structural data. In this work, we propose a novel GCN architecture, namely, SStaGCN. SStaGCN first takes advantages of stacking method and aggregation technique to attain pre-classified data features, and then utilizes GCN to conduct prediction for heterogeneous graph data. Our approach SStaGCN can effectively explore and exploit node features for heterogeneous graph data in a stacking way. Our work paves a way towards better combining classical machine learning methods to design GCN models, proposes a general framework for handling distinct types of graph structural data, and will definitely give insights to a better understanding of GCN. Extensive experiments demonstrate that the proposed model is superior to state-of-the-art competitors in terms of accuracy, F1-score, and training time. The proposed framework here could be generalized to the regression setting. Furthermore, we believe the proposed method can tackle various distinct types of heterogeneous graph data, although experiments are conducted on tabular data. A promising future research direction is to investigate deeper GCNs in our setting as discussed by [12].

Vi Appendix

In this part, we give the detailed proof of Theorem 1. Before addressing the proof, we give several lemmas. At first, let us present the contraction inequality of Rademacher complexity in the vector form.

Lemma 1

[41] Let be any set, and be a class of functions ad let have Lipschitz constant . Then

where is an independent doubly indexed Rademacher sequence and is the th component of .

Lemma 2

[42] Consider a loss function . Denote , and let be independently selected according to the probability measure . Then for any , with probability at least ,

We first give a lemma which plays essential role in the proof of Theorem 1.

Lemma 3

Let for each node , then

Proof 1

Denote as the sub-matrix of whose row and column indices belong to the set . Let be the feature matrix of the nodes in (subgraph of ). Hence,

where is the th row of the matrix with column index belong to the set . Notice that

and . Therefore,

Now we are in position to give the proof of Theorem 1.

Proof 2

To allow a slight abuse of notations, we will use to denote due to the explanation on page 3. Denote , with . Applying Proposition 4 in [43] to the case , we can attain the Lipschitz constant for standard softmax function is . Let stands for the th row of the matrix , for function set

the empirical Rademacher complexity is defined as

where is an i.i.d. family of Rademacher variables independent of . By the contraction property of Rademacher complexity,

and notice Lemma 2, we only need to bound . Therefore, we have the following estimate by utilizing Lemma 1.

where , notice the property of inner product, the above estimate can be further bounded as

the last inequality follows by the property that . Now the key point is how to estimate the term . We will employ the idea introduced in [31] (in the proof of Theorem 1) to remove the “sup” term. Let , , , with ,, and notice that , then we have

By the definition of Frobenius norm , the supremum of the above quantity under the constraint must be obtained when for some , and for all . Hence

Let be the th neighbor number of node (, ). Recall , with , therefore

Applying the conclusion and contraction property of Rademacher complexity again, we have

Therefore, we only need to estimate the term

Applying Cauchy-Schwartz inequality yields that

where the last inequality is due to the i.i.d condition of Rademacher sequences. Plugging the conclusion of Lemma 3 into the above term leads to

and

This completes the proof by combining with Lemma 2.

References

  • [1] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and locally connected networks on graphs,” in ICLR, 2014.
  • [2] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in IJCNN, 2005.
  • [3] W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in NIPS, 2017.
  • [4] J. Sun, W. Guo, D. Zhang, Y. Zhang, F. Regol, Y. Hu, H. Guo, R. Tang, H. Yuan, X. He, and M. Coates, “A framework for recommending accurate and diverse items using bayesian graph convolutional neural networks,” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
  • [5] S. Casas, C. Gulino, R. Liao, and R. Urtasun, “Spagnn: Spatially-aware graph neural networks for relational behavior forecasting from sensor data,” 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9491–9497, 2020.
  • [6] J. M. Stokes, K. Yang, K. Swanson, W. Jin, and J. J. Collins, “A deep learning approach to antibiotic discovery,” Cell, vol. 180, no. 4, pp. 688–702.e13, 2020.
  • [7] L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for text classification,” in AAAI, 2019.
  • [8] T. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” ICLR, 2017.
  • [9] J. Zhu, “Max-margin nonparametric latent feature models for link prediction,” in ICML, 2012.
  • [10] S. Ivanov and L. Prokhorenkova, “Boost then convolve: Gradient boosting meets graph neural networks,” in ICLR, 2021.
  • [11]

    Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in

    AAAI, 2018.
  • [12] K. Sun, Z. Lin, and Z. Zhu, “Adagcn: Adaboosting graph convolutional networks into deep models,” in ICLR, 2021.
  • [13] S. Dz̆eroski and B. Z̆enko, “Is combining classifiers with stacking better than selecting the best one?” Machine Learning, vol. 54, pp. 255–273, 2004.
  • [14] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: Going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
  • [15] B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng, “Graph wavelet neural network,” in ICLR, 2019.
  • [16] M. Li, Z. Ma, Y. G. Wang, and X. Zhuang, “Fast haar transforms for graph neural networks,” Neural Networks, vol. 128, pp. 188–198, 2020.
  • [17] A. Sandryhaila and J. Moura, “Discrete signal processing on graphs,” IEEE Transactions on Signal Processing, vol. 61, no. 7, pp. 1644–1656, 2013.
  • [18]

    D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,”

    IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, 2013.
  • [19] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
  • [20] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in NIPS, 2016.
  • [21] F. Wu, T. Zhang, A. Souza, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” in ICML, 2019.
  • [22]

    Q. Li, Z. Han, and X. M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in

    AAAI, 2018.
  • [23] G. Li, M. Mueller, A. Thabet, and B. Ghanem, “Deepgcns: Can gcns go as deep as cnns?” in ICCV, 2019.
  • [24] S. Luan, M. Zhao, X. W. Chang, and D. Precup, “Break the ceiling: Stronger multi-scale deep graph convolutional networks,” in NIPS, 2019.
  • [25] D. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, pp. 241–259, 1992.
  • [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 6000–6010.
  • [27] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of visual attention,” in NIPS, 2014.
  • [28]

    D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in

    ICLR, 2015.
  • [29] W. Yin, H. Schütze, B. Xiang, and B. Zhou, “Abcnn: Attention-based convolutional neural network for modeling sentence pairs,” Transactions of the Association for Computational Linguistics, vol. 4, pp. 259–272, 2016.
  • [30] R. Schapire, Y. Freund, P. Barlett, and W. S. Lee, “Boosting the margin: A new explanation for the effectiveness of voting methods,” in ICML, 1997.
  • [31] S. Lv, “Generalization bounds for graph convolutional neural networks via rademacher complexity,” 2021.
  • [32] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad, “Collective classification in network data,” AI Mag., vol. 29, pp. 93–106, 2008.
  • [33] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio’, and Y. Bengio, “Graph attention networks,” in ICLR, 2018.
  • [34] J. Klicpera, A. Bojchevski, and S. Günnemann, “Predict then propagate: Graph neural networks meet personalized pagerank,” in ICLR, 2019.
  • [35] K. Pal and B. Patel, “Data classification with k-fold cross validation and holdout accuracy estimation methods with 5 different machine learning techniques,” 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), pp. 83–87, 2020.
  • [36] J. Friedman, “Greedy function approximation: A gradient boosting machine.” Annals of Statistics, vol. 29, pp. 1189–1232, 2001.
  • [37] Y. Freund and R. E. Schapire, “A short introduction to boosting,” Journal of Japanese Society for Artificial Intelligence, vol. 14, pp. 771–780, 1999.
  • [38] F. Schwenker, “Ensemble methods: Foundations and algorithms,” pp. 77–79, 2013.
  • [39] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
  • [40] L. V. D. Maaten and G. E. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
  • [41] A. Maurer, “A vector-contraction inequality for rademacher complexities,” in International Conference on Algorithmic Learning Theory, 2016.
  • [42] P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” in

    Conference on Computational Learning Theory & and European Conference on Computational Learning Theory

    , 2001.
  • [43]

    B. Gao and L. Pavel, “On the properties of the softmax function with application in game theory and reinforcement learning,” 2017.