Rcently, research with graph structural data learning has received wide considerable attention in ample fields of artificial intelligence. Graph neural networks (GNNs), particularly, graph convolutional networks (GCNs)[1, 2, 3] have shown remarkable success in learning graph structural data, and been applied in recommendation systems 5], molecular design 7], node classification  and clustering tasks . Despite their great success, almost all of the GCNs focus on graph data with homogeneous node embedding or bag-of-words representations. Ivanov and Prokhorenkova 
incorporated gradient boosted decision trees (GBDT) into GNN and proposed a novel BGNN architecture to deal with heterogeneous tabular data for the first time. As is known to all, there are various types of heterogeneous data in real-world applications. Can we design a general framework to handle distinct types of heterogeneous data besides heterogeneous tabular data. Moreover, as laid out by, graph convolution of GCN model is a special form of Laplacian smoothing, which mixes the node features from different clusters. Therefore, it brings the potential problem of over-smoothing . Sun et al.  developed a RNN-like GCN by employing AdaBoost, which can extract knowledge from high-order neighbors of current nodes. However, does not imply , one apparent simple example is considering the situation (two layer), and
Thus, this RNN-like GCN will definitely lose the information of the original graph structure. The over-smoothing, the ability of dealing with heterogeneous data, and the working mechanisms of GCN model remain open.
Recap stacking method  or stacked generation is an approach to ensemble multiple different classifications, which contains of base models and meta-model. Stacking is successful in the feature extraction task of tackling different kinds of data, although the data may be disordered or irregular. This is because stacking has the following properties: (1). Combining multiple distinct learners in the base models, effective discernible features could be well learned; (2). Base models are fitted on the whole training data to compute the performance on the test data; (3). Meta-model is used to make a prediction on the test data.
Undoubtedly, we will get benefits by combining stacking and GCN model. To the best of our knowledge, there is no general framework for dealing with heterogenous graph structural data by GCN model. In this paper, we present a novel simplified stacking based architecture for handling graph data, SStaGCN, which combines the feature extraction ability of stacking and aggregation, and GCN’s ability to learn graph structure on graph data. These enable SStaGCN to inherit the advantages of the stacking method and graph convolutional network. Overall, the contributions of the paper are listed as follows:
(1). We propose a new delicate general architecture that combines simplified stacking and GCN, which is adaptive and flexible to tackle heterogeneous graph structural data of different types. To the best of the authors’ knowledge, this is the first systematic work that applies modified stacking approach to general graph structural data.
(2). Generalization bound is addressed to highlight the role of stacking and aggregation from the viewpoint of learning theory.
(3). Extensive evaluation of our approach against strong baselines in node prediction tasks is investigated. The experimental results indicate significant performance improvements on both homogeneous and heterogeneous node classification tasks over a variety of real-world graph structural data, the over-smoothing phenomenon can be well alleviated.
The remainder of the paper is organized as the following. In Section II, we give a brief review of related work. Section III addresses the theoretical analysis and the proposed algorithm for GCN. Experiments on public citation networks and another heterogeneous datasets are relegated in Section IV. Some discussions and concluding remarks go to Section V. Proof of the main result is provided in the Appendix.
Ii Related Work
Graph Convolutional Networks
GCNs can be routinely interpreted as extensions of traditional convolutional neural networks on the graph domain. Typically, there are two types of GCNs
: spatial GCNs and spectral GCNs. Spatial GCNs construct new feature vectors for each vertex using its neighborhood information, in which convolution is viewed as “patch operator”. Spectral GCNs define the convolution by decomposing a graph signal on the spectral domain, and then employing a spectral filter (Fourier or wavelet,, , ) on the spectral components , 
. However, this model entails the computation of the Laplacian eigenvector, which is exhausted and impractical for large-scale graphs. Hammond et al. used Chebyshev polynomials up to th order to approximate the spectral filter. Defferrard and Vandergheynst  constructed a -localized ChebyNet. Kipf and Welling  considered the case and proposed a simple but powerful model for semi-supervised classification task. Wu et al.  removed nonlinearities and collapsed the weight matrix between consecutive layers, achieved a simplified GCN, whilst [22, 23] considered the design of deep GCNs. Multi-scale deep GCNs were investigated in .
Ensemble learning based graph neural networks
GCNs may confront with the over-smoothing problem and can not handle heterogeneous graph data. Sun et al.  designed a RNN-like graph structure to extract the knowledge from high-order neighbors of the current nodes, while Ivanov and Prokhorenkova  incorporated GBDT into GNN and developed BGNN to tackle heterogeneous tabular data. A natural question arises: Is there a general GCN architecture to handle various heterogeneous graph structural data and mitigate the over-smoothing issue? This paper aims at investigating these challenges and answers the above-mentioned questions.
Iii The proposed approach: SStaGCN
Iii-a Graph convolutional networks
Given an undirected graph with nodes , edges . Denote as the adjacency matrix with corresponding degree matrix . Obviously, for an undirected graph, . In the conventional GCN models for semi-supervised node classification task, the graph embedding of nodes with two convolutional layers is described as the following:
is the final embedding matrix (output logits) of nodes before softmax withthe number of classes. stands for the feature matrix with the input dimension. , where is the degree matrix of and (
stands for identity matrix). Moreover,is the input-to-hidden weight matrix for a hidden layer with feature maps, and denotes the hidden-to-output weight matrix.
Stacking, as a hierarchical model integration framework, is a well-known and widely used ensemble machine learning algorithm
. It uses a meta-learning algorithm to learn how to best combine the prediction from two or more base machine learning algorithms. Traditional stacking model involves two or more base models, and a meta-model that combines the predictions of the base models. Base models use different types of models to fit on the training data and compile the predictions. Meta-model tries to best combine the predictions of the base models, which is often simple, providing a smooth interpretation of the predictions made by the base models. Hence, linear models are often used as the meta-model, such as linear regression for regression tasks and logistic regression for classification tasks.
Iii-C The proposed approach
As mentioned above, GCN may mix the node features from different clusters and make them indistinguishable. Therefore, it is necessary to aggregate more node information in an effective way for better predictions. Motivated by traditional stacking approach and the work in [12, 10], to reduce the computational cost, we only use base models of the stacking approach, and then aggregate the output of them to attain the node features of the graph data. Specifically, the proposed method could be addressed as follows. At first, in the first layer, we attain by input (node feature matrix) through
Thereafter, we get the pre-classification results through (). Secondly, we use aggregation method to attain the final output results, i. e.,
where denotes an aggregation method. The idea of aggregation method is very simple, which aims to group attribute values by a single value. Generally, we can choose mean, attention or voting approach. Recall
Mean: mean operator takes the element-wise mean of the components .
: Attention mechanism has been widely used in various fields of deep learning, including but not limited to image processing, speech recognition, and natural language processing. The idea of attention is motivated from the attention mechanism of human beings. Denote query vector as the output of base classifiers, let query vector be the label of the data. Then we compute the attention coefficients between and as follows:
denotes cosine similarity. Finally, the inputof the graph convolutional layer is aggregated by considering the following sum with attention score
Voting : Voting method is the most intuitive way among all integrated learning methods, which aims at selecting one or more winners. In this work, our objective is to choose the most popular prediction among the results of the base classifiers. Therefore, we will take the majority voting approach. if the number of categories appears the same, a category will be randomly selected.
Thus, we attain a novel GCN model to deal with heterogeneous graph data by elegantly combining stacking, aggregation, and vanilla GCN, namely, SStaGCN. The first layer of SStaGCN utilizes the base models of stacking approach, and the second layer of SStaGCN uses aggregation method such as mean, attention or voting, which can enhance the feature extraction ability of conventional GCN models. The aggregated data will henceforth be used as the input of conventional GCN model, and we attain the final prediction results. The workflow of the proposed model is demonstrated in Algorithm 1 and Fig. 1.
Iii-D Generalization Bound Analysis
In this Section, we give theoretical generalization analysis of the proposed approach. In our analysis, we assume that the adjacency matrix and the node feature matrix are both fixed.
Theoretically, in learning theory, the risk of over the unknown population distribution is measured by
is the loss function defined as a map:, Given a training data and adjacency matrix
, the objective is to estimate parametersfrom model (1) based on empirical data. Concretely, we attempts to minimize the empirical risk functional over some function class takes form
where is the labelled sample achieved from the original training data via stacking and aggregation. Typically, the clustering algorithms will only produce discernible nodes. Hence, if . It is trivial that stacking and the three aggregation methods proposed in this paper will not violate the constraints, i.e., . Now we are in position to present the theoretical generalization bound analysis.
Suppose , , . Denote the number of neighbors of node (the set of node indices with observed labels), let , be any given predictor of a class of GCNs with one-hidden layer. Assume that the loss function is Lipschitz continuous with Lipschitz constant . Then, for any , with probability at least
, with probability at least, we have
where is the Frobenius norm, with .
Theorem 1 indicates that the dominant upper bound linearly depends on the maximum number of the neighbors of the nodes , bounds of the weights and , which strongly depend on the dimension index . Obviously, will yield a tight generalization bound. When (binary classification case), the results stated here is similar to the one outlined in .
To evaluate the performance of the proposed SStaGCN model for distinct types of graph structural data, we utilize real-world datasets for the semi-supervised node classification task, including32], and another heterogeneous datasets: Houseclass, VKclass and DBLP . As indicated in , Houseclass and VKclass are from House and VK datasets, respectively, where the target labels are converted into several discrete classes due to the lack of publicly available heterogeneous graph-structured data.
In the citation networks, nodes represent documens, and edges (undirected) stand for the citation relationships connected to documents. The characteristics of nodes are representative words in documents, and the label rate here denotes the percentage of node tags used for training. The Cora dataset contains nodes, edges, classes, and node features, the CiteSeer dataset contains nodes, edges, classes, and
node features, and the Pubmed dataset containsnodes, edges, and classes. We select , , and nodes for Cora, CiteSeer, and Pubmed datasets for training, respectively. Each dataset uses nodes for testing, and nodes for cross-validation. The data splitting we used is the same as that of GCN, Graph Attention Network (GAT, ) and GWNN . Details about the citation network (heterogeneous network resp.) are described in Table I (Table II resp.)
|Min Target Nodes||0.14||13.48||745|
|Max Target Nodes||5.00||118.39||1197|
We compare SStaGCN with classical graph convolutional networks: ChebyNet, GCN, GAT, and APPNP , and ensemble learning based GCNs: AdaGCN and BGNN models.
As demonstrated in Fig. 1, the proposed SStaGCN model contains four layers, where the first and second layers are called feature extraction layer. The first layer is based on the base models of the stacking method. Here we consider the combination of35], GBDT , and Adaboost in the first layer of SStaGCN. These classifiers are representative classical classifiers used in the community of machine learning, which have respective merits in dealing with distinct types of tasks. Moreover, we adopt three aggregation methods: mean, attention, and voting in the second layer. Among them, for the mean approach, we take the mean value of the output of the first layer and then round it. As for the attention mechanism, we consider the label data as vector , the predicted value of the feature extraction layer as the query vector , and then compute the attention coefficients. For the voting approach, we employ the hard voting technique in ensemble learning .
Thereafter, the output of the second layer is considered as the input of the first layer of the GCN, which only has two layers in our setting. The GCN considered in this paper has hidden units, and Adam optimizer  is the default optimizer, cross entropy is used as the loss function. We set learning rate , number of iterations , weight decay , and dropout rate equals .
p-values of the paired t-test of SStaGCN (Voting) with competitors ondifferent data sets (CORA, Citeseer, Pubmed, Houseclass, VKclass, and DBLP).
The results of the comparative evaluation for node classification are summarized in Tables III-XI, where SStaGCN (Mean) stands for mean mechanism is utilized in the second layer of SStaGCN, SStaGCN (Attention) and SStaGCN (Voting) have similar meanings. We report the accuracy, F1-score (macro), and training time on the test set between the proposed SStaGCN model and other methods. Experimental results successfully demonstrate significant improvement of SStaGCN model over the baselines. Specifically, for public citation networks, SStaGCN (Voting) achieves almost (), (), and () improvement of the accuracy (resp. F1-score) for Cora, CiteSeer and Pubmed datasets, respectively. For heterogeneous datasets, SStaGCN (Voting) obtains nearly (), (), and () improvement of the accuracy (resp. F1-score) for Houseclass, VKclass, and DBLP datasets, respectively. AdaGCN performs bad on heterogeneous dataset. This maybe due to the reason that AdaGCN aims at designing deep GCNs, which may mix the node features of different clusters when the layers of GCNs go deeper. Intuitively, SStaGCN is able to enhance the performance of GCN, and provides better qualitative results for distinct types of graph structural data.
This impressive improvement can be explained as follows:
(1). The feature extraction step of SStaGCN can achieve a dimensionality reduction effect and make the graph data more discernible. For instance, the size of Cora dataset reduces to from after conducting feature extraction, which means we attain a relatively smaller as discussed in Remark 1, and this greatly improves the prediction ability and computation efficiency in the subsequent graph convolution model.
(2). In the aggregation step of SStaGCN model, the mean and attention mechanisms destroy the pre-classification results to some extent, which is inappropriate for feature extraction, while voting mechanism does not. Therefore, experimental results demonstrate that SStaGCN (Voting) is more efficient in our datasets.
(3). Simplified stacking can extract efficient node features but ignore the graph structure information, while GCN model is weak at extracting node features. Hence, the SStaGCN model inherits the merits of simplified stacking and GCN, not only achieves higher classification accuracy, but also reduces the cost of computation time.
Tables VIII and IX indicate the comparison of training time between SStaGCN and other methods. We can see that BGNN runs faster on citation networks followed by our SStaGCN method, whilst the proposed SStaGCN runs faster on heterogeneous datasets except on DBLP dataset. We think the reason is that the feature extraction step takes extra computation time but yields more efficiency when the output of features is fed into the GCN model.
To express the effect of feature extraction step of the proposed model, we provide a visualization with t-SNE  as shown in Figs. 2(a) and 3(a). Figs. 2(a) and 3(a) indicate that the combination of stacking and aggregation could well extract the node features and make the graph data more discriminative. Moreover, the paired t-test in Table VII demonstrates that the proposed SStaGCN model is significantly different from the simplified stacking and other GCN models.
Table X indicates that we do not need all the seven classifiers. Taking Cora dataset as an example, we can observe that KNN, Random Forest, and Naive Bayes is the best combination, which attains the highest accuracy value without much cost in computation time (only seconds). Therefore, this demonstrates that the classifiers have their own merits in handling different specific tasks.
|KNN||Random Forest||Naive Bayes||Decision Tree||GBDT||Adaboost||SVC||Accuracy||Training Time|
Furthermore, to demonstrate the effect of simplified stacking to the over-smoothing problem, we also add an experiment on the over-smoothing discussion. In Fig. 4(a), we can observe that conventional GCN may mix the features of vertices from different clusters when increasing the layers of GCN. However, as demonstrated in Fig. 5(a) and Table XI 111here the number of layers do not contain the number of layers included in feature extraction part (only layers) of SStaGCN., the proposed SStaGCN could effectively ameliorate the over-smoothing phenomenon and improve the accuracy.
Overall, these experiments demonstrate the superiority of SStaGCN model over competitors.
To further demonstrate the performance of SStaGCN, we plot the final classification features via GCN, AdaGCN, BGNN, and our SStaGCN. Fig. 6(a) (Fig. 7(a) resp.) displays the final classification features of relevant methods on CiteSeer (DBLP resp.) dataset. From Figs. 6(a) and 7(a), we can observe that relatively smaller points are misclassified by the proposed SStaGCN, whilst many classes are wrongly predicted and classified by GCN, AdaGCN and BGNN.
Traditional GCNs could not well deal with heterogeneous graph structural data. In this work, we propose a novel GCN architecture, namely, SStaGCN. SStaGCN first takes advantages of stacking method and aggregation technique to attain pre-classified data features, and then utilizes GCN to conduct prediction for heterogeneous graph data. Our approach SStaGCN can effectively explore and exploit node features for heterogeneous graph data in a stacking way. Our work paves a way towards better combining classical machine learning methods to design GCN models, proposes a general framework for handling distinct types of graph structural data, and will definitely give insights to a better understanding of GCN. Extensive experiments demonstrate that the proposed model is superior to state-of-the-art competitors in terms of accuracy, F1-score, and training time. The proposed framework here could be generalized to the regression setting. Furthermore, we believe the proposed method can tackle various distinct types of heterogeneous graph data, although experiments are conducted on tabular data. A promising future research direction is to investigate deeper GCNs in our setting as discussed by .
In this part, we give the detailed proof of Theorem 1. Before addressing the proof, we give several lemmas. At first, let us present the contraction inequality of Rademacher complexity in the vector form.
 Let be any set, and be a class of functions ad let have Lipschitz constant . Then
where is an independent doubly indexed Rademacher sequence and is the th component of .
 Consider a loss function . Denote , and let be independently selected according to the probability measure . Then for any , with probability at least ,
We first give a lemma which plays essential role in the proof of Theorem 1.
Let for each node , then
Denote as the sub-matrix of whose row and column indices belong to the set . Let be the feature matrix of the nodes in (subgraph of ). Hence,
where is the th row of the matrix with column index belong to the set . Notice that
and . Therefore,
Now we are in position to give the proof of Theorem 1.
To allow a slight abuse of notations, we will use to denote due to the explanation on page 3. Denote , with . Applying Proposition 4 in  to the case , we can attain the Lipschitz constant for standard softmax function is . Let stands for the th row of the matrix , for function set
the empirical Rademacher complexity is defined as
where is an i.i.d. family of Rademacher variables independent of . By the contraction property of Rademacher complexity,
where , notice the property of inner product, the above estimate can be further bounded as
the last inequality follows by the property that . Now the key point is how to estimate the term . We will employ the idea introduced in  (in the proof of Theorem 1) to remove the “sup” term. Let , , , with ,, and notice that , then we have
By the definition of Frobenius norm , the supremum of the above quantity under the constraint must be obtained when for some , and for all . Hence
Let be the th neighbor number of node (, ). Recall , with , therefore
Applying the conclusion and contraction property of Rademacher complexity again, we have
Therefore, we only need to estimate the term
Applying Cauchy-Schwartz inequality yields that
where the last inequality is due to the i.i.d condition of Rademacher sequences. Plugging the conclusion of Lemma 3 into the above term leads to
This completes the proof by combining with Lemma 2.
-  J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and locally connected networks on graphs,” in ICLR, 2014.
-  M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in IJCNN, 2005.
-  W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in NIPS, 2017.
-  J. Sun, W. Guo, D. Zhang, Y. Zhang, F. Regol, Y. Hu, H. Guo, R. Tang, H. Yuan, X. He, and M. Coates, “A framework for recommending accurate and diverse items using bayesian graph convolutional neural networks,” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
-  S. Casas, C. Gulino, R. Liao, and R. Urtasun, “Spagnn: Spatially-aware graph neural networks for relational behavior forecasting from sensor data,” 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9491–9497, 2020.
-  J. M. Stokes, K. Yang, K. Swanson, W. Jin, and J. J. Collins, “A deep learning approach to antibiotic discovery,” Cell, vol. 180, no. 4, pp. 688–702.e13, 2020.
-  L. Yao, C. Mao, and Y. Luo, “Graph convolutional networks for text classification,” in AAAI, 2019.
-  T. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” ICLR, 2017.
-  J. Zhu, “Max-margin nonparametric latent feature models for link prediction,” in ICML, 2012.
-  S. Ivanov and L. Prokhorenkova, “Boost then convolve: Gradient boosting meets graph neural networks,” in ICLR, 2021.
Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” inAAAI, 2018.
-  K. Sun, Z. Lin, and Z. Zhu, “Adagcn: Adaboosting graph convolutional networks into deep models,” in ICLR, 2021.
-  S. Dz̆eroski and B. Z̆enko, “Is combining classifiers with stacking better than selecting the best one?” Machine Learning, vol. 54, pp. 255–273, 2004.
-  M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: Going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
-  B. Xu, H. Shen, Q. Cao, Y. Qiu, and X. Cheng, “Graph wavelet neural network,” in ICLR, 2019.
-  M. Li, Z. Ma, Y. G. Wang, and X. Zhuang, “Fast haar transforms for graph neural networks,” Neural Networks, vol. 128, pp. 188–198, 2020.
-  A. Sandryhaila and J. Moura, “Discrete signal processing on graphs,” IEEE Transactions on Signal Processing, vol. 61, no. 7, pp. 1644–1656, 2013.
D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,”IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, 2013.
-  D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129–150, 2011.
-  M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in NIPS, 2016.
-  F. Wu, T. Zhang, A. Souza, C. Fifty, T. Yu, and K. Q. Weinberger, “Simplifying graph convolutional networks,” in ICML, 2019.
Q. Li, Z. Han, and X. M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” inAAAI, 2018.
-  G. Li, M. Mueller, A. Thabet, and B. Ghanem, “Deepgcns: Can gcns go as deep as cnns?” in ICCV, 2019.
-  S. Luan, M. Zhao, X. W. Chang, and D. Precup, “Break the ceiling: Stronger multi-scale deep graph convolutional networks,” in NIPS, 2019.
-  D. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, pp. 241–259, 1992.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 6000–6010.
-  V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of visual attention,” in NIPS, 2014.
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” inICLR, 2015.
-  W. Yin, H. Schütze, B. Xiang, and B. Zhou, “Abcnn: Attention-based convolutional neural network for modeling sentence pairs,” Transactions of the Association for Computational Linguistics, vol. 4, pp. 259–272, 2016.
-  R. Schapire, Y. Freund, P. Barlett, and W. S. Lee, “Boosting the margin: A new explanation for the effectiveness of voting methods,” in ICML, 1997.
-  S. Lv, “Generalization bounds for graph convolutional neural networks via rademacher complexity,” 2021.
-  P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad, “Collective classification in network data,” AI Mag., vol. 29, pp. 93–106, 2008.
-  P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio’, and Y. Bengio, “Graph attention networks,” in ICLR, 2018.
-  J. Klicpera, A. Bojchevski, and S. Günnemann, “Predict then propagate: Graph neural networks meet personalized pagerank,” in ICLR, 2019.
-  K. Pal and B. Patel, “Data classification with k-fold cross validation and holdout accuracy estimation methods with 5 different machine learning techniques,” 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), pp. 83–87, 2020.
-  J. Friedman, “Greedy function approximation: A gradient boosting machine.” Annals of Statistics, vol. 29, pp. 1189–1232, 2001.
-  Y. Freund and R. E. Schapire, “A short introduction to boosting,” Journal of Japanese Society for Artificial Intelligence, vol. 14, pp. 771–780, 1999.
-  F. Schwenker, “Ensemble methods: Foundations and algorithms,” pp. 77–79, 2013.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
-  L. V. D. Maaten and G. E. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
-  A. Maurer, “A vector-contraction inequality for rademacher complexities,” in International Conference on Algorithmic Learning Theory, 2016.
P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk
bounds and structural results,” in
Conference on Computational Learning Theory & and European Conference on Computational Learning Theory, 2001.