1 Introduction
With great expressive power, graphs have been employed as the representation of a wide range of systems across various areas, including social network (Kipf and Welling, 2016; Hamilton et al., 2017), physical systems (Battaglia et al., 2016; SanchezGonzalez et al., 2018), proteinprotein interaction networks (Hamaguchi et al., 2017)
and knowledge graph
(Fout et al., 2017). Recently, research of analyzing graphs with machine learning has been received more and more attention, mainly focusing on node classification
(Kipf and Welling, 2016), link prediction (Zhu et al., 2016) and clustering tasks (Fortunato, 2010).Graph convolution can be regarded as the extension of standard convolution from Euclidean to nonEuclidean domain. Graph Convolutional Networks (GCNs) (Kipf and Welling, 2016)
generalize convolutional neural networks (CNNs) to graphstructured data from the perspective of spectral theory based on prior works
(Bruna et al., 2013; Defferrard et al., 2016). GCNs naturally integrate the connectivity patterns and feature attributes of graphstructured data and it has been demonstrated that GCNs and their variants (Hamilton et al., 2017; Velickovic et al., 2017; Dai et al., 2018; Chen and Zhu, 2017)significantly outperform traditional multilayer perceptron (MLP) models and traditional graph embedding approaches
(Tang et al., 2015; Perozzi et al., 2014; Grover and Leskovec, 2016).Nevertheless, it is well known that deep neural networks heavily depend on a large amount of labeled data. The requirement of largescale data might not be met in many real scenarios for graphs with sparse labeled nodes. GCNs and their variants are mainly established on semisupervised setting where the graph usually has relative plenty of labeled data. However, to the best of our knowledge, there is hardly any work about graphs focusing on weakly supervised setting (Zhou, 2017), especially learning a classification model with few examples from each class. In addition, the GCNs are usually with shallow architectures due to its intrinsic limitation (Li et al., 2018), thereby restricting the efficient propagation of label signals. To address this issue, Li et al. (2018) proposed CoTraining and SelfTraining to enlarge training dataset in a boostinglike way. Although these methods can partially improve the performance of GCNs with few labeled data, it is difficult to pick single one consistently efficient algorithm in real applications since these methods perform inconsistently across distinct training sizes.
On the other hand, a recent surge of interest has focused on the selfsupervised learning, a popular form of unsupervised learning, which uses pretext tasks to replace the labels annotated by humans by “pseudolabel” directly computed from the raw input data. On the basis of the analysis above, there are mainly two issues worthy to explore further. Firstly, since it is hard to change the innate shallow architectures of GCNs, how to design a consistently efficient training algorithm based on GCNs to improve its generalization performance on graphs with few labeled nodes? Secondly, how to leverage the advantage of selfsupervised learning approaches based on a large amount of unlabeled data, to refine the performance of proposed training algorithm?
In this paper, we firstly analyze the Symmetric Laplacian Smoothing (Li et al., 2018) of GCNs and show that this intrinsic property determines the shallow architectures of GCNs, thus restricting its generalization performance on only few labeled data due to the inefficient propagation of label information. Then we show the layer effect of GCNs on graph with few labeled nodes: to maintain the best generalization, it requires more layers for GCNs with fewer labeled data in order to propagate the weak label signals more broadly. Further, to overcome the inefficient propagation of label information with few labels for shallow architectures of GCNs, we firstly propose a more general training algorithm of GCNs based on SelfTraining (Li et al., 2018), called multistage training framework. Additionally, we apply DeepCluster (Caron et al., 2018), a popular method of selfsupervised learning, on the graph embedding process of GCNs and design a novel aligning mechanism on clusters to construct pseudolabels in classification for each unlabeled data in the embedding space. Next we incorporate DeepCluster approach and the aligning mechanism into the MultiStage Training Framework in an elegant way and formally propose MultiStage SelfSupervised (M3S) Training Algorithm. Extensive experiments demonstrate that our M3S approach are superior to other stateoftheart approaches across all the considered graph learning tasks with limited number of labeled nodes. In summary, the contributions of the paper are listed below:

We first probe the existence of Layer Effect of GCNs on graphs with few labeled nodes, revealing that GCNs requires more layers to maintain the performance with lower label rate.

We propose an efficient training algorithm, called M3S, combining the multistage training framework and DeepCluster approach. It exhibits stateoftheart performance on graphs with low label rates.

Our M3S Training Algorithm in fact can provide a more general framework that leverages selfsupervised learning approaches to improve multistage training framework to design efficient algorithms on learning tasks with only few labeled data.
2 Our Approach
Before introducing our M3S training algorithm, we will firstly elaborate the issue of inefficient propagation of information from limited labeled data due to the essence of symmetric laplacian smoothing of GCNs, which forms the motivation of our work. Then a multistage training framework and DeepCluster approach are proposed, respectively, composing the basic components of our M3S algorithm. Finally, we will formally provide multistage selfsupervised (M3S) training algorithm in detail, a novel and efficient training method of GCNs focusing on graphs with few labeled nodes.
2.1 Symmetric Laplacian Smoothing of Graph Convolutional Networks
In the GCNs model (Kipf and Welling, 2016) of semisupervised classification, the graph embedding of nodes with two convolutional layers is formulated as:
(1) 
where and is the degree matrix of . and denote the feature and the adjacent matrix, respectively. is the inputtohidden weight matrix and is the hiddentooutput weight matrix.
Related work (Li et al., 2018) pointed out the reason why the GCNs work lies in the Symmetric Laplacian Smoothing of this spectral convolutional type, which is the key for the huge performance gain. We simplify it as follows:
(2) 
where is the firstlayer embedding of node from input features and its corresponding matrix formulation is as follows:
(3) 
where is the onelayer embedding matrix of feature matrix . In addition, Li et al. (2018) showed that by repeatedly applying Laplacian smoothing many times, the embedding of vertices will finally converge to the proportional to the square root of the vertex degree, thus restricting the enlargement of convolutional layers.
In this case, a shallow GCN cannot sufficiently propagate the label information to the entire graph with only a few labels, yielding the unsatisfying performance of GCNs on graphs with few labeled nodes. To tackle this deficit of GCNs, we propose an effective training algorithm based on GCNs especially with only a small number of labels, dispensing with the inconsistent performance of four algorithms proposed in (Li et al., 2018).
On the other hand, as shown in Section 3.1, the requirement of number of graph convolutional layers for the best performance differs for the different label rates. Concretely speaking, the lower label rate of a graph has, the more graph convolutional layers are required for the purpose of more efficient propagation of label information.
2.2 MultiStage Training Framework
Inspired by the SelfTraining algorithm proposed by (Li et al., 2018), working by adding the most confident predictions of each class to the label set, we propose a more general MultiStage Training Framework described in Algorithm 1.
In contrast with original SelfTraining that explores the most confident nodes and adds them with predicted virtual labels only once, MultiStage Training Algorithm executes this process times. On graphs with limited labels, this algorithm framework repeatedly adds more confident labeled data and facilitates the propagation of label information, resulting in the better performance compared with original approaches.
Nevertheless, the core of MultiStage Training Framework lies in the accuracy of selected nodes with virtual labels based on the confidence and thus it is natural to incorporate selfchecking mechanism that can guarantee the precision of chosen labeled data.
2.3 DeepCluster
Recently, selfsupervised learning (Doersch et al., 2015)
, a popular form of unsupervised learning, shows its power in the field of computer vision, which utilizes pretext tasks to replace the labels annotated by human by “pseudolabels”. A neat and effective approach of selfsupervised learning is DeepCluster
(Caron et al., 2018)that takes a set of embedding vectors produced by ConvNet
as input and groups them into distinct clusters based on a geometric criterion.More concretely, DeepCluster jointly learns a centroid matrix and the cluster assignment of each data point such as image, by solving the following problem:
(4)  
Solving this problem provides a set of optimal assignments and a centroid matrix . These assignments are then used as pseudolabels. In particular, DeepCluster alternates between clustering the embedding vectors produced from ConvNet into pseudolabels and updating parameters of the ConvNet by predicting these pseudolabels.
For the node classification task in a graph, the representation process can also be viewed as graph embedding (Zhou et al., 2018)
, allowing the DeepCluster as well. Thus, we harness the innate property of graph embedding in GCNs and execute kmeans on the embedding vectors to cluster all nodes into distinct categories based on embedding distance. Next, an aligning mechanism is introduced to classify the nodes in each cluster to the nearest class in classification on the embedding space. Finally, the obtained pseudolabels are leveraged to construct the selfchecking mechanism of MultiStage SelfSupervised Algorithm as shown in Figure
1.Aligning Mechanism
The target of aligning mechanism is to transform the categories in clustering to the classes in classification based on the embedding distance. For each cluster in unlabeled data after kmeans, the computation of aligning mechanism is:
(5) 
where denotes centroids of class in labeled data, denotes the centroid of cluster in unlabeled data and represents the aligned class that has the closest distance from among all centroids of class in the original labeled data. Through the aligning mechanism, we are capable of classifying nodes of each cluster to a specific class in classification and then construct pseudolabels for all unlabeled nodes according to their embedding distance.
Extension
In fact, DeepCluster is a more general and economical form of constructing selfchecking mechanism via embedding distance. The naive selfchecking way is to compare the distance of each unlabeled node to centroids of classes in labeled data since distance between each unlabeled data and training centroids is a more precise measure than the class centriods of unlabeled data. However, when the number of clusters is equivalent to the amount of all unlabeled nodes, our selfchecking mechanism via DeepCluster is the same as the naive way. Considering the expensive computation of the naive selfchecking, DeepCluster performs more efficiently and flexibly in the selection of number of clusters.
2.4 M3S Training Algorithm
In this section, we will formally present our MultiStage SelfSupservised (M3S) Training Algorithm, a novel training method on GCN aiming at addressing the inefficient propagation of label information on graphs with few labeled nodes. The flow chart of our approach is illustrated in Figure 1.
The crucial part of M3S Training Algorithm compared with MultiStage Training is additionally utilizing the information of embedding distance to check the accuracy of selected nodes with virtual labels from SelfTraining based on the confidence. Specifically speaking, M3S Training Algorithm elegantly combines DeepCluster selfchecking mechanism with MultiStage Training Framework to choose nodes with more precise virtual labels in an efficient way. We provide a detailed description of M3S approach in Algorithm 2.
Input: Features Matrix , adjacent matrix , labeled and unlabeled set , graph convolution network .
Output: Graph Embedding
For M3S Training Algorithm, firstly we train a GCN model on an initial dataset to obtain meaningful embedding vectors. Then we perform DeepCluster on the embedding vectors of all nodes to acquire their clustering labels. Furthermore, we align their labels of each cluster based on the embedding distance to attain the pseudolabel of each unlabeled node. In the following SelfTraining process, for the selected top confident nodes of each class, we perform selfchecking based on pseudolabels to guarantee they belong to the same class in the embedding space, then add the filtered nodes to the labeled set and execute a new stage SelfTraining.
Avoiding trivial solutions
It should be noted that the categorically balanced labeled set plays an important role on graphs with low label rate. In addition, DeepCluster tends to be caught in trivial solutions that actually exist in various methods that jointly learns a discriminative classifier and the labels(Caron et al., 2018)
. Highly unbalanced data of per class is a typical trivial solution of DeepCluster, which hinders the generalization performance with few supervised signals. In this paper we provide a simple solution by enlarging the number of clusters in Kmeans. For the one hand, setting more clusters allows higher probability of being evenly classified to all categories. For the other hand, it contributes to more precise computation in embedding distance from the perspective of extension of DeepCluster selfchecking mechanism. These are dicussed in the experimental part.
3 Experiments
In this section we conduct extensive experiments to demonstrate the effectiveness of our proposed M3S Algorithm on graphs with few labeled nodes. For the graph dataset, we select the three commonly used citation networks: CiteSeer, Cora and PubMed (Sen et al., 2008). Dateset statistics are summarized in Table 1.
Dateset  Nodes  Edges  Classes  Features 


CiteSeer  3327  4732  6  3703  3.6%  
Cora  2708  5429  7  1433  5.2%  
PubMed  19717  44338  3  500  0.3% 
As for the baselines, we opt the Label Propagation (LP) (Wu et al., 2012) using ParWalks; Graph Convolutional Networks (GCNs) (Kipf and Welling, 2016); SelfTraining, CoTraining, Union and Intersection (Li et al., 2018) all based on the confidence of prediction. On graphs with low label rates, we compare both our MultiStage Training Framework and M3S Algorithm with other stateoftheart approaches by changing the label rate for each dataset. We report the mean accuracy of 10 runs in all result tables to make fair comparison.
3.1 Layer Effect on Graphs with Few Labels
Before comparing our algorithms with other methods, we point out the layer effect of GCNs for different label rates: to maintain the best performance, a GCN model in semisupervised task with a lower label rate requires more graph convolutional layers.
Figure 2 presents some empirical evidence to demonstrate the layer effect on graphs with few labels. We test the performance of GCNs with different layers in distinct label rates in Figure 2 and it is apparent to note that the number of layer under the best performance exhibits a descending trend as the label rate increases.
The existence of layer effect demonstrates the urge of propagation of label information by stacking more convolutional layers. In the original GCNs (Kipf and Welling, 2016), the authors argued to apply two graph convolutional layers for standard node classification tasks. However, due to the existence of Layer Effect, we are expected to choose proper number of layers especially on graphs with low label rates. In the following experiments, we all choose the best number of layer to compare the best performance for each method.
3.2 Performance of MultiStage Training Framework
To gain a better understanding of the advantage of MultiStage Training Framework, we make a extensive comparison between MultiStage Framework of different stages with the SelfTraining approach under different label rates.
From Figure 3, it is easy to observe that all selftraining methods outperform the original GCNs with a large margin, especially when the graph has low label rate, which usually happens in real applications. In addition, MultiStage Training is superior to traditional SelfTraining especially when there are fewer labeled nodes and more stages are inclined to bring more improvement. Nevertheless, the discrepancy between the MultiStage Training algorithm and SelfTraining algorithm narrows down as the label rate increases. Moreover, the improvement of all selftraining methods over GCNs diminishes as well with the increasing of label rate. As for the reason, we argue that with the enlargement of labeled nodes, the accuracy of the learned GCN model also increases, while the accuracy of explored nodes via selftraining tends to approach the accuracy of current GCN, resulting in the diminishment of improvement. However, the limited precision of selected nodes only based on the confidence of prediction is just what M3S Training Algorithm is devoted to improve.
3.3 Performance of M3S Training Algorithm
In this section, we conduct experiments by comparing MultiStage SelfTraining Algorithm and M3S Training Algorithm with other stateoftheart approaches under different label rates across the three datasets.
Experimental Setup
All the results are the mean accuracy of 10 runs and the number of clusters in DeepCluster is fixed 200 for all datasets to avoid trivial solutions. We select the best number of layers for different label rates. In particular, the best layer in Cora and CiteSeer is 4,3,3,2,2 and 3,3,3,2,2 respectively for 0.5%,1%,2%,3%,4% label rates and fixed 4 for 0.03%,0.05%,0.1% label rates on PubMed. The number of epochs of each stage in MultiStage Training Framework, M3S and other approaches is set as 200. For all methods involved in GCNs, we use the same hyperparameters as in (Kipf and Welling, 2016): learning rate of 0.01, 0.5 dropout rate, regularization weight, and 16 hidden units without validation set for fair comparison (Li et al., 2018). For the option of stages, we view it as a hyperparameter. For CiteSeer dataset we fix and for PubMed dataset we fix , in which the result of our proposed algorithms have already outperformed other approaches easily. For Cora dataset we choose as 5,4,4,2,2 as the training size increases, since higher label rate usually matches with a smaller .
Results shown in Tables 2, 3 and 4 verify the effectiveness of our M3S Training Algorithm, consistently outperforming other stateoftheart approaches to a large margin on a wide range of label rates across the three datasets. More specifically, we make four observations from the results:

It is apparent to note that the performance of GCN significantly declines when the labeled data is scarce due to the inefficient propagation of label information. For instance, on Cora and PubMed datasets, the performance of GCN is even inferior to Label Propagation (LP) when the training size is relative small.

Previous stateoftheart algorithms, namely Cotraining, Selftraining, Union and Intersection exhibit inconsistent performance compared with GCNs, thus it is hard to employ one single algorithm from them in real scenarios.

MultiStage Training Framework tends to be superior to SelfTraining especially on fewer labeled data, demonstrating the effectiveness of this framework on graphs with few labeled nodes.

M3S Training Algorithm leverages both the advantage of MultiStage Training Framework and selfchecking mechanism constructed by DeepCluster, consistently outperforming other stateoftheart approaches on all label rates. Additionally, it turns out that the lower label rate the graph has, the larger improvement of M3S Training Algorithm can produce, perfectly adapting on graphs with few labeled nodes.
Sensitivity Analysis of Number of Clusters
Sensitivity analysis of number of clusters is regarded as the extensive discussion of our M3S Training Algorithm, where we present the influence of number of clusters in DeepCluster on the balance of each class and the final performance of GCN. We leverage “MaxMin Ratio” to measure the balance level of each class, which is computed by the subtraction between max ratio and min ratio of categories of unlabeled data after the aligning mechanism, and the lower “MaxMin Ratio” represents the higher balance level of categories. We choose two labeled nodes of each class across three datasets. As shown in Figure 4 where each column presents the change of a specific dataset, with the increasing of number of clusters, categories tend to be more balanced until the number of clusters is large enough, facilitating the final performance of M3S Training Algorithm. These results empirically demonstrate that more clusters are beneficial to avoid trivial solutions in DeepCluster, thus enhancing the performance of our method.
4 Discussions
Although in this work we employ only one kind of selfsupervised approach on the graph learning task, the introduction of selfchecking mechanism constructed by DeepCluster in fact provides a more general framework on weakly supervised signals for a wide range of data types. On the one hand, it is worthy of exploring the avenue to utilize the pseudolabels produced by selfsupervised learning more efficiently on few supervised labels, for instance, designing new aligning mechanism or applying better selfsupervised learning approach. On the other hand, how to extend similar algorithm combined with selfsupervised learning methods to other machine learning task such as image classification and sentence classification, requires more endeavours in the future.
5 Conclusion
In this paper, we firstly clarify the Layer Effect of GCNs on graphs with few labeled nodes, demonstrating that it is expected to stack more layers to facilitate the propagation of label information with lower label rate. Then we propose MultiStage Training Algorithm Framework on the basis of SelfTraining, adding confident data with virtual labels to the labeled set to enlarge the training set. In addition, we apply DeepCluster on the graph embedding process of GCNs and design a novel aligning mechanism to construct selfchecking mechanism to improve MultiStage Training Framework. Our final proposed approach, M3S Training Algorithm, outperforms other stateoftheart methods with different label rates across all the considered graphs with few labeled nodes. Overall, M3S Training Algorithm is a novel and efficient algorithm focusing on graphs with few labeled nodes.
References
 Battaglia et al. [2016] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pages 4502–4510, 2016.
 Bruna et al. [2013] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
 Caron et al. [2018] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Computer Vision–ECCV 2018, pages 139–156. Springer, 2018.
 Chen and Zhu [2017] Jianfei Chen and Jun Zhu. Stochastic training of graph convolutional networks. arXiv preprint arXiv:1710.10568, 2017.
 Dai et al. [2018] Hanjun Dai, Zornitsa Kozareva, Bo Dai, Alex Smola, and Le Song. Learning steadystates of iterative algorithms over graphs. In International Conference on Machine Learning, pages 1114–1122, 2018.
 Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
 Doersch et al. [2015] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1422–1430, 2015.
 Fortunato [2010] Santo Fortunato. Community detection in graphs. Physics reports, 486(35):75–174, 2010.
 Fout et al. [2017] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa BenHur. Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems, pages 6530–6539, 2017.
 Grover and Leskovec [2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
 Hamaguchi et al. [2017] Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and Yuji Matsumoto. Knowledge transfer for outofknowledgebase entities: a graph neural network approach. arXiv preprint arXiv:1706.05674, 2017.
 Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
 Kipf and Welling [2016] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Li et al. [2018] Qimai Li, Zhichao Han, and XiaoMing Wu. Deeper insights into graph convolutional networks for semisupervised learning. arXiv preprint arXiv:1801.07606, 2018.
 Perozzi et al. [2014] Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
 SanchezGonzalez et al. [2018] Alvaro SanchezGonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller, Raia Hadsell, and Peter Battaglia. Graph networks as learnable physics engines for inference and control. arXiv preprint arXiv:1806.01242, 2018.
 Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina EliassiRad. Collective classification in network data. AI magazine, 29(3):93, 2008.
 Tang et al. [2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077. International World Wide Web Conferences Steering Committee, 2015.
 Velickovic et al. [2017] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 1(2), 2017.
 Wu et al. [2012] XiaoMing Wu, Zhenguo Li, Anthony M So, John Wright, and ShihFu Chang. Learning with partially absorbing random walks. In Advances in Neural Information Processing Systems, pages 3077–3085, 2012.
 Zhou et al. [2018] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.
 Zhou [2017] ZhiHua Zhou. A brief introduction to weakly supervised learning. National Science Review, 5(1):44–53, 2017.
 Zhu et al. [2016] Jun Zhu, Jiaming Song, and Bei Chen. Maxmargin nonparametric latent feature models for link prediction. arXiv preprint arXiv:1602.07428, 2016.
Comments
There are no comments yet.