1 Introduction
Graph is a natural way to represent and organize data with complicated relationships. But graph data is hard to processed by the machine learning methods directly, especially the deep learning
[11], which has achieved brilliant achievements in various fields. Learning a useful graph representation lies at the heart of many deep learningbased graph mining applications, such as node classification, link prediction, and community detection, etc.It is now wildly adopted to embed the structure data into vectors for the well developed deep learning methods. Recently, Semisupervised method represented by grpah convolutional network has been a hot topic in graph embedding area, and massive outstanding works are proposed
^{1}^{1}1We will use graph convolutional network as the representive of semisupervised methods through the paper. Kipfkipf2017semi came up with GCN, the one widely uesd today, has formally brought the field of graphs into neural networks’ era. Since then, plenty of work like GraphSAGE
[7], Graph Attention Networks[16] have been proposed.However, there are two key challenges in applying these semisupervised methods to specific fields: 1) extreme insufficient of labeled data; 2) outofdistribution prediction, when the distribution of training set differs from the test set much. Works like GCN[9], GAT[16]
tend to handly pick the training set in order to maintain the distributionsimilarity. On the other hand, as this work is proceeding, there are some researchers seek to overcome the challenges by utilizing transfer learning
[8, 19]. But the ’pretrain’ methods needs extensive domain knowledge and pretty long training time.To address the research gap, we proposed a samplingbased training framework for graph convolution network methods in this paper, which is more scalable and needless of domain knowledge. We integrate a random walkbased sampling strategy into the graph convolution network model training process. In this way, our framework utilizes the sampling algorithm to find out the most representative nodes of a graph, which have proven successful in graph measurement, and graph structure estimating. The comparison between random walkbased sampled nodes and uniformly chosen nodes with limited ratio is shown in Fig.
1. When the sampling scale is small, the ’Law of Large Number’ fails, and nodes chosen uniformly are more likely to be in a small subgraph, which would lost much information during the training process. Several works
[7, 3] also utilize the thought of sampling, the difference between ours and them will be clarified in Section 2.Our framework can improve the performance of existing graph convolution network methods significantly with the following advantages: 1) the random walkbased sampled training data can maintain more accurate graph characteristics than uniformly chosen data, which eliminate the model deviation; 2) the wellsampled training nodes can estimate the parameters in graph convolution network models effectively; 3) the smaller scale of training data to be labelled would make the existing models work with limited labelled data and save lots of human resource. To demonstrate the sampling method’s verification in training graph convolution network, we combine it with five stateoftheart GCNbased methods, and evaluate its performance on challenging multilabel node classification problems with limited labelled training data. The result shows that each combined method can outperform its original one with the same scale of training data, or achieve the same accuracy with less labelled data.
The contribution of this paper is summarized as follows:

To the best of our knowledge, this is the first work by utilizing the sampling strategy as a preprocess stage to improve the performance of graph convolution network algorithms. It can reduce the scale of labelled data and lower the human resource cost significantly without changing the original methods.

We develop a general framework by integrating a random walk sampling strategy with graph convolution network methods, which could make them obtaining better results and extent them to the application with extremely small scale of labelled data.

We verify the validation of our idea by evaluating our framework on different realworld networks. The case study of multilabel node classification shows that our framework can make GCNbased methods outperform their original ones.
This work is organized as follows. In Section 2, we formulate the problem definition. In Section 3, we detail the samplingbased training framework for graph convolutional network related methods. We demonstrate our experiment settings and result in Section 4. Finally, we close with a discussion of related work in Section 5 and conclusions of our work in Section 6.
2 Problem Definition
In this paper, we are trying to solve the problem that most of the stateoftheart graph convolution network methods need a large number of labelled nodes as the training set.
We seek to solve this problem by proposing one samplingbased training framework introduced in section1, we only need to put labels on the tinyscale nodes set we sampled out and train the graph convolution network model to the same level performance as usual.
The most related work is FastGCN[3] and GraphSAGE [7]. These two works also manage to utilize sampling method(we simplify it as Fsampling as feature sampling) to improve the performance. But it is necessary to clarified that their sampling strategies and ours mentioned in this paper are not the same concept. Specially, both FastGCN and GraphSAGE focus on how to downsize the data steam passing through layer to layer in the graph convolution network during the training process, either change the network structure or the neighbour’s information.
We, differently, use sampling strategy (we simplify it as Nsampling as training node sampling) as a preprocess method, and pronounce a general framework, which can be extented easily to almost every kind of node representation methods, without the requirement of knowing specific workflow of the inner methods.
In short, Fsampling strategy aims to select features for the training nodes by changing the workflow of original models, but Nsampling is determined to select the more representative training nodes without changing anything of the original model.
3 Methodology
We seek to build such samplingbased framework for training graph convolution networks. Our framework can be generalized into a threestage algorithm. First, we need to perform a sampling process on the graph we are going to deal with.
(1) 
The function f is the sampling strategy. Respectively, represent the sampled nodes, the corresponding feature matrix and labeled vectors. While the B denotes the the expected sampling budget. The original training process will be carried on after the training set has been sampled out by equation 1.
Generally, we demonstrate the training strategy by a generalized equation 2,
(2) 
The function F represents the training process based on the sampling training dataset. After the model has been well trained, it can be functional as usual.
Our framework integrates these three stages in a pipeline, as shown in Fig. 2. The framework samples a small scale of data as training set, whose time complexity is lower than linear. What’s more, because of the independent of the stages for sampling and training, the framework can be implemented on different models and datasets easily.
3.1 Sampling Strategy
The randomly way for choosing training dataset can be sen as ’uniformly randomly sampling’(a.k.a UR), which select a node by the probability of
. refers to the number of nodes.Because of its randomness, the generalization performance of its correlated networks varies a lot.In order to overcome the disadvantage of UR, we consider several common used sampling methods: Depth First Sampling(DFS), Breadth First Sampling(BFS), Random Walk.
DFS and BFS are common ways to explore the graph structure, but both has inherent defect. If we set the sample budget as a certain small number, DFS may sampled out a long way through the graph, loosing the information on its way deep into the graph. BFS, on the contrary, could sampled out nodes within a small part of the whole graph.
Regular random walk(RW) is one of the popular methods to explore the network structure by obtaining a series of nodes or edges. It starts from a root vertex . Then push into the traversed node list , and choosing its next hop uniformly from the neighbours of . The probability that will be selected by the probability of , is the degree of node . Repeating this moving strategy for times, we get the whole traversed node list. This makes the most common type of RW.
RW can be performed with low time cost, and it is a exploration and exploitation tradeoff, which overcomes the weakness of BFS and DFS.
But this type of regular random walk still suffers from several flaws. It has a relatively high demand for the structure of the graph. A necessary condition for a regular RW to reach stationary is that the graph must be symmetric, connected, and nonbipartite. When the graph is not connected, a regular random walker would only explore the subgraph where it starts. This would severely damage the result of training a graph network, for it can never learn the structure or the information of those parts that disconnect with the start continent. More severely, if , the walk process would never be carried on.
Even when the graph is connected, if it is weakly connected, a regular random walker can still get temporarily ”trapped” inside a strongly connected subgraph[15]. It would take a lot of time to escape, which contradicts with the Low time complexity we seek to achieve.
In order to overcome these drawbacks, we utilize a technique called ”Frontier Sampling”[14], which is an advanced sampling strategy based on random walk. It performs dependent walkers at the same time, who share the candidate list together.
Walkers^{2}^{2}2’Walker’ and ’Dimension’ have the same definition in ’Frontier Sampling’. So we don’t distinguish one from the other. in ’Frontier Sampling’ are less likely to get stuck in a loosely connected part of the whole graph. And this kind of method can be easily paralleled. We would also carry out a comparison experiment in Section 4.5 on the sampling methods mentioned above, in order to verify the analyse.
3.2 Implementation
The sampling strategy provides a easy and effective way to decrease the scale of the training set scale. We intend to combine this kind of stratge as a prepocess stage of GCNbased methods in a pipline form.
The sampling stage takes the graph , sampling budget and the number of walkers as input and generate the sampled list by performing Line 113 in Algorithm1.
The nodes in the sampled list , their corresponds features and the labels will be fed into GCNbased methods. The models then function as it used to be. We formulate the whole structure of our framework as Algorithm 1, which is a typical threestage framework.
3.3 Feasibility
The node sequence sampled in Algorithm 1 can maintain more accurate graph structure than that by uniformly randomly sampled, which overcomes the second drawback we mentioned in Section 1: outofdistribution prediction. Considering a important graph characteristic – label density, we assume that each vertex is associated with a label . The label density on graph is , is defined by equation 3, is the indicator function.
(3) 
Utilizing the same unbiased estimator came up by Zhao zhao2019sampling, which is depicted as equation
4.(4) 
is the probability that node is sampled, which equals to in uniformly randomly sampling, and equals to in random walk at steady state, where is the degree of node , and is the number of edges in the graph.
Theorem 1: For a single random walker,
Proof.
For notation convenience, we depict as , as . The length of sampled nodes is . Combining the equation 3 and 4, the original inequality can be written as :
(5) 
The situation also suits the condition that . ∎
Theorem 1 tells that the label density estimated by node sequence sampled by random walk is closer to than estimated by uniformly randomly sampled sequence .
Lemma 1: The Mdimensional random walk process is equivalent to the process of a single random walker over .
Lemma 1 is proved by Ribeiroribeiro2012sampling, combining with Theorem 1, we can deduced Theorem 2.
Theorem 2: Estimating label density with nodes sampled by ’Frontier Sampling’ can performs better than nodes sampled uniformly.
Theorem 2 prove the assumption we make before theoretically, and ensures the performance of the Algorithm 1.
3.4 Time Complexity
We come up with the framework to mainly reduce the training dataset scale, then reduce the labor force and compute consumption. So the first stage, i.e., sampling process, should not have a high time complexity. Or the data prepossessing stage may contrary to our original idea.
According to Algorithm 1, the whole sampling procession’s time complexity T depends only on the scale of the sampling budget ,
(7) 
where represents the number of nodes in the graph G. Thus, the time complexity of sampling algorithm is lower than . Compared with the graph convolutional network methods’ time complexity, the linear time complexity for sampling is acceptable.
4 Evaluation
We verify our proposal by the multiclass classification task on three realworld datasets, including two citation networks and one social network. In the citation networks, the nodes are papers and the edges are the citation relationship. Each paper has a feature vector that contains the information of its contents. Classes implicate the kind of the categories among the papers. And for social network, the nodes represent users using the social media, and an edge between two users means the followerfollowed relation. And the details for these datasets are presented in Table 1.
Dataset  Type  Node  Edges  Classes 

Cora  Citation  2,707  5,429  7 
Pubmed  Citation  19,717  44,338  3 
BlogCatalog  Social  10,312  333,983  10 
4.1 Experimental Settings
We utilize our framework on five GCNbased methods to verify the validation of our proposal. For the sampling strategy, we set . For the training data scales(sampling budget), we range it in
nodes for each dataset. For the baseline algorithms, we choose the same scale of training data from the graphs uniformly randomly; 100 nodes are randomly selected from training set as the validation part. The prediction accuracy is evaluated on another randomly selected 1000 nodes for each dataset. We use the ’Cross Entropy Loss’ as our loss function during the experiments. For the methods combined with our framework, we use
SS with the original method name to represent them. We performs each experiment 10 times and take the average results as the final results. The experiment on FastGCN is based on the code released by the original author^{3}^{3}3https://github.com/matenure/FastGCN, and all the other algorithms are implemented based on the the Deep Graph Library (DGL)^{4}^{4}4https://github.com/dmlc/dgl.Cora  Pubmed  BlogCatalog  

Training Set  0.5%  1%  5%  10%  0.5%  1%  5%  10%  0.5%  1%  5%  10% 
GCN  0.63  0.64  0.73  0.80  0.39  0.65  0.78  0.79  0.25  0.30  0.31  0.33 
SSGCN  0.70  0.71  0.75  0.83  0.63  0.73  0.78  0.81  0.28  0.30  0.33  0.34 
GraphSAGE  0.63  0.57  0.79  0.85  0.70  0.79  0.82  0.83  0.27  0.27  0.32  0.34 
SSGraphSAGE  0.69  0.66  0.86  0.85  0.78  0.85  0.85  0.85  0.33  0.32  0.34  0.36 
SGC  0.47  0.61  0.79  0.83  0.78  0.79  0.81  0.83  0.25  0.27  0.32  0.33 
SSSGC  0.56  0.68  0.81  0.84  0.81  0.82  0.83  0.84  0.33  0.33  0.33  0.35 
FastGCN  0.18  0.26  0.33  0.33  0.56  0.59  0.63  0.63  0.19  0.23  0.27  0.27 
SSFastGCN  0.23  0.32  0.38  0.38  0.59  0.64  0.66  0.66  0.25  0.27  0.30  0.31 
TAGCN  0.56  0.56  0.79  0.79  0.70  0.77  0.80  0.83  0.30  0.30  0.33  0.32 
SSTAGCN  0.70  0.71  0.80  0.84  0.79  0.80  0.86  0.85  0.31  0.32  0.34  0.34 
APPNP  0.68  0.71  0.72  0.80  0.74  0.80  0.83  0.83  0.26  0.32  0.32  0.32 
SSAPPNP  0.72  0.79  0.84  0.85  0.79  0.81  0.85  0.86  0.32  0.33  0.33  0.34 
4.2 Baseline Methods
To validate the improvement of the sampling strategy to GCNrelated methods, we evaluate it on several stateoftheart GCNrelated baselines:

GCN [9]: This is the first wildly used graph convolutional network method to embed the graph structure. It takes the graph structure and a few number of labeled nodes as input and output the node embedding vector.

GraphSAGE [7]: This method is built on GCN, which utilizes the neighbours’ features to represent one node. In this way, the framework can deal with dynamic graph structure.

SGC [17]: This method speeds up the GCN’s training time by removing nonlinearities and collapsing weight matrices between consecutive layers.

FastGCN [3]: This method also tries to speed up the GCN’s training time by sampling active nodes between layers, which performs like ’dropout’ in traditional neural networks.

TAGCN [5]: This method designs a set of fixedsize learnable node filters to perform convolutions on graphs. It differs from the spectral domain of the origin GCN.
4.3 Results
We now validate the effectiveness of our framework by combining it with 5 GCNrelated baseline algorithms and compare them with the original ones. Specifically, we use the task of node classification for evaluation. The experimental results are shown in Table 2. We bold the better result for each comparison pair, the detailed analyse for the table are as follows.
When we fix the training set at and evaluate the performance of each algorithm, we can tell that the ones with sampling strategy can outperform the original ones . This simulates the extreme situation that the labeled data are scarcely little. We can get a lift of on average, which is significant. And the greatest improvement happens with GCN and SSGCN on the Pubmed dataset.
As the training set scale raises, although the improvement gets smaller, it still exists. We can get an improvement of when the training scale is set as . In some cases, under the relatively larger training set, because of the powerful structure of the original algorithm, it can get the same result with its ’SS’ competitor. The trend that the gap shrinks with the growing of the training set scale is reasonable, because the large data scale makes sure that uniformly sampling can explore as much as the ’Frontier Sampling’ we utilize.
If we take another perspective, we can get more meaningful observations. Take the accuracy of SSSGC on the Pubmed with dataset – 0.81 as a goal. We notice that the original SGC can achieve this accuracy on the same dataset with up to of the training data, which is 10 times of SSSGC. In other words, sampling strategy can save need of labelled data to get the same accuracy. This observation benefits more in realworld scenarios, the saving of labelled data can significantly improve the efficiency of labelling process and crossvalidate among different models or datasets.
Overall, we can summarize the conclusions drawn from the results: 1) Algorithms with samplingstrategy can get a accuracy improvement under the same situation. 2) Methods with samplingstrategy can get a close performance with of the training data scale. The outstanding results of our proposal verifies that the sampling strategy can improve the GCNrelated method performance easily without change them, i.e, just add the sampling strategy to the original ones as a pipeline.
4.4 Algorithm Efficiency
The training data with well distribution would help the model to quickly converge, which is a common way to reduce the training time. The sampling method we utilized is a simple but effective way to achieve the target of welldistributed training data.
To evaluate the contribution to reduce the training time, we carried out a case study on GraphSAGE and SSGraphSAGE with the Pubmed dataset. We set 10
of the dataset as the training data, and the number of training epochs is set as 200.
The converge speed of GraphSAGE and SSGraphSAGE is ploted in Fig.(a)a, both SSGraphSAGE’s training and test accuracy consistently outperforms the ones of the origin GraphSAGE. If we set the final test accuracy–0.83 as a threshold, SSGraphSAGE can achieve the similar test accuracy with only 62 epochs, which decreases the training time by 69.
4.5 Sampling Strategy Comparison
We have discussed about several sampling strategies in Section 3.1. We have done a case study on the ’cora’ dataset with GraphSAGE to verify our analyse before. We replace the ’Frontier Sampling’ we used in Algorithm 1 by ’Uniform sampling’, ’Regular Random Walk’, ’DFS’, and ’BFS’. Experiment setting is the same as the one in Section 4.1. From the result shown in Fig (b)b (We take the Natural logarithm of the training set scale for clarity display), we can tell the sampling method we use outperforms the others theoretically and experimentally. Regular random walk performs the worst when the sampling scale is shrink to , caused by its nature of ’easily been trapped’, but its accuracy raises dramatically when the scale get a little bit larger, which is consistent with our analyse. To be noticed, when the training scale get larger, uniformly sampled data can get close accuracy with ours, which meets the ’Law of Large Number’.
4.6 Parameter Influence
We take a numerical evaluation on the influence of ampling scale and the number of random walk dimensions with SSGCN on the Cora dataset.
4.6.1 Sampling Scale
Fig. (a)a shows the accuracy distance from the steady performance with different sampling budget on multilabel classification task. The steady accuracy is obtained through taking 50 of the nodes as training data. From the result, we can observe that the samplingbased methods can easily approach the steady performance with about 1 to 3 of nodes as sampling training data. It reveals the power of sampling strategy added to the GCNrelated methods.
4.6.2 Number of Walkers
The sampling scheme performs a dimensional random walks. Fig. (b)b shows the influence of the number of walkers on the accuracy and memory cost. The box figure shows the distribution of accuracy, the green line is the median accuracy, and the yellow line across the boxes connects the average accuracy. It shows small fluctuations with the changing value of . The memory cost is also plotted in the same figure, Which stays at about the same level when the ranges.
5 Related Work
Two lines of research are related to our work, which are summarized as follows.
5.1 GCNbased Methods
Graph neural networks have drawn a lot of attention recently. It has been proposed since 2014[2], and been modified by[4, 6]. Kipf’s GCNkipf2017semi has brought it under the spotlight. Since then researchers seek to build more effective network structure. Like the GraphSAGE[7] is proposed to deal with the dynamic graphs, Graph Attention Network[16] is proposed to weight the node’s neighbours. Some works also focus on the problem of training efficiency, FastGCN[3]
is one of the pioneers to accelerate the training process by eliminating part of neurons. SGC
[17] steps further by simplifying convolutional computation.5.2 Graph Sampling Based on Random Walks
Sampling methods, especially random walkbased graph sampling methods, have been widely studied[1][15][18]. Leskovecleskovec2006sampling has came up with an efficient way to downsize the sampling scale based on random walk. And Weiwei2004towards has worked out how to make the sampling by random walk more efficient. Randomwalk based sampling can also be used in overlay networks[12].Based on the prior knowledge about the graph structure, Zhao et al. zhao2019sampling proposed a graph sampling strategy by random walk with indirect jumps.
6 Conclusion and Future Work
Faced with the challenge of limited labbled data for semisupervised methods, we propose a samplingbased model to improve their performance without changing original model. The evaluation of our proposal on realworld datasets show that our framework could achieve better results with smaller scale of training data, and it surpasses the original ones .
In this paper, we take a small step to improve the performance of semisupervised methods representived by GCN in the condition of limited labbeled data. But limited labbeled data is a very common problem in graph embedding method study, and we would try to develop a more general framework working for all of them. What’s more, we would like to study the problem of lowering the training time complexity with smaller scale of training data.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 61902308, 61822309, 61773310, U1736205, U1766215, Foundation of Xi’an Jiaotong University under grant xxj022019016, xtr022019002, and Initiative Postdocs Supporting Program BX20190275.
References
 [1] (2010) Improving random walk estimation accuracy with uniform restarts. In International Workshop on Algorithms and Models for the WebGraph, pp. 98–109. Cited by: §5.2.
 [2] (2014) Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR2014), CBLS, April 2014, pp. http–openreview. Cited by: §5.1.
 [3] (2018) Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247. Cited by: §1, §2, 4th item, §5.1.
 [4] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §5.1.
 [5] (2017) Topology adaptive graph convolutional networks. arXiv preprint arXiv:1710.10370. Cited by: 5th item.
 [6] (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §5.1.
 [7] (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1, §1, §2, 2nd item, §5.1.
 [8] (2019) Pretraining graph neural networks. External Links: 1905.12265 Cited by: §1.
 [9] (2017) Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §1, 1st item.
 [10] (2018) Predict then propagate: graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997. Cited by: 6th item.
 [11] (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
 [12] (2006) Peer counting and sampling in overlay networks: random walk methods. In Proceedings of the twentyfifth annual ACM symposium on Principles of distributed computing, pp. 123–132. Cited by: §5.2.
 [13] (1999) The pagerank citation ranking: bringing order to the web.. Technical report Stanford InfoLab. Cited by: 6th item.
 [14] (2010) Estimating and sampling graphs with multidimensional random walks. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pp. 390–403. Cited by: §3.1.
 [15] (2012) Sampling directed graphs with random walks. In 2012 Proceedings IEEE INFOCOM, pp. 1692–1700. Cited by: §3.1, §5.2.
 [16] (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §1, §5.1.
 [17] (2019) Simplifying graph convolutional networks. In International Conference on Machine Learning, pp. 6861–6871. Cited by: 3rd item, §5.1.
 [18] (2014) A general framework of hybrid graph sampling for complex network analysis. In IEEE INFOCOM 2014IEEE Conference on Computer Communications, pp. 2795–2803. Cited by: §5.2.

[19]
(2019)
Graph transformer networks
. In Advances in Neural Information Processing Systems, pp. 11960–11970. Cited by: §1.