1 Introduction
Semisupervised learning on graphs aims to recover the labels for all nodes, while only very small proportion of node labels are given. The setting is popular in various of real world applications, since many labels are often expensive and difficult to collect. In the recent years, Graph Neural Networks (GNNs) have shown great power in the semisupervised learning on attributed graphs. GNN often contains multiple layers. The nodes collect information from the neighborhood iteratively, layer by layer. The representative methods include Graph convolutional network (GCN) [11], Graphsage [8], Graph attention networks [23] and so on.
Due to the multilayer message aggregation scheme, GNN is easy to be oversmoothing after a few propagation steps [13]. That is to say, the node representations tend to be identical and lack of distinction. In general, GNNs can afford only 2 layers to collection information within 2 hops of neighborhood, otherwise the oversmoothing problem will deteriorate the performance. Several previous works aimed to address the problem and expand the size of utilized neighborhood. One major line of works considered to improve the information aggregation scheme. For example, NGCN [1] trained multiple instances of GCNs over node pairs discovered at different distances in random walks. PPNP/APPNP [12]
introduced the teleport probability of personalized PageRank.
These solutions are purely based on semisupervised learning, which have their natural bottleneck in modeling global information. As the reception field is getting larger and more nodes are involved, more powerful models are required to explore the complex relationships among nodes. However, the labels for training is quite sparse in the semisupervised learning, and hence cannot afford to train models with high complexity well. Hence, the existing works often have to restrain the complexity of models, where limits their ability in learning global information.
The unsupervised learning methods based on random walk, e.g. Deepwalk
[16] and Node2vec [7], can be used to obtain the global structure information of each node. These methods first sample node sequences that contain the structure regularity of the network, then try to maximize the likelihood of neighborhood node pairs within certain distance, e.g. 10 hops. Hence, they can capture global structure features without label information.In this paper, we propose a new learning schema of pretraining and learning to address the global information preserving problem in semisupervised learning. First, instead of improving the aggregation function via semisupervision, we obtain the global structure and attribute features by pretraining the graph with random walk strategy in the unsupervised learning. Second, we design a GNN based method to conduct semisupervised classification by learning from the pretrained features and the orginal graph. The twostage schema of pretrainingandlearning has several advantages. First, the global information modeling procedure is decoupled with the subsequent semisupervised learning method. Therefore, the modeling of the global information no longer suffers from the sparse supervision or oversmoothing problem. Second, the framework marries two lines of work and benefits from their advantages: the global information preserving ability brought by random walk and the local information aggregation ability induced by GNN. Moreover, the method allows GNN to be applied to plain graphs without attributes, since the unsupervised structure features can be used as graph attributes.
In all, our contributions are as follows.

We propose a method named Global Information for Graph Neural Network (GGNN) for semisupervised classification on graphs. The proposed pretrainingandlearning framework allows GNN models to use global information for learning. Moreover, the schema enables GNN to be applied to plain graphs.

We design the global information as the global structure and attribute features to each node, and propose a parallel GNN based model to learn different aspects from the pretrained global features and the original graph.

Our method achieves stateoftheart results in semisupervised learning on both plain and attributed graphs. Specially, our results of attributed graph learning on Cora (84.31%) and Pubmed (80.95%) are the new benchmark results.
The rest of the paper is organized as follows. In section 2, The preliminaries are given. We introduce our method in section 3. Section 4 presents the experiments. Section 5 briefly summarizes related work. Finally, we conclude our work in section 6.
2 Preliminary
2.1 Definition
First, we will give the formal definition of attributed/plain graph, and the problem we are going to solve.
Let be a graph, where denotes the node set, denotes the edges with the adjacency matrix . If is not equal to 0, there is a link from to with weight . If is an attributed graph, there is a corresponding attribute matrix , where the th row denotes ’s attributes and denotes the total amount of attributes. If is a plain network, no is provided. The graph contains label information , where the th row denotes
’s one hot label vector. The amount of labels is
.During the training stage, the provided data is the entire adjacency matrix and the node attributes . Only labels of the training nodes are given. The task of the semisupervised learning in attributed graph is to predict the rest of the node labels . is not provided for plain graph learning.
2.2 Graph Neural Networks
We will introduce how general GNNs solve the semisupervised learning problem. Note that current GNNs can only be applied to attributed graphs. Therefore, we assume is given here.
Among the huge family of GNNs, Graph convolutional network(GCN) [11] is a simple and farreaching method that motivates many of the following works. Let be the adjacency matrix with selfconnections, where
is an identity matrix. The selfloops will allow GCN to consider attributes of the represented nodes when aggregating the neigbhood’s attributes. Let
be the nomalized adjacency matrix, where denotes the diagonal degree matrix where . The twolayer GCN produces the following hidden states by aggregating neighborhood attributes iteratively:(1) 
where . Each row in denotes the final hidden states of a node, and each row corresponds to a prediction catagory. and are the trainable weight matrices. After that, a softmax function is used to produce the classification probability on each class.
(2) 
Finally, a loss function is applied to measure the difference between the predict probability and the ground truth labels.
Many of the following studies aimed to improve the aggregation function, such as assigning alternative weights to the neighborhood nodes [23], adding skip connections [8], introducing teleport probability [12] and so on. Abstract from the specific designs, these methods can be viewed as a transform from the original and to the final hidden states :
(3) 
From Equation 1, we can see only 2hops of local information can be used. The oversmoothing problem prevents to add more layers, so the global information is difficult to be integrated in. The input attributes are necessary in the general learning framework of GNN. Hence, GNN cannot be applied to plain graphs directly. In the next section, we will show to solve these problems with the proposed pretrainingandlearning schema.
3 Proposed Method
In the section, we first give an overview of GGNN within the context of attributed graph. Second, the method to obtain the global features are introduced. Third, a parallel GNN based method is proposed to learn from all these features. Finally, we show how to extend GGNN to plain network.
3.1 Overview
The overview of the GGNN method is shown in Figure 1. First, the global structure feature matrix and attribute feature matrix are learned in an unsupervised way. Next, , and the original attribute matrix are fed to a parallel GNN based model, to learn their corresponding hidden states. Finally, the final hidden states are the weighted sum of the 3 hidden states , and .
3.2 Unsupervised Learning of Global Features
Herein, we propose to learn the unsupervised features of graphs based on random walk strategy. Each node can utilize information within steps of random walk, where is often set to . Small world phenomenon suggests that the average distance between nodes will grow logarithmically with the size of the graph nodes [3], and the undirected average distance of a very large Web graph is only [4]. Hence, steps of random walk can already capture the global information of the graphs.
3.2.1 Global Structure Features
Similar to Deepwalk [16], the structure features are learned by first sampling the contextsource node pairs, and then maximizing their cooccurrence probability. Note that graph attributes are not used here. We apply random walks to the graph to obtain short truncated node sequences. These sequences contain the structure regularity of the original graph. In the node sequences, the neighborhood nodes within certain distance to the source node are considered as ’s context nodes, which are denoted as .
To maximize the likelihood of the observed sourcecontext pairs, we try to minimize the following objective function:
(4) 
denotes the global structure feature matrix, where the row denotes ’s global structure feature vector and denotes the dimension of the vectors. The calculation of the denominator is computational expensive since it is required to traverse the entire node set. We approximate it using negative sampling [15].
3.2.2 Global Attribute Features
The global attribute features are obtained by maximizing the likelihood of the context attributes. The underlying idea is that if the context attributes can be recovered from the source node, it has already been preserved by the learning model.
For each sampled context node , some attributes of are sampled as the context attributes of . In this paper, we sample one attribute for one context node. Let be the sampled context attributes of , and be the set of all attributes where . We try to minimize the following objective.
where denotes the global attribute feature matrix, denotes the parameters to predict the attributes. The row denotes ’s global attribute feature vector and denotes the dimension of the vectors.
[26] proposed an unsupervised graph learning method that utilize the context attributes. They learned the node representations by jointly optimizing two objective functions that preserved the neighborhood nodes and attributes. The mainly difference of our work is that we learn two feature vectors for each node separately, which provides richer information for the following learning algorithm.
3.3 Parallel Graph Neural Networks
As is shown in Figure 1, we will propose a parallel model with kernels of GNN to learn from these input matrices of and . The learning is semisupervised.
3.3.1 Learning from the Heterogeneous Features
The motivation of applying multiple parallel GNN kernels to these feature matrices is as follows. First, the features are quite heterogeneous, especially when some of them are obtained via pretraining. The parallel kernels can learn different aspects from these features respectively. Second, the three feature matrices are highly correlated. For example, is obtained partly based on . and are sampled based on the identical random walk method. It makes the model difficult to learn the complex relationships among them. The parallel setting can effectively give some implicit prior knowledge to the model that these features are different, which will make the optimization easier. Indeed, the parallel schema are successful in some previous works, such as multihead attention [22, 23] and NGCN [1].
First, because the amplitude at each dimension of the pretrained and often varies a lot, it is better to make a normalization. For each row in or , we makes the following transformation:
Then, several kernels of GNN are proposed to learn from the three feature matrices.
3.3.2 Combine the Hidden States
A simple way to obtain the final hidden state matrix is to linear combine the 3 obtained hidden state matrices, where and are coefficients between 0 and 1.
(8) 
Then a softmax function is applied to , as in Equation 2, to get the prediction probability matrix , where denotes the probability that node ’s label is the class . The coefficients of and are used to turn down the effect of the pretrained features, which are essential for optimization. bengio2007greedy( bengio2007greedy) suggested the pretraining is useful for initializing the network in a region of the parameter space where optimization is easier. In training, easy samples often contribute more to the loss and dominate the gradient updating [14]. Similarly, we found the easytrained component of and also dominate the learning procedure. If no weight strategy is used, merely contribute to the results and hence the performance is far from promising.
3.3.3 Training
We minimize the crossentropy loss function between and the groundtruth labels to train the model [1].
where denotes Hadamard product, and denotes a diagonal matrix, with entry at set to 1 if and 0 otherwise.
3.4 Learning on Plain Graphs
Plain Graphs contain no attributes . It does not affect the obtaining of the global structure feature matrix . Then the final hidden states is as:
Hence, learning on plain graphs also follows the pretrainingandlearning schema, where some components that depend on the graph attributes are removed.
4 Experiments
In this section, we want to address the following questions^{1}^{1}1The data, code and pretrained vectors to reproduce our results will be released on https://github.com/zhudanhao/ggnn:

Q1: How do GGNNs perform in comparison to stateoftheart GNN kernels on attributed graphs?

Q2: Are all the designed components in GGNN helpful for achieving stronger learning ability?

Q2: How does GGNN perform in comparison to stateoftheart learning methods on plain graphs?
4.1 Experiments on Attributed Graphs (Q1)
Herein, we will address the first question Q1 by comparing GGNNs with different GNN kernels. Note that the code of all the GNN methods are based on the implementation released by DGL ^{2}^{2}2https://github.com/dmlc/dgl/tree/master/examples/pytorch. All results are the average value of 10 experiments with different random initialization seeds.
4.1.1 Datasets and Baselines
Cora  Citeseer  Pubmed  

# Nodes  2708  3327  19717 
# Edges  5429  4732  44338 
# Attributes  1433  3703  500 
# Classes  7  6  3 
# Training nodes  140  120  60 
# Valid nodes  500  500  500 
# Test nodes  1000  1000  1000 
The statistics of the datasets used in this study are shown in Table 1. The three standard citation graph benchmark datasets of Cora, Citeseer and Pubmed [17] are widely used in various of GNN studies [23, 25]. In the citation graphs, nodes denote papers and links denote undirected citations. Node attributes are the extracted elements of bagofwords representation of the documents. The class label is the research area of each paper.
Cora  Citeseer  Pubmed  
Precision  Range  Precision  Range  Precision  Range  
GCN  81.47  71.01  79.1  
GGCN  83.71  71.27  80.88  
Graphsage  82.7  69.87  78.56  
GGraphsage  83.84  70.2  78.89  
APPNP  83.91  71.93  79.68  
GAPPNP  84.31  72  80.95 
The following baselines are compared in this paper.

Graph convolutional network(GCN) [11]: It is a simple type of GNN which have already been introduced in details in section 2. We use dropout technique to prevent from overfitting [18], where the probability is
. We set number of training epoches to 300, number of layers to 2, and the dimension of hidden states to 16. The selfloops are used.

Graphsage [8]: It is a general framework by sampling and aggregating features from a node’s local neighborhood. We use the mean aggregate. We set the dropout rate to 0.5, the number of training epoches to 200, the number of layers to 2, and the dimension of hidden states to 16.

APPNP [12]: This paper designs a new propagation procedure based on personalized PageRank, and hence can also model the longdistance information to a source node. We set the dropout rate to 0.5, the number of training epoches to 300, the number of propagation steps to 10, the teleport probability to 0.1 and the dimension of hidden states to 64.
All models are optimized with Adam [10] where the initial learning rate is 0.01 and the weight decay is 0.0005 per epoch.
4.1.2 Training Details
In the unsupervised learning of global features, we conducted 10 iterations of random walk which started from each node. The walk length was 100. For each source node, the nearby nodes within 10 steps were considered as the neighborhood nodes. The dimensions of both the global structure and attribute vectors were 8. The number of negative sampling was 64.
In the semisupervised learning, we used the three baseline models as kernels of GGNNs, named GGCN, GGraphsage and so on. The parameters were exactly the same as those in the baseline methods.We search and between 0.001 to 0.05. The test results were reported when the best valid results were obtained.
4.1.3 Results
The results are shown in Table 2. From the table, it is found that all baseline kernels with global information achieve substantial gains on the classification task. For example, GGCN outperforms GCN with 2.24%, 0.26%, and 1.79% of precision on the three datasets respectively. The results demonstrate that the learning framework of GGNNs can effectively and consistently enhance the learning ability of the corresponding GNN kernels.
APPNP is also designed for enlarging the reception field and can utilize global information. APPNP outperforms other baseline models. Although the improvement is not as large as those in GGCN, GAPPNP still significantly outperforms its kernel of APPNP with 0.4%, 0.07% and 1.27% respectively. The result shows that even for a propagation method can powerful utilize global information, our learning schema of GGNNs can still bring considerable precision gains. We believe the advantage comes from the pretrainingandlearning schema, since our global information are obtained via pretraining and no longer suffers from the limitation brought by weak supervision.
GAPPNP achieves the best results among all the methods. Note that its precisions on Cora (84.31%) and Pubmed (80.95%) are the new stateoftheart results. To the best of our knowledge, the previous best results is GraphNAS (84.2%) [6] on Cora, and MixHop (80.8%) [2] on Pubmed.
In all, the results validate the effectiveness of the pretrainingandlearning schema, which can significantly improve the global information preserving ability of GNN based methods.
4.2 Properties Analysis (Q2)
We want address the second question Q2. The parameter sensitivity and ablation analysis are given. Herein, we mainly use GCN as the kernel and the setting is attributed graph learning.
4.2.1 Parameter Sensitivity
Figure 2 shows the precision w.r.t. and . Generally, different datasets require different and to achieve the best precision, and the value is often around 0.01. The precision will decrease quickly if we continue to increase the two parameters. The result shows that it is very necessary to introduce the two hyperparameters to turn down the impact of pretrained features. In fact, the component of will contribute almost nothing without the weight method.
Figure 3 shows the precision w.r.t. dimensions of the global features. The highest precision is achieved when the dimension is around 8 to 16.
4.2.2 Ablation Analysis
First, the effectiveness of parallel learning framework are investigated. Note that the simplest way to utilize all 3 feature matrices , and is to concatenate them first, and then feed the concatenated feature matrix to a single GNN kernel. Table 3 compares the results with simple concatenation and our parallel method. We can find that the method of simple concatenation has already outperforms GCN, which demonstrates that the pretrainingandlearning schema can well utilize the global information. GGCN makes further improvement than the method of simple concatenation, which validates the effectiveness of parallel learning method.
Cora  Citeseer  Pubmed  

GCN  81.47  71.01  79.1 
Simple Concatenation  83.39  70.78  80.64 
GGCN  83.71  71.27  80.88 
Second, we investigate the effect of the global features, as is shown in Table 4. The results show that both pretrained features can help to increase the model precision. The global structure features are more helpful on Cora, but less effective than global attribute features on Pubmed and Citeseer. On the dataset of Cora and Pubmed, the highest precisions are obtained when all the features are used. However, in Citeseer, the global structure features cannot help to increase the performance if the global attribute features are used. In all, both pretrained features can contribute to improve the results of our proposed model. However, the amount of improvement depends on specific datasets.
Cora  Citeseer  Pubmed  

81.47  71.01  79.1  
83.2  71.05  79.35  
82.77  71.27  80.73  
83.71  71.27  80.88 
4.3 Experiments on Plain Graphs (Q3)
Cora  10%  20%  30%  40%  50%  60%  70%  80%  90%  

PNE  77.58  81.22  82.94  84.54  84.73  85.55  86.15  86.39  87.76  
MWNE  74.94  80.83  82.83  83.68  84.71  85.51  87.01  87.27  88.19  
GGCN (Plain)  76.88  80.5  82.65  85.06  85.57  86.23  87.67  87.05  89.16  
Citeseer  PNE  54.79  60.87  64.67  66.95  68.59  70.00  72.06  73.41  74.76 
MWNE  55.60  60.97  63.18  65.08  66.93  69.52  70.47  70.87  70.95  
GGCN (Plain)  54.24  60.31  64.16  66.41  69.36  70.77  72.12  74.41  75.89 
We will try to address the last question Q3 by comparing GGNN with other plain graph learning methods.
4.3.1 Experiment Setup
We conducted the task of semisupervised classification on the two datasets of Cora and Citeseer. The results of the baseline methods are cited from their original papers. Pubmed is excluded from comparison since it is not used the baseline papers.
The training data was the entire plain graph(attributes X are excluded), and part of the node labels. We used 0.1, 0.2 … 0.9 of node labels to train the model respectively, and reported the classification accuracy on the rest of data. All results were the average value of 10 experiments with different random split.
Our proposed model was GGCN (plain), where the GCN was used as the kernel. In the unsupervised training, the dimension of global structure vectors was 32. In the semisupervised learning, the dimension of the hidden states was 256. No dropout was used. The rest of parameters were the same as those in 4.1.2.
Two stateoftheart semisupervised learning methods on plain graphs were used as baselines.

MMDW [21]
: The method jointly optimized the maxmargin classifier and the aimed social representation learning model.

PNE [5]: The method embedded nodes and labels into the same latent space.
4.3.2 Results
The results are shown in Table 5. GGCN (plain) achieves the highest precision on half of the total data points (9 out of 18). Specially, the advantage is more obvious when the training ratio is arising. When the training ratio is small, the catagories with less instances may provide very small number of training instance, which makes GNN difficult to pass message from these nodes. We beileve this is the reason why GGCN is less powerful when training ratio is small.
In all, the results show that the learning framework of GGNN can be successfully applied to plain graphs, and achieve similar or better results than stateoftheart methods.
5 Related Work
Several related works have tried to expand the reception field of GNN and increase the neighborhood available at each node. PPNP/APPNP [12] improved the message passing algorithms based on personalized PageRank. [24] proposed jumping knowledge networks that can that flexibly leverage different neighborhood ranges for each node. NGCN [1] trained multiple instances of GCNs over node pairs discovered at different distances in random walks, and learned a combination of these instance outputs. However, because the semisupervised settings lack of enough training, these methods have to control the model complexity carefully, which limits the learning ability in exploring the global information. For example, our experiments have shown that GGNNs with kernel of APPNP can still achieve promising improvement.
Some studies also tried to introduce unsupervised learning in GNNs to alleviate the insufficient supervision problem. [20] proposed a autoencoder architecture that learned a joint representation of both local graph structure and available node features for the multitask learning of link prediction and node classification. GraphNAS[6]
first generated variablelength strings that described the architectures of graph neural networks, and then maximized the expected accuracy of the generated architectures on a validation data based on reinforcement learning. However, these methods do not consider to utilize global information of the graphs.
6 Conclusion
In the paper, we propose a novel framework named GGNN, which is able to conduct semisupervised learning on both plain and attributed graphs. Our pretrainingandlearning schema marries two lines of work and benefits from their advantages: the global information preserving ability brought by unsupervised learning and the local information aggregation ability induced by GNN. Extensive experiments confirm the effectiveness of GGNN.
For future work, we plan to test some more complicated methods that combine the hidden states, and study other unsupervised methods that can produce global features more suitable for the learning ability of GNN.
References
 [1] (2018) Ngcn: multiscale graph convolution for semisupervised node classification. arXiv preprint arXiv:1802.08888. Cited by: 1 Introduction, 3.3.1 Learning from the Heterogeneous Features, 3.3.3 Training, 5 Related Work.
 [2] (2019) Mixhop: higherorder graph convolution architectures via sparsified neighborhood mixing. arXiv preprint arXiv:1905.00067. Cited by: 4.1.3 Results.
 [3] (1999) Internet: diameter of the worldwide web. nature 401 (6749), pp. 130. Cited by: 3.2 Unsupervised Learning of Global Features.
 [4] (2000) Graph structure in the web. Computer networks 33 (16), pp. 309–320. Cited by: 3.2 Unsupervised Learning of Global Features.
 [5] (2017) PNE: label embedding enhanced network embedding. In PacificAsia Conference on Knowledge Discovery and Data Mining, pp. 547–560. Cited by: 2nd item.
 [6] (2019) GraphNAS: graph neural architecture search with reinforcement learning. arXiv preprint arXiv:1904.09981. Cited by: 4.1.3 Results, 5 Related Work.
 [7] (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: 1 Introduction.
 [8] (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: 1 Introduction, 2.2 Graph Neural Networks, 3.3.1 Learning from the Heterogeneous Features, 2nd item.
 [9] (2014) Learning latent representations of nodes for classifying in heterogeneous social networks. Cited by: 4.3.1 Experiment Setup.
 [10] (2014) Adam: a method for stochastic optimization. Computer Science. Cited by: 4.1.1 Datasets and Baselines.
 [11] (2016) Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: 1 Introduction, 2.2 Graph Neural Networks, 3.3.1 Learning from the Heterogeneous Features, 1st item.
 [12] (2018) Predict then propagate: graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997. Cited by: 1 Introduction, 2.2 Graph Neural Networks, 3.3.1 Learning from the Heterogeneous Features, 3rd item, 5 Related Work.

[13]
(2018)
Deeper insights into graph convolutional networks for semisupervised learning.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: 1 Introduction. 
[14]
(2017)
Focal loss for dense object detection.
In
Proceedings of the IEEE international conference on computer vision
, pp. 2980–2988. Cited by: 3.3.2 Combine the Hidden States.  [15] (2013) Distributed representations of words and phrases and their compositionality. In International Conference on Neural Information Processing Systems, pp. 3111–3119. Cited by: 3.2.1 Global Structure Features.
 [16] (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: 1 Introduction, 3.2.1 Global Structure Features, 4.3.1 Experiment Setup.
 [17] (2008) Collective classification in network data. AI magazine 29 (3), pp. 93. Cited by: 4.1.1 Datasets and Baselines.

[18]
(2014)
Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research
15 (1), pp. 1929–1958. Cited by: 1st item.  [19] (2015) Line: largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. Cited by: 4.3.1 Experiment Setup.

[20]
(2018)
Learning to make predictions on graphs with autoencoders
. In5th IEEE International Conference on Data Science and Advanced Analytics
, Cited by: 5 Related Work.  [21] (2016) Maxmargin deepwalk: discriminative learning of network representation. In International Joint Conference on Artificial Intelligence, Cited by: 1st item.
 [22] (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: 3.3.1 Learning from the Heterogeneous Features.
 [23] (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: 1 Introduction, 2.2 Graph Neural Networks, 3.3.1 Learning from the Heterogeneous Features, 4.1.1 Datasets and Baselines.
 [24] (2018) Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536. Cited by: 5 Related Work.
 [25] (2016) Revisiting semisupervised learning with graph embeddings. arXiv preprint arXiv:1603.08861. Cited by: 4.1.1 Datasets and Baselines.
 [26] (2019) PCANE: preserving context attributes for network embedding. In PacificAsia Conference on Knowledge Discovery and Data Mining, pp. 156–168. Cited by: 3.2.2 Global Attribute Features.