Semi-supervised learning on graphs aims to recover the labels for all nodes, while only very small proportion of node labels are given. The setting is popular in various of real world applications, since many labels are often expensive and difficult to collect. In the recent years, Graph Neural Networks (GNNs) have shown great power in the semi-supervised learning on attributed graphs. GNN often contains multiple layers. The nodes collect information from the neighborhood iteratively, layer by layer. The representative methods include Graph convolutional network (GCN) , Graphsage , Graph attention networks  and so on.
Due to the multi-layer message aggregation scheme, GNN is easy to be over-smoothing after a few propagation steps . That is to say, the node representations tend to be identical and lack of distinction. In general, GNNs can afford only 2 layers to collection information within 2 hops of neighborhood, otherwise the over-smoothing problem will deteriorate the performance. Several previous works aimed to address the problem and expand the size of utilized neighborhood. One major line of works considered to improve the information aggregation scheme. For example, N-GCN  trained multiple instances of GCNs over node pairs discovered at different distances in random walks. PPNP/APPNP 
introduced the teleport probability of personalized PageRank.
These solutions are purely based on semi-supervised learning, which have their natural bottleneck in modeling global information. As the reception field is getting larger and more nodes are involved, more powerful models are required to explore the complex relationships among nodes. However, the labels for training is quite sparse in the semi-supervised learning, and hence cannot afford to train models with high complexity well. Hence, the existing works often have to restrain the complexity of models, where limits their ability in learning global information.
The unsupervised learning methods based on random walk, e.g. Deepwalk and Node2vec , can be used to obtain the global structure information of each node. These methods first sample node sequences that contain the structure regularity of the network, then try to maximize the likelihood of neighborhood node pairs within certain distance, e.g. 10 hops. Hence, they can capture global structure features without label information.
In this paper, we propose a new learning schema of pre-training and learning to address the global information preserving problem in semi-supervised learning. First, instead of improving the aggregation function via semi-supervision, we obtain the global structure and attribute features by pre-training the graph with random walk strategy in the unsupervised learning. Second, we design a GNN based method to conduct semi-supervised classification by learning from the pre-trained features and the orginal graph. The two-stage schema of pretraining-and-learning has several advantages. First, the global information modeling procedure is decoupled with the subsequent semi-supervised learning method. Therefore, the modeling of the global information no longer suffers from the sparse supervision or over-smoothing problem. Second, the framework marries two lines of work and benefits from their advantages: the global information preserving ability brought by random walk and the local information aggregation ability induced by GNN. Moreover, the method allows GNN to be applied to plain graphs without attributes, since the unsupervised structure features can be used as graph attributes.
In all, our contributions are as follows.
We propose a method named Global Information for Graph Neural Network (G-GNN) for semi-supervised classification on graphs. The proposed pretraining-and-learning framework allows GNN models to use global information for learning. Moreover, the schema enables GNN to be applied to plain graphs.
We design the global information as the global structure and attribute features to each node, and propose a parallel GNN based model to learn different aspects from the pretrained global features and the original graph.
Our method achieves state-of-the-art results in semi-supervised learning on both plain and attributed graphs. Specially, our results of attributed graph learning on Cora (84.31%) and Pubmed (80.95%) are the new benchmark results.
The rest of the paper is organized as follows. In section 2, The preliminaries are given. We introduce our method in section 3. Section 4 presents the experiments. Section 5 briefly summarizes related work. Finally, we conclude our work in section 6.
First, we will give the formal definition of attributed/plain graph, and the problem we are going to solve.
Let be a graph, where denotes the node set, denotes the edges with the adjacency matrix . If is not equal to 0, there is a link from to with weight . If is an attributed graph, there is a corresponding attribute matrix , where the th row denotes ’s attributes and denotes the total amount of attributes. If is a plain network, no is provided. The graph contains label information , where the th row denotes
’s one hot label vector. The amount of labels is.
During the training stage, the provided data is the entire adjacency matrix and the node attributes . Only labels of the training nodes are given. The task of the semi-supervised learning in attributed graph is to predict the rest of the node labels . is not provided for plain graph learning.
2.2 Graph Neural Networks
We will introduce how general GNNs solve the semi-supervised learning problem. Note that current GNNs can only be applied to attributed graphs. Therefore, we assume is given here.
Among the huge family of GNNs, Graph convolutional network(GCN)  is a simple and far-reaching method that motivates many of the following works. Let be the adjacency matrix with self-connections, where
is an identity matrix. The self-loops will allow GCN to consider attributes of the represented nodes when aggregating the neigbhood’s attributes. Letbe the nomalized adjacency matrix, where denotes the diagonal degree matrix where . The two-layer GCN produces the following hidden states by aggregating neighborhood attributes iteratively:
where . Each row in denotes the final hidden states of a node, and each row corresponds to a prediction catagory. and are the trainable weight matrices. After that, a softmax function is used to produce the classification probability on each class.
Finally, a loss function is applied to measure the difference between the predict probability and the ground truth labels.
Many of the following studies aimed to improve the aggregation function, such as assigning alternative weights to the neighborhood nodes , adding skip connections , introducing teleport probability  and so on. Abstract from the specific designs, these methods can be viewed as a transform from the original and to the final hidden states :
From Equation 1, we can see only 2-hops of local information can be used. The over-smoothing problem prevents to add more layers, so the global information is difficult to be integrated in. The input attributes are necessary in the general learning framework of GNN. Hence, GNN cannot be applied to plain graphs directly. In the next section, we will show to solve these problems with the proposed pretraining-and-learning schema.
3 Proposed Method
In the section, we first give an overview of G-GNN within the context of attributed graph. Second, the method to obtain the global features are introduced. Third, a parallel GNN based method is proposed to learn from all these features. Finally, we show how to extend G-GNN to plain network.
The overview of the G-GNN method is shown in Figure 1. First, the global structure feature matrix and attribute feature matrix are learned in an unsupervised way. Next, , and the original attribute matrix are fed to a parallel GNN based model, to learn their corresponding hidden states. Finally, the final hidden states are the weighted sum of the 3 hidden states , and .
3.2 Unsupervised Learning of Global Features
Herein, we propose to learn the unsupervised features of graphs based on random walk strategy. Each node can utilize information within steps of random walk, where is often set to . Small world phenomenon suggests that the average distance between nodes will grow logarithmically with the size of the graph nodes , and the undirected average distance of a very large Web graph is only . Hence, steps of random walk can already capture the global information of the graphs.
3.2.1 Global Structure Features
Similar to Deepwalk , the structure features are learned by first sampling the context-source node pairs, and then maximizing their co-occurrence probability. Note that graph attributes are not used here. We apply random walks to the graph to obtain short truncated node sequences. These sequences contain the structure regularity of the original graph. In the node sequences, the neighborhood nodes within certain distance to the source node are considered as ’s context nodes, which are denoted as .
To maximize the likelihood of the observed source-context pairs, we try to minimize the following objective function:
denotes the global structure feature matrix, where the row denotes ’s global structure feature vector and denotes the dimension of the vectors. The calculation of the denominator is computational expensive since it is required to traverse the entire node set. We approximate it using negative sampling .
3.2.2 Global Attribute Features
The global attribute features are obtained by maximizing the likelihood of the context attributes. The underlying idea is that if the context attributes can be recovered from the source node, it has already been preserved by the learning model.
For each sampled context node , some attributes of are sampled as the context attributes of . In this paper, we sample one attribute for one context node. Let be the sampled context attributes of , and be the set of all attributes where . We try to minimize the following objective.
where denotes the global attribute feature matrix, denotes the parameters to predict the attributes. The row denotes ’s global attribute feature vector and denotes the dimension of the vectors.
 proposed an unsupervised graph learning method that utilize the context attributes. They learned the node representations by jointly optimizing two objective functions that preserved the neighborhood nodes and attributes. The mainly difference of our work is that we learn two feature vectors for each node separately, which provides richer information for the following learning algorithm.
3.3 Parallel Graph Neural Networks
As is shown in Figure 1, we will propose a parallel model with kernels of GNN to learn from these input matrices of and . The learning is semi-supervised.
3.3.1 Learning from the Heterogeneous Features
The motivation of applying multiple parallel GNN kernels to these feature matrices is as follows. First, the features are quite heterogeneous, especially when some of them are obtained via pre-training. The parallel kernels can learn different aspects from these features respectively. Second, the three feature matrices are highly correlated. For example, is obtained partly based on . and are sampled based on the identical random walk method. It makes the model difficult to learn the complex relationships among them. The parallel setting can effectively give some implicit prior knowledge to the model that these features are different, which will make the optimization easier. Indeed, the parallel schema are successful in some previous works, such as multi-head attention [22, 23] and N-GCN .
First, because the amplitude at each dimension of the pre-trained and often varies a lot, it is better to make a normalization. For each row in or , we makes the following transformation:
Then, several kernels of GNN are proposed to learn from the three feature matrices.
3.3.2 Combine the Hidden States
A simple way to obtain the final hidden state matrix is to linear combine the 3 obtained hidden state matrices, where and are coefficients between 0 and 1.
Then a softmax function is applied to , as in Equation 2, to get the prediction probability matrix , where denotes the probability that node ’s label is the class . The coefficients of and are used to turn down the effect of the pre-trained features, which are essential for optimization. bengio2007greedy( bengio2007greedy) suggested the pre-training is useful for initializing the network in a region of the parameter space where optimization is easier. In training, easy samples often contribute more to the loss and dominate the gradient updating . Similarly, we found the easy-trained component of and also dominate the learning procedure. If no weight strategy is used, merely contribute to the results and hence the performance is far from promising.
We minimize the cross-entropy loss function between and the ground-truth labels to train the model .
where denotes Hadamard product, and denotes a diagonal matrix, with entry at set to 1 if and 0 otherwise.
3.4 Learning on Plain Graphs
Plain Graphs contain no attributes . It does not affect the obtaining of the global structure feature matrix . Then the final hidden states is as:
Hence, learning on plain graphs also follows the pretraining-and-learning schema, where some components that depend on the graph attributes are removed.
In this section, we want to address the following questions111The data, code and pre-trained vectors to reproduce our results will be released on https://github.com/zhudanhao/g-gnn:
Q1: How do G-GNNs perform in comparison to state-of-the-art GNN kernels on attributed graphs?
Q2: Are all the designed components in G-GNN helpful for achieving stronger learning ability?
Q2: How does G-GNN perform in comparison to state-of-the-art learning methods on plain graphs?
4.1 Experiments on Attributed Graphs (Q1)
Herein, we will address the first question Q1 by comparing G-GNNs with different GNN kernels. Note that the code of all the GNN methods are based on the implementation released by DGL 222https://github.com/dmlc/dgl/tree/master/examples/pytorch. All results are the average value of 10 experiments with different random initialization seeds.
4.1.1 Datasets and Baselines
|# Training nodes||140||120||60|
|# Valid nodes||500||500||500|
|# Test nodes||1000||1000||1000|
The statistics of the datasets used in this study are shown in Table 1. The three standard citation graph benchmark datasets of Cora, Citeseer and Pubmed  are widely used in various of GNN studies [23, 25]. In the citation graphs, nodes denote papers and links denote undirected citations. Node attributes are the extracted elements of bag-of-words representation of the documents. The class label is the research area of each paper.
The following baselines are compared in this paper.
Graph convolutional network(GCN) : It is a simple type of GNN which have already been introduced in details in section 2. We use dropout technique to prevent from overfitting , where the probability is
. We set number of training epoches to 300, number of layers to 2, and the dimension of hidden states to 16. The self-loops are used.
Graphsage : It is a general framework by sampling and aggregating features from a node’s local neighborhood. We use the mean aggregate. We set the dropout rate to 0.5, the number of training epoches to 200, the number of layers to 2, and the dimension of hidden states to 16.
APPNP : This paper designs a new propagation procedure based on personalized PageRank, and hence can also model the long-distance information to a source node. We set the dropout rate to 0.5, the number of training epoches to 300, the number of propagation steps to 10, the teleport probability to 0.1 and the dimension of hidden states to 64.
All models are optimized with Adam  where the initial learning rate is 0.01 and the weight decay is 0.0005 per epoch.
4.1.2 Training Details
In the unsupervised learning of global features, we conducted 10 iterations of random walk which started from each node. The walk length was 100. For each source node, the nearby nodes within 10 steps were considered as the neighborhood nodes. The dimensions of both the global structure and attribute vectors were 8. The number of negative sampling was 64.
In the semi-supervised learning, we used the three baseline models as kernels of G-GNNs, named G-GCN, G-Graphsage and so on. The parameters were exactly the same as those in the baseline methods.We search and between 0.001 to 0.05. The test results were reported when the best valid results were obtained.
The results are shown in Table 2. From the table, it is found that all baseline kernels with global information achieve substantial gains on the classification task. For example, G-GCN outperforms GCN with 2.24%, 0.26%, and 1.79% of precision on the three datasets respectively. The results demonstrate that the learning framework of G-GNNs can effectively and consistently enhance the learning ability of the corresponding GNN kernels.
APPNP is also designed for enlarging the reception field and can utilize global information. APPNP outperforms other baseline models. Although the improvement is not as large as those in G-GCN, G-APPNP still significantly outperforms its kernel of APPNP with 0.4%, 0.07% and 1.27% respectively. The result shows that even for a propagation method can powerful utilize global information, our learning schema of G-GNNs can still bring considerable precision gains. We believe the advantage comes from the pretraining-and-learning schema, since our global information are obtained via pre-training and no longer suffers from the limitation brought by weak supervision.
G-APPNP achieves the best results among all the methods. Note that its precisions on Cora (84.31%) and Pubmed (80.95%) are the new state-of-the-art results. To the best of our knowledge, the previous best results is GraphNAS (84.2%)  on Cora, and MixHop (80.8%)  on Pubmed.
In all, the results validate the effectiveness of the pretraining-and-learning schema, which can significantly improve the global information preserving ability of GNN based methods.
4.2 Properties Analysis (Q2)
We want address the second question Q2. The parameter sensitivity and ablation analysis are given. Herein, we mainly use GCN as the kernel and the setting is attributed graph learning.
4.2.1 Parameter Sensitivity
Figure 2 shows the precision w.r.t. and . Generally, different datasets require different and to achieve the best precision, and the value is often around 0.01. The precision will decrease quickly if we continue to increase the two parameters. The result shows that it is very necessary to introduce the two hyper-parameters to turn down the impact of pre-trained features. In fact, the component of will contribute almost nothing without the weight method.
Figure 3 shows the precision w.r.t. dimensions of the global features. The highest precision is achieved when the dimension is around 8 to 16.
4.2.2 Ablation Analysis
First, the effectiveness of parallel learning framework are investigated. Note that the simplest way to utilize all 3 feature matrices , and is to concatenate them first, and then feed the concatenated feature matrix to a single GNN kernel. Table 3 compares the results with simple concatenation and our parallel method. We can find that the method of simple concatenation has already outperforms GCN, which demonstrates that the pretraining-and-learning schema can well utilize the global information. G-GCN makes further improvement than the method of simple concatenation, which validates the effectiveness of parallel learning method.
Second, we investigate the effect of the global features, as is shown in Table 4. The results show that both pre-trained features can help to increase the model precision. The global structure features are more helpful on Cora, but less effective than global attribute features on Pubmed and Citeseer. On the dataset of Cora and Pubmed, the highest precisions are obtained when all the features are used. However, in Citeseer, the global structure features cannot help to increase the performance if the global attribute features are used. In all, both pre-trained features can contribute to improve the results of our proposed model. However, the amount of improvement depends on specific datasets.
4.3 Experiments on Plain Graphs (Q3)
We will try to address the last question Q3 by comparing G-GNN with other plain graph learning methods.
4.3.1 Experiment Setup
We conducted the task of semi-supervised classification on the two datasets of Cora and Citeseer. The results of the baseline methods are cited from their original papers. Pubmed is excluded from comparison since it is not used the baseline papers.
The training data was the entire plain graph(attributes X are excluded), and part of the node labels. We used 0.1, 0.2 … 0.9 of node labels to train the model respectively, and reported the classification accuracy on the rest of data. All results were the average value of 10 experiments with different random split.
Our proposed model was G-GCN (plain), where the GCN was used as the kernel. In the unsupervised training, the dimension of global structure vectors was 32. In the semi-supervised learning, the dimension of the hidden states was 256. No dropout was used. The rest of parameters were the same as those in 4.1.2.
Two state-of-the-art semi-supervised learning methods on plain graphs were used as baselines.
PNE : The method embedded nodes and labels into the same latent space.
The results are shown in Table 5. G-GCN (plain) achieves the highest precision on half of the total data points (9 out of 18). Specially, the advantage is more obvious when the training ratio is arising. When the training ratio is small, the catagories with less instances may provide very small number of training instance, which makes GNN difficult to pass message from these nodes. We beileve this is the reason why G-GCN is less powerful when training ratio is small.
In all, the results show that the learning framework of G-GNN can be successfully applied to plain graphs, and achieve similar or better results than state-of-the-art methods.
5 Related Work
Several related works have tried to expand the reception field of GNN and increase the neighborhood available at each node. PPNP/APPNP  improved the message passing algorithms based on personalized PageRank.  proposed jumping knowledge networks that can that flexibly leverage different neighborhood ranges for each node. N-GCN  trained multiple instances of GCNs over node pairs discovered at different distances in random walks, and learned a combination of these instance outputs. However, because the semi-supervised settings lack of enough training, these methods have to control the model complexity carefully, which limits the learning ability in exploring the global information. For example, our experiments have shown that G-GNNs with kernel of APPNP can still achieve promising improvement.
Some studies also tried to introduce unsupervised learning in GNNs to alleviate the insufficient supervision problem.  proposed a auto-encoder architecture that learned a joint representation of both local graph structure and available node features for the multi-task learning of link prediction and node classification. GraphNAS
first generated variable-length strings that described the architectures of graph neural networks, and then maximized the expected accuracy of the generated architectures on a validation data based on reinforcement learning. However, these methods do not consider to utilize global information of the graphs.
In the paper, we propose a novel framework named G-GNN, which is able to conduct semi-supervised learning on both plain and attributed graphs. Our pretraining-and-learning schema marries two lines of work and benefits from their advantages: the global information preserving ability brought by unsupervised learning and the local information aggregation ability induced by GNN. Extensive experiments confirm the effectiveness of G-GNN.
For future work, we plan to test some more complicated methods that combine the hidden states, and study other unsupervised methods that can produce global features more suitable for the learning ability of GNN.
-  (2018) N-gcn: multi-scale graph convolution for semi-supervised node classification. arXiv preprint arXiv:1802.08888. Cited by: 1 Introduction, 3.3.1 Learning from the Heterogeneous Features, 3.3.3 Training, 5 Related Work.
-  (2019) Mixhop: higher-order graph convolution architectures via sparsified neighborhood mixing. arXiv preprint arXiv:1905.00067. Cited by: 4.1.3 Results.
-  (1999) Internet: diameter of the world-wide web. nature 401 (6749), pp. 130. Cited by: 3.2 Unsupervised Learning of Global Features.
-  (2000) Graph structure in the web. Computer networks 33 (1-6), pp. 309–320. Cited by: 3.2 Unsupervised Learning of Global Features.
-  (2017) PNE: label embedding enhanced network embedding. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 547–560. Cited by: 2nd item.
-  (2019) GraphNAS: graph neural architecture search with reinforcement learning. arXiv preprint arXiv:1904.09981. Cited by: 4.1.3 Results, 5 Related Work.
-  (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: 1 Introduction.
-  (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: 1 Introduction, 2.2 Graph Neural Networks, 3.3.1 Learning from the Heterogeneous Features, 2nd item.
-  (2014) Learning latent representations of nodes for classifying in heterogeneous social networks. Cited by: 4.3.1 Experiment Setup.
-  (2014) Adam: a method for stochastic optimization. Computer Science. Cited by: 4.1.1 Datasets and Baselines.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: 1 Introduction, 2.2 Graph Neural Networks, 3.3.1 Learning from the Heterogeneous Features, 1st item.
-  (2018) Predict then propagate: graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997. Cited by: 1 Introduction, 2.2 Graph Neural Networks, 3.3.1 Learning from the Heterogeneous Features, 3rd item, 5 Related Work.
Deeper insights into graph convolutional networks for semi-supervised learning.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: 1 Introduction.
Focal loss for dense object detection.
Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: 3.3.2 Combine the Hidden States.
-  (2013) Distributed representations of words and phrases and their compositionality. In International Conference on Neural Information Processing Systems, pp. 3111–3119. Cited by: 3.2.1 Global Structure Features.
-  (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: 1 Introduction, 3.2.1 Global Structure Features, 4.3.1 Experiment Setup.
-  (2008) Collective classification in network data. AI magazine 29 (3), pp. 93. Cited by: 4.1.1 Datasets and Baselines.
Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research15 (1), pp. 1929–1958. Cited by: 1st item.
-  (2015) Line: large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077. Cited by: 4.3.1 Experiment Setup.
Learning to make predictions on graphs with autoencoders. In
5th IEEE International Conference on Data Science and Advanced Analytics, Cited by: 5 Related Work.
-  (2016) Max-margin deepwalk: discriminative learning of network representation. In International Joint Conference on Artificial Intelligence, Cited by: 1st item.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: 3.3.1 Learning from the Heterogeneous Features.
-  (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: 1 Introduction, 2.2 Graph Neural Networks, 3.3.1 Learning from the Heterogeneous Features, 4.1.1 Datasets and Baselines.
-  (2018) Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536. Cited by: 5 Related Work.
-  (2016) Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861. Cited by: 4.1.1 Datasets and Baselines.
-  (2019) PCANE: preserving context attributes for network embedding. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 156–168. Cited by: 3.2.2 Global Attribute Features.