Graph is a flexible data structure, which can store data and reflect the underlying topological relationship between data. Because of this, graph structure is widely used in various fields, including social networks , biological protein-protein networks , drug molecule graphs , knowledge network , etc. For example, in a drug molecule graph, a node can represent an atom, and nodes are connected with edges indicating that there are chemical bonds between atoms. By this way, people can easily store the data information in the network and access the relational knowledge about the interactive entities from this structured knowledge base at any time.
The traditional methods of extracting information from graphs depend on the statistics (degree, clustering coefficient , centrality [6, 7], etc.) of graphs, or kernel functions 
(or other characteristic functions) which are carefully designed. However, with the development of information technology, the amount of data increases rapidly, which makes the graph network more and more complex. Therefore, the traditional manual feature extraction methods becomes expensive, time-consuming and unreliable – can not extract valid information from complicated organizations.
In this case, representation learning has played a key role to efficiently extracts information in the graph. The graph representation learning method is a technical method for learning graph structure data, which hopes to transform complex original data into easy-to-process low-dimensional vector representations. In essence, the graph representation learning method is to learn a function, and this function maps the input graph or node to a point in the low-dimensional vector space. Compared with traditional methods, the representation learning method treats the problem of capturing graph information as part of the learning task itself, rather than just as a preprocessing link. In fact, the representation learning method uses data-driven methods to obtain features, avoiding the trouble of traditional manual feature extraction.
The goal of representation learning is not simply to obtain results directly, but to obtain an efficient representation of the original data. In other words, the choice of representation usually depends on subsequent learning tasks, and a good representation should make the learning of downstream tasks easier. In recent years, representation learning on graphs is very popular, and many good results have been obtained. Belkin et al.  propose Laplacian feature map in 2002, which is one of the earliest and most famous representation learning methods. Then a large number of methods are proposed, and the representative ones such as Grarep (Cao et al. ), Deepwalk (Perozzi et al. ), node2vec (Grover et al. ), GNN (Scarselli et al. ), GCN (Kipf et al. ), etc.
However, on the one hand, these methods are based on the binary relationship (i.e., edges) in the graph, and can not leverage the local structure of the graph; on the other hand, due to the sparseness of the edges in the graph, many methods are Encountered difficulties in generalization in many cases. Thus, our method is proposed in order to leverage the high-order connection patterns that are essential for understanding the control and regulation of the basic structure of complex network systems, and to alleviate the problem of edge sparsity in the graph.
This paper develops a new framework – wGCN, a controllable and supervised representation learning method. wGCN can be regarded as a two-stage task: in the first stage, according to the original graph structure data, we execute a random walk algorithm on the graph, and then use the generated walk to redistribute the weight of the graph network to obtain the reconstructed graph; in the second stage, we connect the reconstructed image to the graph convolutional neural network, and combine the original features and labels for training. This new framework can improve the accuracy of prediction tasks while spending a little more time. We prove that it has the same level of time complexity as the graph convolutional neural network, and conduct a large number of experiments on a variety of actual datasets, all of which have obtained better test results than the baselines.
The rest of this article is organized as follows. In Section 2, the related work in the past is summarized, and in Section 3, our representation learning method wGCN is introduced in detail. Our experiment will be introduced in Section 4, and the results will be given in Section 5. In Section 6, conclusions and future work will be discussed.
2 Related Work
This part will focus on previous work closely related to wGCN. Specifically, we first introduce random walk, and then introduce some classic graph representation learning methods, which inspires our methods.
2.1 Random Walk
Random walk refers to the behavior that can not predict future development steps and directions based on past performance. The core concept is that the conserved quantities of any irregular walker correspond to a law of diffusion and transportation, which is the ideal mathematical state of Brownian motion.
The random walk on the graph refers to starting from one or a series of nodes and moving between nodes in the graph according to specific rules. For example, a walker randomly selects a vertex to start, walks to the neighbor node of this vertex with probability, and jumps to any node in the graph with probability
, which is called a jump-turn probability. Each walk will result in a probability distribution reflecting the node to be visited, which is used as the input of the next step of the random walk. And the above process is iterated continuously. When certain conditions are met, the results of the iteration will tend to converge, resulting in a stable probability distribution. Fig.1 gives a specific random walk on graph. Random walk is widely used in the field of information retrieval. The well-known PageRank algorithm is a classic application of random walk.
2.2 Representation Learning Method
The graph representation learning method is a technical method for learning graph structure data. It hopes to transform complex raw data into a low-dimensional representation that is convenient to develop and process by machine learning. To a certain extent, it can also be regarded as a method of dimensionality reduction. According to whether neural network is used, graph representation learning methods can be divided into two categories: i) direct coding methods that do not use neural networks; ii) neural network-based coding methods.
Direct coding methods. Early learning node representation methods are concentrated in the matrix factorization framework. Belkin et al. present Laplacian eigenmaps method, which is a encoder-decoder framework measured by Euclidean distance in the coding space . Following the Laplacian eigenmaps method, Ahmed et al.  and Cao et al.  propose Graph Factorization (GF) and GraRep separately, whose main difference is the way the basic matrix is used. The original adjacency matrix of graph is used in GF and GraRep is based on various powers high order relationship of the adjacency matrix. And Mingdong er al. present High Order Proximity preserved Embedding (HOPE), which can preserve the asymmetric transitivity of the directed graph .
is the current node. Hyperparametersand control the walking probability of the next step (from node to its neighbors). As marked in subfigure (a), the probability of returning node from node is ; the probability from node to node or is as and are also neighbors of node ; and the probability from node to node or is as and are the second-order neighbors of node . Thus, when decreases, walking tends to return to the source node (i.e., ”more local”); when decreases, walking tends to move away from the source node (i.e., ”more global”). The consequence is shown in subfigure (b), red arrows show the walking which is ”more local”, and blue arrows show the walking which is ”more global”.
On the other hand, there are also some classic methods based on random walk instead of matrix factorization, thus becoming more flexible. Here, two representative methods are DeepWalk  and node2vec . DeepWalk uses random walk to disassemble the graph which is nonlinear structure into multiple linear node sequences, then the node sequences treated as ”sentences” (the nodes are treated as ”words”) are processed by SkipGram . As for node2vec, it allows an adjustable random walk on the graph. In particular, node2vec creatively uses two hyperparameters and to control the random walk ”more local” or ”more global”, in other words, depth-first search or breadth-first search (relative to the starting node, see Fig.2).
Neural network-based coding methods. The above direct encoding methods independently generate a representation vector for each node, which results in no shared parameters between nodes, high computational complexity, and underutilized node attribute information. Considering to solve these problems, many graph neural network methods have been proposed in recent years. Scarselli et al.  present the Graph Neural Network (GNN) model which can implement a function that maps the graph and one of its nodes to Euclidean space. And Inspired by the success of Convolutional Neural Network in image processing (Convolutional Neural Network extremely reduces the number of parameters by using convolution kernels to gather the information of local pixels on the image. However, CNN has encountered difficulties on the graph, due to the irregularity of the graph, that is, the number of neighbors is uncertain), Kipf et al.  propose the well-known Graph Convolutional Network (GCN) method. The GCN method cleverly applies the convolution operation to the graph structure, which means that the information of neighbor nodes is aggregated on the irregular graph structure.
We first perform random walk operation on the original graph, and then use the obtained ”walks” to reconstruct the graph network. After that, the graph convolution operation is performed on the obtained reconstructed graph to obtain the representation vectors of the nodes. We believe that these approaches can combine the nodes of the graph at multiple levels to obtain a more informative representation. Finally, the obtained representation vectors are sent to the downstream classifier (such as knn, mlp and so on) to complete the node classification task.
Next, we will introduce the technical details of our method. First, the random walk and the reconstructed graph is introduced in detail in Section 3.1; then, the wGCN embedding algorithm to generate embeddings for nodes is described in Section 3.2; finally in Section 3.3, we give an analysis of the complexity of the algorithm and prove it at the same time.
3.1 Reconstructed Graph
We will explain reconstructed graph utilizing random walks in detail in this section. For the convenience of explanation, some commonly used symbols are given below:
Formally, let graph , where is the set of the nodes in network, represents the number of nodes, and is the set of the edges. Given a labeled network with node feature information , where ( is the number of the nodes in network, is the feature dimension) is the feature information matrix and ( is the feature dimension) is the label information matrix, our goal is to use the labels of some of the nodes for training, and generate a vector representation matrix of the nodes.
Then, we give:
Definition 1: Given a graph () and initial node , a random walk of length rooted at is denoted as , where and the two nodes in the same parenthesis are neighbors (i.e., there is an edge connection between and ). Or for convenience, .
As shown in the Fig.3, we first perform random walk operation on the graph (subfigure (a)) to obtain ”walks” (subfigure (b)). After that, we can use the obtained ”walks” to reconstruct the graph network. A node that appears in the same walk with node is considered to be related to node , and this node is connected to node in the reconstructed graph (the red lines in subfigure (c)). And we assign different weights to distinguish the distance between the nodes and node in ”walks”. Thus in the reconstructed graph, the weight for nodes can be given by the following formula:
where indicates the original adjacency matrix, which is defined as follows:
and is a parameter to be determined indicating the decay speed with distance, satisfying . And k is the exponent, whose values are the distance between nodes in the ”walks”. For example, node and node are not connected in the original graph (subfigure (a)), thus ; and node and node have a distance of in ”walks” (subfigure (b)), thus ; in summary:
In the next section, we give the complete algorithm.
3.2 Embedding Algorithm
After entering the required data, we first initialize the random walk matrix
to a zero matrix in line 1. The random walk matrixrefers to the matrix generated by random walk, whose element indicates the weight attached to the nodes by the ”walks”. And in lines 2-14, the reconstructed graph matrix is built as described in section 3.1:
1. Generate ”walks” of length at each node () in lines 2-5;
2. Update the elements of the random walk matrix in lines 7-13;
3. Obtain the reconstructed graph matrix in line 14.
Note that the ”walk” returned by is a node list. For example, , in which is the neighbor of , , and . In addition, the adjacency matrix and the random walk matrix are mixed and normalized to obtain the reconstructed graph matrix in line 14, where is the mixing ratio coefficient. And is a symmetric normalization function:
where is a square matrix,
is the identity matrix andis a diagonal matrix, satisfying .
Then the graph convolution operation is performed. The number of graph convolutional neural network layers is specified by users in advance. And the initial representation of all nodes is expressed as: , in line 15; In lines 16-20, we perform a graph convolution operation based on reconstructed graph, in the formula , represents a weighted nonlinear aggregation function, whose purpose is to reorganize the information of the target node and its neighbors. Formally,
is the hidden representation of nodein the -th layer; is the number on the -row and -column of the reconstructed graph matrix , indicating the closeness between nodes of and ; is the parameter matrix to be trained of layer ; is the neighborhood nodes set of node in the reconstructed graph;
represents for ReLU function:
Then, the final representation vector of node is obtained. And the representation vectors can be sent to the downstream classifier (such as softmax classifier) to obtain the predicted category vectors .
If softmax classifier is chosen (the form is as follows),
where is the representation vector, Softmax is the -th component of the vector Softmax, and is the dimension of the representation vector ,
then the cross entropy function can be used as the loss function to train the parameters of our model:
where is the true label of the node .
3.3 Complexity Analysis
Our method is based on GCN. And from the related work of Kipf et al. , we know that the computational complexity of the original GCN based on the following formula is , where is the edge set of the graph:
where is the adjacency matrix and is the feature matrix. And is the normalized processing matrix of the adjacency matrix . is an input-to-hidden weight matrix and is a hidden-to-output weight matrix, where is input channels, is the number of feature maps in the hidden layer and is the number of feature maps in the output layer .
The time of our algorithm is mainly consumed in the training phase of the neural network. Thus, we will prove that the calculation complexity of our method in the training phase of the neural network is also , while keeping the number of hidden layers unchanged.
Let be the ”walks” length in random walk, denotes the number of ”walks” from each node, and denotes the original adjacency matrix, reconstructed graph matrix and random walk matrix respectively. We compare the number of non-zero elements in and .
Since each node has ”walks” of length , then the maximum number of non-zero elements per node in the random walk matrix is (in this case, the ”walks” guided by the starting node have no duplicate nodes except the starting node itself).
So has at most non-zero elements more than ( is the number of nodes). And in experiments, is set to 5 and is generally set to ( is the average degree of the graph), thus . Therefore, the computational complexity satisfies:
4 Experiments and Result
In section 4.1, we introduce the datasets used in the experiment, and the specific settings of the experiment are described in section 4.2. In section 4.3, a wide variety of baselines and previous approaches are introduced, and the results are shown in section 4.4.
Citation Network Datasets: Citeseer, Cora, and Pubmed. In these three standard citation network datasets, nodes represent documents and edges (undirected) represent citation links. The three citation network datasets contain a sparse feature vector for each document and each document has a category label . As shown in the Table 1, the Cora, Citeseer and Pubmed datasets contain 1433 features, 3703 features and 500 features per node respectively, and the number of label categories are 7, 6 and 3 respectively.
Social Network Datasets: Ego-Facebook and Ego-Gplus. For Ego-Facebook, this dataset consists of ’circles’ (or ’friends lists’) from Facebook . There are many subsets of the Ego-Facebook dataset. Take ’107Ego(F)’ as an example. This dataset is a network with the node ’107’ as the core, where the nodes represent users and the edges (undirected) represent interactions between users. And each user has a feature attribute vector and a category label. As for Ego-Gplus, it is similar to Ego-Facebook, except that the data comes from Google and the edges are directed . We choose the suitable subsets of Ego-Facebook and Ego-Gplus for experiments (to facilitate the distinction, (F) represents a subset of Ego-Facebook and (G) represents a subset of Ego-Gplus). And after preprocessing, the data whose information has been lost is removed.
4.2 Experimental Setup
Citation Network Datasets. On citation network datasets, we apply a two-layer wGCN model. Specifically, we perform random walk operation 8 times for each node, and stop after passing 4 different nodes every time on Cora dataset. And the decay rate (i.e., , mentioned in section 3.1) is set to 0.8 and the mixing ratio coefficient (i.e.,
, mentioned in section 3.2) is set to 0.9; on Citation dataset, the random walk operation is performed 3 times for each node, and stop after passing 4 different nodes every time. The decay rate is set to 0.8 and the mixing ratio coefficient is set to 0.73; on Pubmed dataset, the random walk operation is performed 5 times for each node, and also stop after passing 4 different nodes every time. The decay rate and the mixing ratio coefficient ars set to 0.8 and 0.9, respectively. The remaining parameter settings follow the settings in.
Social Network Datasets. On social network datasets, we apply a three-layer wGCN model. Specifically, the shape of the parameter matrix () of the three-layer model, the number of random walk operation from each node, the decay rate and the mixing ratio coefficient on 8 subdatasets are shown in Table 2. In addition, the learning rates on subdatasets of Ego-Facebook and Ego-Gplus are set to 0.02 and 0.01 respectively. And the random walk operation for each node is also stopped after passing 4 different nodes every time.
4.3 Baselines and Previous Approaches
Citation Network Datasets. On citation network datasets (Citeseer, Cora, and Pubmed), our method is compared with the same strong baselines and previous approaches as specified in , including label propagation (LP) , semi-supervised embedding (SemiEmb) , manifold regularization (ManiReg) , iterative classification algorithm (ICA)  and Planetoid . Here, DeepWalk is a method based on random walks, as stated at the beginning of the article, whose sampling strategy can be seen as a special case of node2vec with and . As for method named GCN, which is the first method to achieve convolution on the graph, it is the best performing baseline method.
Social Network Datasets. On social network datasets (Ego-Facebook and Ego-Gplus), our method is compared against Deepwalk, GraRep and again GCN, which is the strongest baseline in the above experiment. And here, GraRep  works by utilizing the adjacency matrix of each order and defining a more accurate loss function that allows non-linear combinations of different local relationship information to be integrated.
The results of our comparative evaluation experiments are summarized in Tables 3 and 4.
As shown in Table 2, our method achieves the best results on all datasets, and compared with the strongest baseline, our method improve upon GCN by a margin of 0.3, 2.1 and 0.5 on Cora, Citeseer and Pubmed respectively.
Next, we use the social network datasets (Ego-Facebook and Ego-Gplus) for the experiments, and compare the experimental results with the classic method based on adjacency matrix GraRep  and based on random walk Deepwalk , and the best performing baseline method GCN . The experimental results are shown in the Table 3:
Table 4: The results of classification accuracy on social network datasets. Method 107Ego(F) 414Ego(F) 1684Ego(F) 1912Ego(F) DeepWalk 77.5 79.2 64.4 66.5 GraRep 90.0 85.4 76.3 77.0 GCN 92.5 93.8 81.9 77.0 wGCN 95.0 97.9 85.0 81.0
As can be seen from Table 3, the experimental results of both parts of our method are significantly higher than the results of Deepwalk and GraRep. And except for ’5249Ego(G)’ (the result is as good as the result of GCN), the other results of our method are also better than the results of GCN.
In this paper, we have designed a new framework combined with random walk wGCN, which can reconstruct the graph to capture higher-order features through random walk, and can effectively aggregate node information. We conduct experiments on a series of datasets (citation network and social network, directed and undirected). The results have shown that wGCN can effectively generate embeddings for nodes of unknown category and get the better results than the baseline methods.
There are many extensions and potential improvements to our method, such as exploring more random walk methods with different strategies and extending wGCN to handle multi-graph mode or time-series-graph mode. Another interesting direction for future work is to extend the method to be able to handle edge features, which will allow the model to have wider applications.
This work is supported by the Research and Development Program of China (No.2018AAA0101100), the Fundamental Research Funds for the Central Universities, the International Cooperation Project No.2010DFR00700, Fundamental Research of Civil Aircraft No. MJ-F-2012-04, the Beijing Natural Science Foundation (1192012, Z180005) and National Natural Science Foundation of China (No.62050132).