1 Introduction
Many real world data come naturally in the form of pairwise relations, such as proteinprotein interaction in human cell, paper citations in scientific research and drugtarget interaction in medicine discovery [1, 2, 3]. These linkages contain rich information on node properties, network structures and network evolution. To predict the existence of a relation, which is always abbreviated as link prediction, is a fundamental task in network science and of great importance in practice. For networks in biology like proteinprotein interaction network, metabolic network, and food webs, the discovery and validation of links require significant experimental effort. Instead of blindly checking all possible links, link prediction can help scientists to focus on the most likely links and this can sharply reduce the experimental cost. For WWW, social networks, and citation networks, link prediction can also help in recommending relevant pages, finding new friends, or discovering new citations [4, 5, 6].
The conventional link prediction methods can be divided into several groups. The approaches that make link prediction according to local similarity based on the assumption that two nodes are more likely to be connected if they have many common neighbors [7, 8]. These approaches are fast and highly parallel since they only consider local structure. However, the biggest drawback is their low prediction accuracy, especially when the network is sparse and large. While, global similaritybased methods use the whole network topological information to calculate similarity between links [9, 8, 10]. Although those methods perform better on prediction accuracy, they usually suffer from high computational complexity problem which makes them unfeasible for graph that contains million and billion nodes. There are also some probabilistic and statistical based approaches, assuming that there is a known prior structure of the network, like a hierarchical or circles structures [11, 12]. But they can not get over the problem of low accuracy. Furthermore, the conventional approaches can hardly reveal the hidden information of node properties and network structures behind the linkages.
Recently, there has been a surge of algorithms that seek to make link prediction through network representation learning which extracting both local and global structural information about nodes from graphs automatically. The idea behind these representation learning algorithms is to learn a mapping function that embedding nodes as points in a lowdimensional space which encoding the information from the original graph. Network representation based methods, which are usually based on the SkipGram method or matrix factorization, such as DeepWalk, node2vec, LINE, and struc2vec [13, 14, 15, 16, 17]
, have achieved a much higher link prediction accuracy compared with the conventional ones. Random walk based network representation learning algorithms are task agnostic, the learned representations then are used to perform graphbased downstream machine learning tasks, such as node classification, node ranking as well as link prediction
[18, 14]. To start with, there is no supervised information during the training process, nodes’ representation vectors are updated directly without considering network global structure. Besides, the computational complexity is the biggest headache in those algorithms since a large number of random walks on the whole graph are required for most of the skipgram based methods
[13, 14, 17]. Moreover, the expressive power is limited because the embedding process is fixed by the random walk strategy. Finally, the representations can be hardly extent for inductive learning since the embedding vectors can not be transferred to a other similar graphs [19].More recently, deep learning techniques based on neural networks have achieved triumphs in image processing
[20] and natural language processing [21]. This stimulates the extensions of the methods on graph structures to perform node classification and link prediction tasks by converting the network structures into a low dimensional representations. For example, graph convolutional network (GCN) [22]borrows the concept of convolution from the convolutional neural network (CNN) and convolve the graph directly according to the connectivity structure of the graph. After that, Velickovic et al.
[23] further proposed graph attention networks (GAT)and obtained the stateofart of accuracy in node classification task. Following the selfattention mechanism, GAT compute representation of each node by combining its neighborhoods vectors in an adaptive way. The attention here is an adjustable weights on different neighbor nodes which can be updated dynamically according to the states of the nodes within a local connected neighborhood.Nevertheless, those algorithms mentioned above and their extensions [24] have the scalability problem since they take the whole graph as input and recursively expand neighborhoods across layers. This expansion is computationally expensive especially when the graph is very large. Due to the scale free property in many graphs, when hub nodes are sampled as the 1storder neighbors, their 2ndorder neighbors usually quickly fill up the memory that usually leads to the memory bottleneck problem. This problem prevents GAT and GCNs to be applied on large scale networks.
GraphSAGE [19] tries to solve the memory bottleneck problem by sampling a fixedsize neighborhood during each iteration, after that it performs a specific aggregator over feature extractor. The sampling strategy in GraphSAGE yields impressive performance on node labeling task over several largescale networks. FastGCN [25] then proposes to view GCN [22]
as integral transforms of embedding functions under probability measure. The classification accuracy is highly comparable with the original GCN while gets rid of the reliance on the test data.
However, most of the neural network based methods mainly apply node classification labels as the only supervised information rather than linkages. As we know the representation quality relies on the supervised information. Node labels, however, are always highly scarce in most real networks. Besides, linkages rather than node attributes contain much richer information on network structure and evolution. For example, according to the similarity and popularity theory [26], the links within a network not only reveal nodes similarity relations [27] but also encodes node popularity information [7]. Take the formation process of a citation network as an instance, the citations are made not only according to the content similarity but also based on the popularity of the currently existing papers [28]. Thus, linkages rather than node labels should be used as the supervised information to learn node representations because they encodes at least both popularity and similarity information.
In this paper, we propose a new model named DeepLinker which extends the GAT models to be applied on predicting the linkages over various networks. By adopting the attention mechanism, we can not only make predictions on links but also learn node representations and node ranking. The learned attentional weights being paid on each node can be regarded as a kind of node centrality. However, the original GAT model can not be directly applied on the supervised link prediction task for the following reasons. First, the complexity of node classification is (), while the complexity of link prediction is (), where is the node number. Link prediction task usually involves larger number of node feature computations compared node classification. Second, the original GAT model needs to access the entire network while performing node classification inference, it must be extended to process the minibatched graph data. However, we can not directly sample the graph by following links to form a minibatch due to the well known scalefree property in real networks: the expansion of the neighborhood for a hub node can quickly fill up a large portion of the graph. Last, although large minibatches are preferable to reduce the communication cost, they may slow down the convergence rate in practice [29] since the decrease in minibatch size typically increases the rate of convergence [30] in the optimization process. Whereas, in the original GAT model, a small minibatch usually involves a large amount of nodes, which decreases the convergence rate and usually leads to poor performance in link prediction accuracy.
Here we solve the memory bottleneck and the convergence problem by incorporating the minibatch sampling strategy in GraphSAGE via a fixed neighborhood size [19]
. The difference between DeepLinker and GraphSAGE lies in the sampling times. In DeepLinker we sample only once and then fix the sampling neighborhoods for all nodes during the training process, while GraphSAGE, however keeps changing neighbors in every epoch. We also discover that changing neighbors in every epoch usually slows down the convergence and leads to poor performance. Our model is a novel structure by combining both GAT and GraphSAGE, and particularly designed for link predictions. This model computes the hidden representations of each node through a shared attention mechanism across its neighbors, and combine nodes’ vectors to represent edge features for the graph.
A large number of experiments are implemented on five representative networks. The results show that DeepLinker not only achieve the stateofart accuracy on link predictions but can also obtain the effective node representations for downstream tasks such as node ranking and node classifications. We find that the nodes with more attention paid by their neighbors are either the elites in a Chinese coinvestment network or the ’best papers’ in the APS citation network. Moreover, if the model is well trained, the lowdimensional node vectors extracted from the last layer of DeepLinker, with only a small fraction of labeled nodes for training, can be used to classify nodes in a relatively higher accuracy than other unsupervised learning algorithms. And the fewer the node label information is exploited, the higher the advantage our model has.
Our main contributions are summarized as follows:

We propose DeepLinker, which achieves the stateoftheart accuracy in link prediction task.

We handle the memory bottleneck and minibatch problems by fixing the neighborhood number which yields a controllable cost for perbatch computation.

The trained attentional coefficient matrix of DeepLinker plays a key role in revealing the latent significance of nodes. It helps in identifying the elites of a Chinese Investment Network and finding the ’best papers’ in a citation network.

DeepLinker can extract meaningful feature representations for nodes, this link prediction based node embedding method achieves high accuracy in node classification task especially when the training set is small.
Even more, our model and its performance on a large variety of tasks reminds us the language model in natural language process (NLP). In that case, the algorithms can learn effective representations of words and sentences for a large variety of downstream tasks by doing the task to predict the next word [31, 32]. Therefore, we claim that the link prediction task can be treated as the "Language Model" in graphs.
2 GAT architectures
To start with, we review the architecture of GAT model since our model is mainly based on that. GAT takes a set of node features as input, h = {, , … , }, in which, , and is the number of nodes, is the number of input node attributes. We use to denote GAT’s outputs that also contain
node features. The target of GAT is to obtain sufficient expressive power to transfer the input features into highlevel output features. It first applies a learnable linear transformation, parameterized by a weight matrix,
Wto every node, it then uses a singlelayer feedforward neural network
to compute the attention coefficients between nodes. This computation process is shown in equation 1, where represents matrix transposition and is the concatenation operation. Node is a neighbor of node , and indicates the importance of ’s features to among all of ’s neighbors, represented as .(1) 
Once the normalized attention coefficient
is obtained, GAT aggregates nodes’ features as a combination of their neighbors, followed by a potentially nonlinear sigmoid function
, as is shown in equation 2.(2) 
Finally, GAT employs multihead attention to stabilize the learning process of attention coefficients. K denotes the number of attentional heads. denotes the th relative attentional weights of features to . The output features of nodes’ neighbors are either concatenated or averaged to form their final output features, as is shown in equation 3:
(3) 
3 DeepLinker architecture
To solve the low link prediction accuracy problem with deep learning architecture, we introduce DeepLinker, which has an encoderdecoder architecture. The overall architecture of DeepLink is shown in Figure 1. An encoder encodes each node to a vector representation, , after that a decoder constructs an edge vector by aggregating nodes’ vectors. Finally, a score function is applied to evaluate the link existence probability between two nodes via the edge vectors. One of the key ideas behind DeepLinker is to learn how to aggregate nodes’ features into edge vector for link prediction task.
As mentioned above, the limitations of GAT are memory bottleneck and minibatch problem. Due to the scalefree property of most networks, once the hub nodes are sampled as the 1storder neighbors in GAT architectures, the 2ndorder neighbors usually quickly fill up the memory, which prevents GAT to be applied on larger networks. Besides, the existing GPUenabled tensor manipulation frameworks are only able to parallelize on the normalized activation coefficients (
) for the same sized neighborhoods, which prevents GAT from parallel computing.3.1 Fixedsized Neighborhood Sampling
Here we use the fixedsized neighborhood sampling strategy to solve the memory bottleneck and the minibatch problem. The undirected graph can be represented as , with denoting the set of nodes and representing the edges in network . For any two randomly selected nodes and , we then calculate the edge vector to make a prediction for the existence of the linkage between them. We sample each node’s neighborhood to form a fixedsized node minibatch. Taking node as a sample example, we uniformly sample a fixedsized set of its neighbors defined as from the set instead of full neighborhood set as in GAT. Different from sampling neighborhood during each training iteration in GraphSAGE, we sample only once during the whole training process. The sampling strategy is illustrated in Figure 1.
We then compute the Hadamard product between two output vectors to generate an edge vector and pass by a sigmoid active function
to evaluate the edge existing probability, as shown in equation 4:(4) 
3.2 Training DeepLinker
The whole framework is trained by minimizing the following objective function:
(5) 
where, is the label information for linkage between and , with for nonexistence and for existence. is the training set of edges. We follow the convention in link predictions to randomly divide the full set of edges into three parts, for training, for validating and for testing. We train the model to predict not only the existence but also the nonexistence links. Here, is the set of negative samples, in which each element is a node pair , and both and are drawn from the nodes involved in . Additionally, there is no edge between and in the set of . In our experiment, we sample the same number of negative instances as for positive ones.
4 Experiments
In order to evaluate the performance of DeepLinker on the realworld networks and to explore the potential application of it, we conduct link prediction, node ranking, as well as node classification task over several networks ranging from citation network to venture capital coinvest networks. All experiments are done under the PyTorch machine learning environment with a CUDA backend.
4.1 Experiments setup
We utilize the following five networks in our experiments:

Cora network contains 2,708 nodes 5,429 edges, and 7 classes. Each node has 1,433 attributes corresponding to elements of a bagofwords representation of a document.

Citeseer network contains 3,327 nodes 4,732 edges, and 6 classes. Each node has 3,703 attributes extracted from paper contents.

Pubmed is a citation network, which contains 19,717 nodes and 44,338 edges, and 3 classes. The size of attributes of each node is 500.

VC network is a venture capital coinvest network with nodes representing venture capital companies and edges corresponding to coinvest events. It contains 1,436 nodes and 2,265 edges. Here we use the adjacency matrix as the onehot input node attribute in the training process. On this network, 42 nodes are identified manually as VC which play a vital role in venture capital events. These nodes are regarded as the ground truth in node ranking task. The VC investment network is built on the SiMuTon [33] database.

APS graph has 1,012 nodes and 3,336 edges. The adjacency matrix is used as the the onehot input node attribute in the training process. We quantify each paper’s impact and importance by counting the number of citations over 10 years (c_10) after its publication. The metric c_10 is used as a ground truth for measuring nodes importance.
We compare DeepLinker with the following baseline algorithms:

RA [8] is a traditional link prediction method, and the similarity between two nodes is measured by computing neighbors’ weights which are negatively proportional to its degree.

LINE [16]
minimizes a loss function to learn embedding while preserving the first and the secondorder neighbors proximity among vertices in the graph.

Node2vec [14] adopts a biased random walk strategy and SkipGram to learn vertex embedding. This embedding algorithm is widely used in recent years.

GraphSAGE [19] learns node embedding through a general inductive framework consisting with several feature aggregators. It usually adopts supervised node classification task as the evaluation benchmark with the assumption that better embedding algorithm leads to higher node classification accuracy.
In our experiments, we keep all models’ architecture as they are in the original GAT papers. This includes the type and sequence of layers, choice of activation functions, placement of dropout, and setting of hyperparameters.
DeepLinker algorithm consists of two layers, the first layer is made up of 8 () attentional heads over all networks. Here we set the hidden size to 32 for Cora and VC, 16 for Citeseer and APS, and 64 for Pubmed network. The main purpose of the first layer is to compute the hidden features of the 1storder neighbors. We then add the nonlinearity by feeding the hidden features to an exponential linear unit (ELU), as shown in equation 2. The aggregated features from each head are concatenated in this layer.
The main purpose of the second layer is to compute edge features that used for evaluating the link existence probability. Here we use a single attention (K2 = 1) for Cora, Citeseer, VC and APS graphs. We discover that the Pubmed graph requires a much larger attention head number, so we set (K2 = 8) for Pubmed graph. The aggregated features from each head are averaged in this layer. The output of the second layer is the final feature representations for nodes. We then compute the Hadamard distance between two node features to represent the edge vector, as is shown in equation 4. Once edge vectors are obtained, an active function sigmoid is applied to evaluate the existing probability between nodes.
We initialize the parameter of DeepLinker with Glorot initialization [34] and train to minimize the binarycrossentropy5, for the training set we use the Adam SGD optimizer [35] with an initial learning rate of . We also apply an early stop strategy over link prediction accuracy for the validation set, with the patience sets to 100 epochs.
We solve the memory bottleneck by sampling a fixed neighbors size (20) for both 1st and 2ndorder neighbors selection. The sampling strategy is illustrated in Figure 1. We set the batch size to be 32 for Cora, Citeseer, and Pubmed, 16 for APS network, and 8 for VC network.
4.2 Link prediction
In this part, we evaluate the link prediction accuracy of DeepLinker and compare it with other link prediction algorithms. The goal of link prediction task is to predict whether there exists an edge between two given vertices. To start with, we randomly hide 10% of the edges in the original graph as the ’positive’ samples in the test set. The test set not only contains the hidden 10% edges but also has equal number of randomly selected disconnected links that servers as ’negative’ samples. We then use the left connected links and randomly selected disconnected ones to form the training set. After that, we uniformly sample first and second order neighbors for each node. Finally we feed the sampled nodes into DeepLinker and the output of this method is the edge existence probability between two nodes.
Two standard metrics, Accuracy and AUC (area under the curve) are used to quantify the accuracy of link prediction. As shown in Table 1, DeepLinker outperforms all the baseline methods in link prediction accuracy across all graphs.
Here, we propose two implements of DeepLinker:DeepLinker (attention) and DeepLinker (all ones). Both implements apply neighborhood sampling and minibatch strategies during the training process, while DeepLinker (attention) trains the attention coefficient, which indicates how important the neighbors’ features to the present node, as illustrated in equation 1 and DeepLinker (all ones) sets the attention to 1 among all neighbors. As is shown in Table 1, DeepLinkerDeepLinker (attention) has a much higher prediction accuracy compared with other link prediction algorithms. DeepLinker(all ones) achieves the best performance among all datasets. Actually in larger networks such as Pubmed, no matter how we adjust learning rate, linear transformation matrix size and the minibatch size for the attention based DeepLinker, the loss doesn’t go down. At the initial stage of the training process, the neighbors’ attributes have a stronger influence on the representation ability of the node, however the attention coefficient value in DeepLinker (attention) architecture can only apply a limited attention coefficient ranging from 0 to 1. This attention coefficient is too small especially for a hub node with many neighbors, the attention mechanism limits the expression power on neighbors features’ influence for the current node.
At the mean time, DeepLinker (all ones) sets all the attention coefficients to 1, which means neighbors’ features have an equal contribution to the present node. Although DeepLinker (all ones) is a much simpler architecture with less trainable parameters, to our surprise, its performance is even better than DeepLinker (attention) over all networks, and the gap becomes larger on large networks, as is shown in Table1. When training the Pubmed network, the loss converges within 10 epochs when we set the attention coefficients to 1. Compared with DeepLinker (attention), DeepLinker (all ones) is a more suitable model for link predictions on large networks. Compared GraphSAGEmean algorithm, DeepLinker (all ones) can improve the accuracy by 5%, 2%, 2%, 1% and 3% respectively.
Table1 shows that RA and LINE algorithms may can not capture the essential pattern of graph structure, since the predictive accuracies are low in both algorithms. Node2vec performs better than LINE and RA since the SkipGram model is better at extracting neighborhood information from graph structure. The original GraphSAGEmean and GAT model are used for node classification task only, here by adding a logistic regression layer we make them suitable for link prediction task. Since there are no sampling and minibatch training strategies in the original GAT model, thus, once the network becomes large, the original GAT algorithm suffers from memory bottlenecks. That is why we do not report the link prediction accuracy on the Citeseer and Pubmed networks of the GAT model.
In order to test the robustness of the model, we randomly break 20% edges among the existing links. Table 2 shows that both implements of DeepLinker are robust than other algorithm, and DeepLinker (all ones) achieves the highest link prediciton accuracy.
Accuracy/AUC  Cora  Citeseer  Pubmed  VC network  APS network 

GAT  0.79/0.88  NA  NA  0.77/0.83  0.80/0.89 
DeepLinker (attention)  0.87/0.93  0.85/0.91  0.57/0.63  0.80/0.90  0.84/0.94 
DeepLinker (all ones)  0.88/0.93  0.86/0.91  0.90/0.97  0.82/0.90  0.85/0.95 
GraphSAGEmean  0.83/0.89  0.84/0.90  0.88/0.96  0.81/0.87  0.82/0.88 
node2vec  0.82/0.92  0.85/0.89  0.81/0.94  0.77 /0.87  0.83/0.89 
LINE  0.69/0.76  0.67 / 0.73  0.66/0.72  0.78/0.84  0.68/0.74 
RA  0.41/0.75  0.32/0.73  0.31/0.69  0.33/0.76  0.35/0.78 
Break Portion 20%  

DeepLinker (attention)  0.84/0.90 
DeepLinker (all ones)  0.85/0.91 
GraphSAGEmean  0.81/0.90 
node2vec  0.80/0.89 
LINE  0.52/0.53 
RA  0.33/0.73 
4.3 Attention coefficients for node centrality measuring
The learned attention coefficients from DeepLinker (attention) help extract the relationship between connected nodes for a given graph. Here we take the Chinese Venture Capital (VC) and the APS citation networks as examples to show how the node coefficients help in node centrality measuring and ranking.
One of the most important questions in venture capital analysis field is to identify the leading investors (leading VC) among a large amount of invest activities. Syndication in the Chinese venture capital market is typically led by main leading VCs, who always finding good investment opportunities, setting up investment plans, and organizing the partners. These leaders play a major role during the investment activities, therefore, identifying them has practical significance. To find the ground truth for discovering and identifying the leading VCs in venture capital industry, we use the Delphi method to interview four experts in this field to get a name list of leaders among the VC firms [36]. Based on this questionnaire survey, we identify 42 elites (leading VCs) in this network.
The APS graph is a subgraph extracted from APS (American Physical Society journals) website with nodes representing papers and links representing citations. Measuring the centrality of papers helps scientists find the significant and high quality discoveries among thousands of publications. In this paper, we follow [37] to evaluate papers’ importance by counting the number of citations within the first 10 years (c_10) after papers’ publications.
Intuitively, the more attention a node attracts, the more influential it is. We measure node’s influence by accumulating neighborhoods’ attention towards it across all heads in the second layer of DeepLinker(attention). We name this attention coefficient based node rank method as Attention Rank, as is shown in equation 6,
(6) 
Attention Rank is a byproduct of DeepLinker (attention). We first extract the normalized attention coefficients from a pretrained DeepLinker model of the attentional head in the second layer, and then sum them up for all neighbors across all heads.
By calculating VC’s total amount of attention based on equation 6, we find that the elites (leading VCs) always attract a larger amount of attention compared with the followers. Following the evaluation methods for node centrality measures in complex networks [38], we sort VC nodes according to the total amount of attention in a decreasing order, and find that we can hit 30 elites out of the top 42 elites of the ground truth set. Table 3 shows the top 16 VCs with the most attention, all of the VCs are elites in the ground truth set. Besides, we find the top 16 VCs has a larger overlaps with a later released VC ranking website that discusses about the ’best’ venture capital in China [39].
Rank  VC name  is_elite  Rank  VC name  is_elite 

1  MORE/Shenzhen Capital Group  Yes  9  JAFCO ASIA  Yes 
2  IDG Capital  Yes  10  FOETURE Capital  Yes 
3  Sequoia  Yes  11  GGV Capital  Yes 
4  Legend Captital  Yes  12  Walden International  Yes 
5  Goldman Sachs  Yes  13  SBCVC  Yes 
6  Intel Capital  Yes  14  DFJ Venture Capital  Yes 
7  Northern Light Venture Capital  Yes  15  Qiming  Yes 
8  DT Capital  Yes  16  Cowin  Yes 
To accurately compare the ranking result with other ranking algorithms, we follow the method in Webspam competition [40] and use the Accuracy which defines the ratio of the hits on the ground truth as a metric to evaluate the performance of different node ranking methods. The ranking performances are listed in Table 4.
In order to evaluate papers’ importance for APS graph, we first extract the pretrained attention coefficient matrix from DeepLinker and rank papers’ total attention in a decreasing order. We then follow the experiments of [41]
to use Spearman’s rank correlation coefficient as an evaluation metric since the ground truths (c_10) are real values instead of binary ones.
In this part we choose several unsupervised graph ranking algorithms to compare with. The first algorithm is PageRank [42] which is a basic solution for ranking nodes. The second is Closeness Centrality and the third is Betweenness Centrality [43]. The Closeness Centrality believes that the most important nodes should have shorter path lengths to other nodes, while the Betweenness Centrality assumes that the most important nodes should be involved in more shortest paths. We also compared Attention Rank with the SPRank [3], which can efficiently and accurately identify high quality papers (e.g. Nobel prize winning papers) and can significantly outperform PageRank in predicting the future citation growth of papers. Table 4 shows the compare result of DeepLinker (attention) with the ranking methods mentioned above with the default parameter settings. We can find that Attention Rank significantly outperforms other ranking methods without any adjustable parameter and human knowledge.
Dataset  Evaluation  PageRank  Closeness  Betweenness  SPRank  Attention Rank 

VC  Accuracy  0.65  0.60  0.58  0.64  0.72 
APS  Rank Coor.  0.32  0.08  0.03  0.38  0.42 
4.4 Feature learning for node classification
DeepLinker can not only be applied in predicting missing links, measuring node centrality but also can provide meaningful node representations. In Figure 2 we visualize the raw input attributes, node2vec representation vectors and the second layer vectors of the pretrained DeepLinker (all ones) with tSNE [44]
visualization method. In this figure each point represents a node in the Cora graph with its node color denoting classification label. From the DeepLinker (all ones) visualization, we can tell in general, nodes belonging to the same class usually block together. The NMI (Normalized Mutual Information) and the Silhouette score of DeepLinker representation are much higher than node2vec vectors and raw attributes. For example, the Reinforcement Learning and the Genetic Algorithms are quite independent communities with the highest subnetwork density 0.017 and 0.009 compared with the whole network’s density (0.001). This indicates that link prediction based representation has incorporated node similarity information. We also discover that even DeepLinker tends to separate nodes from different classes, there are still some overlap areas consisting of nodes from different classes. This phenomenon may support the popularity versus similarity theory
[26], which claims that network links are tradeoffs between node similarity and other network properties such as node popularity.Compared with the network representations learned from supervised node classification information such as GAT, GCNs, and GraphSAGE, the representations learned from link information such as DeepLinker and node2vec contains richer structural information. The node representations learned from supervised node classification can only decode part of the hidden information. For example in citation networks, papers belonging to the same subject in general will have the same labels. Network embedding based on node labels would decode only the similarity between nodes. However, the evolution of citation networks rely not only on papers’ subjects and labels, other factors such as authors’ fame and their popularity also play important roles in network formation [45]. Besides, the supervised link information is easier to acquire compared with the supervised label information.
In order to evaluate the effectiveness of the proposed DeepLinker representation, we follow the commonly adopted setting to compare different representation algorithms of node classification task on Cora and Citeseer networks. In node classification task, each node has a targeted label and our goal is to build a predictive learning logistic model based on the training set and to predict the correct labels for the test set. In particular, after extracting nodes representations of a given graph, we randomly select some nodes to form the training data, we then apply the training nodes representations and their labels to train a node classify model and use this model to predict the label of the remaining nodes. We repeat the classification experiments for 10 times and report the average MicroF1 value over 10 runs.
We compare the node vectors learned from link prediction task of DeepLinker (all ones) with the unsupervised GraphSAGEmean and the widely used . In order to control changing variables, we fix the embedding dimension to 128 for all algorithms, and name them as DeepLinker_128, GraphSAGEmean_128 and node2vec_128. The first subgraph in Figure 3 shows in Citeseer graph, DeepLinker outperforms and GraphSAGEmean especially when the training set is small. In the second subgraph, when the training portion is less than 10%. DeepLinker also achieves better than and GraphSAGEmean. Here, we believe achieves a better performance when the training set is small is a very important factor in node classification task, because in realworld networks there are only a few labeled graphs and manually labelling a large amount of nodes not only cost time and efforts but also introduce biases.
Moreover, in order to improve node classification accuracy, we increase the embedding dimension by concatenating the vectors learned from different unsupervised algorithms. As is shown in 3, we find the concatenation of DeepLinker and GraphSAGEmean achieves the highest classification accuracy in Cora and Citeseer, the combination of DeepLinker and also achieves a higher classification accuracy compared with the joint of GraphSAGEmean and . Actually the concatenating of GraphSAGEmean and node2vec performs even worse than DeepLinker in 128 dimension in Citeseer network.
5 Conclusion and discussion
In this paper, we propose a minibatched link prediction model: DeepLinker which based on the graph attention architecture and sampling strategy. DeepLinker can extract meaningful vertex representations and achieve the stateoftheart link prediction accuracy. The byproducts of DeepLinker, attention coefficients and node vectors show the potential in node centrality measuring and node classification tasks, especially, when the labeled training set is small. DeepLinker outperforms other unsupervised node representation learning methods in node classification and node visualization tasks. This may alleviate the dependency on large labeled data. Therefore, we believe the link prediction task in graph is like the language model in natural language process, and the learned node representations by link prediction can be used on other downstream tasks in wide areas.
Despite the process of adjusting the hyperparameter requires many efforts, we still believe that network representation based on link prediction can lead to both quantitative and qualitative leap in graph processing. Although DeepLinker has achieved a quite high accuracy of link prediction, we still can’t figure out the mechanism that leads to such a good performance.Our future work will mainly focus on the hidden theory behind DeepLinker.
References
 [1] Chen, X., Liu, M.X. & Yan, G.Y. Drug–target interaction prediction by random walk on the heterogeneous network. Molecular BioSystems 8, 1970–1978 (2012).
 [2] Campillos, M., Kuhn, M., Gavin, A.C., Jensen, L. J. & Bork, P. Drug target identification using sideeffect similarity. Science 321, 263–266 (2008).
 [3] Zhou, J., Zeng, A., Fan, Y. & Di, Z. Ranking scientific publications with similaritypreferential mechanism. Scientometrics 106, 805–816 (2016).
 [4] Craven, M. et al. Learning to construct knowledge bases from the world wide web. Artificial intelligence 118, 69–113 (2000).
 [5] Popescul, A. & Ungar, L. H. Statistical relational learning for link prediction. In IJCAI workshop on learning statistical models from relational data, vol. 2003 (Citeseer, 2003).
 [6] LibenNowell, D. & Kleinberg, J. The linkprediction problem for social networks. Journal of the American society for information science and technology 58, 1019–1031 (2007).
 [7] Barabási, A.L. & Albert, R. Emergence of scaling in random networks. science 286, 509–512 (1999).
 [8] Zhou, T., Lü, L. & Zhang, Y.C. Predicting missing links via local information. The European Physical Journal B 71, 623–630 (2009).
 [9] Liu, H., Hu, Z., Haddadi, H. & Tian, H. Hidden link prediction based on node centrality and weak ties. EPL (Europhysics Letters) 101, 18004 (2013).
 [10] Rücker, G. Network metaanalysis, electrical networks and graph theory. Research Synthesis Methods 3, 312–324 (2012).
 [11] Clauset, A., Moore, C. & Newman, M. E. Hierarchical structure and the prediction of missing links in networks. Nature 453, 98 (2008).
 [12] Huang, Z. Link prediction based on graph topology: The predictive value of the generalized clustering coefficient (2006).
 [13] Perozzi, B., AlRfou, R. & Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 701–710 (ACM, 2014).
 [14] Grover, A. & Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 855–864 (ACM, 2016).
 [15] Ou, M., Cui, P., Pei, J., Zhang, Z. & Zhu, W. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 1105–1114 (ACM, 2016).
 [16] Tang, J. et al. Line: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, 1067–1077 (International World Wide Web Conferences Steering Committee, 2015).
 [17] Ribeiro, L. F., Saverese, P. H. & Figueiredo, D. R. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 385–394 (ACM, 2017).
 [18] Gu, W., Gong, L., Lou, X. & Zhang, J. The hidden flow structure and metric space of network embedding algorithms based on random walks. Scientific reports 7, 13114 (2017).
 [19] Hamilton, W., Ying, Z. & Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, 1024–1034 (2017).

[20]
He, K., Zhang, X., Ren,
S. & Sun, J.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, 770–778 (2016).  [21] Gehring, J., Auli, M., Grangier, D. & Dauphin, Y. N. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344 (2016).
 [22] Kipf, T. N. & Welling, M. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
 [23] Velickovic, P. et al. Graph attention networks. arXiv preprint arXiv:1710.10903 1 (2017).
 [24] Defferrard, M., Bresson, X. & Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I. & Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, 3844–3852 (Curran Associates, Inc., 2016).
 [25] Chen, J., Ma, T. & Xiao, C. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247 (2018).
 [26] Papadopoulos, F., Kitsak, M., Serrano, M. Á., Boguná, M. & Krioukov, D. Popularity versus similarity in growing networks. Nature 489, 537 (2012).
 [27] Şimşek, Ö. & Jensen, D. Navigating networks by using homophily and degree. Proceedings of the National Academy of Sciences (2008).
 [28] Wu, Y., Fu, T. Z. & Chiu, D. M. Generalized preferential attachment considering aging. Journal of Informetrics 8, 650–658 (2014).
 [29] Byrd, R. H., Chin, G. M., Nocedal, J. & Wu, Y. Sample size selection in optimization methods for machine learning. Mathematical programming 134, 127–155 (2012).
 [30] Li, M., Zhang, T., Chen, Y. & Smola, A. J. Efficient minibatch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 661–670 (ACM, 2014).
 [31] Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
 [32] Devlin, J., Chang, M.W., Lee, K. & Toutanova, K. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
 [33] Ke, Q. Zero2IPO research. https://www.pedata.cn/data/index.html (2014).
 [34] Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–256 (2010).
 [35] Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 [36] Gu, W., Liu, J. et al. Exploring smallworld network with an eliteclique: Bringing embeddedness theory into the dynamic evolution of a venture capital network. Social Networks 57, 70–81 (2019).
 [37] Wang, D., Song, C. & Barabási, A.L. Quantifying longterm scientific impact. Science 342, 127–132 (2013).
 [38] Aral, S. & Walker, D. Identifying influential and susceptible members of social networks. Science 1215842 (2012).
 [39] analysis. The ’Best’ Chinese Venture Capital Firms. https://www.nanalyze.com/2018/01/bestchineseventurecapitalfirms (2018).
 [40] Heidemann, J., Klier, M. & Probst, F. Identifying key users in online social networks: A pagerank based approach (2010).
 [41] Wang, Y., Tong, Y. & Zeng, M. Ranking scientific articles by exploiting citations, authors, journals, and time information. In AAAI (2013).
 [42] Page, L., Brin, S., Motwani, R. & Winograd, T. The pagerank citation ranking: Bringing order to the web. Tech. Rep., Stanford InfoLab (1999).
 [43] Freeman, L. C. Centrality in social networks conceptual clarification. Social networks 1, 215–239 (1978).
 [44] Maaten, L. v. d. & Hinton, G. Visualizing data using tsne. Journal of machine learning research 9, 2579–2605 (2008).
 [45] Hunter, D., Smyth, P., Vu, D. Q. & Asuncion, A. U. Dynamic egocentric models for citation networks. In Proceedings of the 28th International Conference on Machine Learning (ICML11), 857–864 (2011).
Comments
There are no comments yet.