Link Prediction via Deep Learning

10/10/2019 ∙ by Weiwei Gu, et al. ∙ 0

Link prediction aims to infer the missing links or predicting future ones based on the currently observed partial network. It is a fundamental problem in network science because not only the problem has wide range of applications such as social network recommendation and information retrieval, but also the linkages contain rich hidden information of node properties and network structures. However, conventional link prediction approaches neither have high prediction accuracy nor being capable of revealing the hidden information behind links. To address this problem, we generalize the latest techniques in deep learning on graphs and present a new link prediction model - DeepLinker by integrating the batched graph convolution techniques in GraphSAGE and the attention mechanism in graph attention network (GAT). Experiments on five graphs show that our model can not only achieve the state-of-the-art accuracy in link prediction, but also the efficient ranking and node representations as the byproducts of link prediction task. Although the low dimensional node representations are obtained without any node label information, they can perform very well on downstream tasks such as node ranking and classification. Therefore, we claim that the link prediction task on graphs is like the language model in natural language processing because it reveals the hidden information from the graph structure in an unsupervised way.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real world data come naturally in the form of pairwise relations, such as protein-protein interaction in human cell, paper citations in scientific research and drug-target interaction in medicine discovery [1, 2, 3]. These linkages contain rich information on node properties, network structures and network evolution. To predict the existence of a relation, which is always abbreviated as link prediction, is a fundamental task in network science and of great importance in practice. For networks in biology like protein-protein interaction network, metabolic network, and food webs, the discovery and validation of links require significant experimental effort. Instead of blindly checking all possible links, link prediction can help scientists to focus on the most likely links and this can sharply reduce the experimental cost. For WWW, social networks, and citation networks, link prediction can also help in recommending relevant pages, finding new friends, or discovering new citations [4, 5, 6].

The conventional link prediction methods can be divided into several groups. The approaches that make link prediction according to local similarity based on the assumption that two nodes are more likely to be connected if they have many common neighbors [7, 8]. These approaches are fast and highly parallel since they only consider local structure. However, the biggest drawback is their low prediction accuracy, especially when the network is sparse and large. While, global similarity-based methods use the whole network topological information to calculate similarity between links [9, 8, 10]. Although those methods perform better on prediction accuracy, they usually suffer from high computational complexity problem which makes them unfeasible for graph that contains million and billion nodes. There are also some probabilistic and statistical based approaches, assuming that there is a known prior structure of the network, like a hierarchical or circles structures [11, 12]. But they can not get over the problem of low accuracy. Furthermore, the conventional approaches can hardly reveal the hidden information of node properties and network structures behind the linkages.

Recently, there has been a surge of algorithms that seek to make link prediction through network representation learning which extracting both local and global structural information about nodes from graphs automatically. The idea behind these representation learning algorithms is to learn a mapping function that embedding nodes as points in a low-dimensional space which encoding the information from the original graph. Network representation based methods, which are usually based on the Skip-Gram method or matrix factorization, such as DeepWalk, node2vec, LINE, and struc2vec [13, 14, 15, 16, 17]

, have achieved a much higher link prediction accuracy compared with the conventional ones. Random walk based network representation learning algorithms are task agnostic, the learned representations then are used to perform graph-based downstream machine learning tasks, such as node classification, node ranking as well as link prediction

[18, 14]

. To start with, there is no supervised information during the training process, nodes’ representation vectors are updated directly without considering network global structure. Besides, the computational complexity is the biggest headache in those algorithms since a large number of random walks on the whole graph are required for most of the skip-gram based methods

[13, 14, 17]. Moreover, the expressive power is limited because the embedding process is fixed by the random walk strategy. Finally, the representations can be hardly extent for inductive learning since the embedding vectors can not be transferred to a other similar graphs [19].

More recently, deep learning techniques based on neural networks have achieved triumphs in image processing

[20] and natural language processing [21]. This stimulates the extensions of the methods on graph structures to perform node classification and link prediction tasks by converting the network structures into a low dimensional representations. For example, graph convolutional network (GCN) [22]

borrows the concept of convolution from the convolutional neural network (CNN) and convolve the graph directly according to the connectivity structure of the graph. After that, Velickovic et al.

[23] further proposed graph attention networks (GAT)and obtained the state-of-art of accuracy in node classification task. Following the self-attention mechanism, GAT compute representation of each node by combining its neighborhoods vectors in an adaptive way. The attention here is an adjustable weights on different neighbor nodes which can be updated dynamically according to the states of the nodes within a local connected neighborhood.

Nevertheless, those algorithms mentioned above and their extensions [24] have the scalability problem since they take the whole graph as input and recursively expand neighborhoods across layers. This expansion is computationally expensive especially when the graph is very large. Due to the scale free property in many graphs, when hub nodes are sampled as the 1st-order neighbors, their 2nd-order neighbors usually quickly fill up the memory that usually leads to the memory bottleneck problem. This problem prevents GAT and GCNs to be applied on large scale networks.

GraphSAGE [19] tries to solve the memory bottleneck problem by sampling a fixed-size neighborhood during each iteration, after that it performs a specific aggregator over feature extractor. The sampling strategy in GraphSAGE yields impressive performance on node labeling task over several large-scale networks. FastGCN [25] then proposes to view GCN [22]

as integral transforms of embedding functions under probability measure. The classification accuracy is highly comparable with the original GCN while gets rid of the reliance on the test data.

However, most of the neural network based methods mainly apply node classification labels as the only supervised information rather than linkages. As we know the representation quality relies on the supervised information. Node labels, however, are always highly scarce in most real networks. Besides, linkages rather than node attributes contain much richer information on network structure and evolution. For example, according to the similarity and popularity theory [26], the links within a network not only reveal nodes similarity relations [27] but also encodes node popularity information [7]. Take the formation process of a citation network as an instance, the citations are made not only according to the content similarity but also based on the popularity of the currently existing papers [28]. Thus, linkages rather than node labels should be used as the supervised information to learn node representations because they encodes at least both popularity and similarity information.

In this paper, we propose a new model named DeepLinker which extends the GAT models to be applied on predicting the linkages over various networks. By adopting the attention mechanism, we can not only make predictions on links but also learn node representations and node ranking. The learned attentional weights being paid on each node can be regarded as a kind of node centrality. However, the original GAT model can not be directly applied on the supervised link prediction task for the following reasons. First, the complexity of node classification is (), while the complexity of link prediction is (), where is the node number. Link prediction task usually involves larger number of node feature computations compared node classification. Second, the original GAT model needs to access the entire network while performing node classification inference, it must be extended to process the mini-batched graph data. However, we can not directly sample the graph by following links to form a mini-batch due to the well known scale-free property in real networks: the expansion of the neighborhood for a hub node can quickly fill up a large portion of the graph. Last, although large mini-batches are preferable to reduce the communication cost, they may slow down the convergence rate in practice [29] since the decrease in mini-batch size typically increases the rate of convergence [30] in the optimization process. Whereas, in the original GAT model, a small mini-batch usually involves a large amount of nodes, which decreases the convergence rate and usually leads to poor performance in link prediction accuracy.

Here we solve the memory bottleneck and the convergence problem by incorporating the mini-batch sampling strategy in GraphSAGE via a fixed neighborhood size [19]

. The difference between DeepLinker and GraphSAGE lies in the sampling times. In DeepLinker we sample only once and then fix the sampling neighborhoods for all nodes during the training process, while GraphSAGE, however keeps changing neighbors in every epoch. We also discover that changing neighbors in every epoch usually slows down the convergence and leads to poor performance. Our model is a novel structure by combining both GAT and GraphSAGE, and particularly designed for link predictions. This model computes the hidden representations of each node through a shared attention mechanism across its neighbors, and combine nodes’ vectors to represent edge features for the graph.

A large number of experiments are implemented on five representative networks. The results show that DeepLinker not only achieve the state-of-art accuracy on link predictions but can also obtain the effective node representations for downstream tasks such as node ranking and node classifications. We find that the nodes with more attention paid by their neighbors are either the elites in a Chinese co-investment network or the ’best papers’ in the APS citation network. Moreover, if the model is well trained, the low-dimensional node vectors extracted from the last layer of DeepLinker, with only a small fraction of labeled nodes for training, can be used to classify nodes in a relatively higher accuracy than other unsupervised learning algorithms. And the fewer the node label information is exploited, the higher the advantage our model has.

Our main contributions are summarized as follows:

  • We propose DeepLinker, which achieves the state-of-the-art accuracy in link prediction task.

  • We handle the memory bottleneck and mini-batch problems by fixing the neighborhood number which yields a controllable cost for per-batch computation.

  • The trained attentional coefficient matrix of DeepLinker plays a key role in revealing the latent significance of nodes. It helps in identifying the elites of a Chinese Investment Network and finding the ’best papers’ in a citation network.

  • DeepLinker can extract meaningful feature representations for nodes, this link prediction based node embedding method achieves high accuracy in node classification task especially when the training set is small.

Even more, our model and its performance on a large variety of tasks reminds us the language model in natural language process (NLP). In that case, the algorithms can learn effective representations of words and sentences for a large variety of downstream tasks by doing the task to predict the next word [31, 32]. Therefore, we claim that the link prediction task can be treated as the "Language Model" in graphs.

2 GAT architectures

Figure 1:

The overall DeepLinker architecture. We take a simple five-node network as an example, any two node linkage relation will be considered. We describe the linked relations in solid lines and the unlinked ones in dashed lines. We then take the linked node 1 in red and node 2 in yellow as an training example. We first sample nodes 3 and 4 as their 1st-order neighbors, then sample nodes 1,2,5 as their 2nd-order neighbors, nodes 1,2,5 are also the 1st-order neighbors of nodes 3 and 4. After that we calculate nodes 1 and 2’s vector representations based on their initial attributes as well as their neighbors’ initial features via GAT architecture. We acquire edge vector via calculating the inner product of the 1 and 2’s vector representations. Finally a logistic regression function is applied to compute the linkage existence probability.

To start with, we review the architecture of GAT model since our model is mainly based on that. GAT takes a set of node features as input, h = {, , … , }, in which, , and is the number of nodes, is the number of input node attributes. We use to denote GAT’s outputs that also contain

node features. The target of GAT is to obtain sufficient expressive power to transfer the input features into high-level output features. It first applies a learnable linear transformation, parameterized by a weight matrix,

W

to every node, it then uses a single-layer feed-forward neural network

to compute the attention coefficients between nodes. This computation process is shown in equation 1, where represents matrix transposition and is the concatenation operation. Node is a neighbor of node , and indicates the importance of ’s features to among all of ’s neighbors, represented as .

(1)

Once the normalized attention coefficient

is obtained, GAT aggregates nodes’ features as a combination of their neighbors, followed by a potentially nonlinear sigmoid function

, as is shown in equation 2.

(2)

Finally, GAT employs multi-head attention to stabilize the learning process of attention coefficients. K denotes the number of attentional heads. denotes the th relative attentional weights of features to . The output features of nodes’ neighbors are either concatenated or averaged to form their final output features, as is shown in equation 3:

(3)

3 DeepLinker architecture

To solve the low link prediction accuracy problem with deep learning architecture, we introduce DeepLinker, which has an encoder-decoder architecture. The overall architecture of DeepLink is shown in Figure 1. An encoder encodes each node to a vector representation, , after that a decoder constructs an edge vector by aggregating nodes’ vectors. Finally, a score function is applied to evaluate the link existence probability between two nodes via the edge vectors. One of the key ideas behind DeepLinker is to learn how to aggregate nodes’ features into edge vector for link prediction task.

As mentioned above, the limitations of GAT are memory bottleneck and mini-batch problem. Due to the scale-free property of most networks, once the hub nodes are sampled as the 1st-order neighbors in GAT architectures, the 2nd-order neighbors usually quickly fill up the memory, which prevents GAT to be applied on larger networks. Besides, the existing GPU-enabled tensor manipulation frameworks are only able to parallelize on the normalized activation coefficients (

) for the same sized neighborhoods, which prevents GAT from parallel computing.

3.1 Fixed-sized Neighborhood Sampling

Here we use the fixed-sized neighborhood sampling strategy to solve the memory bottleneck and the mini-batch problem. The undirected graph can be represented as , with denoting the set of nodes and representing the edges in network . For any two randomly selected nodes and , we then calculate the edge vector to make a prediction for the existence of the linkage between them. We sample each node’s neighborhood to form a fixed-sized node mini-batch. Taking node as a sample example, we uniformly sample a fixed-sized set of its neighbors defined as from the set instead of full neighborhood set as in GAT. Different from sampling neighborhood during each training iteration in GraphSAGE, we sample only once during the whole training process. The sampling strategy is illustrated in Figure 1.

We then compute the Hadamard product between two output vectors to generate an edge vector and pass by a sigmoid active function

to evaluate the edge existing probability, as shown in equation 4:

(4)

3.2 Training DeepLinker

The whole framework is trained by minimizing the following objective function:

(5)

where, is the label information for linkage between and , with for non-existence and for existence. is the training set of edges. We follow the convention in link predictions to randomly divide the full set of edges into three parts, for training, for validating and for testing. We train the model to predict not only the existence but also the non-existence links. Here, is the set of negative samples, in which each element is a node pair , and both and are drawn from the nodes involved in . Additionally, there is no edge between and in the set of . In our experiment, we sample the same number of negative instances as for positive ones.

4 Experiments

In order to evaluate the performance of DeepLinker on the real-world networks and to explore the potential application of it, we conduct link prediction, node ranking, as well as node classification task over several networks ranging from citation network to venture capital co-invest networks. All experiments are done under the PyTorch machine learning environment with a CUDA backend.

4.1 Experiments setup

We utilize the following five networks in our experiments:

  • Cora network contains 2,708 nodes 5,429 edges, and 7 classes. Each node has 1,433 attributes corresponding to elements of a bag-of-words representation of a document.

  • Citeseer network contains 3,327 nodes 4,732 edges, and 6 classes. Each node has 3,703 attributes extracted from paper contents.

  • Pubmed is a citation network, which contains 19,717 nodes and 44,338 edges, and 3 classes. The size of attributes of each node is 500.

  • VC network is a venture capital co-invest network with nodes representing venture capital companies and edges corresponding to co-invest events. It contains 1,436 nodes and 2,265 edges. Here we use the adjacency matrix as the one-hot input node attribute in the training process. On this network, 42 nodes are identified manually as VC which play a vital role in venture capital events. These nodes are regarded as the ground truth in node ranking task. The VC investment network is built on the SiMuTon [33] database.

  • APS graph has 1,012 nodes and 3,336 edges. The adjacency matrix is used as the the one-hot input node attribute in the training process. We quantify each paper’s impact and importance by counting the number of citations over 10 years (c_10) after its publication. The metric c_10 is used as a ground truth for measuring nodes importance.

We compare DeepLinker with the following baseline algorithms:

  1. RA [8] is a traditional link prediction method, and the similarity between two nodes is measured by computing neighbors’ weights which are negatively proportional to its degree.

  2. LINE [16]

    minimizes a loss function to learn embedding while preserving the first and the second-order neighbors proximity among vertices in the graph.

  3. Node2vec [14] adopts a biased random walk strategy and Skip-Gram to learn vertex embedding. This embedding algorithm is widely used in recent years.

  4. GraphSAGE [19] learns node embedding through a general inductive framework consisting with several feature aggregators. It usually adopts supervised node classification task as the evaluation benchmark with the assumption that better embedding algorithm leads to higher node classification accuracy.

In our experiments, we keep all models’ architecture as they are in the original GAT papers. This includes the type and sequence of layers, choice of activation functions, placement of dropout, and setting of hyper-parameters.

DeepLinker algorithm consists of two layers, the first layer is made up of 8 () attentional heads over all networks. Here we set the hidden size to 32 for Cora and VC, 16 for Citeseer and APS, and 64 for Pubmed network. The main purpose of the first layer is to compute the hidden features of the 1st-order neighbors. We then add the non-linearity by feeding the hidden features to an exponential linear unit (ELU), as shown in equation 2. The aggregated features from each head are concatenated in this layer.

The main purpose of the second layer is to compute edge features that used for evaluating the link existence probability. Here we use a single attention (K2 = 1) for Cora, Citeseer, VC and APS graphs. We discover that the Pubmed graph requires a much larger attention head number, so we set (K2 = 8) for Pubmed graph. The aggregated features from each head are averaged in this layer. The output of the second layer is the final feature representations for nodes. We then compute the Hadamard distance between two node features to represent the edge vector, as is shown in equation 4. Once edge vectors are obtained, an active function sigmoid is applied to evaluate the existing probability between nodes.

We initialize the parameter of DeepLinker with Glorot initialization [34] and train to minimize the binary-cross-entropy5, for the training set we use the Adam SGD optimizer [35] with an initial learning rate of . We also apply an early stop strategy over link prediction accuracy for the validation set, with the patience sets to 100 epochs.

We solve the memory bottleneck by sampling a fixed neighbors size (20) for both 1st and 2nd-order neighbors selection. The sampling strategy is illustrated in Figure 1. We set the batch size to be 32 for Cora, Citeseer, and Pubmed, 16 for APS network, and 8 for VC network.

4.2 Link prediction

In this part, we evaluate the link prediction accuracy of DeepLinker and compare it with other link prediction algorithms. The goal of link prediction task is to predict whether there exists an edge between two given vertices. To start with, we randomly hide 10% of the edges in the original graph as the ’positive’ samples in the test set. The test set not only contains the hidden 10% edges but also has equal number of randomly selected disconnected links that servers as ’negative’ samples. We then use the left connected links and randomly selected disconnected ones to form the training set. After that, we uniformly sample first and second order neighbors for each node. Finally we feed the sampled nodes into DeepLinker and the output of this method is the edge existence probability between two nodes.

Two standard metrics, Accuracy and AUC (area under the curve) are used to quantify the accuracy of link prediction. As shown in Table 1, DeepLinker outperforms all the baseline methods in link prediction accuracy across all graphs.

Here, we propose two implements of DeepLinker:DeepLinker (attention) and DeepLinker (all ones). Both implements apply neighborhood sampling and mini-batch strategies during the training process, while DeepLinker (attention) trains the attention coefficient, which indicates how important the neighbors’ features to the present node, as illustrated in equation 1 and DeepLinker (all ones) sets the attention to 1 among all neighbors. As is shown in Table 1, DeepLinkerDeepLinker (attention) has a much higher prediction accuracy compared with other link prediction algorithms. DeepLinker(all ones) achieves the best performance among all datasets. Actually in larger networks such as Pubmed, no matter how we adjust learning rate, linear transformation matrix size and the mini-batch size for the attention based DeepLinker, the loss doesn’t go down. At the initial stage of the training process, the neighbors’ attributes have a stronger influence on the representation ability of the node, however the attention coefficient value in DeepLinker (attention) architecture can only apply a limited attention coefficient ranging from 0 to 1. This attention coefficient is too small especially for a hub node with many neighbors, the attention mechanism limits the expression power on neighbors features’ influence for the current node.

At the mean time, DeepLinker (all ones) sets all the attention coefficients to 1, which means neighbors’ features have an equal contribution to the present node. Although DeepLinker (all ones) is a much simpler architecture with less trainable parameters, to our surprise, its performance is even better than DeepLinker (attention) over all networks, and the gap becomes larger on large networks, as is shown in Table1. When training the Pubmed network, the loss converges within 10 epochs when we set the attention coefficients to 1. Compared with DeepLinker (attention), DeepLinker (all ones) is a more suitable model for link predictions on large networks. Compared GraphSAGE-mean algorithm, DeepLinker (all ones) can improve the accuracy by 5%, 2%, 2%, 1% and 3% respectively.

Table1 shows that RA and LINE algorithms may can not capture the essential pattern of graph structure, since the predictive accuracies are low in both algorithms. Node2vec performs better than LINE and RA since the Skip-Gram model is better at extracting neighborhood information from graph structure. The original GraphSAGE-mean and GAT model are used for node classification task only, here by adding a logistic regression layer we make them suitable for link prediction task. Since there are no sampling and mini-batch training strategies in the original GAT model, thus, once the network becomes large, the original GAT algorithm suffers from memory bottlenecks. That is why we do not report the link prediction accuracy on the Citeseer and Pubmed networks of the GAT model.

In order to test the robustness of the model, we randomly break 20% edges among the existing links. Table 2 shows that both implements of DeepLinker are robust than other algorithm, and DeepLinker (all ones) achieves the highest link prediciton accuracy.

Accuracy/AUC Cora Citeseer Pubmed VC network APS network
GAT 0.79/0.88 NA NA 0.77/0.83 0.80/0.89
DeepLinker (attention) 0.87/0.93 0.85/0.91 0.57/0.63 0.80/0.90 0.84/0.94
DeepLinker (all ones) 0.88/0.93 0.86/0.91 0.90/0.97 0.82/0.90 0.85/0.95
GraphSAGE-mean 0.83/0.89 0.84/0.90 0.88/0.96 0.81/0.87 0.82/0.88
node2vec 0.82/0.92 0.85/0.89 0.81/0.94 0.77 /0.87 0.83/0.89
LINE 0.69/0.76 0.67 / 0.73 0.66/0.72 0.78/0.84 0.68/0.74
RA 0.41/0.75 0.32/0.73 0.31/0.69 0.33/0.76 0.35/0.78
Table 1: Link prediction accuracy and AUC of different algorithms over several networks. DeepLinker (attention) trains the attention coefficients while DeepLinker (all ones) set all the attention coefficients to 1 among all the connected node pairs.
Break Portion 20%
DeepLinker (attention) 0.84/0.90
DeepLinker (all ones) 0.85/0.91
GraphSAGE-mean 0.81/0.90
node2vec 0.80/0.89
LINE 0.52/0.53
RA 0.33/0.73
Table 2: Link prediction robustness test on Cora graph.

4.3 Attention coefficients for node centrality measuring

The learned attention coefficients from DeepLinker (attention) help extract the relationship between connected nodes for a given graph. Here we take the Chinese Venture Capital (VC) and the APS citation networks as examples to show how the node coefficients help in node centrality measuring and ranking.

One of the most important questions in venture capital analysis field is to identify the leading investors (leading VC) among a large amount of invest activities. Syndication in the Chinese venture capital market is typically led by main leading VCs, who always finding good investment opportunities, setting up investment plans, and organizing the partners. These leaders play a major role during the investment activities, therefore, identifying them has practical significance. To find the ground truth for discovering and identifying the leading VCs in venture capital industry, we use the Delphi method to interview four experts in this field to get a name list of leaders among the VC firms [36]. Based on this questionnaire survey, we identify 42 elites (leading VCs) in this network.

The APS graph is a sub-graph extracted from APS (American Physical Society journals) website with nodes representing papers and links representing citations. Measuring the centrality of papers helps scientists find the significant and high quality discoveries among thousands of publications. In this paper, we follow [37] to evaluate papers’ importance by counting the number of citations within the first 10 years (c_10) after papers’ publications.

Intuitively, the more attention a node attracts, the more influential it is. We measure node’s influence by accumulating neighborhoods’ attention towards it across all heads in the second layer of DeepLinker(attention). We name this attention coefficient based node rank method as Attention Rank, as is shown in equation 6,

(6)

Attention Rank is a byproduct of DeepLinker (attention). We first extract the normalized attention coefficients from a pre-trained DeepLinker model of the attentional head in the second layer, and then sum them up for all neighbors across all heads.

By calculating VC’s total amount of attention based on equation 6, we find that the elites (leading VCs) always attract a larger amount of attention compared with the followers. Following the evaluation methods for node centrality measures in complex networks [38], we sort VC nodes according to the total amount of attention in a decreasing order, and find that we can hit 30 elites out of the top 42 elites of the ground truth set. Table 3 shows the top 16 VCs with the most attention, all of the VCs are elites in the ground truth set. Besides, we find the top 16 VCs has a larger overlaps with a later released VC ranking website that discusses about the ’best’ venture capital in China [39].

Rank VC name is_elite Rank VC name is_elite
1 MORE/Shenzhen Capital Group Yes 9 JAFCO ASIA Yes
2 IDG Capital Yes 10 FOETURE Capital Yes
3 Sequoia Yes 11 GGV Capital Yes
4 Legend Captital Yes 12 Walden International Yes
5 Goldman Sachs Yes 13 SBCVC Yes
6 Intel Capital Yes 14 DFJ Venture Capital Yes
7 Northern Light Venture Capital Yes 15 Qiming Yes
8 DT Capital Yes 16 Cowin Yes
Table 3: The top 16 VC firms that attract the most attention.

To accurately compare the ranking result with other ranking algorithms, we follow the method in Webspam competition [40] and use the Accuracy which defines the ratio of the hits on the ground truth as a metric to evaluate the performance of different node ranking methods. The ranking performances are listed in Table 4.

In order to evaluate papers’ importance for APS graph, we first extract the pre-trained attention coefficient matrix from DeepLinker and rank papers’ total attention in a decreasing order. We then follow the experiments of [41]

to use Spearman’s rank correlation coefficient as an evaluation metric since the ground truths (c_10) are real values instead of binary ones.

In this part we choose several unsupervised graph ranking algorithms to compare with. The first algorithm is PageRank [42] which is a basic solution for ranking nodes. The second is Closeness Centrality and the third is Betweenness Centrality [43]. The Closeness Centrality believes that the most important nodes should have shorter path lengths to other nodes, while the Betweenness Centrality assumes that the most important nodes should be involved in more shortest paths. We also compared Attention Rank with the SPRank [3], which can efficiently and accurately identify high quality papers (e.g. Nobel prize winning papers) and can significantly outperform PageRank in predicting the future citation growth of papers. Table 4 shows the compare result of DeepLinker (attention) with the ranking methods mentioned above with the default parameter settings. We can find that Attention Rank significantly outperforms other ranking methods without any adjustable parameter and human knowledge.

Dataset Evaluation PageRank Closeness Betweenness SPRank Attention Rank
VC Accuracy 0.65 0.60 0.58 0.64 0.72
APS Rank Coor. 0.32 0.08 0.03 0.38 0.42
Table 4: Ranking performance comparison under unsupervised ranking methods.

4.4 Feature learning for node classification

DeepLinker can not only be applied in predicting missing links, measuring node centrality but also can provide meaningful node representations. In Figure 2 we visualize the raw input attributes, node2vec representation vectors and the second layer vectors of the pre-trained DeepLinker (all ones) with t-SNE [44]

visualization method. In this figure each point represents a node in the Cora graph with its node color denoting classification label. From the DeepLinker (all ones) visualization, we can tell in general, nodes belonging to the same class usually block together. The NMI (Normalized Mutual Information) and the Silhouette score of DeepLinker representation are much higher than node2vec vectors and raw attributes. For example, the Reinforcement Learning and the Genetic Algorithms are quite independent communities with the highest sub-network density 0.017 and 0.009 compared with the whole network’s density (0.001). This indicates that link prediction based representation has incorporated node similarity information. We also discover that even DeepLinker tends to separate nodes from different classes, there are still some overlap areas consisting of nodes from different classes. This phenomenon may support the popularity versus similarity theory

[26], which claims that network links are trade-offs between node similarity and other network properties such as node popularity.

Raw feature visualization node2vec visualization DeepLinker visualization
Figure 2: t-SNE visualization of Cora graph from the raw features (left), node2vec representations with the default parameter setting (middle), and the DeepLinker representations, node features for DeepLinker are extracted from the second layer of a pre-trained model (right). The clusters of the DeepLinker’s representations are clearly defined with a Silhouette score equals to 0.38 compared with 0.09 in node2vec and 0.00 in raw features. The NMI value of DeepLinker is 0.44 compared with 0.41 in node2vec vectors and 0.13 in raw features.

Compared with the network representations learned from supervised node classification information such as GAT, GCNs, and GraphSAGE, the representations learned from link information such as DeepLinker and node2vec contains richer structural information. The node representations learned from supervised node classification can only decode part of the hidden information. For example in citation networks, papers belonging to the same subject in general will have the same labels. Network embedding based on node labels would decode only the similarity between nodes. However, the evolution of citation networks rely not only on papers’ subjects and labels, other factors such as authors’ fame and their popularity also play important roles in network formation [45]. Besides, the supervised link information is easier to acquire compared with the supervised label information.

In order to evaluate the effectiveness of the proposed DeepLinker representation, we follow the commonly adopted setting to compare different representation algorithms of node classification task on Cora and Citeseer networks. In node classification task, each node has a targeted label and our goal is to build a predictive learning logistic model based on the training set and to predict the correct labels for the test set. In particular, after extracting nodes representations of a given graph, we randomly select some nodes to form the training data, we then apply the training nodes representations and their labels to train a node classify model and use this model to predict the label of the remaining nodes. We repeat the classification experiments for 10 times and report the average Micro-F1 value over 10 runs.

We compare the node vectors learned from link prediction task of DeepLinker (all ones) with the unsupervised GraphSAGE-mean and the widely used . In order to control changing variables, we fix the embedding dimension to 128 for all algorithms, and name them as DeepLinker_128, GraphSAGE-mean_128 and node2vec_128. The first sub-graph in Figure 3 shows in Citeseer graph, DeepLinker outperforms and GraphSAGE-mean especially when the training set is small. In the second sub-graph, when the training portion is less than 10%. DeepLinker also achieves better than and GraphSAGE-mean. Here, we believe achieves a better performance when the training set is small is a very important factor in node classification task, because in real-world networks there are only a few labeled graphs and manually labelling a large amount of nodes not only cost time and efforts but also introduce biases.

Moreover, in order to improve node classification accuracy, we increase the embedding dimension by concatenating the vectors learned from different unsupervised algorithms. As is shown in 3, we find the concatenation of DeepLinker and GraphSAGE-mean achieves the highest classification accuracy in Cora and Citeseer, the combination of DeepLinker and also achieves a higher classification accuracy compared with the joint of GraphSAGE-mean and . Actually the concatenating of GraphSAGE-mean and node2vec performs even worse than DeepLinker in 128 dimension in Citeseer network.

Figure 3: Classification accuracy comparison under different representation learning methods.

5 Conclusion and discussion

In this paper, we propose a mini-batched link prediction model: DeepLinker which based on the graph attention architecture and sampling strategy. DeepLinker can extract meaningful vertex representations and achieve the state-of-the-art link prediction accuracy. The byproducts of DeepLinker, attention coefficients and node vectors show the potential in node centrality measuring and node classification tasks, especially, when the labeled training set is small. DeepLinker outperforms other unsupervised node representation learning methods in node classification and node visualization tasks. This may alleviate the dependency on large labeled data. Therefore, we believe the link prediction task in graph is like the language model in natural language process, and the learned node representations by link prediction can be used on other downstream tasks in wide areas.

Despite the process of adjusting the hyper-parameter requires many efforts, we still believe that network representation based on link prediction can lead to both quantitative and qualitative leap in graph processing. Although DeepLinker has achieved a quite high accuracy of link prediction, we still can’t figure out the mechanism that leads to such a good performance.Our future work will mainly focus on the hidden theory behind DeepLinker.

References

  • [1] Chen, X., Liu, M.-X. & Yan, G.-Y. Drug–target interaction prediction by random walk on the heterogeneous network. Molecular BioSystems 8, 1970–1978 (2012).
  • [2] Campillos, M., Kuhn, M., Gavin, A.-C., Jensen, L. J. & Bork, P. Drug target identification using side-effect similarity. Science 321, 263–266 (2008).
  • [3] Zhou, J., Zeng, A., Fan, Y. & Di, Z. Ranking scientific publications with similarity-preferential mechanism. Scientometrics 106, 805–816 (2016).
  • [4] Craven, M. et al. Learning to construct knowledge bases from the world wide web. Artificial intelligence 118, 69–113 (2000).
  • [5] Popescul, A. & Ungar, L. H. Statistical relational learning for link prediction. In IJCAI workshop on learning statistical models from relational data, vol. 2003 (Citeseer, 2003).
  • [6] Liben-Nowell, D. & Kleinberg, J. The link-prediction problem for social networks. Journal of the American society for information science and technology 58, 1019–1031 (2007).
  • [7] Barabási, A.-L. & Albert, R. Emergence of scaling in random networks. science 286, 509–512 (1999).
  • [8] Zhou, T., Lü, L. & Zhang, Y.-C. Predicting missing links via local information. The European Physical Journal B 71, 623–630 (2009).
  • [9] Liu, H., Hu, Z., Haddadi, H. & Tian, H. Hidden link prediction based on node centrality and weak ties. EPL (Europhysics Letters) 101, 18004 (2013).
  • [10] Rücker, G. Network meta-analysis, electrical networks and graph theory. Research Synthesis Methods 3, 312–324 (2012).
  • [11] Clauset, A., Moore, C. & Newman, M. E. Hierarchical structure and the prediction of missing links in networks. Nature 453, 98 (2008).
  • [12] Huang, Z. Link prediction based on graph topology: The predictive value of the generalized clustering coefficient (2006).
  • [13] Perozzi, B., Al-Rfou, R. & Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 701–710 (ACM, 2014).
  • [14] Grover, A. & Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 855–864 (ACM, 2016).
  • [15] Ou, M., Cui, P., Pei, J., Zhang, Z. & Zhu, W. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 1105–1114 (ACM, 2016).
  • [16] Tang, J. et al. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, 1067–1077 (International World Wide Web Conferences Steering Committee, 2015).
  • [17] Ribeiro, L. F., Saverese, P. H. & Figueiredo, D. R. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 385–394 (ACM, 2017).
  • [18] Gu, W., Gong, L., Lou, X. & Zhang, J. The hidden flow structure and metric space of network embedding algorithms based on random walks. Scientific reports 7, 13114 (2017).
  • [19] Hamilton, W., Ying, Z. & Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, 1024–1034 (2017).
  • [20] He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 770–778 (2016).
  • [21] Gehring, J., Auli, M., Grangier, D. & Dauphin, Y. N. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344 (2016).
  • [22] Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • [23] Velickovic, P. et al. Graph attention networks. arXiv preprint arXiv:1710.10903 1 (2017).
  • [24] Defferrard, M., Bresson, X. & Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I. & Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, 3844–3852 (Curran Associates, Inc., 2016).
  • [25] Chen, J., Ma, T. & Xiao, C. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247 (2018).
  • [26] Papadopoulos, F., Kitsak, M., Serrano, M. Á., Boguná, M. & Krioukov, D. Popularity versus similarity in growing networks. Nature 489, 537 (2012).
  • [27] Şimşek, Ö. & Jensen, D. Navigating networks by using homophily and degree. Proceedings of the National Academy of Sciences (2008).
  • [28] Wu, Y., Fu, T. Z. & Chiu, D. M. Generalized preferential attachment considering aging. Journal of Informetrics 8, 650–658 (2014).
  • [29] Byrd, R. H., Chin, G. M., Nocedal, J. & Wu, Y. Sample size selection in optimization methods for machine learning. Mathematical programming 134, 127–155 (2012).
  • [30] Li, M., Zhang, T., Chen, Y. & Smola, A. J. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 661–670 (ACM, 2014).
  • [31] Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  • [32] Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • [33] Ke, Q. Zero2IPO research. https://www.pedata.cn/data/index.html (2014).
  • [34] Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–256 (2010).
  • [35] Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • [36] Gu, W., Liu, J. et al. Exploring small-world network with an elite-clique: Bringing embeddedness theory into the dynamic evolution of a venture capital network. Social Networks 57, 70–81 (2019).
  • [37] Wang, D., Song, C. & Barabási, A.-L. Quantifying long-term scientific impact. Science 342, 127–132 (2013).
  • [38] Aral, S. & Walker, D. Identifying influential and susceptible members of social networks. Science 1215842 (2012).
  • [39] analysis. The ’Best’ Chinese Venture Capital Firms. https://www.nanalyze.com/2018/01/best-chinese-venture-capital-firms (2018).
  • [40] Heidemann, J., Klier, M. & Probst, F. Identifying key users in online social networks: A pagerank based approach (2010).
  • [41] Wang, Y., Tong, Y. & Zeng, M. Ranking scientific articles by exploiting citations, authors, journals, and time information. In AAAI (2013).
  • [42] Page, L., Brin, S., Motwani, R. & Winograd, T. The pagerank citation ranking: Bringing order to the web. Tech. Rep., Stanford InfoLab (1999).
  • [43] Freeman, L. C. Centrality in social networks conceptual clarification. Social networks 1, 215–239 (1978).
  • [44] Maaten, L. v. d. & Hinton, G. Visualizing data using t-sne. Journal of machine learning research 9, 2579–2605 (2008).
  • [45] Hunter, D., Smyth, P., Vu, D. Q. & Asuncion, A. U. Dynamic egocentric models for citation networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 857–864 (2011).