Code for SEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction)
Traditional methods for link prediction can be categorized into three main types: graph structure feature-based, latent feature-based, and explicit feature-based. Graph structure feature methods leverage some handcrafted node proximity scores, e.g., common neighbors, to estimate the likelihood of links. Latent feature methods rely on factorizing networks' matrix representations to learn an embedding for each node. Explicit feature methods train a machine learning model on two nodes' explicit attributes. Each of the three types of methods has its unique merits. In this paper, we propose SEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction), a new framework for link prediction which combines the power of all the three types into a single graph neural network (GNN). GNN is a new type of neural network which directly accepts graphs as input and outputs their labels. In SEAL, the input to the GNN is a local subgraph around each target link. We prove theoretically that our local subgraphs also reserve a great deal of high-order graph structure features related to link existence. Another key feature is that our GNN can naturally incorporate latent features and explicit features. It is achieved by concatenating node embeddings (latent features) and node attributes (explicit features) in the node information matrix for each subgraph, thus combining the three types of features to enhance GNN learning. Through extensive experiments, SEAL shows unprecedentedly strong performance against a wide range of baseline methods, including various link prediction heuristics and network embedding methods.READ FULL TEXT VIEW PDF
Learning graph representations of n-ary relational data has a number of ...
Network embedding has proved extremely useful in a variety of network
Deep learning methods for graphs achieve remarkable performance on many
Text detection in scenes based on deep neural networks have shown promis...
Learning node embeddings that capture a node's position within the broad...
Graph representations have increasingly grown in popularity during the l...
Graph Neural Networks (GNNs) are emerging machine learning models on gra...
Code for SEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction)
Link prediction is to predict whether two nodes in a network are likely to have a link . Given the ubiquitous existence of networks, it has many applications such as friend recommendation , movie recommendation 
, knowledge graph completion, and metabolic network reconstruction .
One class of simple yet effective approaches for link prediction is called heuristic methods. Heuristic methods compute some heuristic node similarity scores as the likelihood of links [1, 6]. Existing heuristics can be categorized based on the maximum hop of neighbors needed to calculate the score. For example, common neighbors (CN) and preferential attachment (PA)  are first-order heuristics, since they only involve the one-hop neighbors of two target nodes. Adamic-Adar (AA) and resource allocation (RA)  are second-order heuristics, as they are calculated from up to two-hop neighborhood of the target nodes. We define -order heuristics to be those heuristics which require knowing up to -hop neighborhood of the target nodes. There are also some high-order heuristics which require knowing the entire network. Examples include Katz, rooted PageRank (PR) , and SimRank (SR) . Table LABEL:heuristics in Appendix A summarizes eight popular heuristics.
Although working well in practice, heuristic methods have strong assumptions on when links may exist. For example, the common neighbor heuristic assumes that two nodes are more likely to connect if they have many common neighbors. This assumption may be correct in social networks, but is shown to fail in protein-protein interaction (PPI) networks – two proteins sharing many common neighbors are actually less likely to interact .
In fact, the heuristics belong to a more generic class, namely graph structure features. Graph structure features are those features located inside the observed node and edge structures of the network, which can be calculated directly from the graph. Since heuristics can be viewed as predefined graph structure features, a natural idea is to automatically learn such features from the network. Zhang and Chen  first studied this problem. They extract local enclosing subgraphs around links as the training data, and use a fully-connected neural network to learn which enclosing subgraphs correspond to link existence. Their method called Weisfeiler-Lehman Neural Machine (WLNM) has achieved state-of-the-art link prediction performance. The enclosing subgraph for a node pair is the subgraph induced from the network by the union of and ’s neighbors up to hops. Figure 1 illustrates the 1-hop enclosing subgraphs for and . These enclosing subgraphs are very informative for link prediction – all first-order heuristics such as common neighbors can be directly calculated from the 1-hop enclosing subgraphs.
However, it is shown that high-order heuristics such as rooted PageRank and Katz often have much better performance than first and second-order ones . To effectively learn good high-order features, it seems that we need a very large hop number so that the enclosing subgraph becomes the entire network. This results in unaffordable time and memory consumption for most practical networks. But do we really need such a large to learn high-order heuristics?
Fortunately, as our first contribution, we show that we do not necessarily need a very large to learn high-order graph structure features. We dive into the inherent mechanisms of link prediction heuristics, and find that most high-order heuristics can be unified by a -decaying theory. We prove that, under mild conditions, any -decaying heuristic can be effectively approximated from an -hop enclosing subgraph, where the approximation error decreases at least exponentially with . This means that we can safely use even a small to learn good high-order features. It also implies that the “effective order” of these high-order heuristics is not that high.
Based on our theoretical results, we propose a novel link prediction framework, SEAL, to learn general graph structure features from local enclosing subgraphs. SEAL fixes multiple drawbacks of WLNM. First, a graph neural network (GNN) [13, 14, 15, 16, 17] is used to replace the fully-connected neural network in WLNM, which enables better graph feature learning ability. Second, SEAL permits learning from not only subgraph structures, but also latent and explicit node features, thus absorbing multiple types of information. We empirically verified its much improved performance.
Our contributions are summarized as follows. 1) We present a new theory for learning link prediction heuristics, justifying learning from local subgraphs instead of entire networks. 2) We propose SEAL, a novel link prediction framework based on GNN (illustrated in Figure 1). SEAL outperforms all heuristic methods, latent feature methods, and recent network embedding methods by large margins. SEAL also outperforms the previous state-of-the-art method, WLNM.
Notations Let be an undirected graph, where is the set of vertices and is the set of observed links. Its adjacency matrix is , where if and otherwise. For any nodes , let be the 1-hop neighbors of , and be the shortest path distance between and . A walk is a sequence of nodes with . We use to denote the length of the walk , which is here.
Latent features and explicit features Besides graph structure features, latent features and explicit features are also studied for link prediction. Latent feature methods [3, 18, 19, 20] factorize some matrix representations of the network to learn a low-dimensional latent representation/embedding for each node. Examples include matrix factorization  and stochastic block model  etc. Recently, a number of network embedding techniques have been proposed, such as DeepWalk , LINE  and node2vec , which are also latent feature methods since they implicitly factorize some matrices too . Explicit features are often available in the form of node attributes, describing all kinds of side information about individual nodes. It is shown that combining graph structure features with latent features and explicit features can improve the performance [23, 24].
Graph neural networks Graph neural network (GNN) is a new type of neural network for learning over graphs [13, 14, 15, 16, 25, 26]). Here, we only briefly introduce the components of a GNN since this paper is not about GNN innovations but is a novel application of GNN. A GNN usually consists of 1) graph convolution layers which extract local substructure features for individual nodes, and 2) a graph aggregation layer
which aggregates node-level features into a graph-level feature vector. Many graph convolution layers can be unified into a message passing framework.
Supervised heuristic learning There are some previous attempts to learn supervised heuristics for link prediction. The closest work to ours is the Weisfeiler-Lehman Neural Machine (WLNM) 
, which also learns from local subgraphs. However, WLNM has several drawbacks. Firstly, WLNM trains a fully-connected neural network on the subgraphs’ adjacency matrices. Since fully-connected neural networks only accept fixed-size tensors as input, WLNM requires truncating different subgraphs to the same size, which may lose much structural information. Secondly, due to the limitation of adjacency matrix representations, WLNM cannot learn from latent or explicit features. Thirdly, theoretical justifications are also missing. We include more discussion on WLNM in AppendixD
. Another related line of research is to train a supervised learning model on different heuristics’ combination. For example, the path ranking algorithmNickel et al.  propose to incorporate heuristic features into tensor factorization models. However, these models still rely on predefined heuristics – they cannot learn general graph structure features.
In this section, we aim to understand deeper the mechanisms behind various link prediction heuristics, and thus motivating the idea of learning heuristics from local subgraphs. Due to the large number of graph learning techniques, note that we are not concerned with the generalization error of a particular method, but focus on the information reserved in the subgraphs for calculating existing heuristics.
(Enclosing subgraph) For a graph , given two nodes , the -hop enclosing subgraph for is the subgraph induced from by the set of nodes .
The enclosing subgraph describes the “-hop surrounding environment" of . Since contains all -hop neighbors of and , we naturally have the following theorem.
Any -order heuristic for can be accurately calculated from .
For example, a -hop enclosing subgraph will contain all the information needed to calculate any first and second-order heuristics. However, although first and second-order heuristics are well covered by local enclosing subgraphs, an extremely large seems to be still needed for learning high-order heuristics. Surprisingly, our following analysis shows that learning high-order heuristics is also feasible with a small . We support this first by defining the -decaying heuristic. We will show that under certain conditions, a -decaying heuristic can be very well approximated from the -hop enclosing subgraph. Moreover, we will show that almost all well-known high-order heuristics can be unified into this -decaying heuristic framework.
(-decaying heuristic) A -decaying heuristic for has the following form:
where is a decaying factor between 0 and 1, is a positive constant or a positive function of that is upper bounded by a constant, is a nonnegative function of under the the given network.
Next, we will show that under certain conditions, a -decaying heuristic can be approximated from an -hop enclosing subgraph, and the approximation error decreases at least exponentially with .
Given a -decaying heuristic , if satisfies:
(property 1) where ; and
(property 2) is calculable from for , where with and ,
then can be approximated from and the approximation error decreases at least exponentially with .
We can approximate such a -decaying heuristic by summing over its first terms.
The approximation error can be bounded as follows.
In practice, a small and a large lead to a faster decreasing speed. Next we will prove that three popular high-order heuristics: Katz, rooted PageRank and SimRank, are all -decaying heuristics which satisfy the properties in Theorem 2. First, we need the following lemma.
Any walk between and with length is included in .
Given any walk with length , we will show that every node is included in . Consider any . Assume and . Then, , a contradiction. Thus, or . By the definition of , must be included in . ∎
Next we will analyze Katz, rooted PageRank and SimRank one by one.
The Katz index  for is defined as
where is the set of length- walks between and , and is the power of the adjacency matrix of the network. Katz index sums over the collection of all walks between and where a walk of length is damped by (), giving more weight to shorter walks.
Katz index is directly defined in the form of a -decaying heuristic with , and . According to Lemma 1, is calculable from for , thus property 2 in Theorem 2 is satisfied. Now we show when property 1 is satisfied.
For any nodes , is bounded by , where is the maximum node degree of the network.
We prove it by induction. When , for any . Thus the base case is correct. Now, assume by induction that for any , we have
The rooted PageRank for node calculates the stationary distribution of a random walker starting at , who iteratively moves to a random neighbor of its current position with probability or returns to with probability . Let denote the stationary distribution vector. Let denote the probability that the random walker is at node under the stationary distribution.
Let be the transition matrix with if and otherwise. Let be a vector with the element being and others being . The stationary distribution satisfies
When used for link prediction, the score for is given by (or for symmetry). To show that rooted PageRank is a -decaying heuristic, we introduce the inverse P-distance theory , which states that can be equivalently written as follows:
where the summation is taken over all walks starting at and ending at (possibly touching and multiple times). For a walk , is the length of the walk. The term is defined as , which can be interpreted as the probability of traveling . Now we have the following theorem.
The rooted PageRank heuristic is a -decaying heuristic which satisfies the properties in Theorem 2.
We first write in the following form.
Defining leads to the form of a -decaying heuristic. Note that is the probability that a random walker starting at stops at with exactly steps, which satisfies . Thus, (property 1). According to Lemma 1, is also calculable from for (property 2). ∎
The SimRank score  is motivated by the intuition that two nodes are similar if their neighbors are also similar. It is defined in the following recursive way: if , then ; otherwise,
where is a constant between 0 and 1. According to , SimRank has an equivalent definition:
where denotes all simultaneous walks such that one walk starts at , the other walk starts at , and they first meet at any vertex . For a simultaneous walk , is the length of the walk. The term is similarly defined as , describing the probability of this walk. Now we have the following theorem.
SimRank is a -decaying heuristic which satisfies the properties in Theorem 2.
We write as follows.
Defining reveals that SimRank is a -decaying heuristic. Note that . It is easy to see that is also calculable from for . ∎
Discussion There exist several other high-order heuristics based on path counting or random walk  which can be as well incorporated into the -decaying heuristic framework. We omit the analysis here. Our results reveal that most high-order heuristics inherently share the same -decaying heuristic form, and thus can be effectively approximated from an -hop enclosing subgraph with exponentially smaller approximation error. We believe the ubiquity of -decaying heuristics is not by accident – it implies that a successful link prediction heuristic is better to put exponentially smaller weight on structures far away from the target, as remote parts of the network intuitively make little contribution to link existence. Our results build the foundation for learning heuristics from local subgraphs, as they imply that local enclosing subgraphs already contain enough information to learn good graph structure features for link prediction which is much desired considering learning from the entire network is often infeasible. To summarize, from the small enclosing subgraphs extracted around links, we are able to accurately calculate first and second-order heuristics, and approximate a wide range of high-order heuristics with small errors. Therefore, given adequate feature learning ability of the model used, learning from such enclosing subgraphs is expected to achieve performance at least as good as a wide range of heuristics. There is some related work which empirically verifies that local methods can often estimate PageRank and SimRank well [31, 32]. Another related theoretical work  establishes a condition of to achieve some fixed approximation error for ordinary PageRank.
In this section, we describe our SEAL framework for link prediction. SEAL does not restrict the learned features to be in some particular forms such as -decaying heuristics, but instead learns general graph structure features for link prediction. It contains three steps: 1) enclosing subgraph extraction, 2) node information matrix construction, and 3) GNN learning. Given a network, we aim to learn automatically a “heuristic” that best explains the link formations. Motivated by the theoretical results, this function takes local enclosing subgraphs around links as input, and output how likely the links exist. To learn such a function, we train a graph neural network (GNN) over the enclosing subgraphs. Thus, the first step in SEAL is to extract enclosing subgraphs for a set of sampled positive links (observed) and a set of sampled negative links (unobserved) to construct the training data.
A GNN typically takes as input, where (with slight abuse of notation) is the adjacency matrix of the input enclosing subgraph, is the node information matrix each row of which corresponds to a node’s feature vector. The second step in SEAL is to construct the node information matrix for each enclosing subgraph. This step is crucial for training a successful GNN link prediction model. In the following, we discuss this key step. The node information matrix in SEAL has three components: structural node labels, node embeddings and node attributes.
The first component in is each node’s structural label. A node labeling is function which assigns an integer label to every node in the enclosing subgraph. The purpose is to use different labels to mark nodes’ different roles in an enclosing subgraph: 1) The center nodes and are the target nodes between which the link is located. 2) Nodes with different relative positions to the center have different structural importance to the link. A proper node labeling should mark such differences. If we do not mark such differences, GNNs will not be able to tell where are the target nodes between which a link existence should be predicted, and lose structural information.
Our node labeling method is derived from the following criteria: 1) The two target nodes and always have the distinctive label “”. 2) Nodes and have the same label if and . The second criterion is because, intuitively, a node ’s topological position within an enclosing subgraph can be described by its radius with respect to the two center nodes, namely . Thus, we let nodes on the same orbit have the same label, so that the node labels can reflect nodes’ relative positions and structural importance within subgraphs.
Based on the above criteria, we propose a Double-Radius Node Labeling (DRNL) as follows. First, assign label 1 to and . Then, for any node with , assign label . Nodes with radius or get label 3. Nodes with radius or get 4. Nodes with get 5. Nodes with or get 6. Nodes with or get 7. So on and so forth. In other words, we iteratively assign larger labels to nodes with a larger radius w.r.t. both center nodes, where the label and the double-radius satisfy
1) if , then ;
2) if , then .
One advantage of DRNL is that it has a perfect hashing function
where , , , and are the integer quotient and remainder of divided by , respectively. This perfect hashing allows fast closed-form computations.
For nodes with or , we give them a null label 0. Note that DRNL is not the only possible way of node labeling, but we empirically verified its better performance than no labeling and other naive labelings. We discuss more about node labeling in Appendix B
. After getting the labels, we use their one-hot encoding vectors to construct.
Other than the structural node labels, the node information matrix also provides an opportunity to include latent and explicit features. By concatenating each node’s embedding/attribute vector to its corresponding row in , we can make SEAL simultaneously learn from all three types of features.
Generating the node embeddings for SEAL is nontrivial. Suppose we are given the observed network , a set of sampled positive training links , and a set of sampled negative training links with . If we directly generate node embeddings on , the node embeddings will record the link existence information of the training links (since ). We observed that GNNs can quickly find out such link existence information and optimize by only fitting this part of information. This results in bad generalization performance in our experiments. Our trick is to temporally add into , and generate the embeddings on
. This way, the positive and negative training links will have the same link existence information recorded in the embeddings, so that GNN cannot classify links by only fitting this part of information. We empirically verified the much improved performance of this trick to SEAL. We name this tricknegative injection.
We name our proposed framework SEAL (learning from Subgraphs, Embeddings and Attributes for Link prediction), emphasizing its ability to jointly learn from three types of features.
We conduct extensive experiments to evaluate SEAL. Our results show that SEAL is a superb and robust framework for link prediction, achieving unprecedentedly strong performance on various networks. We use AUC and average precision (AP) as evaluation metrics. We run all experiments for 10 times and report the average AUC results and standard deviations. We leave the the AP and time results in AppendixF. SEAL is flexible with what GNN or node embeddings to use. Thus, we choose a recent architecture DGCNN  as the default GNN, and node2vec  as the default embeddings. The code and data are available at https://github.com/muhanzhang/SEAL.
Datasets The eight datasets used are: USAir, NS, PB, Yeast, C.ele, Power, Router, and E.coli (please see Appendix C for details). We randomly remove 10% existing links from each dataset as positive testing data. Following a standard manner of learning-based link prediction, we randomly sample the same number of nonexistent links (unconnected node pairs) as negative testing data. We use the remaining 90% existing links as well as the same number of additionally sampled nonexistent links to construct the training data.
Comparison to heuristic methods We first compare SEAL with methods that only use graph structure features. We include eight popular heuristics (shown in Appendix A, Table LABEL:heuristics): common neighbors (CN), Jaccard, preferential attachment (PA), Adamic-Adar (AA), resource allocation (RA), Katz, PageRank (PR), and SimRank (SR). We additionally include Ensemble (ENS) which trains a logistic regression classifier on the eight heuristic scores. We also include two heuristic learning methods: Weisfeiler-Lehman graph kernel (WLK)  and WLNM , which also learn from (truncated) enclosing subgraphs. We omit path ranking methods  as well as other recent methods which are specifically designed for knowledge graphs or recommender systems [23, 35]. As all the baselines only use graph structure features, we restrict SEAL to not include any latent or explicit features. In SEAL, the hop number
is an important hyperparameter. Here, we selectonly from , since on one hand we empirically verified that the performance typically does not increase after , which validates our theoretical results that the most useful information is within local structures. On the other hand, even sometimes results in very large subgraphs if a hub node is included. This raises the idea of sampling nodes in subgraphs, which we leave to future work. The selection principle is very simple: If the second-order heuristic AA outperforms the first-order heuristic CN on 10% validation data, then we choose ; otherwise we choose . For datasets PB and E.coli, we consistently use to fit into the memory. We include more details about the baselines and hyperparameters in Appendix D.
Table 1 shows the results. Firstly, we observe that methods which learn from enclosing subgraphs (WLK, WLNM and SEAL) generally perform much better than predefined heuristics. This indicates that the learned “heuristics” are better at capturing the network properties than manually designed ones. Among learning-based methods, SEAL has the best performance, demonstrating GNN’s superior graph feature learning ability over graph kernels and fully-connected neural networks. From the results on Power and Router, we can see that although existing heuristics perform similarly to random guess, learning-based methods still maintain high performance. This suggests that we can even discover new “heuristics” for networks where no existing heuristics work.
Comparison to latent feature methods Next we compare SEAL with six state-of-the-art latent feature methods: matrix factorization (MF), stochastic block model (SBM) , node2vec (N2V) , LINE 
, spectral clustering (SPC), and variational graph auto-encoder (VGAE). Among them, VGAE uses a GNN too. Please note the difference between VGAE and SEAL: VGAE uses a node-level GNN to learn node embeddings that best reconstruct the network, while SEAL uses a graph-level GNN to classify enclosing subgraphs. Therefore, VGAE still belongs to latent feature methods. For SEAL, we additionally include the 128-dimensional node2vec embeddings in the node information matrix . Since the datasets do not have node attributes, explicit features are not included.
Table 2 shows the results. As we can see, SEAL shows significant improvement over latent feature methods. One reason is that SEAL learns from both graph structures and latent features simultaneously, thus augmenting those methods that only use latent features. We observe that SEAL with node2vec embeddings outperforms pure node2vec by large margins. This implies that network embeddings alone may not be able to capture the most useful link prediction information located in the local structures. It is also interesting that compared to SEAL without node2vec embeddings (Table 1), joint learning does not always improve the performance. More experiments and discussion are included in Appendix F.
Learning link prediction heuristics automatically is a new field. In this paper, we presented theoretical justifications for learning from local enclosing subgraphs. In particular, we proposed a -decaying theory to unify a wide range of high-order heuristics and prove their approximability from local subgraphs. Motivated by the theory, we proposed a novel link prediction framework, SEAL, to simultaneously learn from local enclosing subgraphs, embeddings and attributes based on graph neural networks. Experimentally we showed that SEAL achieved unprecedentedly strong performance by comparing to various heuristics, latent feature methods, and network embedding algorithms. We hope SEAL can not only inspire link prediction research, but also open up new directions for other relational machine learning problems such as knowledge graph completion and recommender systems.
The work is supported in part by the III-1526012 and SCH-1622678 grants from the National Science Foundation and grant 1R21HS024581 from the National Institute of Health.
Learning convolutional neural networks for graphs.In International conference on machine learning, pages 2014–2023, 2016.
An end-to-end deep learning architecture for graph classification.In AAAI, pages 4438–4445, 2018a.
Finding community structure in networks using the eigenvectors of matrices.Physical review E, 74(3):036104, 2006.
In this section, we discuss more about the difference among the three types commonly used features for link prediction: graph structure features, latent features, and explicit features.
Graph structure features locate inside the observed node and edge structures of the network, which can be directly observed and computed. Link prediction heuristics belong to graph structure features. We show eight popular heuristics in Table LABEL:heuristics. In addition to link prediction heuristics, node centrality scores (degree, closeness, betweenness, PageRank, eigenvector, hubs etc.), graphlets, network motifs etc. all belong to graph structure features. Although effective in many domains, these predefined graph structure features are handcrafted – they only capture a small set of structure patterns, lacking the ability to express general structure patterns underlying different networks. Considering deep neural networks’ success in feature learning, a natural question to ask is whether we can automatically learn such features, no longer relying on predefined ones.
Graph structure features are inductive, meaning that these features are not associated with a particular node or network. For example, the common neighbor heuristic between any pair of nodes and is consistently calculated by counting the number of their common one-hop neighbors, invariant to where and are located. Thus, graph structure features are transferrable to new nodes and new networks. This is in contrast to latent features, which are often transductive – the changing of network structure will require a complete retraining to get the latent features again.
Latent features are latent properties or representations of nodes, often obtained by factorizing a specific matrix derived from a network, such as the adjacency matrix or the Laplacian matrix. Through factorization, a low-dimensional embedding is learned for each node. Latent features focus more on global properties and long range effects, because the network’s matrix is treated as a whole during factorization. Latent features cannot capture structural similarities between nodes , and usually need an extremely large dimension to express some simple heuristics . Latent features are also transductive. They cannot be transferred to new nodes or new networks. They are also less interpretable than graph structure features.
Network embedding methods [19, 21, 20, 39, 40, 41] have gained great popularity recently. They learn low-dimensional representations for nodes too. Recently, it is shown that network embedding methods (including DeepWalk , LINE , and node2vec ) implicitly factorize some matrix representation of a network . For example, DeepWalk approximately factorizes , where is the adjacency matrix of the network , is the diagonal degree matrix, is skip-gram’s window size, and is the number of negative samples. For LINE and node2vec, there also exist such matrices. Since network embedding methods also factorize matrix representations of networks, we may regard them as learning more expressive latent features through factorizing some more informative matrices.
Explicit features are often given by continuous or discrete node attribute vectors. In principle, any side information about the network other than its structure can be seen as explicit features. For example, in citation networks, word distributions are explicit features of document nodes. In social networks, a user’s profile information is also explicit feature (however, their friendship information belongs to graph structure features).
The necessity of structural node labels for enclosing subgraphs is because, unlike ordinary graphs, enclosing subgraphs intrinsically have a directionality. The center of an enclosing subgraph are two nodes and between which the target link is located. Outward from the center, other nodes have larger and larger distance to and . Node labeling is to mark such structural differences thus providing additional structural information to facilitate GNN training.
When designing a node labeling for enclosing subgraphs, we always want to ensure that the target nodes and have a distinct label so that GNN can distinguish the target link to predict from other edges. Secondly, we want the node labels to reflect nodes’ relative positions in their enclosing subgraph. This relative position can be intuitively described by a node ’s double-radius with respect to and , i.e., .
We restate our Double-Radius Node Labeling (DRNL) algorithm here. First, assign label 1 to and . Then, for any node with , assign label . Nodes with double-radius or get label 3. Nodes with double-radius or get 4. Nodes with get 5. Nodes with or get 6. Nodes with or get 7. So on and so forth. Our DRNL not only satisfies the above criteria, but also attains the additional benefits that for nodes and :
1) if , then ;
2) if , then .
That is, the magnitude of node labels also reflects their distance to the center. Nodes with smaller arithmetic mean distance to the target nodes get smaller labels. If two nodes have the same arithmetic mean distance, the node with a smaller geometric mean distance to the target nodes gets a smaller label. Note that these additional benefits will not be available under one-hot encoding of node labels, since the magnitude information will be lost after one-hot encoding. However, such a labeling is potentially useful when node labels are directly used for training, or used to rank the nodes. Furthermore, our node labeling has a perfect hashing (10) which allows closed-form computation.
We present a lookup table for DRNL and an example labeled subgraph in Figure 2. Note that when calculating , we temporally remove from the subgraph, and vice versa. This is because we aim to use the pure distance between and without the influence of . If we do not remove , will be upper bounded by , obscuring the “true distance” between and .
Our node labeling algorithm is different from the Weisfeiler-Lehman algorithm used in WLNM . In WLNM, node labeling is for defining a node order in adjacency matrices – the labels are not really input to machine learning models. To rank nodes with least ties, the node labels should be as fine as possible in WLNM. In comparison, the node labels in SEAL need not be very fine, as their purpose is for indicating nodes’ different roles within the enclosing subgraph, not for ranking nodes. In addition, node labels in SEAL are encoded into node information matrices and input to machine learning models.
USAir  is a network of US Air lines with 332 nodes and 2,126 edges. The average node degree is 12.81. NS  is a collaboration network of researchers in network science with 1,589 nodes and 2,742 edges. The average node degree is 3.45. PB  is a network of US political blogs with 1,222 nodes and 16,714 edges. The average node degree is 27.36. Yeast  is a protein-protein interaction network in yeast with 2,375 nodes and 11,693 edges. The average node degree is 9.85. C.ele  is a neural network of C. elegans with 297 nodes and 2,148 edges. The average node degree is 14.46. Power  is an electrical grid of western US with 4,941 nodes and 6,594 edges. The average node degree is 2.67. Router  is a router-level Internet with 5,022 nodes and 6,258 edges. The average node degree is 2.49. E.coli  is a pairwise reaction network of metabolites in E. coli with 1,805 nodes and 14,660 edges. The average node degree is 12.55.
Hyperparameters of heuristic and latent feature methods Most hyperparameters are inherited from the original paper of each method. For Katz, we set the damping factor to 0.001. For PageRank, we set the damping factor to 0.85. For SimRank, we set to 0.8. For stochastic block model (SBM), we use the implementation of  using a latent group number 12. For matrix factorization (MF), we use the libFM  software with the default parameters. For node2vec, LINE, and spectral clustering, we first generate 128-dimensional embeddings from the observed networks with default parameters of each software. Then, we use the Hadamard product of two nodes’ embeddings as a link’s embedding as suggested in , and train a logistic regression model with Liblinear  using automatic hyperparameter selection. For VGAE, we use its default setting.
WLNM Weisfeiler-Lehman Neural Machine (WLNM)  is a recent link prediction method that learns general graph structure features. It achieves state-of-the-art performance on various networks, outperforming all handcrafted heuristics. WLNM has three steps: enclosing subgraph extraction, subgraph pattern encoding, and neural network training. In the enclosing subgraph extraction step: for each node pair , WLNM iteratively extracts and ’s one-hop neighbors, two-hop neighbors, and so on, until the enclosing subgraph has more than vertices, where is a user-defined integer. In the subgraph pattern encoding step, WLNM uses the Weisfeiler-Lehman algorithm to define an order for nodes within each enclosing subgraph, so that the neural network can read different subgraphs’ nodes in a consistent order and learn meaningful patterns. To unify the sizes of the enclosing subgraphs, after getting the vertex order, the last few vertices are deleted so that all the truncated enclosing subgraphs have the same size . These truncated enclosing subgraphs are reordered and their fixed-size adjacency matrices are fed into the fully-connected neural network to train a link prediction model. Due to the truncation, WLNM cannot consistently learn from each link’s full -hop neighborhood. The loss of structural information limits WLNM’s performance and restrict it from learning complete -order graph structure features. Following , we use (the best performing ) in our experiments.
|Graph structure features||Yes||No||Yes||Yes||Yes|
|Learn from full -hop||No||n/a||Yes||No||Yes|
WLK Weisfeiler-Lehman graph kernel (WLK)  is a state-of-the-art graph kernel. Graph kernels make kernel machines feasible for graph classification by defining some positive semidefinite graph similarity scores. Most graph kernels measure graph similarity by decomposing graphs into small substructures and adding up the pair-wise similarities between these components. Common types of substructures include walks [54, 55], subgraphs [56, 57], paths , and subtrees [34, 59]. WLK is based on counting common rooted subtrees between two graphs. In our experiments, we train a SVM on the WL kernel matrix. We feed the same enclosing subgraphs as in SEAL to WLK. We search the subtree depth from on 10% validation links. WLK does not support continuous node information, but supports integer node labels. Thus, we feed the same structural node labels from (10) to WLK too.
We compare the characteristics of different link prediction methods in Table 4.
In the experiments, we use Deep Graph Convolutional Neural Network (DGCNN)  as the default GNN engine of SEAL. DGCNN is a recent GNN architecture for graph classification. It has consistently good performance on various benchmark datasets with a single network architecture (avoid hyperparameter tweaking). DGCNN is equipped with propagation-based graph convolution layers and a novel graph aggregation layer, called SortPooling. We illustrate the overall architecture of DGCNN in Figure 3. Given the adjacency matrix and the node information matrix of an enclosing subgraph, DGCNN uses the following graph convolution layer:
where , is a diagonal degree matrix with , is a matrix of trainable graph convolution parameters,
is an element-wise nonlinear activation function, andare the new node states. The mechanism behind (11) is that the initial node states
are first applied a linear transformation by multiplying, and then propagated to neighboring nodes through the propagation matrix . After graph convolution, the row of becomes:
which summarizes the node information as well as the first-order structure pattern from ’s neighbors. DGCNN stacks multiple graph convolution layers (11) and concatenates each layer’s node states as the final node states, in order to extract multi-hop node features.
A graph aggregation layer constructs a graph-level feature vector from individual nodes’ final states, which is used for graph classification. The most widely used aggregation operation is summing, i.e., nodes’ final states after graph convolutions are summed up as the graph’s representation. However, the averaging effect of summing might lose much individual nodes’ information as well as the topological information of the graph. DGCNN uses a novel SortPooling layer, which sorts the final node states according to the last graph convolution layer’s output to achieve an isomorphism invariant node ordering . A max- pooling operation is then used to unify the sizes of the sorted representations of different graphs, which enables training a traditional 1-D CNN on the node sequence.
We use the default setting of DGCNN, i.e., four graph convolution layers as in (11) with 32,32,32,1 channels, a SortPooling layer (with such that 60% graphs have nodes less than
), two 1-D convolution layers (16 and 32 output channels) and a dense layer (128 neurons), see
. We train DGCNN on enclosing subgraphs for 50 epochs, and select the model with the smallest loss on the 10% validation data to predict the testing links.
Note that, in any positive training link’s enclosing subgraph, we should always remove the edge between the two target nodes before feeding it into a graph classification model. This is because this edge will contain the link existence information, which is not available in any testing link’s enclosing subgraph.
In this section, we show the additional experimental results. We first use 90% observed links as training links and 10% as testing links following the main paper’s experiments. The average precision (AP) comparison results with heuristic methods are shown in Table 5. The AP comparison results with latent feature methods are shown in Table 6. We can see that our proposed SEAL shows great performance improvement over all baselines in both AUC and AP.
To evaluate SEAL’s scalability, we show its single-GPU inference time performance in Table 7. As we can see, SEAL has good scalability. For networks with over 1E7 potential links, SEAL took less than an hour to make all the predictions. One possible way to further scale SEAL to social networks with millions of users is to first use some simple heuristics such as common neighbors to filter out most unlikely links and then use SEAL to make further recommendations. Another way is to restrict the candidate friend recommendations to be those who are at most 2 or 3 hops away from the target user, which will vastly reduce the number of candidate links to infer for each user and thus further increase the scalability.
|Number of potential links||5.49E+04||1.26E+06||7.46E+05||2.82E+06||4.40E+04||1.22E+07||1.26E+07||1.39E+06|
|Inference time per link (s)||6.05E-04||2.55E-04||2.04E-04||3.96E-04||4.13E-04||1.35E-04||2.13E-04||2.40E-04|
|Inference time for all potential links (s)||31||321||146||1106||16||1640||2681||328|
Next, we redo the comparisons under 50%–50% train/test split. We randomly remove 50% existing links as positive testing links and use the remaining 50% existing links as positive training links. The same number of negative training and testing links are sampled from the nonexistent links as well. The AUC results are shown in Table 8 and 9. The AP results are shown in Table 10 and 11.
The results are consistent with the 90%–10% split setting. As we can see, SEAL is still the best among all methods in general. The performance gains over heuristic methods are even larger compared to the 90%-10% split. This indicates that SEAL is able to learn good heuristics even when the network is very incomplete. SEAL also shows more clear advantages over WLNM. On the other hand, we observe that VGAE becomes a strong baseline when network is sparser by achieving the best AUC results on 3 out of 8 datasets. It is thus interesting to study whether replacing the node2vec embeddings in SEAL with the VGAE embeddings can further improve the performance. We leave it to future work.
We further conduct experiments with the setting of the node2vec paper  on five networks: arXiv (18,722 nodes and 198,110 edges) , Facebook (4,039 nodes and 88,234 edges) , BlogCatalog (10,312 nodes, 333,983 edges and 39 attributes) , Wikipedia (4,777 nodes, 184,812 edges and 40 attributes) , and Protein-Protein Interactions (PPI) (3,890 nodes, 76,584 edges and 50 attributes) . For each network, 50% of random links are removed and used as testing data, while keeping the remaining network connected. For Facebook and arXiv, all remained links are used as positive training data. For PPI, BlogCatalog and Wikipedia, we sample 10,000 remained links as positive training data. We compare SEAL (, 10 training epochs) with node2vec, LINE, SPC, VGAE, and WLNM (). For node2vec, we use the parameters provided in  if available. For SEAL and VGAE, the node attributes are used since only these two methods support explicit features.
Table 12 shows the results. As we can see, SEAL consistently outperforms all embedding methods. Especially on the last three networks, SEAL (with node2vec embeddings) outperforms pure node2vec by large margins. These results indicate that in many cases, embedding methods alone cannot capture the most useful link prediction information, while effectively combining the power of different types of features results in much better performance. SEAL also consistently outperforms WLNM.