Log In Sign Up

Graph Neural Networks for Link Prediction with Subgraph Sketching

Many Graph Neural Networks (GNNs) perform poorly compared to simple heuristics on Link Prediction (LP) tasks. This is due to limitations in expressive power such as the inability to count triangles (the backbone of most LP heuristics) and because they can not distinguish automorphic nodes (those having identical structural roles). Both expressiveness issues can be alleviated by learning link (rather than node) representations and incorporating structural features such as triangle counts. Since explicit link representations are often prohibitively expensive, recent works resorted to subgraph-based methods, which have achieved state-of-the-art performance for LP, but suffer from poor efficiency due to high levels of redundancy between subgraphs. We analyze the components of subgraph GNN (SGNN) methods for link prediction. Based on our analysis, we propose a novel full-graph GNN called ELPH (Efficient Link Prediction with Hashing) that passes subgraph sketches as messages to approximate the key components of SGNNs without explicit subgraph construction. ELPH is provably more expressive than Message Passing GNNs (MPNNs). It outperforms existing SGNN models on many standard LP benchmarks while being orders of magnitude faster. However, it shares the common GNN limitation that it is only efficient when the dataset fits in GPU memory. Accordingly, we develop a highly scalable model, called BUDDY, which uses feature precomputation to circumvent this limitation without sacrificing predictive performance. Our experiments show that BUDDY also outperforms SGNNs on standard LP benchmarks while being highly scalable and faster than ELPH.


Revisiting Graph Neural Networks for Link Prediction

Graph neural networks (GNNs) have achieved great success in recent years...

Link Prediction Based on Graph Neural Networks

Traditional methods for link prediction can be categorized into three ma...

Understanding and Extending Subgraph GNNs by Rethinking Their Symmetries

Subgraph GNNs are a recent class of expressive Graph Neural Networks (GN...

Integrating Transductive And Inductive Embeddings Improves Link Prediction Accuracy

In recent years, inductive graph embedding models, viz., graph neural ne...

Hashing-Accelerated Graph Neural Networks for Link Prediction

Networks are ubiquitous in the real world. Link prediction, as one of th...

Adversarial Permutation Guided Node Representations for Link Prediction

After observing a snapshot of a social network, a link prediction (LP) a...

Ordered Subgraph Aggregation Networks

Numerous subgraph-enhanced graph neural networks (GNNs) have emerged rec...

1 Introduction

Link Prediction (LP) is an important problem in graph ML with many industrial applications. For example, recommender systems can be formulated as LP; link prediction is also a key process in drug discovery and knowledge graph construction. There are three main classes of LP methods: (i)

heuristics (See Appendix C.1

) that estimate the distance between two nodes (e.g. personalized page rank (PPR) 

(Page et al., 1999) or graph distance (Zhou et al., 2009)) or the similarity of their neighborhoods (e.g Common Neighbors (CN), Adamic-Adar (AA) (Adamic and Adar, 2003), or Resource Allocation (RA) (Zhou et al., 2009)); (ii) unsupervised node embeddings or factorization methods, which encompass the majority of production recommendation systems (Koren et al., 2009; Chamberlain et al., 2020); and, recently, (iii) Graph Neural Networks, in particular of the Message-Passing type (MPNNs) (Gilmer et al., 2017; Kipf and Welling, 2017; Hamilton et al., 2017).111GNNs are a broader category than MPNNs. Since the majority of GNNs used in practice are of the message passing type, we will use the terms synonymously. GNNs excel in graph- and node-level tasks, but often fail to outperform node embeddings or heuristics on common LP benchmarks such as the Open Graph Benchmark (OGB) (Hu et al., 2020).

There are two related reasons why MPNNs tend to be poor link predictors. Firstly, due to the equivalence of message passing to the Weisfeiler-Leman (WL) graph isomorphism test (Xu et al., 2019; Morris et al., 2019), standard MPNNs are provably incapable of counting triangles (Chen et al., 2020)

and consequently of counting Common Neighbors or computing one-hop or two-hop LP heuristics such as AA or RA. Secondly, GNN-based LP approaches combine permutation-equivariant structural node representations (obtained by message passing on the graph) and a readout function that maps from two node representations to a link probability. However, generating link representations as a function of equivariant node representations encounters the problem that all nodes

in the same orbit induced by the graph automorphism group have equal representations. Therefore, the link probability is the same for all in the orbit independent of e.g. the graph distance (Figure 1).  (Srinivasan and Ribeiro, 2019).

Figure 1: Nodes 2 and 4 are in the same orbit induced by the graph’s automorphism group. As a result, a conventional GNN will assign the same probability to links (1,2) and (1,4).

This is a result of GNN’s built-in permutation equivariance, which produces equal representation for any nodes whose enclosing subgraphs (corresponding to the receptive field of the GNN) are isomorphic.222More precisely, WL-equivalent, which is a necessary but insufficient condition for isomorphism. We refer to this phenomenon as the automorphic node problem and define automorphic nodes (denoted ) to be those nodes that are indistinguishable by means of a given -layer GNN. On the other hand, transductive node embedding methods such as TransE (Bordes et al., 2013) and DeepWalk (Perozzi et al., 2014), or matrix factorization (Koren et al., 2009) do not suffer from this problem as the embeddings are not permutation equivariant.

Several methods have been proposed to improve GNN expressivity for LP. Most simply, adding unique node IDs immediately makes all structural node representations distinguishable, but at the expense of generalization (Abboud et al., 2021) and training convergence (Sato et al., 2021). Substructure counts may act as permutation-equivariant approximately unique identifiers (Bouritsas et al., 2022), but they require a precomputation step which may be computationally intractable in the worst case. More successfully, a family of structural features, sometimes referred to as labeling tricks, have recently been proposed that solve the automorphic node problem while still being equivariant and having good generalization (Li et al., 2020; Zhang et al., 2021; You et al., 2021). However, adding structural features amounts to computing structural node representations that are conditioned on an edge and so can no longer be efficiently computed in parallel. For the purpose of tractability, state-of-the-art methods for LP restrict computation to subgraphs enclosing a link, transforming link prediction into binary subgraph classification (Zhang et al., 2021; Zhang and Chen, 2018; Yin et al., 2022). Subgraph GNNs (SGNN) are inspired by the strong performance of LP heuristics compared to more sophisticated techniques and are motivated as an attempt to learn data-driven LP heuristics.

Despite impressive performance on benchmark datasets, SGNNs suffer from some serious limitations: (i) Constructing the subgraphs is expensive; (ii) Subgraphs are irregular and so batching them is inefficient on GPUs (iii); Each step of inference is almost as expensive as each training step because subgraphs must be constructed for every test link. These drawbacks preclude many applications, where scalability or efficient inference are required.

Main contributions.

(i) We analyze the relative contributions of SGNN components and reveal which properties of the subgraphs are salient to the LP problem. (ii) Based on our analysis, we develop an MPNN (ELPH) that passes subgraph sketches as messages. The sketches allow the most important qualities of the subgraphs to be summarized in the nodes. The resulting model removes the need for explicit subgraph construction and is a full-graph MPNN with the similar complexity to GCN. (iii) We prove that ELPH is strictly more expressive than MPNNs for LP and that it solves the automorphic node problem. (iv) As full-graph GNNs suffer from scalability issues when the data exceeds GPU memory, we develop BUDDY, a highly scalable model that precomputes sketches and node features. (v) We provide an open source Pytorch library for (sub)graph sketching that generates data sketches via message passing on the GPU. Experimental evaluation shows that our methods compares favorably to state-of-the-art both in terms of accuracy and speed.

2 Preliminaries


Let be an undirected graph comprising the set of nodes (vertices) and links (edges) . We denote by the geodesic distance (shortest walk length) between nodes and . Let be a node-induced subgraph of satisfying iff for any . We denote by a -hop subgraph enclosing the link , where is the union of the -hop neighbors of and and is the union of the links that can be reached by a -hop walk originating at and (for simplicity, where possible, we omit ). Similarly, is the -hop subgraph enclosing node . The given features of nodes are denoted by and the derived structure features by . The probability of a link is denoted by . When nodes and have isomorphic enclosing subgraphs (i.e., ), we write .

Sketches for Intersection Estimation.

We use two sketching techniques, HyperLogLog (Flajolet et al., 2007; Heule et al., 2013a) and MinHashing (Broder, 1997). Given sets and , HyperLogLog efficiently estimates the cardinality of the union and MinHashing

estimates the Jaccard index

. We combine these approaches to estimate the intersection of node sets produced by graph traversals  (Pascoe, 2013). These techniques represent sets as sketches, where the sketches are much smaller than the sets they represent. Each technique has a parameter controlling the trade-off between the accuracy and computational cost. Running times for merging two sets, adding an element to a set, and extracting estimates only depend on , but they are constant with respect to the size of the set. Importantly, the sketches of the union of sets are given by permutation-invariant operations (element-wise for minhash and element-wise for hyperloglog). More details are provided in Appendix C.3.

Graph Neural Networks for Link Prediction.

Message-passing GNNs (MPNNs) are parametric functions of the form , where and are matrix representations (of size and , where is the number of nodes and are the input and output dimensions, respectively) of input and output node features. Permutation equivariance implies that for any node permutation matrix . This is achieved in GNNs by applying a local permutation-invariant aggregation function (typically sum, mean, or max) to the neighbor features of every node (‘message passing’), resulting in a node-wise update of the form


where are learnable functions. MPNNs are upperbounded in their discriminative power by the Weisfeiler-Leman isomorphism test (WL) (Weisfeiler and Leman, 1968), a procedure iteratively refining the representation of a node by hashing its star-shaped neighborhood. As a consequence, since WL always identically represents automorphic nodes (), any MPNN would do the same: . Given the node representations computed by a GNN, link probabilities can then be computed as , where is a learnable readout function with the property that for any . This node automorphism problem is detrimental for LP as if even when . As an example, in Figure1 , therefore while . As a result, a GNN may suggest to link totally unrelated nodes ( may even be in a separate connected component to and , but still have equal probability of connecting to  (Srinivasan and Ribeiro, 2019)).

Subgraph GNNs (SGNN).

(Zhang et al., 2021; Zhang and Chen, 2018; Yin et al., 2022) convert LP into binary graph classification. For a pair of nodes and the enclosing subgraph , SGNNs produce node representations and one desires if and zero otherwise. In order to resolve the automorphic node problem, node features are augmented with structural features (Bouritsas et al., 2020) that improve the ability of networks to count substructures (Chen et al., 2020). SGNNs were originally motivated by the strong performance of heuristics on benchmark datasets and attempted to learn generalized heuristics. When the graph is large it is not tractable to learn heuristics over the full graph, but global heuristics can be well approximated from subgraphs that are augmented with structural features with an approximation error that decays exponentially with the number of hops taken to construct the subgraph (Zhang and Chen, 2018).

3 Analyzing Subgraph Methods for Link Prediction

SGNNs can be decomposed into the following steps: (i) subgraph extraction around every pair of nodes for which one desires to perform LP; (ii) augmentation of the subgraph nodes with structure features; (iii) feature propagation over the subgraphs using a GNN, and (iv) learning a graph-level readout function to predict the link. Steps (ii)–(iv) rely on the existence of a set of subgraphs (i), which is either constructed on the fly or as a preprocessing step. In the remainder of this section we discuss the inherent complexity of each of these steps and perform ablation studies with the goal of understanding the relative importance of each.

Figure 2: (a) Structure features in SGNNs. (b) LP readout function over the output of all nodes in (sum or mean) or just and (edge).
Structure Features
Figure 3:

The importance of DRNL structure features. Importance is based on the weights in a logistic regression model using all DRNL features without node features.

Structure features address limitations in GNN expressivity stemming from the inherent inability of message passing to distinguish automorphic nodes. In SGNNs, permutation-equivariant distances and are used. The three most well known are Zero-One (ZO) encoding (You et al., 2021), Double Radius Node Labeling (DRNL) (Zhang and Chen, 2018) and Distance Encoding (DE) (Zhang and Chen, 2018). To solve the automorphic node problem, all that is required is to distinguish and from , which ZO achieves with binary node labels. DRNL has and , where is a bijective map. Distance Encoding (DE) (Li et al., 2020) generalizes DRNL; each node is encoded with a tuple (See Figure 5). Both of these schemes therefore include a unique label for triangles / common neighbors (See Appendix C.4 for more details on labeling schemes). The relative performance of ZO, DRNL, DE and no structure features is shown in Figure 1(a)

. The use of structure features greatly improves performance and DRNL and DE slightly outperform ZO, with a more pronounced outperformance for the larger Pubmed dataset. It is important to note that structure features are conditioned on edges and so can not be easily paralellized in the same way that node features usually are in GNNs and must be calculated for each link at both training and inference time. In particular DRNL requires two full traversals of each subgraph, which for regular graphs is

, but for complex networks becomes .

In Figure 3 we investigate the relative importance of DRNL structure features. Feature importance is measured using the weights in a logistic regression link prediction model with only structure feature counts as inputs. We then sum up feature importances corresponding to different max distance and normalize the total sum to one. The Figure indicates that most of the predictive performance is concentrated in low distances.

Propagation / GNN

In SGNNs, structure features are usually embedded into a continuous space, concatenated to any node features and propagated over subgraphs. While this procedure is necessary for ZO encodings, it is less clear why labels that precisely encode distances and are directly comparable as distances should be embedded in this way. Instead of passing embedded DE or DRNL structure features through a GNN, we fixed the number of DE features by setting the max distance to three and trained an MLP directly on the counts of these nine features (e.g. (1,1): 3, (1,2): 1, etc.). Figure 3(a) shows that doing so, while leaving ceteris paribus (node features still propagated over the subgraph) actually improves performance in two out of three datasets. We also investigated if any features require SGNN propagation by passing both raw node features and structure feature counts through an MLP. The results in the right columns of Figure 3(b) indicate that this reduces performance severely, but by pre-propagating the features as (middle columns) it is possible to almost recover the performance of propagation with the SGNN (left columns).

(a) Propagating structure features
(b) Propagating node features.
Figure 4: Examining the extent to which an SGNN is needed to propagate features. (a) An MLP on structure feature counts outperforms SGNN propagation in two of three datasets. (b) An MLP on structure feature counts and raw node features performs poorly (no prop), but performance can be recovered if node features are propagated over the graph in preprocessing (preprop).
Figure 5: The DE node labeling scheme for link
Readout / Pooling Function

Given SGNN node representations on the subgraph, a readout function maps a representations to link probabilities. For graph classification problems, this is most commonly done by pooling node representations (graph pooling) typically with a mean or sum operation plus an MLP. For LP, an alternative is edge pooling with , usually with the Hadamard product. A major advantage of edge pooling is that it can be formulated subgraph free. Figure 2 indicates that edge pooling produces better predictive performance than either mean or sum pooling across all nodes in .

Analysis Summary

The main results of Section 3 are that (i) The inclusion of structure features leads to very large improvements across all datasets (Figure 1(a)); (ii) The processing of these features, by embedding them and propagating them with an SGNN is sub-optimal both in terms of efficiency and performance (Figure 3(a)); (iii) Most of the importance of the structure features is located in the lowest distances (Figure 3); and (iv) edge level readout functions greatly outperform mean or sum pooling over subgraphs (Figure 2). If, on one hand, subgraphs are employed as a tractable alternative to the full graph for each training edge, on the other, generating them remains an expensive operation ( time complexity for regular graphs and for complex networks with power law degree distributions 333Subgraphs can be pre-computed, but the subgraphs combined are much larger than the original dataset, exceeding available memory for even moderately-sized datasets., see Appendix C.2). Within this context, our analysis shows that if the information necessary to compute structure features for an edge can be encoded in the nodes, then it is possible to recover the predictive performance of SGNNs without the cost of generating a different subgraph for each edge. We build upon this observation to design an efficient yet expressive model in Section 4.

4 Link Prediction with Subgraph Sketching

We now develop a full-graph GNN model that uses node-wise subgraph sketches to approximate structure features such as the counts of DE and DRNL labels, which our analysis indicated are sufficient to encompass the salient patterns governing the existence of a link (Figure 3(a)).

4.1 Structure Features Counts

Let be the number of (, ) labels for the link (, ), which is equivalent to the number of nodes at distances exactly and from and respectively (See Figure 5). We compute for all , less than the receptive field , which guarantees a number of counts that do not depend on the graph size and mitigates overfitting. To alleviate the loss of information coming from a fixed , we also compute , counting the number of nodes at distance from and at distance from . We compute for all . Figure 6 shows how and relate to the neighborhoods of the two nodes. Overall, this results in counts for and counts for ( for the source and for the destination node), for a total of count features. These counts can be computed efficiently without constructing the whole subgraph for each edge. Defining , we have


where are the -hop neighbors of (i.e. nodes at distance from ).

Estimating Intersections and Cardinalities

and can be efficiently approximated with the sketching techniques introduced in section 2. Let and be the HyperLogLog and MinHash sketches of node ’s -hop neighborhood, obtained recursively from the initial node sketches and using the relations and . We can approximate the intersection of neighborhood sets as


where is the Hamming similarity, is the HyperLogLog cardinality estimation function (see Appendix C.3.1) and is taken elementwise. MinHashing approximates the Jaccard similarity () between two sets and HyperLogLog approximates set cardinality. can be approximated as . The cost of this operation only depends on additional parameters of HyperLogLog and MinHashing (outlined in more detail in Section 2) which allow for a tradeoff between speed and accuracy, but is nevertheless independent of the graph size. Using this approximations we obtain estimates and of the structure features counts.

Figure 6: Blue and red concentric circles indicate 1 and 2-hop neighborhoods of and respectively. Structure features and measure the cardinalities of intersections of these neighborhoods.

4.2 Efficient Link Prediction with Hashes (ELPH)

We present ELPH, a novel MPNN for link prediction. In common with other full-graph GNNs, it employs a feature propagation component with a link level readout function. However, by augmenting the messages with subgraph sketches, it achieves higher expressiveness for the same asymptotic complexity. ELPH’s feature propagation fits the standard MPNN formalism (Gilmer et al., 2017):


where are learnable functions, is a local permutation-invariant aggregation function (typically sum, mean, or max), are the original node features, and and are the MinHashing and HyperLogLog single node sketches respectively. Minhash and hyperloglog sketches of are computed by aggregating with and operators respectively the sketches of (Eq. 7). These sketches can be used to compute the intersection estimations and up to the -hop neighborhood, which are then used as edge features (Eq. 8). Intuitively, the role of the edge features is to modulate message transmission based on local graph structures, similarly to how attention is used to modulate message transmission based on feature couplings. A link predictor is then applied to the final node representations of the form


where , is a receptive field, is an MLP and is the element-wise product. decouples the readout function from subgraph generation as suggested in Figure 2. This approach moves computation from edgewise operations e.g. generating subgraphs, to nodewise operations (generating sketches) that encapsulate the most relevant subgraph information for LP. We have shown that (i) GNN propagation is either not needed (structure features) or can be preprocessed globally (given features) (ii) structure features can be generated from sketches instead of subgraphs (iii) an edge pooling readout function performs as well as a graph pooling readout. In combination these approaches circumvent the need to explicitly construct subgraphs. The resulting model efficiently combines node features and the graph structure without explicitly constructing subgraphs and is as efficient as GCN (Kipf and Welling, 2017).

Expressive power of ELPH

ELPH combines the advantageous traits of both GNNs and subgraph-based methods, i.e. tractable computational complexity and the access to pair-wise structural features. Furthermore, as the following results show, it is more expressive than MPNNs:

Proposition 4.1.

Let be the family of ELPH models as per Equations 7 - 9 and 10 where estimates are exact ()444

For sufficiently large samples, this result can be extended to approximate counts via unbiased estimators.

. does not suffer from the automorphic node problem.

While we rigorously define the automorphic node problem and prove the above in Section A.2, this result states that there exists non-automorphic links with automorphic nodes that an ELPH model is able to discriminate (thanks to structural features). Contrary to ELPHs, MPNNs are not able to distinguish any of these links; we build upon this consideration, as well as the observation that ELPH models subsume MPNNs, to obtain this additional result:

Theorem 4.2.

Let be the family of Message Passing Neural Networks (Equation 1). is strictly more powerful than ().

We prove this in Section A.2. Intuitively, the Theorem states that while all links separated by MPNNs are also separated by ELPHs, there also exist links separated by the latter family but not by the former.

5 Scaling ELPH with Preprocessing (BUDDY)

Similarly to other full-graph GNNs (e.g. GCN), ELPH is efficient when the dataset fits into GPU memory. When it does not, the graph must be batched into subgraphs. Batching is a major challenge associated with scalable GNNs and invariably introduces high levels of redundancy across batches. Here we introduce a large scale version of ELPH, called BUDDY, which uses preprocessing to side-step the need to have the full dataset in GPU memory.


Figure 3(b) indicated that a fixed propagation of the node features almost recovers the performance of learnable SGNN propagation. This propagation can be achieved by efficient sparse scatter operations and done only once in preprocessing. Sketches can also be precomputed in a similar way:

where and scatter_max and scatter_mean are defined similarly, are the original node features, and and are the MinHashing and HyperLogLog single node sketches respectively. Similarly to (Rossi et al., 2020), we concatenate features diffused at different hops to obtain the input node features: .

Link Predictor

We now have , where is an MLP and and are computed using and as explained in Section 4.1. Doing so effectively converts a GNN into an MLP, removing the need to sample batches of subgraphs when the dataset overflows GPU memory (Rossi et al., 2020). Further improvements in efficiency are possible when multiple computations on the same edge sets are required. In this case, the edge features can be precomputed and cached.


Denoting the complexity of hash operations as , the node representation dimension and the number of edges , the preprocessing complexity of BUDDY is . The first term being node feature propagation and the second sketch propagation. Preprocessing for SEAL and NBFNet are . A link probability is computed by (i) extracting -hop structural features, which costs and (ii) An MLP on structural and node features, which results in operations. Both are independent of the size of the graph. Since SGNNs construct subgraphs for each link they are for complex networks (See Section C.2). Finally, the total complexity of NBFNet is with an amortized time for a single link prediction of  (Zhu et al., 2021). However, this is only realized in situations where the link probability is required . Complexities are summarized in Table 1.

Expressiveness of BUDDY

Similarly to ELPH, BUDDY does not suffer from the automorphic node problem (see Proposition and its Proof in Appendix A.2). However, due to its non-parametric feature propagation, this class of models does not subsume MPNNs and we cannot exclude the presence of links separated by MPNNs, but not by BUDDY. Nevertheless, BUDDY empirically outperforms common MPNNs by large margins (see Section 7), while also being extremely scalable.

Complexity SEAL NBFNet BUDDY
Training (1 link)
Table 1: Model complexity. and are the number of nodes and edges respectively. We use -dimensional node features, hops for propagation and sketches of size .

6 Related Work

Subgraph methods for link prediction were introduced as the Weisfeiler Leman Neural Machine (WLNM) in (Zhang and Chen, 2017). As an MLP is used to learn from subgraphs instead of a GNN their model is not permutation-equivariant and uses a hashing-based WL algorithm (Kersting et al., 2014) to generate node indices for the MLP. Note that ‘hashing’ here refers to injective neighbor aggregation functions, distinct from the data sketching methods employed in the current work. The MLP in WLNM was replaced by a GNN in the seminal SEAL (Zhang and Chen, 2018), thus removing the need for a hashing scheme. Additional methodological improvements were made in (Zhang et al., 2021; Zhang and Li, 2021) and subgraph generation was improved in (Yin et al., 2022). Applications to recommender systems were addressed in (Zhang and Chen, 2019) and (Yin et al., 2022) use random walks to efficiently approximate subgraphs. The DRNL labeling scheme was introduced in SEAL and further labeling schemes are developed in (You et al., 2021; Li et al., 2020) and generalized with equivariant positional encodings (Wang et al., 2022). Neural Bellman-Ford Networks (Zhu et al., 2021) takes a different approach. It is concerned with single source distances where each layer of the neural networks is equivalent to an iteration of the Bellman-Ford algorithm and can be regarded as using a partial labeling scheme. The same graph and augmentation can be shared among multiple destinations and so it is faster (amortized) than SGNNs. However, it suffers from poor space complexity and thus requires impractically high memory consumption. Neo-GNN (Yun et al., 2021) directly learns a heuristic function using an edgewise and a nodewise neural network. #GNN (Wu et al., 2021) directly hashes node features, instead of graph structure, and is limited to node features that can be interpreted as set memberships. Another paradigm for LP uses self-supervised node embeddings combined with an (often) approximate distance metric. Amongst the best known models are TransE (Bordes et al., 2013), DistMult (Yang et al., 2015), or complEx (Trouillon et al., 2016), which have been shown to scale to large LP systems in e.g. (Lerer et al., 2019; El-Kishky et al., 2022). Decoupling feature propagation and learning in GNNs was first implemented in SGC (Wu et al., 2019) and then scalably in SIGN (Rossi et al., 2020).

Cora Citeseer Pubmed Collab PPA Citation2 DDI
Metric HR@100 HR@100 HR@100 HR@50 HR@100 MRR HR@20
Table 2: Results on link prediction benchmarks. The top three models are colored by First, Second, Third. Where possible, baseline results are taken directly from the OGB leaderboard.

7 Experiments

Datasets, Baselines and Experimental Setup

We report results for the most widely used Planetoid citation networks Cora (McCallum et al., 2000), Citeseer (Sen et al., 2008) and Pubmed (Namata et al., 2012) and the OGB link prediction datasets (Hu et al., 2020). We report results for: Three heuristics that have been successfully used for LP, Adamic-Adar (AA) (Adamic and Adar, 2003), Resource Allocation (RA) (Zhou et al., 2009) and Common Neighbors (CN); two of the most popular GNN architectures: Graph Convolutional Network (GCN) (Kipf and Welling, 2017) and GraphSAGE (Hamilton et al., 2017); the state-of-the-art link prediction GNNs Neo-GNN (Yun et al., 2021), SEAL (Zhang and Chen, 2018) and NBFNet (Zhu et al., 2021). Additional details on datasets and the experimental setup are included in Appendix B.1


Results are presented in Table 2 with metrics given in the first row. Either ELPH or BUDDY achieve the best performance in five of the seven datasets, with SEAL being the closest competitor. Being a full-graph method, ELPH runs out of memory on the two largest datasets, while its scalable counterpart, BUDDY, performs extremely well on both. Despite ELPH being more general than BUDDY, there is no clear winner between the two in terms of performance, with ELPH outperforming BUDDY in three of the five datasets where we have results for both. Ablation studies for number of hops, sketching parameters and the importance of node and structure features are in Appendix D.2.

dataset time (s) SEAL dyn SEAL stat ELPH BUDDY
preproc 0 630 0 5
Pubmed train 70 30 25 1
inference 23 9 0.1 0.06
preproc 0 3,000,000 1,200
Citation train 300,000 200,000 OOM 1,500
inference 300,000 100,000 300
Table 3:

Wall times of our methods in comparison to SEAL. Training time is one epoch. Inference time is the full test set.


Wall times are shown in Table 3. We report numbers for both the static mode of SEAL (all subgraphs are generated and labeled as a preprocessing step), and the dynamic mode (subgraphs are generated on the fly). BUDDY is orders of magnitude faster both in training and inference. In particular, Buddy is 200–1000 faster than SEAL in dynamic mode for training and inference on the Citation dataset. Further runtimes and a breakdown of preprocessing costs are in Appendix D.1 with learning curves in Appendix F.

8 Conclusion

We have presented a new model for LP that is based on an analysis of existing state of the art models, but which acheives better time and space complexity and superior predictive performance on a range of standard benchmarks. The current work is limited to undirected graphs, or directed graphs that are first preprocessed to make them undirected as is common in GNN research. We leave as future work extensions to directed graphs and temporal / dynamically evolving graphs and investigations into the links with graph curvature (See Appendix C.5).


  • R. Abboud, I. I. Ceylan, M. Grohe, and T. Lukasiewicz (2021) The surprising power of graph neural networks with random node initialization. In IJCAI, Cited by: §1.
  • L. A. Adamic and E. Adar (2003) Friends and neighbors on the web. Social networks 25 (3), pp. 211–230. Cited by: §1, §7.
  • A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. Advances in neural information processing systems 26. Cited by: §1, §6.
  • G. Bouritsas, F. Frasca, S. Zafeiriou, and M. M. Bronstein (2020) Improving graph neural network expressivity via subgraph isomorphism counting. arXiv:2006.09252. Cited by: §2.
  • G. Bouritsas, F. Frasca, S. P. Zafeiriou, and M. Bronstein (2022) Improving graph neural network expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
  • A. Z. Broder (1997) On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp. 21–29. Cited by: §2.
  • B. P. Chamberlain, E. Rossi, D. Shiebler, S. Sedhain, and M. M. Bronstein (2020) Tuning word2vec for large scale recommendation systems. In Fourteenth ACM Conference on Recommender Systems, pp. 732–737. Cited by: §1.
  • B. P. Chamberlain, J. Levy-Kramer, C. Humby, and M. P. Deisenroth (2018) Real-time community detection in full social networks on a laptop. PloS one 13 (1), pp. e0188702. Cited by: §C.3.2.
  • Z. Chen, L. Chen, S. Villar, and J. Bruna (2020) Can graph neural networks count substructures?. In NeurIPS, Cited by: §1, §2.
  • A. El-Kishky, T. Markovich, S. Park, C. Verma, B. Kim, R. Eskander, Y. Malkov, F. Portman, S. Samaniego, Y. Xiao, et al. (2022) TwHIN: embedding the twitter heterogeneous information network for personalized recommendation. arXiv preprint arXiv:2202.05387. Cited by: §6.
  • O. Ertl (2017) New cardinality estimation methods for hyperloglog sketches. arXiv preprint arXiv:1706.07290. Cited by: §A.2.
  • M. Fey and J. E. Lenssen (2019) Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §B.4.
  • P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier (2007) HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In AofA: Analysis of Algorithms, P. Jacquet (Ed.), DMTCS Proceedings, Vol. DMTCS Proceedings vol. AH, 2007 Conference on Analysis of Algorithms (AofA 07), Juan les Pins, France, pp. 137–156. External Links: Link, Document Cited by: §C.3.1, §2.
  • J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §1, §4.2.
  • W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NeurIPS, Cited by: §1, §7.
  • S. Heule, M. Nunkesser, and A. Hall (2013a) HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the EDBT 2013 Conference, Genoa, Italy. Cited by: §C.3.1, §2.
  • S. Heule, M. Nunkesser, and A. Hall (2013b) Hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Technology, pp. 683–692. Cited by: §C.3.1.
  • W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec (2020)

    Open graph benchmark: datasets for machine learning on graphs

    arXiv:2005.00687. Cited by: §1, §7.
  • K. Kersting, M. Mladenov, R. Garnett, and M. Grohe (2014) Power iterated color refinement. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 28. Cited by: §6.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §1, §4.2, §7.
  • Y. Koren, R. Bell, and C. Volinsky (2009) Matrix factorization techniques for recommender systems. Computer 42 (8), pp. 30–37. Cited by: §1, §1.
  • A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose, and A. Peysakhovich (2019) PyTorch-BigGraph: A Large-scale Graph Embedding System. In Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA. Cited by: §6.
  • P. Li, Y. Wang, H. Wang, and J. Leskovec (2020) Distance encoding: design provably more powerful neural networks for graph representation learning. Advances in Neural Information Processing Systems 33, pp. 4465–4478. Cited by: §C.4, §1, §3, §6.
  • L. Lü and T. Zhou (2011) Link prediction in complex networks: a survey. Physica A: statistical mechanics and its applications 390 (6), pp. 1150–1170. Cited by: §C.1.
  • H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman (2019) Provably powerful graph networks. In NeurIPS, pp. 2153–2164. Cited by: §A.1.
  • A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore (2000) Automating the construction of internet portals with machine learning. Information Retrieval 3 (2), pp. 127–163. Cited by: §7.
  • C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe (2019) Weisfeiler and leman go neural: higher-order graph neural networks. In AAAI Conference on Artificial Intelligence, pp. 4602–4609. Cited by: §1.
  • G. Namata, B. London, L. Getoor, B. Huang, and U. EDU (2012) Query-driven active surveying for collective classification. In Proceedings Mining and Learning with Graphs, Cited by: §7.
  • L. Page, S. Brin, R. Motwani, and T. Winograd (1999) The pagerank citation ranking: bringing order to the web.. Technical report preprint. Cited by: §1.
  • A. Pascoe (2013) HyperLogLog and minhash-a union for itersections. Note:; accessed 09-May-2022 Cited by: §2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In NeurIPS, Cited by: §B.4.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1.
  • E. Rossi, F. Frasca, B. Chamberlain, D. Eynard, M. Bronstein, and F. Monti (2020) SIGN: scalable inception graph neural networks. arXiv:2004.11198. Cited by: §5, §5, §6.
  • R. Sato, M. Yamada, and H. Kashima (2021) Random features strengthen graph neural networks. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pp. 333–341. Cited by: §1.
  • P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI Magazine 29 (3), pp. 93–93. Cited by: §7.
  • B. Srinivasan and B. Ribeiro (2019) On the equivalence between positional node embeddings and structural graph representations. In International Conference on Learning Representations, Cited by: §A.1, §A.1, §A.2, §1, §2.
  • J. Topping, F. Di Giovanni, B. P. Chamberlain, X. Dong, and M. M. Bronstein (2022) Understanding over-squashing and bottlenecks on graphs via curvature. 10th International Conference on Learning Representations, ICLR 2019. Cited by: §C.5.
  • T. Trouillon, J. Welbl, S. Riedel, E. Gaussier, and G. Bouchard (2016) Complex embeddings for simple link prediction. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 2071–2080. External Links: Link Cited by: §6.
  • H. Wang, H. Yin, M. Zhang, and P. Li (2022) Equivariant and stable positional encoding for more powerful graph neural networks. arXiv preprint arXiv:2203.00199. Cited by: §6.
  • B. Weisfeiler and A. Leman (1968) The reduction of a graph to canonical form and the algebra which appears therein. NTI Series 2 (9), pp. 12–16. Cited by: §2.
  • F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger (2019) Simplifying graph convolutional networks. In ICML, Cited by: §6.
  • W. Wu, B. Li, C. Luo, and W. Nejdl (2021) Hashing-accelerated graph neural networks for link prediction. In Proceedings of the Web Conference 2021, pp. 2910–2920. Cited by: §6.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In ICLR, Cited by: §1.
  • B. Yang, W. Yih, X. He, J. Gao, and L. Deng (2015) Embedding entities and relations for learning and inference in knowledge bases. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §6.
  • H. Yin, M. Zhang, Y. Wang, J. Wang, and P. Li (2022) Algorithm and system co-design for efficient subgraph-based graph representation learning. arXiv preprint arXiv:2202.13538. Cited by: §1, §2, §6.
  • J. You, J. Gomes-Selman, R. Ying, and J. Leskovec (2021) Identity-aware graph neural networks. arXiv preprint arXiv:2101.10320. Cited by: §1, §3, §6.
  • S. Yun, S. Kim, J. Lee, J. Kang, and H. J. Kim (2021) Neo-gnns: neighborhood overlap-aware graph neural networks for link prediction. Advances in Neural Information Processing Systems 34, pp. 13683–13694. Cited by: §6, §7.
  • M. Zhang and Y. Chen (2017) Weisfeiler-lehman neural machine for link prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 575–583. Cited by: §6.
  • M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. Advances in neural information processing systems 31. Cited by: §D.1, §1, §2, §3, §6, §7.
  • M. Zhang and Y. Chen (2019) Inductive matrix completion based on graph neural networks. In International Conference on Learning Representations, Cited by: §6.
  • M. Zhang, P. Li, Y. Xia, K. Wang, and L. Jin (2021) Labeling trick: a theory of using graph neural networks for multi-node representation learning. Advances in Neural Information Processing Systems 34. Cited by: §A.1, §A.1, §A.1, §A.2, §1, §2, §6.
  • M. Zhang and P. Li (2021) Nested graph neural networks. Advances in Neural Information Processing Systems 34. Cited by: §6.
  • T. Zhou, L. Lü, and Y. Zhang (2009) Predicting missing links via local information. The European Physical Journal B 71 (4), pp. 623–630. Cited by: §1, §7.
  • Z. Zhu, Z. Zhang, L. Xhonneux, and J. Tang (2021) Neural bellman-ford networks: a general graph neural network framework for link prediction. Advances in Neural Information Processing Systems 34. Cited by: §5, §6, §7.

Appendix A Theoretical Analyses

a.1 Preliminaries

We introduce some preliminary concepts that will be useful in our analysis. Let us start with the definition of graph isomorphism and automorphism.

Definition A.1 (Graph isomorphism and automorphism).

Let , be two simple graphs. An isomorphism between is a bijective map which preserves adjacencies, that is: . If , is called an automorphism.

In view of the definition above, two graphs are also called isomorphic whenever there exists an isomorphism between the two. Intuitively, if two graphs are isomorphic they encode the same exact relational structure, up to relabeling of their nodes. As we shall see next, automorphisms convey a slightly different meaning: they allow to define graph symmetries by formalizing the concept of node structural roles.

The above definitions can be extended to attributed graphs, more conveniently represented through third-order tensors

, with the number of features [25, 51]. Indexes in the first two dimensions of tensor univocally correspond to vertexes in through a bijection . Here, elements correspond to node features, while elements to edge features, which include the connectivity information in . Within this context, a graph is defined as a tuple , and isomorphisms and automorphisms are interpreted as permutations acting on as:


where is the set of all bijections (permutations) over symbols. We then define the concepts of isomorphism and automorphism as:

Definition A.2 (Graph isomorphism and automorphism on attributed graphs).

Let be two attributed graphs represented by tensors . An isomorphism between is a bijective map (permutation) such that . For a graph , is an automorphism if .

Let us get back to the information conveyed by automorphisms by introducing the following definition:

Definition A.3 (Graph automorphism group).

Let be a simple graph and be the set of graph automorphisms defined on its vertex set. is the group having as base set and function composition () as its operation.

Group induces an equivalence relation : (it is easily proved that is indeed reflexive, symmetric and transitive, thus qualifying as an equivalence relation). The equivalence classes of partition the vertex set into orbits: . The set of orbits identifies all structural roles in the graph. Two vertexes belonging to the same orbit have the same structural role in the graph and are called ‘symmetric’ or ‘automorphic’.

In fact, it is possible to define automorphisms between node-pairs as well [36, 51]:

Definition A.4 (Automorphic node pairs).

Let be an attributed graph and be two pairs of nodes. We say are automorphic if there exists an automorphism such that . We write .

With the above it is possible to extend the definition of orbits to node-pairs, by considering those node-pair automorphisms which are naturally induced by node ones, i.e.: .

Notably, for two node pairs , , i.e., automorphism between nodes does not imply automorphism between node pairs. For instance, one counter-example of automorphic nodes involved in non-automorphic pairs is depicted in Figure 1, while another one is constituted by pair in Figure 7, whose automorphisms are reported in Tables 4 and 5. This ‘phenomenon’ is the cause of the so-called “automorphic node problem”.

Definition A.5 (Automorphic node problem).

Let be a family of models. We say suffers from the automorphic node problem if for any model , and any simple (attributed) graph we have that , .

The above property is identified as a ‘problem’ because there exist examples of non-automorphic node pairs composed by automorphic nodes. A model with this property would inevitably compute the same representations for these non-automorphic pairs, despite they feature significantly different characteristics, e.g. shortest-path distance or number of common neighbors.

Importantly, as proved by [36] and restated by [51], the model family of Message Passing Neural Networks suffer from the aforementioned problem:

Proposition A.6.

Let be the family of Message Passing Neural Networks (Equation 1) representing node-pairs as a function of their computed (equivariant) node representations. suffers from the automorphic node problem.

We conclude these preliminaries by introducing the concept of link discrimination, which gives a (more) fine-grained measure of the link representational power of model families.

Definition A.7 (Link discrimination).

Let be any simple (attributed) graph and a model belonging to some family . Let be two node pairs. We say discriminates pairs iff . We write . If there exists such a model , then family distinguishes between the two pairs and we write .

Accordingly, does not discriminate pairs when it contains no model instances which assign distinct representations to the pairs. We write .

We can compare model families based on their expressiveness, that is, their ability to discriminate node pairs:

Definition A.8 (More expressive).

Let be two model families. We say is more expressive than . We write .

Put differently, is more expressive than when for any two node pairs, if there exists a model in which disambiguates between the two pairs, then there exists a model in which does so as well. When the opposite is not verified, we say is strictly more expressive than :

Definition A.9 (Strictly more expressive).

Let be two model families. We say is strictly more expressive than . Equivalently, .

In other words, is strictly more expressive than when there exists no model in the latter family disambiguating between two non-automorphic pairs while there exist some models in the former which do so.

a.2 Deferred theoretical results and proofs

Figure 7: Graph with nodes and node-pairs coloured according to the orbit they belong to. Node-pairs corresponding to actual edges are depicted solid, dashed otherwise. There are automorphisms which map any node to any other (see Table 4). Hence all nodes are in the same, single orbit. On the contrary, node-pairs are partitioned into three distinct orbits (see Table 5). As it is possible to notice, pairs and are not automorphic, while their constituent nodes are.

Let us start by reporting the Proof for Proposition 4.1, stating that ELPH models do not suffer from the automorphic node problem.

Proof of Proposition 4.1.

In order to prove the Proposition it is sufficient to exhibit an ELPH model which distinguishes between two node-pairs whose nodes are automorphic. Consider , the chordless cycle graph with nodes, which we depict in Figure 7. Due to the symmetry of this graph, all nodes are in the same orbit, and are therefore automorphic: we report the set of all graph automorphisms in Table 4. Let us then consider pairs . The two pairs satisfy the premise in the definition of the automorphic node problem, as . One single ELPH message-passing layer () produces the following (exact) structural features. Pair : ; pair : . Thus, a one-layer ELPH model is such that if the readout layer . An MLP implementing the described is only required to nullify parts of its input and apply an identity mapping on the remaining ones. This MLP trivially exists (it can be even manually constructed). ∎

We now move to the counterpart of the above result for BUDDY models.

Proposition A.10.

Let be the family of BUDDY models as described per Equations 10 and pre-processed features , where estimates are exact (). does not suffer from the automorphic node problem.

Proof of Proposition a.10.

In order to prove the Proposition it is sufficient to notice that the readout in the above Proof completely neglects node features, while only replicating in output the structural features . This is also a valid BUDDY readout function, and allows BUDDY to output the same exact node-pair representations of the ELPH model described above. As we have observed, these are enough to disambiguate from graph depicted in Figure 7. This concludes the proof. ∎

In these results we have leveraged the assumption that estimates are exact. We observe that, if the MinHash and HyperLogLog estimators are unbiased, it would be possible to choose a sample size large enough to provide distinct estimates for the two counts of interest, so that the above results continue to hold. Unbiased cardinality estimators are, for instance, described in [11].

0 0 0 1 1 2 2 3 3 4 4 5 5
1 1 5 0 2 1 3 2 4 3 5 0 4
2 2 4 5 3 0 4 1 5 2 0 1 3
3 3 3 4 4 5 5 0 0 1 1 2 2
4 4 2 3 5 4 0 5 1 0 2 3 1
5 5 1 2 0 3 1 4 2 5 3 4 0
Table 4: Images, for each node, according to the automorphisms for graph . See Figure 7.
(0, 1) (0, 1) (0, 5) (0, 1) (1, 2) (1, 2) (2, 3) (2, 3) (3, 4) (3, 4) (4, 5) (0, 5) (4, 5)
(0, 2) (0, 2) (0, 4) (1, 5) (1, 3) (0, 2) (2, 4) (1, 3) (3, 5) (2, 4) (0, 4) (1, 5) (3, 5)
(0, 3) (0, 3) (0, 3) (1, 4) (1, 4) (2, 5) (2, 5) (0, 3) (0, 3) (1, 4) (1, 4) (2, 5) (2, 5)
(0, 4) (0, 4) (0, 2) (1, 3) (1, 5) (2, 4) (0, 2) (3, 5) (1, 3) (0, 4) (2, 4) (3, 5) (1, 5)
(0, 5) (0, 5) (0, 1) (1, 2) (0, 1) (2, 3) (1, 2) (3, 4) (2, 3) (4, 5) (3, 4) (4, 5) (0, 5)
(1, 2) (1, 2) (4, 5) (0, 5) (2, 3) (0, 1) (3, 4) (1, 2) (4, 5) (2, 3) (0, 5) (0, 1) (3, 4)
(1, 3) (1, 3) (3, 5) (0, 4) (2, 4) (1, 5) (3, 5) (0, 2) (0, 4) (1, 3) (1, 5) (0, 2) (2, 4)
(1, 4) (1, 4) (2, 5) (0, 3) (2, 5) (1, 4) (0, 3) (2, 5) (1, 4) (0, 3) (2, 5) (0, 3) (1, 4)
(1, 5) (1, 5) (1, 5) (0, 2) (0, 2) (1, 3) (1, 3) (2, 4) (2, 4) (3, 5) (3, 5) (0, 4) (0, 4)
(2, 3) (2, 3) (3, 4) (4, 5) (3, 4) (0, 5) (4, 5) (0, 1) (0, 5) (1, 2) (0, 1) (1, 2) (2, 3)
(2, 4) (2, 4) (2, 4) (3, 5) (3, 5) (0, 4) (0, 4) (1, 5) (1, 5) (0, 2) (0, 2) (1, 3) (1, 3)
(2, 5) (2, 5) (1, 4) (2, 5) (0, 3) (0, 3) (1, 4) (1, 4) (2, 5) (2, 5) (0, 3) (1, 4) (0, 3)
(3, 4) (3, 4) (2, 3) (3, 4) (4, 5) (4, 5) (0, 5) (0, 5) (0, 1) (0, 1) (1, 2) (2, 3) (1, 2)
(3, 5) (3, 5) (1, 3) (2, 4) (0, 4) (3, 5) (1, 5) (0, 4) (0, 2) (1, 5) (1, 3) (2, 4) (0, 2)
(4, 5) (4, 5) (1, 2) (2, 3) (0, 5) (3, 4) (0, 1) (4, 5) (1, 2) (0, 5) (2, 3) (3, 4) (0, 1)
Table 5: Images, for each node-pair, according to the induced node-pair automorphisms for graph . Pairs are considered as undirected (sets). See Figure 7. As it is possible to notice, there is no automorphism mapping pair to – hence the two are in distinct orbits.

A more precise assessment of the representational power of ELPH can be obtained by considering the node-pairs this model family is able to discriminate w.r.t. others.

Lemma A.11.

Let be the family of ELPH models (Equations 789, and 10), that of Message Passing Neural Networks (Equation 1). is more powerful than ().

Proof of Lemma a.11.

We prove this Lemma by noticing that the ELPH architecture generalizes that of an MPNN, so that an ELPH model can learn to simulate a standard MPNN by ignoring the structural features. Specifically, ELPH defaults to an MPNN with (1)


and (2)


This entails that, anytime a specific MPNN instance distinguishes between two non-automorphic node-pairs, there exists an ELPH model which does so as well: the one which exactly simulates such MPNN. ∎

Lemma A.12.

There exist node-pairs distinguished by an ELPH or BUDDY model (with exact cardinality estimates) which are not distinguished by any MPNN model.

Proof of Lemma a.12.

The Lemma is proved simply by considering non-automorphic pairs , from graph depicted in Figure 7. We have already shown how there exists both an ELPH and BUDDY model (computing exact estimates of , ) which separate the two. On the other hand, we have already observed how the nodes in the pairs are automorphic as all belonging to the same orbit. Thus, the two pairs cannot possibly be distinguished by any MPNN, since they suffer from the automorphic node problem [36, 51], as remarked in Proposition A.6. ∎

Proof of Theorem 4.2.

The Theorem follows directly from Lemmas A.11 and A.12. ∎

Does this expressiveness result also extend to BUDDY? While we have proved BUDDY does not suffer from the automorphic node problem, its non-parametric message-passing scheme is such that this family of models do not generally subsume MPNNs. On one hand, in view of Lemma A.12, we know there exist node-pairs separated by BUDDY but not by any MPNN; on the other, this does not exclude the presence of node-pairs for which the vice-versa is true.

Appendix B Additional Experimental Details

b.1 Datasets and Their Properties

Table 6 basic properties of the experimental datasets, together with the scaling of subgraph size with hops. Subgraph statistics are generated by expanding -hop subgraphs around 1000 randomly selected links. Regular graphs scale as

, however as these datasets are all complex networks the size of subgraphs grows far more rapidly than this, which poses serious problems for SGNNs. Furthermore, the size of subgraphs is highly irregular with high standard deviations making efficient parallelization in scalable architectures challenging. The one exception to this pattern is DDI. Due to the very high density of DDI, most two hop subgraphs include almost every node.

Cora Citeseer Pubmed Collab PPA DDI Citation2
#Nodes 2,708 3,327 18,717 235,868 576,289 4,267 2,927,963
#Edges 5,278 4,676 44,327 1,285,465 30,326,273 1,334,889 30,561,187
splits rand rand rand time throughput time protein
avg 3.9 2.74 4.5 5.45 52.62 312.84 10.44
avg 15.21 7.51 20.25 29.70 2769 97,344 109
1-hop size
2-hop size
Table 6:

Properties of link prediction benchmarks. Confidence intervals are

one standard deviation. Splits for the Planetoid datasets are random and Collab uses the fixed OGB splits. Where possible, baseline results for Collab are taken directly from the OGB leaderboard

b.2 Experimental Setup

In all cases, we use the largest connected component of the graph. LP tasks require links to play dual roles as both supervision labels and message passing links. For all datasets, at training time the message passing links are equal to the supervision links, while at test and validation time, disjoint sets of links are held out for supervision that are never seen at training time. The test supervision links are also never seen at validation time, but for the Planetoid and ogbl-collab555The OGB rules allow validation edges to be used for the ogbl-collab dataset. datasets, the message passing edges at test time are the union of the training message passing edges and the validation supervision edges. OGB datasets have fixed splits whereas for Planetoid, random 70-10-20 percent train-val-test splits were generated. On DDI, NBFnet was trained without learned node embeddings and with a batch size of 5.

b.3 Hyperparameters

The parameter used by HyperLogLog was 8 and the number of permutations used by MinHashing

was 128. All hyperparameters were tuned using Weights and Biases random search. The search space was over hidden dimension (64–512), learning rate (0.0001–0.01) and dropout (0–1), layers (1–3) and weight decay (0–0.001). Hyperparameters with the highest validation accuracy were chosen and results are reported on a test set that is used only once.

b.4 Implementation Details

Our code is implemented in PyTorch [31], using PyTorch geometric [12]. Code and instructions to reproduce the experiments will be made available following publication. We utilized either AWS p2 or p3 machines with 8 Tesla K80 and 8 Tesla V100 respectively to perform all the experiments in the paper.

Appendix C More on Structural Features

c.1 Heuristic Methods for Link Prediction

Heuristic methods are classified by the receptive field and whether they measure neighborhood similarity or path length. The simplest neighborhood similarity method is the 1-hop Common Neighbor (CN) count. Many other heuristics such as cosine similarity, the Jaccard index and the Probabilistic Mutual Information (PMI) differ from CN only by the choice of normalization. Adamic-Adar and Resource Allocation are two closely related second order heuristics that penalize neighbors by a function of the degree


where are the neighbors of . Shortest path based heuristics are generally more expensive to calculate than neighborhood similarity as they require global knowledge of the graph to compute exactly. The Katz index takes into account multiple paths between two nodes. Each path is given a weight of , where is a hyperparameter attenuation factor and is the path length. Similarly Personalized PageRank (PPR) estimates landing probabilities of a random walker from a single source node. For a survey comparing 20 different LP heuristics see e.g. [24].

c.2 Subgraph Generation Complexity

The complexity for generating regular -hop subgraph is , where is the degree of every node in the subgraph. For graphs that are approximately regular, with a well defined mean degree, the situation is slightly worse. However, in general we are interested in complex networks (such as social networks, recommendation systems or citation graphs) that are formed as a result of preferential attachment. Complex networks are typified by power law degree distributions of the form . As a result the mean is only well defined if

and the variance is finite only for

. As most real world complex networks fall into the range , we have that the maximum degree is only bounded by the number of edges in the graph and thus, so is the complexity of subgraph generation. Table 6 includes the average size of one thousand 1-hop and 2-hop randomly generated graphs for each dataset used in our experiments. In all cases, the size of subgraphs greatly exceeds the average degree baseline with very large variances.

c.3 Subgraph Sketches

We make use of both the HyperLogLog and MinHash sketching schemes. The former is used to estimate the size of the union, while the latter estimates the Jaccard similarity between two sets. Together they can be used to estimate the size of set intersections.

c.3.1 Hyperloglog

HyperLogLog efficiently estimates the cardinality of large sets. It accomplishes this by representing sets using a constant size data sketch. These sketches can be combined in time that is constant w.r.t the data size and linear in the sketch size using elementwise maximum to estimate the size of a set union.

The algorithm takes the precision as a parameter. From , it determines the number of registers to use. The sketch for a set is comprised of registers where . A hash function maps elements from into an array of 64-bits. The algorithm uses the first bits of the hash to associate elements with a register. From the remaining bits, it computes the number of leading zeros, tracking the maximum number of leading zeros per register. Intuitively a large number of leading zero bits is less likely and indicates a higher cardinality and for a single register the expected cardinality for a set where the maximum number of leading zeros is is . This estimate is highly noisy and there are several methods to combine estimates from different registers. We use hyperloglog++ [17]

, for which the standard error is numerically close to

for large enough .

Hyperloglog can be expressed in three functions Initialize, Union, and Card. Sketches are easily merged by populating a new set of registers with the element-wise max values for each register. To extract the estimate, the algorithm finds the harmonic mean of

for each of the the