## 1 Introduction

Graph neural networks (GNN), inheriting from the power of neural networks (hornik1989multilayer)

, have recently become the de facto standard for machine learning with graph-structured data

(scarselli2008graph). While GNNs can easily outperform traditional algorithms for single-node tasks (such as node classification) and whole graph tasks (such as graph classification), GNNs predicting over a set of nodes often achieve subpar performance. For example, for link prediction, GNN models, such as GCN (kipf2016semi), GAE (kipf2016variational)may perform even worse than some simple heuristics such as common neighbors and Adamic Adar

(liben2007link) (see the performance comparison over the networks Collab and PPA in Open Graph Benchmark (OGB) (hu2020open)). Similar issues widely appear in node-set-based tasks such as network motif prediction (liu2022neural; besta2021motif), motif counting (zhengdao2020can), relation prediction (wang2021relational; teru2020inductive) and temporal interaction prediction (wang2021inductive), which posts a big concern for applying GNNs to these relevant real-world applications.The above failure is essentially induced by the loss of node identities during the intrinsic computation of GNNs. Nodes that get matched under graph automorphism will be associated with the same representation by GNNs and thus are indistinguishable (see Fig. 1

(i)). A naive way to solve this problem is to pair GNNs with one-hot encoding as the extra node feature. However, it violates the fundamental inductive bias, i.e., permutation equivariance which GNNs are designed for, and thus may lead to poor generalization capability: The obtained GNNs are not transferable (inductive) across different node sets and different graphs or stable to network perturbation.

Many works have been recently proposed to address such an issue. The key idea is to use augmented node features, where either random features (RF) or deterministic distance encoding (DE) can be adopted. Interested readers may refer to the book chapter (GNNBook-ch5-li) for detailed discussion. Here we give a brief review. RF by nature distinguishes nodes and guarantees permutation equivariance if the distribution to generate RF keep invariant across the nodes. Although GNNs paired with RF have been proved to be more expressive (murphy2019relational; sato2021random), the training procedure is often hard to converge and the prediction is noisy and inaccurate due to the injected randomness (abboud2020surprising). On the other hand, DE defines extra features by using the distance from a node to the node set where the prediction is to be made (li2020distance). This technique is theoretically sound and empirically performs well (zhang2018link; li2020distance; zhang2020revisiting). But it introduces huge memory and time consumption. This is because DE is specific to each node set sample and no intermediate computational results, e.g., node representations in the canonical GNN pipeline can be shared across different samples.

To alleviate the computational cost of DE, absolute positions of nodes in the graph may be used as the extra features. We call this technique as positional encoding (PE). PE may approximate DE by measuring the distance between positional features and can be shared across different node set samples. However, the fundamental challenge is how to guarantee the GNNs trained with PE keep permutation equivariant and stable. Using the idea of RF, previous works randomize PE to guarantee permutation equivariance. Specifically, you2019position designs PE as the distances from a node to some randomly selected anchor nodes. However, the approach suffers from slow convergence and achieves merely subpar performance. srinivasan2020equivalence states that PE using the eigenvectors of the randomly permuted graph Laplacian matrix keeps permutation equivariant. dwivedi2020generalization; kreuzer2021rethinking argue that such eigenvectors are unique up to their signs and thus propose PE that randomly perturbs the signs of those eigenvectors. Unfortunately, these methods are problematic. They fail to provide permutation equivariant GNNs when the matrix has multiple eigenvalues, which thus are dangerous when applying to many practical networks. For example, large social networks, when not connected, have multiple 0 eigenvalues; small molecule networks often have non-trivial automorphism that may give multiple eigenvalues. Even if the eigenvalues are distinct, these methods are unstable. We prove that the sensitivity of node representations to the graph perturbation depends on the inverse of the smallest gap between two consecutive eigenvalues, which could be actually large when two eigenvalues are close (Lemma 3.4).

In this work, we propose a principled way of using PE to build more powerful GNNs. *The key idea is to use separate channels to update the original node features and positional features. The GNN architecture keeps not only permutation equivariant w.r.t. node features but also rotation equivariant w.r.t. positional features.* This idea applies to a broad range of PE techniques that can be formulated as matrix factorization (qiu2018network) such as Laplacian Eigenmap (LE) (belkin2003laplacian) and Deepwalk (perozzi2014deepwalk). We design a GNN layer PEG that satisfies such requirements. PEG is provably stable: In particular, we prove that the sensitivity of node representations learnt by PEG only depends on the gap between the th and th eigenvalues of the graph Laplacian if -dim LE is adopted as PE, instead of the smallest gap between any two consecutive eigenvalues that previous works have achieved.

PEG gets evaluated in the most important node-set-based task, link prediction, over 8 real-world networks. PEG achieves comparable performance with strong baselines based on DE while having much lower training and inference complexity. PEG achieves significantly better performance than other baselines without using DE. Such performance gaps get enlarged when we conduct domain-shift link prediction, where the networks used for training and testing are from different domains, which effectively demonstrates the strong generalization and transferability of PEG.

### 1.1 Other related works

As long as GNNs can be explained by a node-feature-refinement procedure (hamilton2017inductive; gilmer2017neural; morris2019weisfeiler; velivckovic2018graph; klicpera2019predict; chien2021adaptive), they suffer from the aforementioned node ambiguity issue. Some GNNs cannot be explained as node-feature refinement as they directly track node-set representations (maron2018invariant; morris2019weisfeiler; chen2019equivalence; maron2019provably). However, their complexity is high, as they need to compute the representation of each node set of certain size. Moreover, they were only studied for graph-level tasks. EGNN (satorras2021n) seems to be a relevant work as it studies when the nodes have physical coordinates given in prior. However, no analysis of PE has been provided.

A few works studied the stability of GNNs by using tools like graph scattering transform (for graph-level representation) (gama2019diffusion; gama2019stability) and graph signal filtering (for node-level representations) (levie2021transferability; ruiz2020graphon; ruiz2021graph; gama2020stability; nilsson2020experimental). They all focus on justifying the stability of the canonical GNN pipeline, graph convolution layers in particular. None of them consider positional features let alone the stability of GNNs using PE.

## 2 Notations and Preliminaries

In this section, we prepare the notations and preliminaries that are useful later. First, we define graphs.

###### Definition 2.1 (Graph).

Unless specified, we always consider undirected graphs of nodes and let . One such graph can be denoted as , where is the adjacency matrix. denotes the node features, where the th row, , is the feature vector of node . A graph may have self loops, i.e., has nonzero diagonals. Denote as the diagonal degree matrix where for . Let . Denote the normalized adjacency matrix as and the normalized Laplacian matrix as .

###### Definition 2.2 (Permutation).

An -dim permutation matrix is a matrix in where each row and each column has only one single 1. All such matrices are collected in , simplified as .

We denote the vector -norm as , the Frobenius norm as and the operator norm as .

###### Definition 2.3 (Graph-matching).

Given two graphs and their normalized Laplacian matrices for , their matching can be denoted by a permutation matrix that best aligns the graph structures and the node features.

Using instead of to represent graph structures is for notational simplicity. Actually for an unweighted graph, there is a bijective mapping between and . One can rewrite the first term with . Later, we use by not specifying and if there is no confusion. The distance between the two graphs can be defined as .

Next, we review eigenvalue decomposition and summarize arguments on its uniqueness in Lemma 2.6.

###### Definition 2.4 (Eigenvalue Decomposition (EVD)).

For a positive semidefinite (PSD) matrix , it has eigenvalue decomposition where is a real diagonal matrix with the eigenvalues of , as its diagonal components.

is an orthogonal matrix where

is the th eigenvector, i.e. .###### Definition 2.5 (Orthogonal Group in the Euclidean space).

includes all -by- orthogonal matrices. A subgroup of includes all diagonal matrices with as the diagonal components, .

###### Lemma 2.6.

EVD is not unique. If all the eigenvalues are distinct, i.e., , is unique up to the signs of its columns, i.e., replacing by also gives EVD. If there are multiple eigenvalues, say , then lie in an orbit induced by the orthogonal group , i.e., replacing by for any while keeping eigenvalues and other eigenvectors unchanged also gives EVD.

Next, we define Positional Encoding (PE), which associates each node with a vector in a metric space where the distance between two vectors can represent the distance between the nodes in the graph.

###### Definition 2.7 (Positional Encoding).

Given a graph , PE works on and gives where each row gives the positional feature of node .

The absolute values given by PE may not be useful while the distances between the positional features are more relevant. So, we define PE-matching that allows rotation to best match positional features, which further defines the distance between two collections of positional features.

###### Definition 2.8 (PE-matching).

Consider two groups of positional features . Their matching is given by . Later, is used if it causes no confusion. Define the distance between them as .

## 3 Equivariant and Stable Positional Encoding for GNN

In this section, we will study the equivariance and stability of a GNN layer using PE .

### 3.1 Our Goal: Building Permutation Equivariant and Stable GNN layers

A GNN layer is expected to be permutation equivariant and stable. These two properties, if satisfied, guarantee that the GNN layer is transferable and has better generalization performance. Permutation equivariance implies that model predictions should be irrelevant to how one indexes the nodes, which captures the fundamental inductive bias of many graph learning problems: A permutation equivariant layer can make the same prediction over a new testing graph as that over a training graph if the two graphs match each other perfectly, i.e. the distance between them is 0. Stability is actually an even stronger condition than permutation equivariance because it characterizes how much gap between the predictions of the two graphs is expected when they do not perfectly match each other.

Specifically, given a graph with node features , we consider a GNN layer updating the node features, denoted as . We define permutation equivariance and stability as follows.

###### Definition 3.1 (Permutation Equivariance).

A GNN layer is permutation equivariant, if for any and any graph , .

###### Definition 3.2 (Stability).

A GNN layer is claimed to be stable, if there exists a constant , for any two graphs and , letting denote their matching, satisfies . By setting and for some , the RHS becomes zero, so a stable makes the LHS zero too. So, stability is a stronger condition than permutation equivariance.

Our goal is to guarantee that the GNN layer that utilizes PE is permutation equivariant and stable. To distinguish from the GNN layer that does not use PE, we use to denote a GNN layer that uses PE, which takes the positional features as one input and may update both node features and , i.e., . Now, we may define PE-equivariance and PE-stability for the GNN layer .

###### Definition 3.3 (PE-stability & PE-equivariance).

Consider a GNN layer that uses PE. When it works on any two graphs , and gives , let be the matching between the two graphs. is PE-stable, if for some constant we have

(1) |

Recall measures the distance between two sets of positional features as defined in Def. 2.8. Similar as above, a weaker condition of PE-stability is PE-equivariance: If and for some , we expect a perfect match between the updated node features and positional features, and .

Note that previous works also consider that updates only node features, i.e., . In this case, PE-stability can be measured by removing the second term on from Eq.1.

### 3.2 PE-stable GNN layers based on Laplacian Eigenmap as Positional Encoding

. Over Citeseer,

is extremely large because there are multiple eigenvalues ( is still finite due to the numerical approximation of the eigenvalues). Over Twitch(PT), even if there are no multiple eigenvalues, is mostly larger than 10.To study the requirement of the GNN layer that achieves PE-stability, let us start from a particular PE technique, i.e., Laplacian eigenmap (LE) (belkin2003laplacian) to show the key insight behind. Later, we generalize the concept to some other PE. Let denote LE, which includes the eigenvectors that correspond to the smallest eigenvalues of the normalized Laplacian matrix .

Previous works failed to design PE-equivariant or PE-stable GNN layers. srinivasan2020equivalence claims that if a GNN layer that does not rely on PE is permutation equivariant, PE-equivariance may be kept by adding to node features, i.e., , where MLP

is a multilayer perceptron that adjusts the dimension properly. This statement is problematic when

is not unique given the graph structure . Specifically, though sharing graph structure , if different implementations lead to different LEs , then , which violates PE-equivariance. srinivasan2020equivalence suggests using graph permutation augmentation to address the issue, which makes assumptions on an invariant distribution of and empirically does not work well.dwivedi2020generalization; kreuzer2021rethinking claim the uniqueness of up to the signs and suggest building a GNN layer that uses random as where is uniformly at random sampled from . They expect PE-equivariance in the sense of expectation. However, this statement is generally wrong because it depends on the condition that *all eigenvalues have to be distinct* as stated in Lemma 2.6. Actually, for multiple eigenvalues, there are infinitely many eigenvectors that lie in the orbit induced by the orthogonal group. Although many real graphs have multiple eigenvalues such as disconnected graphs or graphs with some non-trivial automophism,
one may argue that the methods work when the eigenvalues are all distinct. However, the above failure may further yield PE-instability even when the eigenvalues are distinct but have small gaps due to the following lemma.

###### Lemma 3.4.

For any PSD matrix without multiple eigenvalues, set positional encoding as the eigenvectors given by the smallest eigenvalues sorted as of . For any sufficiently small , there exists a perturbation , such that

(2) |

Lemma 3.4 implies that small perturbation of graph structures may yield a big change of eigenvectors if there is a small eigengap. Consider two graphs and , where and is with a small perturbation . The perturbation is small enough so that the matching . However, the change of PE even after removing the effect of changing signs could be dominated by the largest inverse eigengap among the first eigenvalues . In practice, it is hard to guarantee all these eigenpairs have large gaps, especially when a large is used to locate each node more accurately. Plugging this PE into a GNN layer gives the updated node features a large change and thus violates PE-stability.

PE-stable GNN layers.

Although a particular eigenvector may be not stable, the eigenspace, i.e., the space spanned by the columns of

could be much more stable. This motivates our following design of the PE-stable GNN layers. Formally, we use the following lemma that can characterize the distance between the eigenspaces spanned by LEs of two graph Laplacians. The error is controlled by the inverse eigengap between the th and th eigenvalues , which by properly setting is typically much smaller than in Lemma 3.4. We compute the ratio between these two values over some real-world graphs as shown in Fig. 2.###### Lemma 3.5.

For two PSD matrices , set as the eigenvectors given by the smallest eigenvalues of . Suppose has eigenvalues and . Then, for any permutation matrix ,

(3) |

Inspired by the stability of the eigenspace, the idea to achieve PE-stability is to make the GNN layer invariant to the selection of bases of the eigenspace for the positional features. So, our proposed PE-stable GNN layer that uses PE should satisfy two necessary conditions: 1) Permutation equivariance w.r.t. all features; 2) Rotation equivariance w.r.t. positional features, i.e.,

PE-stable layer cond. 1: | (4) | ||||

PE-stable layer cond. 2: | (5) |

Rotation equivariance reflects the eigenspace instead of a particular selection of eigenvectors and thus achieves much better stability. Interestingly, these requirements can be satisfied by EGNN recently proposed (satorras2021n) as one can view the physical coordinates of objects considered by EGNN as the positional features. EGNN gets briefly reviewed in Appendix F. Thm. 3.6 proves PE-equivariance under the conditions Eqs. 4 and 5.

###### Theorem 3.6.

Note that satisfying Eqs. 4,5 is insufficient to guarantee PE-stability that depends on the form of .

We implement in our model PEGN with further simplification which has already achieved good empirical performance: Use a GCN layer with edge weights according to the distance between the end nodes of the edge and keep the positional features unchanged. This gives the PEG layer,

(6) |

Here

is an element-wise activation function,

is an MLP mapping from and is the Hadamard product. Note that if is sparse, only for an edge needs to be computed.We may also prove that the PEG layer satisfies the even stronger condition PE-stability.

###### Theorem 3.7.

Note that to achieve PE-stability, we need to normalize the node initial features to keep bounded, and control and . In practice is typically satisfied, e.g. setting

as ReLU. Here, the most important term is

. PE-stability may only be achieved when there is an eigengap between and and the larger eigengap, the more stable. This observation may be also instructive to select the in practice. As previous works may encounter a smaller eigengap (Lemma 3.4), their models will be generally more unstable.Also, the simplified form of is *not necessary* to achieve PE-stability. However, as it has already given consistently better empirical performance than baselines, we choose to keep using .

### 3.3 Generalized to Other PE techniques: DeepWalk and LINE

It is well known that LE, as to compute the smallest eigenvalues, can be written as a low-rank matrix optimization form , s.t. . Other PE techniques, such as Deepwalk (perozzi2014deepwalk), Node2vec (grover2016node2vec) and LINE (tang2015line) can be unified into a similar optimization form, where the positional features are given by matrix factorization ( may be asymmetric so may not be ) and satisfies

(7) |

Here typically revises by combining degree normalization and a power series of , corresponds to edge negative sampling that may be related to node degrees, and

is a component-wise log-sigmoid function

. E.g., in LINE, , , for some positive constant , where is the all-one vector. More discussion on Eq.7 and the forms of and for other PE techniques are given in Appendix G.According to the above optimization formulation of PE, all the PE techniques generate positional features based on matrix factorization, thus are not unique. always lies in the orbit induced by , i.e., if solves Eq.7, for solves it too. A GNN layer still needs to satisfy Eqs. 4,5 to guarantee PE-equivariance. PE-stability actually asks even more.

###### Theorem 3.8.

Consider a general PE technique that has an optimization objective as Eq.7, computes its optimal solution and decomposes , s.t. to get positional features . Essentially, consists of the right-singular vectors of . If Eq.7 has a unique solution , and therein satisfy and , then a GNN layer that satisfies Eqs. 4,5 is PE-equivariant.

Note that the conditions on and are generally satisfied by Deepwalk, Node2vec and LINE. However, solving Eq.7 to get the optimal solution may not be guaranteed in practice because the low-rank constraint is non-convex. One may consider relaxing the low-rank constraint into the nuclear norm for some threshold (recht2010guaranteed), which reduces the optimization Eq.7 into a convex optimization and thus satisfies the conditions in Thm. 3.8

. Empirically, this step and the step of computing SVD seem unnecessary according to our experiments. The PE-stability is related to the value of the smallest non-zero singular value of

. We leave the full characterization of PE-equivariance and PE-stability for the general PE techniques for the future study.## 4 Experiments

In this work, we use the most important node-set-based task link prediction to evaluate PEG, though it may apply to more general tasks. Two types of link prediction tasks are investigated: traditional link prediction (Task 1) and domain-shift link prediction (Task 2). In Task 1, the model gets trained, validated and tested over the same graphs while using different link sets. In Task 2, the graph used for training/validation is different from the one used for testing. Both tasks may reflect the effectiveness of a model while Task 2 may better demonstrate the model’s generalization capability that strongly depends on permutation equivariance and stability. All the results are based on 10 times random tests.

### 4.1 The Experimental Pipeline of PEG for Link Prediction

We use PEG to build GNNs. The pipeline contains three steps. First, we adopt certain PE techniques to compute positional features . Second, we stack PEG layers according to Eq. 6. Suppose the final node representations are denoted by . Third, for link prediction over , we concatenate denoted as and adopt MLP to make final predictions. In the experiments, we test LE and Deepwalk as PE, and name the models PEG-LE and PEG-DW respectively. To verify the wide applicability of our theory, we also apply PEG to GraphSAGE hamilton2017inductive by similarly using the distance between PEs to control the neighbor aggregation according to Eq. 6 and term the corresponding models PE-SAGE-LE and PE-SAGE-DW respectively.

For some graphs, especially those small ones where the union of link sets for training, validation and testing cover the entire graph, the model may overfit positional features and hold a large generalization gap. This is because the links used as labels to supervise the model training are also used to generate positional features. To avoid this issue, we consider a more elegant way to use the training links. We adopt a 10-fold partition of the training set. For each epoch, we periodically pick one fold of the links to supervise the model while using the rest links to compute positional features. Note that PEs for different folds can be pre-computed by removing every fold of links, which reduces computational overhead. In practice, the testing stage often corresponds to online service and has stricter time constraints. Such partition is not needed so there is no computational overhead for testing. We term the models trained in this way as PEG-LE+ and PEG-DW+ respectively.

### 4.2 Task 1 — Traditional link prediction

Datasets.

We use eight real graphs, including Cora, CiteSeer and Pubmed

(sen2008collective), Twitch (RU), Twitch (PT) and Chameleon (rozemberczki2021multi), DDI and COLLAB (hu2021ogb). Over the first 6 graphs, we utilize 85%, 5%, 10% to partition the link set that gives positive examples for training, validation and testing and pair them with the same numbers of negative examples (missing edges). For the two OGB graphs, we adopt the dataset splits in (hu2020open). The links for validation and test are removed during the training stage and the links for validation are also not used in the test stage. All the models are trained till the loss converges and the models with the best validation performance is used to test.Baselines.
We choose 6 baselines: *VGAE* (kipf2016variational), *P-GNN* (you2019position), *SEAL* (zhang2018link), *GNN Trans.* (dwivedi2020generalization), *LE* (belkin2003laplacian) and *Deepwalk (DW)* (perozzi2014deepwalk).
VGAE is a variational version of GAE (kipf2016variational) that utilizes GCN to encode both the structural information and node feature information, and then decode the node embeddings to reconstruct the graph adjacency matrix. SEAL is particularly designed for link prediction by using enclosing subgraph representations of target node-pairs. P-GNN randomly chooses some anchor nodes and aggregates only from these anchor nodes. GNN Trans. adopts LE as PE and merges LE into node features and utilizes attention mechanism to aggregate information from neighbor nodes. For VGAE, P-GNN and GNN Trans., the inner product of two node embeddings is adopted to represent links.
LE and DW are two network embedding methods, where the obtained node embeddings are already positional features and directly used to predict links.

Implementation details. For VGAE, we consider six types of features: (1) node feature (N.): original feature of each node. (2) constant feature (C.): node degree. (3) positional feature (P.): PE extracted by Deepwalk. (4) one-hot feature (O.): one-hot encoding of node indices. (5) random feature (R.): random value (6) node feature and positional feature (N. + P.): concatenating the node feature and the positional feature. P-GNN uses node features and the distances from a node to some randomly selected anchor nodes as positional features (N. + P.). GNN Trans. utilizes node features and LE as positional features (N. + P.). SEAL adopt Double-Radius Node Labeling (DRNL) to compute deterministic distance features (N. + D.). For PEG, we consider node features plus positional features (N. + P.) or constant feature plus positional features (C. + P.).

Method | Feature | Cora | Citeseer | Pubmed | Twitch-RU | Twitch-PT | Chameleon |

VGAE | N. | 89.89 ± 0.06 | 90.11 ± 0.08 | 94.62 ± 0.02 | 83.13 ± 0.07 | 82.89 ± 0.08 | 97.98 ± 0.01 |

C. | 55.68 ± 0.05 | 61.45 ± 0.36 | 69.03 ± 0.03 | 85.37 ± 0.02 | 85.69 ± 0.09 | 83.13 ± 0.04 | |

O. | 83.97 ± 0.05 | 77.22 ± 0.04 | 82.54 ± 0.04 | 84.76 ± 0.09 | 87.91 ± 0.05 | 97.67 ± 0.04 | |

P. | 83.82 ± 0.12 | 78.68 ± 0.25 | 81.74 ± 0.15 | 85.06 ± 0.14 | 85.06 ± 0.14 | 97.91 ± 0.03 | |

R. | 68.43 ± 0.42 | 71.21 ± 0.78 | 69.31 ± 0.23 | 68.42 ± 0.43 | 68.49 ± 0.73 | 73.44 ± 0.53 | |

N. + P. | 87.96 ± 0.29 | 80.04 ± 0.60 | 85.26 ± 0.17 | 84.59 ± 0.37 | 88.27 ± 0.19 | 98.01 ± 0.12 | |

PGNN | N. + P. | 86.92 ± 0.02 | 90.26 ± 0.02 | 88.12 ± 0.06 | 83.21 ± 0.00 | 82.37 ± 0.02 | 94.25 ± 0.01 |

GNN-Trans. | N. + P. | 79.31 ± 0.09 | 77.49 ± 0.02 | 81.23 ± 0.12 | 79.24 ± 0.33 | 75.44 ± 0.14 | 86.23 ± 0.12 |

SEAL | N. + D. | 91.32 ± 0.91 | 89.49 ± 0.43 | 97.16 ± 0.28 | 92.12 ± 0.10 | 93.21 ± 0.06 | 99.31 ± 0.18 |

LE | P. | 84.43 ± 0.02 | 78.36 ± 0.08 | 84.35 ± 0.04 | 78.80 ± 0.10 | 67.56 ± 0.02 | 88.47 ± 0.03 |

DW | P. | 86.82 ± 0.18 | 87.93 ± 0.11 | 85.79 ± 0.06 | 83.10 ± 0.05 | 83.47 ± 0.03 | 92.15 ± 0.02 |

PEG-DW | N. + P. | 89.51 ± 0.08 | 91.67 ± 0.12 | 87.68 ± 0.29 | 90.21 ± 0.04 | 89.67 ± 0.03 | 98.33 ± 0.01 |

PEG-DW | C. + P. | 88.36 ± 0.10 | 88.48 ± 0.10 | 88.80 ± 0.11 | 90.32 ± 0.09 | 90.88 ± 0.05 | 97.30 ± 0.03 |

PEG-LE | N. + P. | 94.20 ± 0.04 | 92.53 ± 0.09 | 87.70 ± 0.31 | 92.14 ± 0.05 | 92.28 ± 0.02 | 98.78 ± 0.02 |

PEG-LE | C. + P. | 86.88 ± 0.03 | 76.96 ± 0.23 | 91.65 ± 0.02 | 90.21 ± 0.18 | 91.15 ± 0.13 | 98.73 ± 0.04 |

PEG-DW+ | N. + P. | 93.32 ± 0.08 | 94.11 ± 0.14 | 97.88 ± 0.05 | 91.68 ± 0.01 | 92.15 ± 0.02 | 98.20 ± 0.01 |

PEG-DW+ | C. + P. | 90.78 ± 0.09 | 91.22 ± 0.12 | 93.44 ± 0.05 | 90.22 ± 0.04 | 91.37 ± 0.05 | 97.50 ± 0.03 |

PEG-LE+ | N. + P. | 93.78 ± 0.03 | 95.73 ± 0.09 | 97.92 ± 0.11 | 92.29 ± 0.11 | 92.37 ± 0.06 | 98.18 ± 0.02 |

PEG-LE+ | C. + P. | 88.98 ± 0.14 | 78.61 ± 0.27 | 94.28 ± 0.05 | 92.35 ± 0.02 | 92.50 ± 0.06 | 97.79 ± 0.01 |

Method | ogbl-ddi (Hits@20(%)) | ogbl-collab (Hits@50(%)) | ||||||
---|---|---|---|---|---|---|---|---|

training time | test time | Validation | test | training time | test time | Validation | test | |

GCN | 29min 27s | 0.20s | 55.27 ± 0.53 | 37.11 ± 0.21 | 1h38min17s | 1.38s | 52.71 ± 0.10 | 44.62 ± 0.01 |

GraphSAGE | 14min 26s | 0.24s | 67.11 ± 1.21 | 52.81 ± 8.75 | 38min 10s | 0.83s | 57.16 ± 0.70 | 48.45 ± 0.80 |

SEAL | 2h 04min 32s | 12.04s | 28.29 ± 0.38 | 30.23 ± 0.24 | 2h29min05s | 51.28s | 64.95 ± 0.04 | 54.71 ± 0.01 |

PGNN | 9min 49.39s | 0.28s | 2.66 ± 0.16 | 1.74 ± 0.19 | N/A | N/A | N/A | N/A |

GNN-trans. | 53min 26s | 0.35s | 15.63 ± 0.14 | 9.22 ± 0.21 | 1h52min22s | 1.86s | 18.17 ± 0.25 | 11.19 ± 0.42 |

DW | 36min 41s | 0.23s | 0.04 ± 0.00 | 0.02 ± 0.00 | 34min40s | 1.08s | 53.64 ± 0.03 | 44.79 ± 0.02 |

LE | 33min 42s | 0.29s | 0.09 ± 0.00 | 0.02 ± 0.00 | 37min22s | 1.23s | 0.10 ± 0.01 | 0.12 ± 0.02 |

PEG-DW | 29min 56s | 0.27s | 56.47 ± 0.35 | 43.80 ± 0.32 | 1h42min 05s | 1.51s | 63.98 ± 0.05 | 54.33 ± 0.06 |

PEG-LE | 30min 32s | 0.29s | 57.49 ± 0.47 | 30.16 ± 0.47 | 1h42min03s | 1.42s | 56.52 ± 0.12 | 48.76 ± 0.92 |

PE-SAGE-DW | 25min 11s | 0.31s | 68.05 ± 0.96 | 56.16 ± 5.50 | 56min54s | 0.97s | 63.43 ± 0.48 | 54.17 ± 0.54 |

PE-SAGE-LE | 26min 19s | 0.32s | 68.38 ± 0.78 | 51.49 ± 9.71 | 55min59s | 0.98s | 58.66 ± 0.55 | 49.75 ± 0.67 |

PEG-DW+ | 48min 03s | 0.28s | 59.70 ± 6.87 | 47.93 ± 0.21 | 1h37min43s | 1.43s | 62.31 ± 0.19 | 53.71 ± 8.02 |

PEG-LE+ | 51min 25.35s | 0.29s | 58.44 ± 7.46 | 32.33 ± 0.14 | 1h33min29s | 1.39s | 52.91 ± 1.24 | 45.96 ± 9.98 |

Results are shown in Table 1 and Table 2. Over the small datasets in Table 1, VGAE with node features outperforms other features in Cora, Citeseer and Pubmed because the nodes features therein are mostly informative, while this is not true over the other three datasets. One-hot features and positional features almost achieve the same performance, which implies that that GNNs naively using PE makes positional features behave like one-hot features and may have instability issues. Constant features are not good because of the node ambiguity issues. Random features may introduce heavy noise that causes trouble in model convergence. Concatenating node features and positional features gives better performance than only using positional features but is sometimes worse than only using node features, which is again due to the instability issue by using positional features.

Although PGNN and GNN Trans. utilize positional features, they achieve subpar performance. SEAL outperforms all of the state-of-art methods, which again demonstrates that the effectiveness of distance features (li2020distance; zhang2020revisiting). PEG significantly outperforms all the baselines except SEAL. PEG+ by better using training sets achieves comparable or even better performance than SEAL, which demonstrates the contributions of the stable usage of PE. Moreover, PEG can achieve comparable performance in most cases even only paired with constant features, which benefits from the more expressive power given by PE (avoids node ambiguity). Note that PE without GNNs (LE or DW solo) does not perform well, which justifies the benefit by joining GNN with PE.

Regarding the OGB datasets, PGNN and GNN Trans. do not perform well either. Besides, PGNN cannot scale to large graphs such as *collab*. The results of DW and LE demonstrate that the original positional features may only provide crude information, so pairing them with GNNs is helpful. PEG achieves the best results on *ddi*, and performs competitively with SEAL on *collab*. The complexity of PEG is comparable to canonical GNNs. Note that we do not count the time of PE as it relates to the particular implementation. If PEG is used for online serving when the time complexity is more important, PE can be computed in prior. For a fair comparison, we also do not count the pre-processing time of SEAL, PGNN or GNN Trans.. Most importantly, PEG performs significantly faster than SEAL on test because SEAL needs to compute distance features for every link while PE in PEG is shared by links. Interestingly, DW seems better than LE as a PE technique for large networks.

### 4.3 Task 2 —Domain-shift link prediction

Datasets & Baselines. Task 2 better evaluates the generalization capability of models. We consider 3 groups, including citation networks (coraciteseer and corapubmed) (sen2008collective), user-interaction networks (Twitch (EN)Twitch (RU) and Twitch (ES)Twitch (PT)) (rozemberczki2021multi) and biological networks (PPI) (hamilton2017inductive). For citation networks and user-interaction networks, we utilize 95%, 5% dataset splitting for training and validation over the training graph, and we use 10% existing links as test positive links in the test graph. For PPI dataset, we randomly select 3 graphs as training, validation and testing datasets and we sample 10% existing links in the validation/testing graphs as validation/test positive links.

Baselines & Implementation details. We use the same baselines as Task 1, while we do not use one-hot features for VGAE, since the training and test graphs have different sizes. As for node features, we randomly project them into the same dimension and then perform row normalization on them. Other settings are the same as Task 1. PE is applied to training and testing graphs separately while PE over testing graphs is computed after removing the testing links.

Mehood | Features | CoraCiteseer | CoraPubmed | ESPT | ENRU | PPI |

N. | 62.74 ± 0.03 | 63.53 ± 0.27 | 51.52 ± 0.17 | 60.08 ± 0.02 | 83.24 ± 0.20 | |

C. | 62.16 ± 0.08 | 56.89 ± 0.36 | 82.72 ± 0.26 | 91.23 ± 0.07 | 75.27 ± 0.89 | |

P. | 70.59 ± 0.03 | 79.83 ± 0.27 | 82.24 ± 0.24 | 81.42 ± 0.01 | 77.61 ± 0.47 | |

R. | 68.44 ± 0.63 | 71.27 ± 0.37 | 71.26 ± 0.36 | 69.37 ± 0.35 | 75.88 ± 0.49 | |

VGAE | N. + P. | 76.45 ± 0.55 | 65.62 ± 0.42 | 71.46 ± 0.31 | 84.00 ± 0.28 | 84.67 ± 0.22 |

PGNN | N. + P. | 85.02 ± 0.28 | 76.88 ± 0.42 | 70.41 ± 0.07 | 63.27 ± 0.27 | 80.84 ± 0.03 |

GNN-Trans. | N. + P. | 61.60 ± 0.52 | 76.35 ± 0.17 | 63.44 ± 0.34 | 62.87 ± 0.22 | 79.82 ± 0.17 |

SEAL | N. + D. | 91.36 ± 0.93 | 89.62 ± 0.87 | 93.37 ± 0.05 | 92.34 ± 0.14 | 88.99 ± 0.12 |

LE | P. | 77.62 ± 0.04 | 84.03 ± 0.22 | 67.75 ± 0.09 | 77.57 ± 0.15 | 72.14 ± 0.82 |

DW | P. | 86.48 ± 0.14 | 86.97 ± 0.06 | 83.56 ± 0.03 | 83.41 ± 0.04 | 85.18 ± 0.20 |

PEG-DW | N. + P. | 89.91 ± 0.03 | 87.23 ± 0.34 | 91.82 ± 0.04 | 91.14 ± 0.02 | 87.36 ± 0.11 |

PEG-DW | C. + P. | 89.75 ± 0.04 | 89.58 ± 0.08 | 91.27 ± 0.04 | 90.26 ± 0.07 | 86.42 ± 0.20 |

PEG-LE | N. + P. | 82.57 ± 0.02 | 92.34 ± 0.28 | 91.61 ± 0.05 | 91.93 ± 0.13 | 85.34 ± 0.14 |

PEG-LE | C. + P. | 79.60 ± 0.04 | 88.89 ± 0.13 | 91.38 ± 0.10 | 92.40 ± 0.10 | 85.22 ± 0.16 |

PEG-DW+ | N. + P. | 91.15 ± 0.06 | 90.98 ± 0.03 | 91.24 ± 0.16 | 91.91 ± 0.02 | 89.92 ± 0.17 |

PEG-DW+ | C. + P. | 91.32 ± 0.01 | 90.93 ± 0.18 | 91.22 ± 0.02 | 92.14 ± 0.02 | 88.44 ± 0.29 |

PEG-LE+ | N. + P. | 86.72 ± 0.05 | 93.34 ± 0.11 | 91.67 ± 0.13 | 92.24 ± 0.19 | 86.77 ± 0.36 |

PEG-LE+ | C. + P. | 87.62 ± 0.04 | 92.21 ± 0.20 | 91.37 ± 0.19 | 93.12 ± 0.21 | 86.21 ± 0.27 |

Results are shown in Table 3. Compared with Table 1, for VGAE, we notice that node features perform much worse (except PPI) than Task 1, which demonstrates the risks of using node features when the domain shifts. Positional features, which is not specified for the same graph, is possibly more generalizable over different graphs. Random features are generalizable while still hard to converge. PGNN and GNN Trans. do not utilize positional features appropriately and perform far from ideal. Both SEAL and PEG outperform other baselines significantly, which implies their good stability and generalization. PEG and SEAL again achieve comparable performance while PEG has much lower training and testing complexity. Our results successfully demonstrate the significance of using permutation equivariant and stable PE.

## 5 Conclusion

In this work, we studied how GNNs should work with PE in principle, and proposed the conditions that keep GNNs permutation equivariant and stable when PE is used to avoid the node ambiguity issue. We follow those conditions and propose the PEG layer. Extensive experiments on link prediction demonstrate the effectiveness of PEG. In the future, we plan to generalize the theory to more general PE techniques and test PEG over other graph learning tasks.

#### Acknowledgments

We greatly thank the actionable suggestions given by reviewers. H. Yin and P. L. are supported by the 2021 JPMorgan Faculty Award and the National Science Foundation (NSF) award HDR-2117997.

## References

## Appendix A Proof of Lemma 3.4

Recall the eigenvalues of the PSD matrix are . Suppose one EVD of . In , is the eigenvector of so . Without loss of generality, we set .

Suppose . Now, we perturb by slightly perturbing and . We set

Set for . Note that and . Then, the columns of still give a group of orthogonal bases.

Now we denote the above perturbation of as . Then, could be for any . Therefore, for sufficiently small ,

(8) |

## Appendix B Proof of Lemma 3.5

The result of the Lemma 3.5 can be derived from the Davis-Kahan theorem (davis1970rotation) and its variant (yu2015useful) that characterizes the eigenspace perturbation. We apply the Theorem 2 Eq.3 of (yu2015useful) to two PSD matrices.

###### Theorem B.1 (Theorem 2 (yu2015useful)).

Let two PSD matrices , with eigenvalues such that . For , let have orthonormal columns satisfying for . Then, there exists an orthogonal matrix such that

By symmetry, use the above theorem again and we know there exists

Because , then there exists ,

(10) |

where .

When we apply a permutation matrix to permute the rows and columns of , then for any . Moreover, permuting the rows and columns of a PSD matrix will not change its eigenvalues. This means that can be replaced by in Eq.10 as long as is replaced by . Therefore,

## Appendix C Proof of Theorem 3.6

To prove PE-equivariance, consider two graphs and that have perfect matching . So, , .

Let denote the Laplacian eigenmaps of , . Set , and and use Lemma 3.5. Because and , then there exists .

(11) |