1 Introduction
Learning lowdimensional vector representations of nodes in graphs
(Hamilton et al., 2017b) has led to advances on tasks such as node classification (Kipf & Welling, 2017), link prediction (Grover & Leskovec, 2016), graph classification (Ying et al., 2018b) and graph generation (You et al., 2018b), with successful applications across domains such as social and information networks (Ying et al., 2018a), chemistry (You et al., 2018a), and biology (Zitnik & Leskovec, 2017).Node embedding methods can be categorized into Graph Neural Networks (GNNs) approaches (Scarselli et al., 2009), matrixfactorization approaches (Belkin & Niyogi, 2002), and randomwalk approaches (Perozzi et al., 2014). Among these, GNNs are currently the most popular paradigm, largely owing to their efficiency and inductive learning capability (Hamilton et al., 2017a). By contrast, randomwalk approaches (Perozzi et al., 2014; Grover & Leskovec, 2016)
are limited to transductive settings and cannot incorporate node attributes. In the GNN framework, the embedding of a node is computed by a GNN layer aggregating information from the node’s network neighbors via nonlinear transformation and aggregation functions
(Battaglia et al., 2018). Longrange node dependencies can be captured via stacking multiple GNN layers, allowing the information to propagate for multiplehops (Xu et al., 2018).However, the key limitation of existing GNN architectures is that they fail to capture the position/location of the node within the broader context of the graph structure. For example, if two nodes reside in very different parts of the graph but have topologically the same (local) neighbourhood structure, they will have identical GNN structure. Therefore, the GNN will embed them to the same point in the embedding space (we ignore node attributes for now). Figure 1 gives an example where a GNN cannot distinguish between nodes and
and will always embed them to the same point because they have isomorphic network neighborhoods. Thus, GNNs will never be able to classify nodes
and into different classes because from the GNN point of view they are indistinguishable (again, not considering node attributes). Researchers have spotted this weakness (Xu et al., 2019)and developed heuristics to fix the issue: augmenting node features with onehot encodings
(Kipf & Welling, 2017), or making GNNs deeper (Selsam et al., 2019). However, models trained with onehot encodings cannot generalize to unseen graphs, and arbitrarily deep GNNs still cannot distinguish structurally isomorphic nodes (Figure 1).Here we propose Positionaware Graph Neural Networks (PGNNs), a new class of Graph Neural Networks for computing node embeddings that incorporate a node’s positional information with respect to all other nodes in the network, while also retaining inductive capability and utilizing node features. Our key observation is that node position can be captured by a lowdistortion embedding by quantifying the distance between a given node and a set of anchor nodes. Specifically, PGNN uses a sampling strategy with theoretical guarantees to choose random subsets of nodes called anchorsets. To compute a node’s embedding, PGNN first samples multiple anchorsets in each forward pass, then learns a nonlinear aggregation scheme that combines node feature information from each anchorset and weighs it by the distance between the node and the anchorset. Such aggregations can be naturally chained and combined into multiple layers to enhance model expressiveness. Bourgain theorem (Bourgain, 1985) guarantees that only anchorsets are needed to preserve the distances in the original graph with low distortion.
We demonstrate the PGNN framework in various realworld graphbased prediction tasks. In settings where node attributes are not available, PGNN’s computation of the dimensional distance vector is inductive across different node orderings and different graphs. When node attributes are available, a node’s embedding is further enriched by aggregating information from all anchorsets, weighted by the dimensional distance vector. Furthermore, we show theoretically that PGNNs are more general and expressive than traditional messagepassing GNNs. In fact, messagepassing GNNs can be viewed as special cases of PGNNs with degenerated distance metrics and anchorset sampling strategies. In largescale applications, computing distances between nodes can be prohibitively expensive. Therefore, we also propose PGNNFast which adopts approximate node distance computation. We show that PGNNFast has the same computational complexity as traditional GNN models while still preserving the benefits of PGNN.
We apply PGNNs to 8 different datasets and several different prediction tasks including link prediction and community detection^{1}^{1}1Code and data are available in https://github.com/JiaxuanYou/PGNN/. In all datasets and prediction tasks, we show that PGNNs consistently outperforms state of the art GNN variants, with up to 66% AUC score improvement.
2 Related Work
Existing GNN models belong to a family of graph messagepassing architectures that use different aggregation schemes for a node to aggregate feature messages from its neighbors in the graph: Graph Convolutional Networks use mean pooling (Kipf & Welling, 2017); GraphSAGE concatenates the node’s feature in addition to mean/max/LSTM pooled neighborhood information (Hamilton et al., 2017a); Graph Attention Networks aggregate neighborhood information according to trainable attention weights (Velickovic et al., 2018); Message Passing Neural Networks further incorporate edge information when doing the aggregation (Gilmer et al., 2017); And, Graph Networks further consider global graph information during aggregation (Battaglia et al., 2018). However, all these models focus on learning node embeddings that capture local network structure around a given node. Such models are at most as powerful as the WL graph isomorphism test (Xu et al., 2019), which means that they cannot distinguish nodes at symmetric/isomorphic positions in the network (Figure 1). That is, without relying on the node feature information, above models will always embed nodes at symmetric positions into same embedding vectors, which means that such nodes are indistinguishable from the GNN’s point of view.
Heuristics that alleviate the above issues include assigning an unique identifier to each node (Kipf & Welling, 2017; Hamilton et al., 2017a) or using locally assigned node identifiers plus pretrained transductive node features (Zhang & Chen, 2018). However, such models are not scalable and cannot generalize to unseen graphs where the canonical node ordering is not available. In contrast, PGNNs can capture positional information without sacrificing other advantages of GNNs.
One alternative method to incorporate positional information is utilizing a graph kernel, which crucially rely on the positional information of nodes and inspired our PGNN model. Graph kernels implicitly or explicitly map graphs to a Hilbert space. WeisfeilerLehman and Subgraph kernels have been incorporated into deep graph kernels (Yanardag & Vishwanathan, 2015) to capture structural properties of neighborhoods. Gärtner et al. (2003) and Kashima et al. (2003) also proposed graph kernels based on random walks, which count the number of walks two graphs have in common (Sugiyama & Borgwardt, 2015). Kernels based on shortest paths were first proposed in (Borgwardt & Kriegel, 2005).
3 Preliminaries
3.1 Notation and Problem Definition
A graph can be represented as , where is the node set and is the edge set. In many applications where nodes have attributes, we augment with the node feature set where is the feature vector associated with node .
Predictions on graphs are made by first embedding nodes into a lowdimensional space which is then fed into a classifier, potentially in an endtoend fashion. Specifically, a node embedding model can be written as a function that maps nodes to dimensional vectors .
3.2 Limitations of Structureaware Embeddings
Our goal is to learn embeddings that capture the local network structure as well as retain the global network position of a given node. We call node embeddings to be positionaware, if the embedding of two nodes can be used to (approximately) recover their shortest path distance in the network. This property is crucial for many prediction tasks, such as link prediction and community detection. We show below that GNNbased embeddings cannot recover shortest path distances between nodes, which may lead to suboptimal performance in tasks where such information is needed.
Definition 1.
A node embedding is positionaware if there exists a function such that , where is the shortest path distance in .
Definition 2.
A node embedding is structureaware if it is a function of up to hop network neighbourhood of node . Specifically, , where is the set of the nodes hops away from node , and can be any function.
For example, most graph neural networks compute node embeddings by aggregating information from each node’s hop neighborhood, and are thus structureaware. In contrast, (long) randomwalkbased embeddings such as DeepWalk and Node2Vec are positionaware, since their objective function forces nodes that are close in the shortest path to also be close in the embedding space. In general, structureaware embeddings cannot be mapped to positionaware embeddings. Therefore, when the learning task requires node positional information, only using structureaware embeddings as input is not sufficient:
Proposition 1.
There exists a mapping that maps structureaware embeddings to positionaware embeddings , if and only if no pair of nodes have isomorphic local hop neighbourhood graphs.
Proposition 1 is proved in the Appendix. The proof is based on the identifiability arguments similar to the proof of Theorem 1 in (Hamilton et al., 2017a), and also explains why in some cases GNNs may perform well in tasks requiring positional information. However, in realworld graphs such as molecules and social networks, the structural equivalences between nodes’ local neighbourhood graphs are quite common, making GNNs hard to identify different nodes. Furthermore, the mapping essentially memorizes the shortest path distance between a pair of structureaware node embeddings whose local neighbourhoods are unique. Therefore, even if the GNN perfectly learns the mapping , it cannot generalize to the mapping to new graphs.
4 Proposed Approach
In this section, we first describe the PGNN framework that extends GNNs to learn positionaware node embeddings. We follow by a discussion on our model designing choices. Last, we theoretically show how PGNNs generalize existing GNNs and learn positionaware embeddings.
4.1 The Framework of PGNNs
We propose Positionaware Graph Neural Networks that generalize the concepts of Graph Neural Networks with two key insights. First, when computing the node embedding, instead of only aggregating messages computed from a node’s local network neighbourhood, we allow PGNNs to aggregate messages from anchorsets, which are randomly chosen subsets of all the nodes (Figure 2, left). Note that anchor sets get resampled every time the model is run forward. Secondly, when performing message aggregation, instead of letting each node aggregate information independently, the aggregation is coupled across all the nodes in order to distinguish nodes with different positions in the network (Figure 2, middle). We design PGNNs such that each node embedding dimension corresponds to messages computed with respect to one anchorset, which makes the computed node embeddings positionaware (Figure 2, right).
PGNNs contain the following key components:
anchorsets of different sizes.
Message computation function that combines feature information of two nodes with their network distance.
Matrix of anchorset messages, where each row is an anchorset message computed by .
Trainable aggregation functions , that aggregate/transform feature information of the nodes in the anchorset and then also aggregate it across the anchorsets.
Trainable vector that projects message matrix to a lowerdimensional embedding space .
Algorithm 1 summarizes the general framework of PGNNs. A PGNN consists of multiple PGNN layers. Concretely, the PGNN layer first samples random anchorsets . Then, the dimension of the output node embedding represents messages computed with respect to the anchorset . Each dimension of the embedding is obtained by first computing the message from each node in the anchorset via message computation function , then applying a message aggregation function , and finally applying a nonlinear transformation to get a scalar via weights and nonlinearity . Specifically, the message from each node includes distances that reveal node positions as well as featurebased information from input node features. The message aggregation functions are the same class of functions as used by existing GNNs. We further elaborate on the design choices in Section 4.3.
PGNNs are positionaware. The output embeddings are positionaware, as each dimension of the embedding encodes the necessary information to distinguish structurally equivalent nodes that reside in different parts of the graph. Note that if we permute the dimensions of all the node embeddings , the resulting embeddings are equivalent to the original embeddings because they carry the same node positional information with respect to (permuted order of) anchorsets .
Multiple PGNN layers can be naturally stacked to achieve higher expressive power. Note that unlike GNNs, we cannot feed the output embeddings from the previous layer to the next layer, because the dimensions of can be arbitrarily permuted; therefore, applying a fixed nonlinear transformation over this representation is problematic. The deeper reason we cannot feed to the next layer is that the position of a node is always relative to the chosen anchorsets; thus, canonical positionaware embeddings do not exist. Therefore, PGNNs also compute structureaware messages , which are computed via an orderinvariant message aggregation function that aggregates messages across anchorsets, and are then fed into the next PGNN layer as input.
4.2 Anchorset Selection
We rely on Bourgain’s Theorem to guide the choice of anchorsets, such that the resulting representations are guaranteed to have low distortion. Specifically, distortion measures the faithfulness of the embeddings in preserving distances when mapping from one metric space to another metric space, which is defined as follows:
Definition 3.
Given two metric spaces and and a function , is said to have distortion if , .
Theorem 1 states the Bourgain Theorem (Bourgain, 1985), which shows the existence of a low distortion embedding that maps from any metric space to the metric space:
Theorem 1.
(Bourgain theorem) Given any finite metric space with , there exists an embedding of into under any metric, where , and the distortion of the embedding is .
A constructive proof of Theorem 1 (Linial et al., 1995) provides an algorithm to construct an dimensional embedding via anchorsets, as summarized in Theorem 2:
Theorem 2.
(Constructive proof of Bourgain theorem) For metric space , given random sets where is a constant, is chosen by including each point in
independently with probability
. An embedding method for is defined as:(1) 
where . Then, is an embedding method that satisfies Theorem 1.
The proposed PGNNs can be viewed as a generalization of the embedding method in Theorem 2, where the distance metric is generalized via message computation function and message aggregation function that accounts for both node feature information and positionbased similarities (Section 4.3). Using this analogy, Theorem 2 offers two insights for selecting anchorsets in PGNNs. First, anchorsets are needed to guarantee low distortion embedding. Second, these anchorsets have sizes distributed exponentially. Here, we illustrate the intuition behind selecting anchorsets with different sizes via the hop shortest path distance defined in Equation 2. Suppose that the model is computing embeddings for node . We say an anchorset hits node if or any of its onehop neighbours is included in the anchorset. Small anchorsets can provide positional information with high certainty, because when a small anchorset hits , we know that is located close to one of the very few nodes in the small anchorset. However, the probability that such small anchorset hits is low, and the anchorset is uninformative if it misses . On the contrary, large anchorsets have higher probability of hitting , thus sampling large anchorsets can result in high sample efficiency. However, knowing that a large anchorset hits provides little information about its position, since might be close to any of the many nodes in the anchorset. Therefore, choosing anchorsets of different sizes balances the tradeoff and leads to efficient embeddings.
Following the above principle, PGNNs choose random anchorsets, denoted as , where and
is a hyperparameter. To sample an anchorset
, we sample each node in independently with probability .4.3 Design decisions for PGNNs
In this section, we discuss the design choices of the two key components of PGNNs: the message computation function and the message aggregation functions Agg.
Message Computation Function . Message computation function has to account for both positionbased similarities as well as feature information. Positionbased similarities are the key to reveal a node’s positional information, while feature information may include other side information that is useful for the prediction task.
Positionbased similarities can be computed via the shortest path distance, or, for example, personalized PageRank (Jeh & Widom, 2003). However, since the computation of shortest path distances has a computational complexity, we propose the following hop shortest path distance
(2) 
where is the shortest path distance between a pair of nodes. Note that hop distance can be directly identified from the adjacency matrix, and thus no additional computation is needed. Since we aim to map nodes that are close in the network to similar embeddings, we further transform the distance to map it to a range.
Feature information can be incorporated into by passing in the information from the neighbouring nodes, as in GCN (Kipf & Welling, 2017), or by concatenating node features and , similar to GraphSAGE (Hamilton et al., 2017a), although other approaches like attention can be used as well (Velickovic et al., 2018). Combining position and feature information can then be achieved via concatenation or product. We find that simple product works well empirically. Specifically, we find the following message passing function performs well empirically
(3) 
Message Aggregation Functions Agg. Message aggregation functions aggregate information from a set of messages (vectors). Any permutation invariant function, such as , can be used, and nonlinear transformations are often applied before and/or after the aggregation to achieve higher expressive power (Zaheer et al., 2017). We find that using simple Mean aggregation function provides good results, thus we use it to instantiate both and .
5 Theoretical Analysis of PGNNs
5.1 Connection to Existing GNNs
PGNNs generalize existing GNN models. From PGNN’s point of view, existing GNNs use the same anchorset message aggregation techniques, but use different anchorset selection and sampling strategies, and only output the structureaware embeddings .
GNNs either use deterministic or stochastic neighbourhood aggregation (Hamilton et al., 2017a). Deterministic GNNs can be expressed as special cases of PGNNs that treat each individual node as an anchorset and aggregate messages based on hop distance. In particular, the function in Algorithm 1 corresponds to the message aggregation function of a deterministic GNN. In each layer, most GNNs aggregate information from a node’s onehop neighbourhood (Kipf & Welling, 2017; Velickovic et al., 2018), corresponding to using hop distance to compute messages, or directly aggregating hop neighbourhood (Xu et al., 2018), corresponding to computing messages within hop distance. For example, a GCN (Kipf & Welling, 2017) can be written as choosing , , , , and the output embedding is in the final layer.
Stochastic GNNs can be viewed as PGNNs that sample size1 anchorsets, but each node’s choice of anchorsets is different. For example, GraphSAGE (Hamilton et al., 2017a) can be viewed as a special case of PGNNs where each node samples size1 anchorsets and then computes messages using 1hop shortest path distance anchorset, followed by aggregation . This understanding reveals the connection between stochastic GNNs and PGNNs. First, PGNN uses larger anchorsets thereby enabling higher sample efficiency (Sec 4.2). Second, anchorsets that are shared across all nodes serve as reference points in the network, consequently, positional information of each node can be obtained from the shared anchorsets.
5.2 Expressive Power of PGNNs
Next, we show that PGNNs provide a more general class of inductive bias for graph representation learning than GNNs; therefore, are more expressive to learn both structureaware and positionaware node embeddings.
We motivate our idea by considering pairwise relation prediction between nodes. Suppose a pair of nodes are labeled with label , using labeling function , and our goal is to predict for unseen node pairs. From the perspective of representation learning, we can solve the problem via learning an embedding function that computes the node embedding , where the objective is to maximize the likelihood of the conditional distribution . Generally, an embedding function takes a given node and the graph as input and can be written as , while can be expressed as a function in the embedding space.
As shown in Section 3.2, GNNs instantiate via a function that takes a node and its hop neighbourhood graph as arguments. Note that is independent from (the hop neighbourhood graph of node ) since knowing the neighbourhood graph structure of node provides no information on the neighbourhood structure of node . In contrast, PGNNs assume a more general type of inductive bias, where is instantiated via that aggregates messages from random anchorsets that are shared across all the nodes, and nodes are differentiated based on their different distances to the anchorsets . Under this formulation, each node’s embedding is computed similarly as in the stochastic GNN when combined with a proper hop distance computation (Section 5.1). However, since the anchorsets are shared across all nodes, pairs of node embeddings are correlated via anchorsets
, and are thus no longer independent. This formulation implies a joint distribution
over node embeddings, where and . In summary, learning node representations can be formalized with the following two types of objectives:GNN representation learning objective:
(4)  
PGNN representation learning objective:
(5)  
where is the target similarity metric determined by the learning task, for example, indicating links between nodes or membership to the same community, and is the similarity metric in the embedding space, usually the norm.
Optimizing Equations 4 and 5 gives representations of nodes using joint and marginal distributions over node embeddings, respectively. If we treat ,
as random variables from
that can take values of any pair of nodes, then the mutual information between the joint distribution of node embeddings and any is larger than that between the marginal distributions and : , where ; , where is the Kronecker product. The gap of this mutual information is great, if the target task is related to the positional information of nodes which can be captured by the shared choice of anchorsets. Thus, we conclude that PGNNs, which embed nodes using the joint distribution of their distances to common anchors, have more expressive power than existing GNNs.5.3 Complexity Analysis
Here we discuss the complexity of neural network computation. In PGNNs, every node communicates with anchorsets in a graph with nodes and edges. Suppose on average each anchorset contains nodes, then there are message communications in total. If we follow the exact anchorset selection strategy, the complexity will be . In contrast, the number of communications is for existing GNNs. In practice, we observe that the computation can be sped up by using a simplified aggregation , while only slightly sacrificing predictive performance. Here for each anchorset, we only aggregate message from the node closest to a given node . This removes the factor in the complexity of PGNNs, making the complexity . We use this implementation in the experiments.
6 Experiments
GridT  CommunitiesT  Grid  Communities  PPI  

GCN  
GraphSAGE  
GAT  
GIN  
PGNNF1L  
PGNNF2L  
PGNNE1L  
PGNNE2L 
PGNNs compared to GNNs on link prediction tasks, measured in ROC AUC. GridT and CommunitiesT refer to the transductive learning setting of Grid and Communities, where onehot feature vectors are used as node attributes. Standard deviation errors are given.
Communities  Protein  

GAT  
GraphSAGE  
GAT  
GIN  
PGNNF1L  
PGNNF2L  
PGNNE1L  
PGNNE2L 
6.1 Datasets
We perform experiments on both synthetic and real datasets. We use the following datasets for a link prediction task:
Grid
. 2D grid graph representing a 20 20 grid with 400 and no node features.
Communities.
Connected caveman graph (Watts, 1999) with 1% edges randomly rewired, with 20 communities where each community has 20 nodes.
PPI.
24 Proteinprotein interaction networks (Zitnik & Leskovec, 2017). Each graph has 3000 nodes with avg. degree 28.8, each node has 50 dimensional feature vector.
We use the following datasets for pairwise node classification tasks which include community detection and role equivalence prediction^{2}^{2}2Inductive positionaware node classification is not welldefined due to permutation of labels in different graphs. However pairwise node classification, which only decides if nodes are of the same class, is well defined in the inductive setting..
Communities.
The same as abovementioned community dataset, with each node labeled with the community it belongs to.
Emails.
7 realworld email communication graphs from SNAP (Leskovec et al., 2007) with no node features. Each graph has 6 communities and each node is labeled with the community it belongs to.
Protein.
1113 protein graphs from (Borgwardt et al., 2005). Each node is labeled with a functional role of the protein. Each node has a 29 dimensional feature vector.
6.2 Experimental setup
Next we evaluate PGNN model on both transductive and inductive learning settings.
Transductive learning. In the transductive learning setting, the model is trained and tested on a given graph with a fixed node ordering and has to be retrained whenever the node ordering is changed or a new graph is given. As a result, the model is allowed to augment node attributes with unique onehot identifiers to differentiate different nodes. Specifically, we follow the experimental setting from (Zhang & Chen, 2018), and use two sets of 10% existing links and an equal number of nonexistent links as test and validation sets, with the remaining 80% existing links and equal number of nonexistent links used as the training set. We report the test set performance when the best performance on the validation set is achieved, and we report results over 10 runs with different random seeds and train/validation splits.
Inductive learning. We demonstrate the inductive learning performance of PGNNs on pairwise node classification tasks for which it is possible to transfer the positional information to a new unseen graph. In particular, for inductive tasks, augmenting node attributes with onehot identifiers restricts a model’s generalization ability, because the model needs to generalize across scenarios where node identifiers can be arbitrarily permuted. Therefore, when the dataset does not come with node attributes, we only consider using constant orderinvariant node attributes, such as a constant scalar, in our experiments. Original node attributes are used if they are available.
We follow the transductive learning setting to sample links, but only use orderinvariant attributes. When multiple graphs are available, we use 80% of the graphs for training and the remaining graphs for testing. Note that we do not allow the model to observe groundtruth graphs at the training time. For the pairwise node classification task, we predict whether a pair of nodes belongs to the same community/class. In this case, a pair of nodes that do not belong to the same community are a negative example.
6.3 Baseline models
So far we have shown that PGNNs are a family of models that differ from the existing GNN models. Therefore, we compare variants of PGNNs against most popular GNN models. To make a fair comparison, all models are set to have similar number of parameters and are trained for the same number of epochs. We fix model configurations across all the experiments. (Implementational details are provide in the Appendix.) We show that even the simplest PGNN models can significantly outperform GNN models in many tasks, and designing more expressive PGNN models is an interesting venue for future work.
GNN variants. We consider 4 variants of GNNs, each with three layers, including GCN (Kipf & Welling, 2017), GraphSAGE (Hamilton et al., 2017a), Graph Attention Networks (GAT) (Velickovic et al., 2018) and Graph Isomorphism Network (GIN) (Xu et al., 2019). Note that in the context of link prediction task, our implementation of GCN is equivalent to GAE (Kipf & Welling, 2016).
PGNN variants. We consider 2 variants of PGNNs, either with one layer or two layers (labeled 1L, 2L): (1) PGNNs using truncated 2hop shortest path distance (PGNNF); (2) PGNNs using exact shortest path distance (PGNNE).
6.4 Results
Link prediction. In link prediction tasks two nodes are generally more likely to form a link, if they are close together in the graph. Therefore, the task can largely benefit from positionaware embeddings. Table 1 summarizes the performance of PGNNs and GNNs on a link prediction task. We observe that PGNNs significantly outperform GNNs across all datasets and variants of the link prediction taks (inductive vs. transductive). PGNNs perform well in all inductive link prediction settings, for example improve AUC score by up to 66% over the best GNN model in the grid dataset. In the transductive setting, PGNNs and GNNs achieve comparable performance. The explanation is that onehot encodings of nodes help GNNs to memorize node IDs and differentiate symmetric nodes, but at the cost of expensive computation over dimensional input features and the failure of generalization to unobserved graphs. On the other hand, PGNNs can discriminate symmetric nodes by their different distances to anchorsets, and thus adding onehot features does not help their performance. In addition, we observe that when graphs come with rich features (e.g., PPI dataset), the performance gain of PGNNs is smaller, because node features may already capture positional information. Quantifying how much of the positional information is already captured by the input node features is an interesting direction left for future work. Finally, we show that the “fast” variant of the PGNN model (PGNNF) that truncates expensive shotest distance computation at 2 still achieves comparable results in many datasets.
Pairwise node classification. In pairwise node classification tasks, two nodes may belong to different communities but have similar neighbourhood structures, thus GNNs which focus on learning structureaware embeddings will not perform well in this tasks. Table 2 summarizes the performance of PGNNs and GNNs on pairwise node classification tasks. The capability of learning positionaware embeddings is crucial in the Communities dataset, where all PGNN variants nearly perfectly detect memberships of nodes to communities, while the best GNN can only achieve 0.620 ROC AUC, which means that PGNNs give 56% relative improvement in ROC AUC over GNNs on this task. Similar significant performance gains are also observed in Email and Protein datasets: 18% improvement in ROC AUC on Email and 39% improvement of PGNN over GNN on Protein dataset.
7 Conclusion
We propose Positionaware Graph Neural Networks, a new class of Graph Neural Networks for computing node embeddings that incorporate node positional information, while retaining inductive capability and utilizing node features. We show that PGNNs consistently outperform existing GNNs in a variety of tasks and datasets.
Acknowledgements
This research has been supported in part by Stanford Data Science Initiative, NSF, DARPA, Boeing, Huawei, JD.com, and Chan Zuckerberg Biohub.
References
 Battaglia et al. (2018) Battaglia, P. W., Hamrick, J. B., Bapst, V., SanchezGonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
 Belkin & Niyogi (2002) Belkin, M. and Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, pp. 585–591, 2002.
 Borgwardt & Kriegel (2005) Borgwardt, K. M. and Kriegel, H.P. Shortestpath kernels on graphs. In IEEE International Conference on Data Mining, pp. 74–81, 2005.
 Borgwardt et al. (2005) Borgwardt, K. M., Ong, C. S., Schönauer, S., Vishwanathan, S., Smola, A. J., and Kriegel, H.P. Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1):i47–i56, 2005.
 Bourgain (1985) Bourgain, J. On lipschitz embedding of finite metric spaces in hilbert space. Israel Journal of Mathematics, 52(12):46–52, 1985.
 Gärtner et al. (2003) Gärtner, T., Flach, P., and Wrobel, S. On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines, pp. 129–143. 2003.
 Gilmer et al. (2017) Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. International Conference on Machine Learning, 2017.
 Grover & Leskovec (2016) Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. ACM, 2016.
 Hamilton et al. (2017a) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034, 2017a.
 Hamilton et al. (2017b) Hamilton, W. L., Ying, R., and Leskovec, J. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin, 2017b.
 Jeh & Widom (2003) Jeh, G. and Widom, J. Scaling personalized web search. In Proceedings of the 12th International Conference on World Wide Web, pp. 271–279. Acm, 2003.
 Kashima et al. (2003) Kashima, H., Tsuda, K., and Inokuchi, A. Marginalized kernels between labeled graphs. In International Conference on Machine Learning, pp. 321–328, 2003.
 Kipf & Welling (2016) Kipf, T. N. and Welling, M. Variational graph autoencoders. arXiv preprint arXiv:1611.07308, 2016.
 Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semisupervised classification with graph convolutional networks. International Conference on Learning Representations, 2017.
 Leskovec et al. (2007) Leskovec, J., Kleinberg, J., and Faloutsos, C. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):2, 2007.
 Linial et al. (1995) Linial, N., London, E., and Rabinovich, Y. The geometry of graphs and some of its algorithmic applications. Combinatorica, 15(2):215–245, 1995.
 Perozzi et al. (2014) Perozzi, B., AlRfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710. ACM, 2014.
 Scarselli et al. (2009) Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
 Selsam et al. (2019) Selsam, D., Lamm, M., Bunz, B., Liang, P., de Moura, L., and Dill, D. L. Learning a sat solver from singlebit supervision. International Conference on Learning Representations, 2019.
 Sugiyama & Borgwardt (2015) Sugiyama, M. and Borgwardt, K. M. Halting in random walk kernels. In Advances in Neural Information Processing Systems, pp. 1639–1647, 2015.
 Velickovic et al. (2018) Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. Graph attention networks, 2018.
 Watts (1999) Watts, D. J. Networks, dynamics, and the smallworld phenomenon. American Journal of sociology, 105(2):493–527, 1999.
 Xu et al. (2018) Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.i., and Jegelka, S. Representation learning on graphs with jumping knowledge networks. International Conference on Machine Learning, 2018.
 Xu et al. (2019) Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? International Conference on Learning Representations, 2019.
 Yanardag & Vishwanathan (2015) Yanardag, P. and Vishwanathan, S. V. N. Deep graph kernels. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374, 2015.

Ying et al. (2018a)
Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., and Leskovec, J.
Graph convolutional neural networks for webscale recommender systems.
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018a.  Ying et al. (2018b) Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and Leskovec, J. Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pp. 4805–4815, 2018b.
 You et al. (2018a) You, J., Liu, B., Ying, R., Pande, V., and Leskovec, J. Graph convolutional policy network for goaldirected molecular graph generation. Advances in Neural Information Processing Systems, 2018a.
 You et al. (2018b) You, J., Ying, R., Ren, X., Hamilton, W., and Leskovec, J. Graphrnn: Generating realistic graphs with deep autoregressive models. In International Conference on Machine Learning, 2018b.
 Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, pp. 3391–3401, 2017.
 Zhang & Chen (2018) Zhang, M. and Chen, Y. Link prediction based on graph neural networks. Advances in Neural Information Processing Systems, 2018.
 Zitnik & Leskovec (2017) Zitnik, M. and Leskovec, J. Predicting multicellular function through multilayer tissue networks. Bioinformatics, 33(14):i190–i198, 2017.
Comments
There are no comments yet.