Introduction
Graph representation learning has attracted unprecedented attention recently due to the ubiquitous graphstructured data Hamilton et al. (2017b). There are a great diversity of graphrelated applications ranging from node classification and link (or subgraph) prediction to the entire graph classification. Graph neural networks (GNNs) are a powerful tool for graph representation learning Scarselli et al. (2008); Battaglia et al. (2018). To represent a structure of interest, GNNs in general follow two steps: (1) Learn the representations of nodes; (2) Read out the representations of a group of nodes of interest to make predictions. For example, in link prediction tasks, the representations of two nodes in the link are readout to represent this link Kipf and Welling (2016); Hamilton et al. (2017a). This type of design seems to be reasonable. However, there is a severe issue: Node representations learnt by GNNs only reflect the contextual structure around the corresponding nodes; The correlation between two node representations cannot be sufficiently established. Here, correlation between two nodes means a measure regarding the common contextual structure of two nodes, e.g., their distance. Due to the absence of correlation, GNNs may not achieve desired accuracy of the inference that depends on correlations between multiple nodes. An illustrative example of this issue, given by Srinivasan and Ribeiro, is shown in Fig. 1. Two nodes in a food web, Lynx and Orca, obtain the same node representations as they have the isomorphic contextual structures. However, if one wants GNNs to tell whether Lynx or Orca are more likely to be the predator of Pelagic Fish (a link prediction task), GNNs cannot make a correct prediction since node representations of both Lynx and Orca are the same. Only if we associate each node with informative node attributes that reflect its identity, GNNs can distinguish Lynx and Orca.
Inspired by the empirically successful work Zhang and Chen (2018), Li et al. recently proposed distance encoding techniques (DE) to solve this issue Li et al. (2020). They paired the general GNN framework with DE, and obtained the model DEGNN (see Fig. 2). DEGNN assigns every node in the graph with a set of correlated distances to establish the correlation between the representations of a group of nodes of interest. In the case of Fig. 1, the node Seal is 1hop away from Pelagic Fish, which has different distances from it to Orca and Lynx (1hop and infinitehop, respectively). Such different distances empower DEGNN to distinguish relations between these two types of node pairs, (Pelagic Fish, Orca) v.s. (Pelagic Fish, Lynx), and thus leads to a successful prediction. Note that, this procedure is naturally generalized to any higherorder structure (containing more than one node) predictions.
Although the success of DE for the higherorder structure prediction is clear and significant, how DE helps GNNs for node classification is still not clear. This corresponds to the case when the set in Fig. 2
stands for a single node that is to be classified.
Li et al. theoretically proves that the representation power of DEGNN is better than traditional GNNs (see an example in Fig. 3)Li et al. (2020). However, whether GNNs need more representation power of structures to perform good node classification remains unknown. Note that the representation power is only related to how well GNNs can fit the training set, it does not necessarily imply good testing performance. In other words, if more representative GNNs memorize any information that is irrelevant to its current task, the testing performance may get deteriorated. Moreover, there could be different formulations of node classification tasks. We need to provide a better understanding of how and where DE fits these realworld node classification tasks. Otherwise, practitioners may find it challenging to properly use DE and GNNs in their applications, as different combinations of them would be suitable for different settings.Most previous GNNbased node classification tasks focus on predicting the labels that reflect the community that one node belongs to Kipf and Welling (2017); Veličković et al. (2018); Hamilton et al. (2017a). Later, we term these labels as communitytype labels (Clabel). For example, consider a coauthorship network where each node belongs to a researcher and each edge indicates that two researchers have collaborated before. The Clabel of a node could be the area that a researcher works in. Graphs with Clabels are typically homophilic, where nodes with same labels are more likely mutually connected (see Fig. 4 right).
DEGNN Li et al. (2020), however, has not been evaluated to predict Clabels. Instead, it was originally tested on predicting another type of node labels, the structuralroles that in practice correspond to the structural functions of nodes in a network Henderson et al. (2012); Ribeiro et al. (2017). We term it as structuretype labels (Slabel) later. As an example of Slabels, still in the coauthorship network (see Fig. 4 left), Slabels indicate whether a researcher is a core node (red diamonds) or peripheral one (green triangles); Or whether he/she works in a group (blue circles) or more independently (grey squares). Another example of Slabels in the real world is the positions of different individuals in an organization, such as managers v.s. employees, faculties v.s. students and so on. Slabels typically are indicated by the contextual structures of nodes. Graphs paired with Slabels are heterophilic, where nodes with same labels are not necessarily adjacent and could be spread all across the graph. Although Slabels have been rarely used to evaluate GNNs until very recently, they are comparably important if not more as opposed to Clabels in many graphrelated applications Ahmed et al. (2020). There are some other types of node classification as well. But here, we focus our discussion on the above two types as they cover almost all the current node classification applications.
Because of the difference between Clabels and Slabels, the mechanisms that GNNs use to predict them are different. For Clabels, as the nodes with the same labels are more likely to be connected (homophilic networks), GNNs work in a way to iteratively smooth the features of a node with those of its neighbors. The final predictions are also smooth over the graph structure. The obtained smoothness is consistent with the allocation of Clabels over graphs, which makes GNNs work well for Clabel prediction^{1}^{1}1GNNs may work well for Clabel prediction but they do not perform the best. GNNs typically work on a local enclosing subgraph around a node. Hence, GNNs can capture limited topological information to predict community labels. Note that community labels can only be robustly captured via longrange propagation, as theoretically demonstrated in Li et al. (2019). Here, longrange propagation means that features need to be propagated a large number of hops ( 10) instead of just hops as typical GNNs do Kipf and Welling (2017); Veličković et al. (2018). This point can actually explain why many models, such as SGC Wu et al. (2019), APPNP Klicpera et al. (2019a) and C&S Huang et al. (2020), by simply removing nonlinear activations and allowing propagating node features far across the graphs, achieve even better performance in Clabel prediction.. However, the above smoothing procedure is not ideal for Slabel prediction, as nodes with the same Slabels are spread across the graph, which induces heterophilic graphs. This setting in general means that the previous mechanism of GNNs cannot be applied. To solve this issue, two ways recently have been proposed to make featuresmoothing procedure of GNNs fit under the heterophilic settings: some work explicitly or implicitly changes the graph structures Pei et al. (2020); Liu et al. (2020) to make nodes from the same Slabels connected and smooth their features; the other expects GNNs to adaptively learn and perform as highpass graph filters that smooth raw features on the nodes which are actually far away Chien et al. (2020); Zhu et al. (2020). Both strategies still depends on node features and essentially smooth node raw features, though the features got smoothed may not come from their direct neighbors. For heterophilic networks, in contrast, DEGNN adopts a fundamentally different mechanism. It learns the representation of the “purely” contextual structure around each node and set it as the node representation. This procedure could be totally independent from raw features, though adding node raw features may or may not help. As Slabels are typically indicated by the contextual structures of nodes by definition, DEGNN could learn a suitable node representation for Slabel prediction even without informative node raw features. Its effectiveness on this gets demonstrated by predicting passenger flow volume of different airports where no node raw features were available Li et al. (2020).
Based on previous discussion, we may conclude that standard GNNs were mostly used for Clabel prediction which heavily depends on informative raw features, while DEGNN works for Slabel prediction and does not necessarily require raw features. In practice, it can be hard to tell whether C or Slabels the datasets contain, and whether the raw features are informative for a certain type of labels. For example, the “purely” contextual structure (without node raw features) could provide useful information for Clabel prediction as if different communities have different internal subgraph structures. Specifically, researchers who are in the area of computer systems typically work in a group since their research projects in general have heavy workload, while researchers from the theory side tend to work in a more independent way. Then, whether a node belongs to some internal groups (like the bluecircle nodes in Fig.4 left), could be indicative to determine whether they works in systems or theories. Therefore, DEGNN have the potential to work better on Clabel prediction. For Slabel prediction, by definition, DE should always be helpful while the raw features are not necessarily helpful. Therefore, it is inconclusive whether we need to combine DE with raw features for Slabel prediction.
Besides the above analysis, in this work, we will provide an extensive study of DEGNNs for node classification. We will investigate how DE can be combined with GNNs and with raw features in eight different configurations over eight different datasets that contain either Clabels or Slabels. We consider whether or not using distance encoding, using raw features and propagating raw features, and the different combinations of these choices. In a high level, we reach three following conclusions while leaving detailed analysis later.

DE always significantly improves GNNs for Slabel predictions.

Raw features are still informative in those heterophilic networks. However, whether or not allowing them to propagate in GNNs depends on the datasets.
GNNs and DEGNN for Node Classification
In this section, we summarize the key steps of GNNs and the DE technique Li et al. (2020) for node classification.
Notation
Given a graph with a node feature matrix , where is the set of nodes, is the set of edges. Denote its adjacency matrix as . Each node
has a feature vector
.Homophily ratio is the fraction of edges in a graph which connect nodes that have the same class label (i.e. intraclass edges). Based on the value of , graphs can be classified into two types, namely, homophily () and heterophily (i.e. ). It is a quantitative indicator of graph datasets, to determine whether the labels are lean towards the community type or not. Note that heterophilic graphs does not necessarily mean its labels are the structure type, although these two concepts are often consistent in practice.
Graph Neural Networks (GNNs)
The th layer of a layer GNN ( can be formulated as
where is the representation of node at the th layer.
The function determines how to aggregate information from node ’s neighborhood (such as, max or mean pooling), while the function decides how to fuse the representation of node from previous layer and aggregate information from its neighbors .
Problem Setup
We focus on the semisupervised node classification problem on a simple graph . Given a training set with known class labels for all , and a feature vector for , we aim to infer the unknown class labels for all .
Node Structural Features and Distance Encoding
The summary of graph statistics (e.g. node degrees) are generally used as structural features in many traditional machine learning approaches. Walkbased techniques such as subgraph sampling and random walk, are very popular to be applied for extracting structural information by exploiting node localities of graphs. In the following, we introduce another general class of structurerelated features, termed Distance Encoding (DE)
Li et al. (2020). DE is an approach to encode the distance between a node in the graph and a target node set. For this study, we adopt a simplified version of DE by reducing the size of node set to 1. Suppose we are to classify node , DE between any node in the graph and is formally defined as follows.Definition 1.
Given a target node for classification and the adjacency matrix , distance encoding for any ,
, is a mapping based on a collection of landing probabilities of random walks from
to , i.e.,where is the random walk matrix. can be a learnable or fixed function.
defined above is a general form that covers several graphrelated measurements. If we set as the first nonzero position in , then it gives the shortestpathdistance (SPD) from to , noted as .
could also be a feedforward neural network. In practice, a finite length of
is sufficient to encode the contextual structure of a node ().Compared to other techniques to represent the structural roles of nodes, e.g. motif/graphlet counting Ahmed et al. (2020), DE is computationally efficient. Meanwhile, DE is different from positional node embeddings, such as Node2vec Grover and Leskovec (2016), because it captures the relative distance between a node and the target node. Such a distance is independent of the absolute positions of nodes in graphs, which makes sure that DEbased approaches are inductive to predict the labels spread in the sections of graphs that have not been used for training.
Utilize Structural Features into Node Classification
In conventional GNNs, the representation for the first layer is initialized by the raw features . Besides it, structural features such as DE and node degree can also be used for initialization by setting initial node features as , where denotes the combining operation between two types of features. In reality, especially for Slabel prediction, the raw features from neighbors could be irrelevant while structural features are informative. Thus, we consider GNNs solely propagate structural features first, and then concatenate the lastlayer representation of a node (currently only hold “pure” structural information) with its raw features to make predictions, which removes the effect of irrelevant neighbors’ raw features.
Based on a model uses which type(s) of features (node degree, raw and DE features), where to add raw features (the first or the last layer in a model), we design eight types of GNN variants for classification tasks in terms of structural roles. In the subsequent discussion, those variants are indexed from to . The detailed configuration of each model is listed in Table 1. Here, only takes raw features as the initial input without structural features involved, which is equivalent to standard GNNs under a typical classification setting. Similarly,
does not apply either raw features or structural features for the propagation, which is essentially degenerated to a multilayer perceptron (MLP).
Model  DE  Raw Features  Degree  Description 

(First)  with all features  
(Last)  with all features  
without raw node features  
node degree only  
(First)  raw node features only  
(Last)  raw node features only  
(First)  without DE features  
(Last)  without DE features 
Model  Cora  Citeseer  Pubmed  Chameleon  Actor  Cornell  Texas  Wisconsin 

GCN*  85.77  73.68  88.13  28.18  26.86  52.70  52.16  45.88 
GeomGCN*  85.27  77.99  90.05  60.90  31.63  60.81  67.57  64.12 
SPD  85.761.32  74.420.88  85.740.32  65.492.11  35.300.96  68.387.46  79.196.29  75.606.74 
SPD  71.812.18  71.612.22  85.420.58  51.742.67  36.261.52  81.353.51  84.053.72  83.604.27 
SPD  39.236.36  27.791.31  44.650.51  48.222.05  25.560.75  50.277.24  53.515.51  50.006.19 
RW  85.631.63  73.981.19  85.400.42  65.822.30  35.120.72  65.144.90  78.656.10  76.604.10 
RW  71.052.39  71.011.43  85.760.52  52.002.04  36.421.11  81.084.83  82.975.28  84.004.98 
RW  37.772.68  27.461.88  46.190.75  39.033.45  36.132.03  51.893.59  53.245.80  51.204.02 
36.813.91  27.411.38  44.460.52  35.763.99  25.530.36  51.624.09  54.056.04  50.603.90  
(SAGE)  86.031.94  74.241.39  86.380.57  60.752.01  34.380.96  69.736.10  76.765.82  78.603.58 
(MLP)  71.161.93  71.321.22  85.120.50  51.781.83  35.761.15  80.003.86  81.622.65  81.004.40 
86.421.79  73.921.42  86.350.55  61.821.69  34.361.17  71.896.64  79.464.55  78.204.51  
72.991.61  71.801.73  85.030.76  49.671.34  35.490.93  78.655.33  78.656.33  83.004.02 
Experiments
Datasets & Baselines
We utilize three types of public datasets for node classification tasks, namely, citation networks, Wikipedia network and WebKB. An overview of characteristics for each dataset is summarized in Table 3.
Citation networks include Cora, Citeseer, and Pubmed Sen et al. (2008); Namata et al. (2012), which are a standard benchmark for semisupervised node classification. In these networks, nodes and edges denote for papers and citation relation between them, respectively. Node features are the embeddings of papers, whose labels are Clabels that indicate the research fields of the papers. Citation networks have a strong tendency of homophily with ratios .
Chameleon and Actor are subgraphs of webpages in Wikipedia Rozemberczki et al. (2019); Tang et al. (2009). Nodes represent for webpages regrading specific topics or actor personals, and edges denote mutual links between pages or cooccurrence in the same page. We use the labels generated by Pei et al.: nodes in Chameleon are classified into five categories based on monthly averaged visits. The node labels are Slabels, as the visiting traffic implies whether a node is a hub or not. Nodes in Actors have 5 classes in term of words appeared in the actor’s Wikipedia page, while it is hard to tell whether they are CorS labels. Both graphs have low homophily ratios around 0.2.
Cornell, Texas, and Wisconsin are three subset of WebKB dataset, which are the network map of universitywise webpages collected by CMU. In these networks, each node denotes a web page, which is manually classified into 5 categories: student, staff, faculty, course and project. The labels indicate the structural roles of nodes in terms of of job positions and oriented audiences, and thus are Slabels.
Dataset  Cora  Cite.  Pubm.  Cham.  Actor  Corn.  Wisc.  Texa. 

Label Type  C  C  C  S    S  S  S 
Ratio  0.81  0.74  0.8  0.23  0.22  0.3  0.21  0.11 
# Nodes  2,708  3,327  19,717  2,277  7,600  183  183  251 
# Edges  5,428  4,732  44,338  36,101  33,544  295  309  499 
# Features  1,433  3,703  500  2,325  931  1,703  1,703  1,703 
# Class  7  6  3  4  4  5  5  5 
Benchmark Settings
For all benchmarks, we follow the same setting of dataset as in Pei et al. (2020) that randomly split nodes of each class into 60%, 20%, and 20% for training, validation and testing. We perform hyperparameter searching for all models on the validation set (details are included in the appendix).
We choose SAGE as the GNN backbone and construct eight GNN variants in Table 1. In particular, to evaluate DE variants, we apply both SPD based onehot vector and the sequence of landing probabilities of random walks. Models with those two types of DEs are labeled as SPD and RW for , respectively.
Results & Discussion
We report the results in Table 2.
Graphs with Clabels On homophilic graphs (), GNNs with DE provide a similar level of performance against other baselines, even those GCNbased methods are heavily optimized under a strict homophily assumption. As we can tell from the first three columns of Table 2 and compare with , feeding DE to the model may not obtain performance gains but would not hinder it much. Note that one may also use DE as controllers of aggregation procedure inside GNNs, termed DEAGNN in Li et al. (2020). We did not include it in our current study. However, different types of graph diffusion (e.g., personalizedPageRank diffusion, heatkernel diffusion), as a special type of DE’s application on changing the way of propagation, can significantly improve node classification over homophilic graphs Klicpera et al. (2019b).
Graphs with Slabels On heterophilic graphs, we observe consistent improvement on the accuracy of models with structural features (either degree or DE features or both two). Notably, the best performance for each dataset with Slabels is always one of models with DE assistance, which applied either SPD or RW. From the outcome of , it suggests that node raw features still play a role in classification of Slabels. Furthermore, in this particular case, DE alone might not be sufficient ( v.s. ). However, whether raw fractures should be initially propagating through the network or not is inconclusive. As results from both / and / suggested, it depends on datasets in practice.
In summary, we can conclude that structural features generalize GNNs on both homophilic and heterophilic graphs for node classification. The above results demonstrate that GNNs with both raw and structural features have the best performance on Slabel graphs while maintaining comparable performance on Clabel ones. It is a more universal framework for both settings, compared to other models that have an implicit assumption of strong homophily for the dataset. In particular, DE is significantly useful to predict Slabels and naturally makes GNNs fit for heterophilic graphs.
Conclusion
In this work, we readdressed the issues and the potentials of graph neural networks on semisupervised node classification. Specifically, we investigated how structural features such as distance encoding would work with raw features under this settings, and did an extensively empirical study on their individual and collaborative effectiveness for GNNs in node classification tasks. Our study includes both homophilic and heterophilic graphs with community or structuretype labels. The experiments show that structural information is able to generalize GNNs on both types of labels and graph settings. It particularly demonstrates the uniform effectiveness of applying distance encoding to assist GNNs on the prediction of structuretype labels.
References
 Rolebased graph embeddings. IEEE Transactions on Knowledge and Data Engineering. Cited by: Introduction, Node Structural Features and Distance Encoding.

Relational inductive biases, deep learning, and graph networks
. arXiv preprint arXiv:1806.01261. Cited by: Introduction.  Joint adaptive feature smoothing and topology extraction via generalized pagerank gnns. arXiv preprint arXiv:2006.07988. Cited by: Introduction.
 Node2vec: scalable feature learning for networks. In the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. Cited by: Node Structural Features and Distance Encoding.
 Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: Introduction, Introduction, Datasets & Baselines.
 Representation learning on graphs: methods and applications. IEEE Data Engineering Bulletin 40 (3), pp. 52–74. Cited by: Introduction.
 Rolx: structural role extraction & mining in large graphs. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1231–1239. Cited by: Figure 4, Introduction.
 Combining label propagation and simple models outperforms graph neural networks. arXiv preprint arXiv:2010.13993. Cited by: footnote 1.
 Adam: A method for stochastic optimization. In International Conference on Learning Representations, Cited by: Hyperparameter Search.
 Variational graph autoencoders. NeurIPS Bayesian Deep Learning Workshop. Cited by: Introduction.
 Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations, Cited by: Introduction, Datasets & Baselines, footnote 1.
 Predict then propagate: graph neural networks meet personalized pagerank. In International Conference on Learning Representations, Cited by: 3rd item, footnote 1.
 Diffusion improves graph learning. In Advances in Neural Information Processing Systems, pp. 13333–13345. Cited by: 3rd item, Results & Discussion.
 Optimizing generalized pagerank methods for seedexpansion community detection. In Advances in Neural Information Processing Systems, pp. 11705–11716. Cited by: footnote 1.
 Distance encoding–design provably more powerful gnns for structural representation learning. arXiv preprint arXiv:2009.00142. Cited by: Revisit graph neural networks and distance encoding in a practical view, Figure 2, Introduction, Introduction, Introduction, Introduction, Node Structural Features and Distance Encoding, GNNs and DEGNN for Node Classification, Results & Discussion.
 Nonlocal graph neural networks. arXiv preprint arXiv:2005.14612. Cited by: Introduction.
 Querydriven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, Cited by: Datasets & Baselines.
 Geomgcn: geometric graph convolutional networks. In International Conference on Learning Representations, Cited by: Introduction, Table 2, Datasets & Baselines, Datasets & Baselines, Benchmark Settings.
 Struc2vec: learning node representations from structural identity. In the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 385–394. Cited by: Introduction.
 Multiscale attributed node embedding. arXiv preprint arXiv:1909.13021. Cited by: Datasets & Baselines.
 The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: Introduction.
 Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: Datasets & Baselines.
 On the equivalence between node embeddings and structural graph representations. In International Conference on Learning Representations, Cited by: Figure 1, Introduction.
 Social influence analysis in largescale networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 807–816. Cited by: Datasets & Baselines.
 Graph attention networks. In International Conference on Learning Representations, Cited by: Introduction, footnote 1.
 Simplifying graph convolutional networks. In International Conference on Machine Learning, pp. 6861–6871. Cited by: footnote 1.
 Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pp. 5165–5175. Cited by: Introduction.
 Generalizing graph neural networks beyond homophily. arXiv preprint arXiv:2006.11468. Cited by: Introduction.
Appendix
Hyperparameter Search
For a fair comparison, the size of search space for each model is set to the same, including number of hidden unit, initial learning rate, dropout rate, and weight decay, which is listed below.
For our SAGE benchmark, the number of hidden layers is set to 1, expect for Cora, Citeseer and Actor which would need 2 or 3 layers; the number of hidden unit is 32 (Cora, Pubmed and Chameleon), 64 (Citeseer, WebKB), and 256 (Actor). The final hyperparameter setting for training: initial learning rate lr=1e4 with early stopping (patience of 50 epochs); weight decay l2=1e6 for Cora, Pubmed and Chameleon; dropout p=0.2 for Citeseer, Actor and WebKB while p=0.4 for Cora and Chameleon. We use Adam Kingma and Ba (2015) as the default optimizer.
The source code is publicly available on Github [repo
]. Compared to the official PyTorch implementation of DEGNN, our version introduces the following new features:

the framework is able to repeatedly train instances of models without regenerating graph samples;

the procedure of subgraph extraction and the number of GNN layers are decoupled. The model now can run on subgraphs with arbitrary number of hops, but without binding the extraction hops to the total number of layer propagation in GNNs;

sparse matrices are utilized to accelerate the computation of DE, especially for randomwalk based features.
Comments
There are no comments yet.