Graph representation learning has attracted unprecedented attention recently due to the ubiquitous graph-structured data Hamilton et al. (2017b). There are a great diversity of graph-related applications ranging from node classification and link (or subgraph) prediction to the entire graph classification. Graph neural networks (GNNs) are a powerful tool for graph representation learning Scarselli et al. (2008); Battaglia et al. (2018). To represent a structure of interest, GNNs in general follow two steps: (1) Learn the representations of nodes; (2) Read out the representations of a group of nodes of interest to make predictions. For example, in link prediction tasks, the representations of two nodes in the link are readout to represent this link Kipf and Welling (2016); Hamilton et al. (2017a). This type of design seems to be reasonable. However, there is a severe issue: Node representations learnt by GNNs only reflect the contextual structure around the corresponding nodes; The correlation between two node representations cannot be sufficiently established. Here, correlation between two nodes means a measure regarding the common contextual structure of two nodes, e.g., their distance. Due to the absence of correlation, GNNs may not achieve desired accuracy of the inference that depends on correlations between multiple nodes. An illustrative example of this issue, given by Srinivasan and Ribeiro, is shown in Fig. 1. Two nodes in a food web, Lynx and Orca, obtain the same node representations as they have the isomorphic contextual structures. However, if one wants GNNs to tell whether Lynx or Orca are more likely to be the predator of Pelagic Fish (a link prediction task), GNNs cannot make a correct prediction since node representations of both Lynx and Orca are the same. Only if we associate each node with informative node attributes that reflect its identity, GNNs can distinguish Lynx and Orca.
Inspired by the empirically successful work Zhang and Chen (2018), Li et al. recently proposed distance encoding techniques (DE) to solve this issue Li et al. (2020). They paired the general GNN framework with DE, and obtained the model DE-GNN (see Fig. 2). DE-GNN assigns every node in the graph with a set of correlated distances to establish the correlation between the representations of a group of nodes of interest. In the case of Fig. 1, the node Seal is 1-hop away from Pelagic Fish, which has different distances from it to Orca and Lynx (1-hop and infinite-hop, respectively). Such different distances empower DE-GNN to distinguish relations between these two types of node pairs, (Pelagic Fish, Orca) v.s. (Pelagic Fish, Lynx), and thus leads to a successful prediction. Note that, this procedure is naturally generalized to any higher-order structure (containing more than one node) predictions.
Although the success of DE for the higher-order structure prediction is clear and significant, how DE helps GNNs for node classification is still not clear. This corresponds to the case when the set in Fig. 2
stands for a single node that is to be classified.Li et al. theoretically proves that the representation power of DE-GNN is better than traditional GNNs (see an example in Fig. 3)Li et al. (2020). However, whether GNNs need more representation power of structures to perform good node classification remains unknown. Note that the representation power is only related to how well GNNs can fit the training set, it does not necessarily imply good testing performance. In other words, if more representative GNNs memorize any information that is irrelevant to its current task, the testing performance may get deteriorated. Moreover, there could be different formulations of node classification tasks. We need to provide a better understanding of how and where DE fits these real-world node classification tasks. Otherwise, practitioners may find it challenging to properly use DE and GNNs in their applications, as different combinations of them would be suitable for different settings.
Most previous GNN-based node classification tasks focus on predicting the labels that reflect the community that one node belongs to Kipf and Welling (2017); Veličković et al. (2018); Hamilton et al. (2017a). Later, we term these labels as community-type labels (C-label). For example, consider a co-authorship network where each node belongs to a researcher and each edge indicates that two researchers have collaborated before. The C-label of a node could be the area that a researcher works in. Graphs with C-labels are typically homophilic, where nodes with same labels are more likely mutually connected (see Fig. 4 right).
DE-GNN Li et al. (2020), however, has not been evaluated to predict C-labels. Instead, it was originally tested on predicting another type of node labels, the structural-roles that in practice correspond to the structural functions of nodes in a network Henderson et al. (2012); Ribeiro et al. (2017). We term it as structure-type labels (S-label) later. As an example of S-labels, still in the co-authorship network (see Fig. 4 left), S-labels indicate whether a researcher is a core node (red diamonds) or peripheral one (green triangles); Or whether he/she works in a group (blue circles) or more independently (grey squares). Another example of S-labels in the real world is the positions of different individuals in an organization, such as managers v.s. employees, faculties v.s. students and so on. S-labels typically are indicated by the contextual structures of nodes. Graphs paired with S-labels are heterophilic, where nodes with same labels are not necessarily adjacent and could be spread all across the graph. Although S-labels have been rarely used to evaluate GNNs until very recently, they are comparably important if not more as opposed to C-labels in many graph-related applications Ahmed et al. (2020). There are some other types of node classification as well. But here, we focus our discussion on the above two types as they cover almost all the current node classification applications.
Because of the difference between C-labels and S-labels, the mechanisms that GNNs use to predict them are different. For C-labels, as the nodes with the same labels are more likely to be connected (homophilic networks), GNNs work in a way to iteratively smooth the features of a node with those of its neighbors. The final predictions are also smooth over the graph structure. The obtained smoothness is consistent with the allocation of C-labels over graphs, which makes GNNs work well for C-label prediction111GNNs may work well for C-label prediction but they do not perform the best. GNNs typically work on a local enclosing subgraph around a node. Hence, GNNs can capture limited topological information to predict community labels. Note that community labels can only be robustly captured via long-range propagation, as theoretically demonstrated in Li et al. (2019). Here, long-range propagation means that features need to be propagated a large number of hops ( 10) instead of just hops as typical GNNs do Kipf and Welling (2017); Veličković et al. (2018). This point can actually explain why many models, such as SGC Wu et al. (2019), APPNP Klicpera et al. (2019a) and C&S Huang et al. (2020), by simply removing non-linear activations and allowing propagating node features far across the graphs, achieve even better performance in C-label prediction.. However, the above smoothing procedure is not ideal for S-label prediction, as nodes with the same S-labels are spread across the graph, which induces heterophilic graphs. This setting in general means that the previous mechanism of GNNs cannot be applied. To solve this issue, two ways recently have been proposed to make feature-smoothing procedure of GNNs fit under the heterophilic settings: some work explicitly or implicitly changes the graph structures Pei et al. (2020); Liu et al. (2020) to make nodes from the same S-labels connected and smooth their features; the other expects GNNs to adaptively learn and perform as high-pass graph filters that smooth raw features on the nodes which are actually far away Chien et al. (2020); Zhu et al. (2020). Both strategies still depends on node features and essentially smooth node raw features, though the features got smoothed may not come from their direct neighbors. For heterophilic networks, in contrast, DE-GNN adopts a fundamentally different mechanism. It learns the representation of the “purely” contextual structure around each node and set it as the node representation. This procedure could be totally independent from raw features, though adding node raw features may or may not help. As S-labels are typically indicated by the contextual structures of nodes by definition, DE-GNN could learn a suitable node representation for S-label prediction even without informative node raw features. Its effectiveness on this gets demonstrated by predicting passenger flow volume of different airports where no node raw features were available Li et al. (2020).
Based on previous discussion, we may conclude that standard GNNs were mostly used for C-label prediction which heavily depends on informative raw features, while DE-GNN works for S-label prediction and does not necessarily require raw features. In practice, it can be hard to tell whether C- or S-labels the datasets contain, and whether the raw features are informative for a certain type of labels. For example, the “purely” contextual structure (without node raw features) could provide useful information for C-label prediction as if different communities have different internal subgraph structures. Specifically, researchers who are in the area of computer systems typically work in a group since their research projects in general have heavy workload, while researchers from the theory side tend to work in a more independent way. Then, whether a node belongs to some internal groups (like the blue-circle nodes in Fig.4 left), could be indicative to determine whether they works in systems or theories. Therefore, DE-GNN have the potential to work better on C-label prediction. For S-label prediction, by definition, DE should always be helpful while the raw features are not necessarily helpful. Therefore, it is inconclusive whether we need to combine DE with raw features for S-label prediction.
Besides the above analysis, in this work, we will provide an extensive study of DE-GNNs for node classification. We will investigate how DE can be combined with GNNs and with raw features in eight different configurations over eight different datasets that contain either C-labels or S-labels. We consider whether or not using distance encoding, using raw features and propagating raw features, and the different combinations of these choices. In a high level, we reach three following conclusions while leaving detailed analysis later.
DE always significantly improves GNNs for S-label predictions.
Raw features are still informative in those heterophilic networks. However, whether or not allowing them to propagate in GNNs depends on the datasets.
GNNs and DE-GNN for Node Classification
In this section, we summarize the key steps of GNNs and the DE technique Li et al. (2020) for node classification.
Given a graph with a node feature matrix , where is the set of nodes, is the set of edges. Denote its adjacency matrix as . Each node
has a feature vector.
Homophily ratio is the fraction of edges in a graph which connect nodes that have the same class label (i.e. intra-class edges). Based on the value of , graphs can be classified into two types, namely, homophily () and heterophily (i.e. ). It is a quantitative indicator of graph datasets, to determine whether the labels are lean towards the community type or not. Note that heterophilic graphs does not necessarily mean its labels are the structure type, although these two concepts are often consistent in practice.
Graph Neural Networks (GNNs)
The -th layer of a -layer GNN ( can be formulated as
where is the representation of node at the -th layer.
The function determines how to aggregate information from node ’s neighborhood (such as, max or mean pooling), while the function decides how to fuse the representation of node from previous layer and aggregate information from its neighbors .
We focus on the semi-supervised node classification problem on a simple graph . Given a training set with known class labels for all , and a feature vector for , we aim to infer the unknown class labels for all .
Node Structural Features and Distance Encoding
The summary of graph statistics (e.g. node degrees) are generally used as structural features in many traditional machine learning approaches. Walk-based techniques such as subgraph sampling and random walk, are very popular to be applied for extracting structural information by exploiting node localities of graphs. In the following, we introduce another general class of structure-related features, termed Distance Encoding (DE)Li et al. (2020). DE is an approach to encode the distance between a node in the graph and a target node set. For this study, we adopt a simplified version of DE by reducing the size of node set to 1. Suppose we are to classify node , DE between any node in the graph and is formally defined as follows.
Given a target node for classification and the adjacency matrix , distance encoding for any ,
, is a mapping based on a collection of landing probabilities of random walks fromto , i.e.,
where is the random walk matrix. can be a learnable or fixed function.
defined above is a general form that covers several graph-related measurements. If we set as the first non-zero position in , then it gives the shortest-path-distance (SPD) from to , noted as .
could also be a feed-forward neural network. In practice, a finite length ofis sufficient to encode the contextual structure of a node ().
Compared to other techniques to represent the structural roles of nodes, e.g. motif/graphlet counting Ahmed et al. (2020), DE is computationally efficient. Meanwhile, DE is different from positional node embeddings, such as Node2vec Grover and Leskovec (2016), because it captures the relative distance between a node and the target node. Such a distance is independent of the absolute positions of nodes in graphs, which makes sure that DE-based approaches are inductive to predict the labels spread in the sections of graphs that have not been used for training.
Utilize Structural Features into Node Classification
In conventional GNNs, the representation for the first layer is initialized by the raw features . Besides it, structural features such as DE and node degree can also be used for initialization by setting initial node features as , where denotes the combining operation between two types of features. In reality, especially for S-label prediction, the raw features from neighbors could be irrelevant while structural features are informative. Thus, we consider GNNs solely propagate structural features first, and then concatenate the last-layer representation of a node (currently only hold “pure” structural information) with its raw features to make predictions, which removes the effect of irrelevant neighbors’ raw features.
Based on a model uses which type(s) of features (node degree, raw and DE features), where to add raw features (the first or the last layer in a model), we design eight types of GNN variants for classification tasks in terms of structural roles. In the subsequent discussion, those variants are indexed from to . The detailed configuration of each model is listed in Table 1. Here, only takes raw features as the initial input without structural features involved, which is equivalent to standard GNNs under a typical classification setting. Similarly,
does not apply either raw features or structural features for the propagation, which is essentially degenerated to a multilayer perceptron (MLP).
|(First)||with all features|
|(Last)||with all features|
|without raw node features|
|node degree only|
|(First)||raw node features only|
|(Last)||raw node features only|
|(First)||without DE features|
|(Last)||without DE features|
Datasets & Baselines
We utilize three types of public datasets for node classification tasks, namely, citation networks, Wikipedia network and WebKB. An overview of characteristics for each dataset is summarized in Table 3.
Citation networks include Cora, Citeseer, and Pubmed Sen et al. (2008); Namata et al. (2012), which are a standard benchmark for semi-supervised node classification. In these networks, nodes and edges denote for papers and citation relation between them, respectively. Node features are the embeddings of papers, whose labels are C-labels that indicate the research fields of the papers. Citation networks have a strong tendency of homophily with ratios .
Chameleon and Actor are subgraphs of webpages in Wikipedia Rozemberczki et al. (2019); Tang et al. (2009). Nodes represent for webpages regrading specific topics or actor personals, and edges denote mutual links between pages or co-occurrence in the same page. We use the labels generated by Pei et al.: nodes in Chameleon are classified into five categories based on monthly averaged visits. The node labels are S-labels, as the visiting traffic implies whether a node is a hub or not. Nodes in Actors have 5 classes in term of words appeared in the actor’s Wikipedia page, while it is hard to tell whether they are C-or-S labels. Both graphs have low homophily ratios around 0.2.
Cornell, Texas, and Wisconsin are three subset of WebKB dataset, which are the network map of university-wise web-pages collected by CMU. In these networks, each node denotes a web page, which is manually classified into 5 categories: student, staff, faculty, course and project. The labels indicate the structural roles of nodes in terms of of job positions and oriented audiences, and thus are S-labels.
For all benchmarks, we follow the same setting of dataset as in Pei et al. (2020) that randomly split nodes of each class into 60%, 20%, and 20% for training, validation and testing. We perform hyper-parameter searching for all models on the validation set (details are included in the appendix).
We choose SAGE as the GNN backbone and construct eight GNN variants in Table 1. In particular, to evaluate DE variants, we apply both SPD based one-hot vector and the sequence of landing probabilities of random walks. Models with those two types of DEs are labeled as -SPD and -RW for , respectively.
Results & Discussion
We report the results in Table 2.
Graphs with C-labels On homophilic graphs (), GNNs with DE provide a similar level of performance against other baselines, even those GCN-based methods are heavily optimized under a strict homophily assumption. As we can tell from the first three columns of Table 2 and compare with , feeding DE to the model may not obtain performance gains but would not hinder it much. Note that one may also use DE as controllers of aggregation procedure inside GNNs, termed DEA-GNN in Li et al. (2020). We did not include it in our current study. However, different types of graph diffusion (e.g., personalized-PageRank diffusion, heat-kernel diffusion), as a special type of DE’s application on changing the way of propagation, can significantly improve node classification over homophilic graphs Klicpera et al. (2019b).
Graphs with S-labels On heterophilic graphs, we observe consistent improvement on the accuracy of models with structural features (either degree or DE features or both two). Notably, the best performance for each dataset with S-labels is always one of models with DE assistance, which applied either SPD or RW. From the outcome of , it suggests that node raw features still play a role in classification of S-labels. Furthermore, in this particular case, DE alone might not be sufficient ( v.s. ). However, whether raw fractures should be initially propagating through the network or not is inconclusive. As results from both / and / suggested, it depends on datasets in practice.
In summary, we can conclude that structural features generalize GNNs on both homophilic and heterophilic graphs for node classification. The above results demonstrate that GNNs with both raw and structural features have the best performance on S-label graphs while maintaining comparable performance on C-label ones. It is a more universal framework for both settings, compared to other models that have an implicit assumption of strong homophily for the dataset. In particular, DE is significantly useful to predict S-labels and naturally makes GNNs fit for heterophilic graphs.
In this work, we readdressed the issues and the potentials of graph neural networks on semi-supervised node classification. Specifically, we investigated how structural features such as distance encoding would work with raw features under this settings, and did an extensively empirical study on their individual and collaborative effectiveness for GNNs in node classification tasks. Our study includes both homophilic and heterophilic graphs with community- or structure-type labels. The experiments show that structural information is able to generalize GNNs on both types of labels and graph settings. It particularly demonstrates the uniform effectiveness of applying distance encoding to assist GNNs on the prediction of structure-type labels.
- Role-based graph embeddings. IEEE Transactions on Knowledge and Data Engineering. Cited by: Introduction, Node Structural Features and Distance Encoding.
Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: Introduction.
- Joint adaptive feature smoothing and topology extraction via generalized pagerank gnns. arXiv preprint arXiv:2006.07988. Cited by: Introduction.
- Node2vec: scalable feature learning for networks. In the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. Cited by: Node Structural Features and Distance Encoding.
- Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: Introduction, Introduction, Datasets & Baselines.
- Representation learning on graphs: methods and applications. IEEE Data Engineering Bulletin 40 (3), pp. 52–74. Cited by: Introduction.
- Rolx: structural role extraction & mining in large graphs. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1231–1239. Cited by: Figure 4, Introduction.
- Combining label propagation and simple models out-performs graph neural networks. arXiv preprint arXiv:2010.13993. Cited by: footnote 1.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, Cited by: Hyper-parameter Search.
- Variational graph auto-encoders. NeurIPS Bayesian Deep Learning Workshop. Cited by: Introduction.
- Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, Cited by: Introduction, Datasets & Baselines, footnote 1.
- Predict then propagate: graph neural networks meet personalized pagerank. In International Conference on Learning Representations, Cited by: 3rd item, footnote 1.
- Diffusion improves graph learning. In Advances in Neural Information Processing Systems, pp. 13333–13345. Cited by: 3rd item, Results & Discussion.
- Optimizing generalized pagerank methods for seed-expansion community detection. In Advances in Neural Information Processing Systems, pp. 11705–11716. Cited by: footnote 1.
- Distance encoding–design provably more powerful gnns for structural representation learning. arXiv preprint arXiv:2009.00142. Cited by: Revisit graph neural networks and distance encoding in a practical view, Figure 2, Introduction, Introduction, Introduction, Introduction, Node Structural Features and Distance Encoding, GNNs and DE-GNN for Node Classification, Results & Discussion.
- Non-local graph neural networks. arXiv preprint arXiv:2005.14612. Cited by: Introduction.
- Query-driven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, Cited by: Datasets & Baselines.
- Geom-gcn: geometric graph convolutional networks. In International Conference on Learning Representations, Cited by: Introduction, Table 2, Datasets & Baselines, Datasets & Baselines, Benchmark Settings.
- Struc2vec: learning node representations from structural identity. In the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 385–394. Cited by: Introduction.
- Multi-scale attributed node embedding. arXiv preprint arXiv:1909.13021. Cited by: Datasets & Baselines.
- The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: Introduction.
- Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: Datasets & Baselines.
- On the equivalence between node embeddings and structural graph representations. In International Conference on Learning Representations, Cited by: Figure 1, Introduction.
- Social influence analysis in large-scale networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 807–816. Cited by: Datasets & Baselines.
- Graph attention networks. In International Conference on Learning Representations, Cited by: Introduction, footnote 1.
- Simplifying graph convolutional networks. In International Conference on Machine Learning, pp. 6861–6871. Cited by: footnote 1.
- Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pp. 5165–5175. Cited by: Introduction.
- Generalizing graph neural networks beyond homophily. arXiv preprint arXiv:2006.11468. Cited by: Introduction.
For a fair comparison, the size of search space for each model is set to the same, including number of hidden unit, initial learning rate, dropout rate, and weight decay, which is listed below.
For our SAGE benchmark, the number of hidden layers is set to 1, expect for Cora, Citeseer and Actor which would need 2 or 3 layers; the number of hidden unit is 32 (Cora, Pubmed and Chameleon), 64 (Citeseer, WebKB), and 256 (Actor). The final hyper-parameter setting for training: initial learning rate lr=1e-4 with early stopping (patience of 50 epochs); weight decay l2=1e-6 for Cora, Pubmed and Chameleon; dropout p=0.2 for Citeseer, Actor and WebKB while p=0.4 for Cora and Chameleon. We use Adam Kingma and Ba (2015) as the default optimizer.
The source code is publicly available on Github [repo
]. Compared to the official PyTorch implementation of DE-GNN, our version introduces the following new features:
the framework is able to repeatedly train instances of models without regenerating graph samples;
the procedure of subgraph extraction and the number of GNN layers are decoupled. The model now can run on subgraphs with arbitrary number of hops, but without binding the extraction hops to the total number of layer propagation in GNNs;
sparse matrices are utilized to accelerate the computation of DE, especially for random-walk based features.