New Benchmarks for Learning on Non-Homophilous Graphs

04/03/2021 ∙ by Derek Lim, et al. ∙ cornell university 0

Much data with graph structures satisfy the principle of homophily, meaning that connected nodes tend to be similar with respect to a specific attribute. As such, ubiquitous datasets for graph machine learning tasks have generally been highly homophilous, rewarding methods that leverage homophily as an inductive bias. Recent work has pointed out this particular focus, as new non-homophilous datasets have been introduced and graph representation learning models better suited for low-homophily settings have been developed. However, these datasets are small and poorly suited to truly testing the effectiveness of new methods in non-homophilous settings. We present a series of improved graph datasets with node label relationships that do not satisfy the homophily principle. Along with this, we introduce a new measure of the presence or absence of homophily that is better suited than existing measures in different regimes. We benchmark a range of simple methods and graph neural networks across our proposed datasets, drawing new insights for further research. Data and codes can be found at



There are no comments yet.


page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Various types of real-world data have natural graph structures, in which objects (nodes) and relationships (edges) are encoded as a graph . As a result, numerous types of models have been developed for machine learning tasks on graph data. Much recent work in graph representation learning (Hamilton, 2020) has focused on using both node features and graph structure, especially in the general family of graph neural network (GNN) models.

To generate predictions, graph learning methods can leverage complex inductive biases relating to the topology of the graph (Battaglia et al., 2018). One common property of real-world graphs that is often leveraged as a strong inductive bias is homophily, which means that connected nodes tend to be similar in certain attributes (McPherson et al., 2001; Altenburger and Ugander, 2018). Here, we refer to a graph as homophilous if connected nodes are much more likely to have the same class label than if edges were independent of labels. In a non-homophilous graph, connected nodes are not significantly more likely to have the same class label, and may even be less likely to have the same class label than if edges were independent of labels. For instance, nodes may have little to no preference for connecting to any particular class in aggregate (e.g. in friendship networks where the classes represent gender (Laniado et al., 2016; Altenburger and Ugander, 2018)); or, nodes of a certain class may only be connected to a subset of the other classes (e.g. in citation networks where class labels are the year of publication of a paper).

Many GNNs explicitly assume homophily in their construction, and even those models that do not may perform poorly in non-homophilous graph settings (Zhu et al., 2020b; Jia and Benson, 2020). While some GNNs have been developed that work better in non-homophilous settings, their evaluation is often limited to a few graph datasets introduced by Pei et al. (2019) that have certain undesirable properties such as small size, narrow range of application areas, and synthetic classes (Zhu et al., 2020b; Liu et al., 2020; Zhu et al., 2020a; Chien et al., 2021; Chen et al., 2020; Yan et al., 2021).

In this work, we summarize issues with the non-homophilous graph datasets that are used in prior work and propose improved non-homophilous graph datasets that are substantially larger, span wider application areas, and capture different types of complex label-topology relationships. As previously-used homophily metrics have various shortcomings, we introduce a new metric that detects the presence or absence of homophily in graphs, and use it to analyze these graph datasets. We reintroduce strong simple methods for learning on graphs that have been overlooked in recent non-homophilous GNN work. Finally, we benchmark these simple methods along with general GNNs and non-homophilous GNNs on our proposed datasets. Our stronger empirical evaluation setup allows for a better understanding of graph learning methods in diverse settings.

2. Prior Work

2.1. Graph Representation Learning

Graph neural networks (Hamilton et al., 2017; Kipf and Welling, 2017; Veličković et al., 2018) have demonstrated their utility on a variety of graph machine learning tasks. Most GNNs are constructed by stacking graph neural network layers that propagate transformed node features and then aggregate them via different mechanisms. The neighborhood aggregation used in many existing GNNs implicitly leverage homophily, so they often fail to generalize on non-homophilous graphs (Zhu et al., 2020b; Balcilar et al., 2021). Indeed, a wide range of GNNs operate as low-pass graph filters (Nt and Maehara, 2019; Wu et al., 2019; Balcilar et al., 2021) that smooth features over the graph topology, which produces similar representations and thus similar predictions for neighboring nodes.

Due in part to the most common graph learning benchmarks exhibiting strong homophily, various graph representation learning methods have been developed that explicitly make use of an assumption of homophily in the data (Wu et al., 2019; Huang et al., 2021; Deng et al., 2020; Klicpera et al., 2019b; Bojchevski et al., 2020). By leveraging this assumption, several simple, inexpensive models are able to achieve state-of-the-art performance on homophilic datasets (Wu et al., 2019; Huang et al., 2021).

2.2. Non-Homophilous Methods

Various GNNs have been proposed to achieve higher performance in low-homophily settings. For instance, Geom-GCN (Pei et al., 2019) introduces a geometric aggregation scheme, MixHop (Abu-El-Haija et al., 2019) uses a graph convolutional layer that mixes powers of the adjacency matrix, GPR-GNN (Chien et al., 2021) uses learnable weights that can be positive and negative in feature propagation, and H2GCN (Zhu et al., 2020b)

shows that separation of ego and neighbor embeddings, aggregation in higher-order neighborhoods, and the combination of intermediate representations improves GNN performance in low-homophily. Also, various methods that only depend on graph topology have been proposed for non-homophilous settings, which are based on label propagation or supervised learning models

(Peel, 2017; Altenburger and Ugander, 2018; Chin et al., 2019; Zheleva and Getoor, 2009). There are several recurring design decisions across these methods that appear to strengthen performance in non-homophilous settings: using higher-order neighborhoods, decoupling neighbor information from ego information, and combining graph information at different scales.

2.3. Real-world Datasets

Recently, the Open Graph Benchmark (Hu et al., 2020) has provided a series of datasets and leaderboards that improve the quality of evaluation in graph representation learning; however, most of the node classification datasets tend to be homophilous, as noted in past work (Zhu et al., 2020b) and expanded upon in Appendix A.1. A comparable set of high-quality benchmarks to evaluate non-homophilous methods does not currently exist.

The most widely used datasets to evaluate non-homophilous graph representation learning methods were presented by (Pei et al., 2019) (see our Appendix Table 4); however, these datasets have fundamental issues. First, they are very small — the Cornell, Texas, and Wisconsin datasets have between 180-250 nodes, and the largest of these datasets has 7,600 nodes. In analogy to certain pitfalls of graph neural network evaluation on small (homophilic) datasets discussed in (Shchur et al., 2018), evaluation on the datasets in (Pei et al., 2019)

is plagued by high variance across different train/test splits (see results in

(Zhu et al., 2020b)). Also, the small size of these datasets may make models more prone to overfitting (Dwivedi et al., 2020), and does not allow for experiments in scalability of GNNs designed for performance in non-homophilous settings. As a result, they are not satisfactory for evaluating the performance of node classification models for non-homophilous graphs, and larger, real-world datasets are necessary.

Peel (2017) also studies node classification on network datasets with various types of relationships between edges and labels. However, they only study methods that act on node labels, and thus their datasets do not necessarily have node features. We take inspiration from their work, by testing on Pokec and Facebook networks with node features that we define, and by introducing other year-prediction tasks on citation networks that have node features.

2.4. Synthetic Data

Synthetically generated or synthetically modified non-homophilous graph data may also be used for evaluation of graph machine learning methods. Past works have taken various approaches to this (Karimi et al., 2018; Abu-El-Haija et al., 2019; Zhu et al., 2020b; Kim and Oh, 2021; Chien et al., 2021). When generating the graph, they control homophily through modifications of some generative models over graphs (Barabási and Albert, 1999; Abbe, 2017; Deshpande et al., 2018)

. They tend to either sample node feature vectors from a real graph or draw them from multivariate Gaussian distributions. While synthetic data are useful for investigating properties of models in controlled settings, we focus on benchmarks from real-world data that exhibit diverse types of complex structure.

3. Datasets and Metrics

3.1. Measuring Homophily

Various metrics have been proposed to measure the homophily of a graph in a single scalar quantity. However, these metrics are sensitive to the number of classes and the number of nodes in each class. Thus, we introduce a metric that better captures the presence or absence of homophily. Our metric does not distinguish between different non-homophilous settings (such as heterophily or independent edges); we argue that there are too many degrees of freedom in non-homophilous settings for a single scalar quantity to be able to distinguish them all.

Let be a graph with nodes, none of which are isolated. Further let each node have a class label for some number of classes , and denote by the set of nodes in class . In recent non-homophilous graph representation learning work, the edge homophily (Zhu et al., 2020b) has been defined as the proportion of edges that connect two nodes of the same class:


Another related measure is what we call the node homophily (Pei et al., 2019), defined as , in which is the number of neighbors of node , and is the number of neighbors of that have the same class label. We focus on the edge homophily (1) in this work, but find that node homophily tends to have similar qualitative behavior in experiments.

The sensitivity of edge homophily to the number of classes and size of each class limits its utility. Note that if edges were rewired randomly independently of node labels, a node would be expected to have as the proportion of nodes of the same class that they connect to (Altenburger and Ugander, 2018). For a dataset with balanced classes, we would thus expect the edge homophily to be around , so the interpretation of the measure depends on the number of classes. Also, if classes are imbalanced, then the edge homophily may be misleadingly large. For instance, if 99% of nodes were of one class, then most edges would likely be within that same class, so the edge homophily would be high.

We define a homophily measure that alleviates these shortcomings. Our measure is given as:


where , and is the class-wise homophily metric


Note that , with a fully homophilous graph (in which every node is only connected to nodes of the same class) having . Since each class-wise homophily metric only contributes positive deviations from the null expected proportion , the class-imbalance problem is substantially mitigated. Also, graphs in which edges are independent of node labels are expected to have , for any number of classes. Our measure measures presence of homophily, but does not distinguish between the many types of possibly non-homophilous relationships. This is reasonable given the diversity of non-homophilous relationships. For example, non-homophily can imply independence of edges and classes, extreme heterophily, connections only among subsets of classes, or certain chemically / biologically determined relationships. Indeed, these relationships are very different, and are better captured by more than one scalar quantity, such as the compatibility matrices that are discussed in Appendix A. Figure 1 compares our measure with the edge homophily ratio .

On certain datasets where previous measures are misleading, our measure shows its advantages. For example, some of our proposed datasets are class-imbalanced (e.g. YelpChi and ogbn-proteins), so they have high edge homophily, but our measure captures that they are indeed non-homophilous (Table 1). Moreover, as seen in Appendix Table 4, the edge homophily of Chameleon, Actor, and Squirrel are approximately the same, but the graph structures (Appendix Figure 5) and performance of different methods on these datasets vary significantly (Zhu et al., 2020b). According to our measure, Chameleon is more homophilous than Squirrel, which is in turn more homophilous than Actor, and past work has shown that models tend to perform better on Chameleon than Squirrel and better on Squirrel than Actor (Zhu et al., 2020b; Chien et al., 2021; Pei et al., 2019). Our measure suggests one possible axis of variation that may help explain this divergence, but we emphasize that there are many possible confounders. Further discussion is given in Appendix A.

Dataset # G/T # Nodes # Edges # Node Feat. # C Class types Edge hom. (ours)
Twitch-explicit 7 1,912 - 9,498 31,299 - 153,138 2,545 2 explicit content .556 - .632 .049-.146
YelpChi 1 45,954 3,846,979 32 2 fake reviews .773 .052
deezer-europe 1 28,281 92,752 31,241 2 gender .525 .030
FB100 100 769-41,536 16,656-1,590,655 5 2 gender .434 - .917 .000 - .246
Pokec 1 1,632,803 30,622,564 65 2 gender .445 .000
ogbn-proteins 112 132,534 39,561,252 8 2 function .623 - .940 .090 - .260
arXiv-year 1 169,343 1,166,243 128 5 pub year .222 .272
snap-patents 1 2,923,922 13,975,788 269 5 time granted .073 .100
Table 1. Statistics of our proposed non-homophilous graph datasets. # C is the number of distinct node classes. # G/T is the number of different graphs or tasks. When there are multiple graphs or tasks, the minimum and maximum statistics are listed, with a hyphen “-” in between.
(a) ,  
(b) ,  
(c) ,  
(d) ,  
(e) ,  
(f) ,  
Figure 1. Examples of graphs with different label-topology relationships and comparison of our measure with the edge homophily ratio . The node classes are labeled by color. Pink edges link nodes of the same class, while purple edges link nodes of different classes. (a,b) Pure homophily and pure heterophily. Both measures equal in homophily and in heterophily. (c,d) Graphs where each node is connected to one member of each class. Edge homophily depends on the number of classes, while our measure does not. (e,f) Random Erdős Rényi graphs in which edges are independent of labels. Edge homophily is sensitive to class imbalance, while our measure is not.

3.2. Proposed Datasets

Here, we detail the non-homophilous datasets that we propose for graph machine learning evaluation. These datasets come from a variety of contexts, and some have been used for evaluation of graph machine learning models in past work; in certain cases, we make adjustments such as modifying node labels and adding node features. We define node features for Facebook100, Pokec, and snap-patents, while we redefine node labels for arXiv-year and snap-patents. Basic dataset statistics are given in Table 1. The datasets are as follows:

Twitch-explicit contains 7 networks where Twitch users are nodes, and mutual friendships between them are edges (Rozemberczki et al., 2019). Node features are games liked, location and streaming habits. Each graph is associated to users of a particular region. The class labels denote whether a streamer uses explicit language. We solely train and test on Twitch-DE, which has 9,498 nodes, 76,569 edges, edge homophily of .632, and of .142.

YelpChi (Mukherjee et al., 2013) is a graph in which the nodes are reviews for hotels and restaurants in the Chicago area, and the class labels are fraudulent reviews and recommended reviews. The 32 node features are taken from (Rayana and Akoglu, 2015). We take the topology from (Dou et al., 2020); while there are different relations captured by edges, we treat them all as the same single edge type.

deezer-europe (Rozemberczki and Sarkar, 2020) is a social network of users on Deezer from European countries, where edges represent mutual follower relationships. The node features are based on artists liked by each user. Nodes are labeled with reported gender.

Facebook100 (Traud et al., 2012) consists of 100 Facebook friendship network snapshots from 2005, each of which has as nodes the Facebook users from a given American university. Each node is labeled with the reported gender of the user. The node features are major, second major/minor, dorm/house, year, and high school. We solely train and test on Penn94, which has 41,554 nodes, 1,362,229 edges, edge homophily of .470, and of .046.

Pokec is the friendship graph of a Slovak online social network, where nodes are users and edges are directed friendship relations (Leskovec and Krevl, 2014). Nodes are labeled with reported gender. The node features are derived from profile information, such as geographical region, registration time, and age.

ogbn-proteins (Hu et al., 2020) has proteins as nodes and different biological relationships between proteins as edges. There are 112 tasks, in which each protein is given a binary label. Also, there are no separate node features — instead, the means of incoming edge features give 8 dimensional node features.

arXiv-year (Hu et al., 2020) is the ogbn-arxiv network with class labels given by the year that the paper is posted, instead of by subject areas. The nodes are arXiv papers, and directed edges connect a paper to other papers that it cites. The node features are averaged word2vec token features of both the title and abstract of the paper. The five classes are chosen by partitioning the posting dates so that class ratios are approximately balanced: 2013 and prior, 2014-2015, 2016-2017, 2018, and 2019-2020.

snap-patents (Leskovec et al., 2005; Leskovec and Krevl, 2014) is a dataset of utility patents granted between 1963 to 1999 in the US. Each node is a patent, and edges connect patents that cite each other. Node features are derived from patent metadata. The task is to predict the time at which a patent was granted. The five classes are: 1971 and prior, 1972-1980, 1981-1988, 1989-1994, and 1995-1999.

3.3. General Non-homophilous Settings

Different settings in which non-homophilous relationships are prevalent have been identified in the literature and are represented by our proposed datasets:

  • Gender relations in social or interaction networks (Altenburger and Ugander, 2018; Chin et al., 2019; Jia and Benson, 2020) (deezer, FB100, Pokec).

  • Biological structures such as in food webs (Gatterbauer, 2014) and protein interactions (Newman, 2003) (ogbn-proteins).

  • Technological and internet relationships, such as in web page connections (Newman, 2003; Pei et al., 2019).

  • Malicious or fraudulent nodes, such as in auction networks (Chau et al., 2006; Pandit et al., 2007) (YelpChi).

  • Publication time in citation networks (Peel, 2017) (arXiv-year, snap-patents).

While not all example graph data from these contexts are non-homophilous, a diverse range are. In order to succeed in future applications in these contexts, it may be important to develop methods that are able to handle non-homophilous structures.

4. Experiments

Twitch-DE YelpChi deezer Penn94 (FB100) pokec ogbn-proteins arXiv-year snap-patents
L Prop (1 hop)
L Prop (2 hop)
SGC (1 hop)
SGC (2 hop)
C&S (1 hop)
C&S (2 hop)
GAT (M) (M) (M) (M)
GAT+JK (M) (M) (M) (M)
H2GCN (M) (M) (M) (M)
MixHop (M)
Table 2. Experimental results. Test accuracy is displayed for most datasets, while Twitch-DE, YelpChi, and ogbn-proteins display test ROC AUC

. Standard deviations are over 5 train/val/test splits, except for ogbn-proteins, which has a fixed split. The three best results per dataset are highlighted in


. (M) denotes some (or all) hyperparameter settings run out of memory.

4.1. Experimental Setup

We include both methods that are graph-agnostic and node-feature-agnostic as simple baselines; the node-feature-agnostic models of two-hop label propagation (Peel, 2017)

and LINK (logistic regression on the adjacency matrix)

(Zheleva and Getoor, 2009) have been found to perform well in various non-homophilous settings, but they have often been overlooked by recent graph representation learning work. Also, we include SGC (Wu et al., 2019) and C&S (Huang et al., 2021) as simple methods that perform well on homophilic datasets. We include a two-hop propagation variant of C&S in analogy with two-step label propagation. In addition to representative general GNNs, we also include three GNNs recently proposed for non-homophilous settings. The full list of methods is:

  • Models that only use node features: MLP (Goodfellow et al., 2016).

  • Models that only use the graph topology: label propagation (standard and two-hop) (Zhou et al., 2004; Peel, 2017), LINK (Zheleva and Getoor, 2009).

  • Simple methods: SGC (Wu et al., 2019), C&S (Huang et al., 2021), two-hop variants.

  • General GNNs: GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018), jumping knowledge networks (GCN+JK, GAT+JK) (Xu et al., 2018), and APPNP (Klicpera et al., 2019a).

  • Non-homophilous GNNs: H2GCN (Zhu et al., 2020b), MixHop (Abu-El-Haija et al., 2019), and GPR-GNN (Chien et al., 2021).

Following other works in non-homophilous graph learning evaluation, we take a high proportion of training nodes (Zhu et al., 2020b; Pei et al., 2019; Yan et al., 2021); we run each method on the same five random 50/25/25 train/val/test splits for each dataset, besides ogbn-proteins, for which we use the original Open Graph Benchmark splits (Hu et al., 2020)

. All methods requiring gradient-based optimization are run for 500 epochs, with test performance reported for the learned parameters of highest validation performance. We use ROC-AUC as the metric for the class-imbalanced Twitch-DE (60.5% of nodes in majority class), YelpChi (85.5% of nodes in majority class), and ogbn-proteins datasets (52.7% to 98.0% in majority, depending on task). For other datasets, we use classification accuracy as the metric. Further experimental details can be found in Appendix


4.2. Experimental Results

Table 2 lists the results of each method across the datasets that we propose. Our new measure and new datasets reveal several important properties of non-homophilous node classification. Firstly, both methods that only use node features and methods that only use graph topology appear to perform better than random, thus demonstrating the quality of our datasets. Secondly, the stability of performance across runs is better for our datasets than those of Pei et al. (2019) (see (Zhu et al., 2020b) results). Moreover, as suggested by prior theory and experiments (Zhu et al., 2020b; Abu-El-Haija et al., 2019; Chien et al., 2021), the non-homophilous GNNs usually do well — though not necessarily on every dataset.

The core assumption of homophily in SGC and C&S that enables them to be simple and efficient does not hold on these non-homophilous datasets, and thus the performance of these methods is typically relatively low. Nevertheless, 2-hop C&S still achieves high performance on some datasets. Indeed, our results indicate that simple 2-hop modifications of learning methods often improve performance in low-homophily, though we note that 1-hop label propagation performs much better than expected on ogbn-proteins, possibly due to some implementation nuance. LINK, a frequently ignored baseline that in some sense acts on two-hop neighborhoods of each node (Altenburger and Ugander, 2018), performs well on many datasets — despite not using node feature information.

Finally, one consequence of using larger datasets for benchmarks is that the tradeoff between scalability and learning performance of non-homophilous methods has become starker, with some methods facing memory issues. This tradeoff is especially important to consider in light of the fact that many scalable graph learning methods rely on implicit or explicit homophily assumptions (Wu et al., 2019; Huang et al., 2021; Deng et al., 2020; Bojchevski et al., 2020), and thus face issues when used in non-homophilous settings.

5. Conclusion

In this paper, we introduce a measure of the presence of homophily that alleviates issues with existing measures, propose new, high-quality non-homophilous graph learning datasets, and benchmark simple baselines and representative graph representation learning methods across our datasets. We hope that these contributions will provide researchers of non-homophilous graph learning methods with better tools to test their models and evaluate the utility of new techniques. While we benchmark on transductive node classification, the datasets we propose could be adapted to benchmark link prediction, clustering tasks, and inductive node classification in the case of Twitch-explicit and Facebook100. Future work could study these other tasks in low-homophily settings, reformulate current understandings of homophily that are most natural in node classification, and introduce new benchmarks for a wider range of applications.


We thank Austin Benson and Horace He for insightful discussions. We also thank the rest of Cornell University Artificial Intelligence for their support and discussion. This research was supported by Facebook AI.


Appendix A Compatibility Matrices and Statistics

Figure 2. Compatibility matrices of our proposed datasets. These datasets from a variety of different contexts exhibit a wide range of non-homophilous structures. We choose 2 of the 112 ogbn-proteins tasks to display.

Following previous work (Zhu et al., 2020b), for a graph with node classes we define the compatibility matrix by


This captures finer details of label-topology relationships in graphs than single scalar metrics (like edge homophily and our ) capture. For classes and , the entry measures the proportion of edges from nodes of class that are connected to nodes of class . Compatibility matrices for our proposed datasets are shown in Figure 2. As evidenced by the different patterns, the proposed datasets show interesting types of label-topology relationships besides homophily. For instance, the citation datasets arXiv-year and snap-patents have primarily lower-triangular structure, since most citations reference past work. The Penn94 matrix is mostly uniform, so there is little to no gender preference in aggregate, though past work has shown other useful signals in the social structure of gender (Altenburger and Ugander, 2018). In Twitch-DE, while streamers that use explicit content (class 1) often connect to other streamers of class 1, streamers that do not use explicit content (class 0) also often connect to streamers of class 1, thus giving an overall non-homophilous structure. In Pokec, there is some heterophily, in that one gender has some preference for friends of another gender.

a.1. Homophilous Data Statistics

Dataset # Nodes # Edges # C Edge hom. (ours)
Cora 2,708 5,278 7 .81 .766
Citeseer 3,327 4,552 6 .74 .627
Pubmed 19,717 44,324 3 .80 .664
ogbn-arXiv 169,343 1,166,243 40 .66 .416
ogbn-products 2,449,029 61,859,140 47 .81 .459
oeis 226,282 761,687 5 .50 .532
Table 3. Statistics for homophilic graph datasets. # C is the number of node classes.

In contrast to the different compatibility matrix structures of our proposed non-homophilous datasets, much other graph data have primarily homophilous relationships, as can be seen in Figure 4 and Table 3. The Cora, CiteSeer, PubMed, ogbn-arxiv, and ogbn-products datasets are widely used as benchmarks for node classification (Yang et al., 2016; Hu et al., 2020), and are highly homophilous, as can be seen by the diagonally dominant structure of the compatibility matrices and by the high edge homophily and .

We collected the oeis dataset displayed in the bottom right of Figure 4. The nodes are entries in the Online Encyclopedia of Integer Sequences (Sloane, 2007), and directed edges link an entry to any other entry that it cites. In analogy to arXiv-year and snap-patents, the node labels are the time of posting of the sequence. However, in this case the graph relationships are homophilous, even as we vary the number of distinct classes (time periods). This is in part due to differences between posting in this online encyclopedia and publication of academic papers or patents. For instance, there is less overhead to posting an entry in the OEIS, so users often post separate related entries and variants of these entries in rapid succession. Also, an entry in the encyclopedia often inspires other people to work on similar entries, which can be created in much less time than an academic follow-up work to a given paper. These related entries tend to cite each other, which contributes to homophilic relationships over time. Thus, the data here does not follow the special temporal citation structure of academic publications and patents.

Figure 3. Comparison of edge homophily and our measure on random class-imbalanced graph data with edges independent of node labels. Three standard deviations are shaded. Our measure is mostly constant as the classes become more imbalanced, while edge homophily increases.

a.2. Previous Non-Homophilous Data

Dataset # Nodes # Edges # Node Feat. # C Context Edge hom. (ours)
Chameleon 2,277 36,101 2,325 5 Wiki pages .23 .062
Cornell 183 295 1,703 5 Web pages .30 .047
Actor 7,600 29,926 931 5 Actors in movies .22 .011
Squirrel 5,201 216,933 2,089 5 Wiki pages .22 .025
Texas 183 309 1,703 5 Web pages .11 .001
Wisconsin 251 499 1,703 5 Web pages .21 .094
Table 4. Statistics for datasets from Pei et al. (2019). #C is the number of node classes.

For the six datasets in Pei et al. (2019) often used in evaluation of graph representation learning methods in non-homophilous regimes (Zhu et al., 2020b), basic statistics are listed in Table 4 and compatibility matrices are displayed in Figure 5. We propose many more datasets (including the 100 graphs from Facebook100) that have up to orders of magnitude more nodes and edges and come from a wider range of contexts. There are several cases of class-imbalance in these datasets, which may make the edge homophily misleading. As discussed in Section 3.1, our measure may be able to alleviate issues with edge homophily in measuring homophily of these datasets, and offers a way to distinguish between the Chameleon, Actor, and Squirrel datasets that all have similar edge homophily.

a.3. Class-Imbalance and Metrics

In this section, we present experiments that demonstrate an instance in which our metric is not affected by imbalanced classes, while edge homophily is. We generate graphs in which node labels are independent of edges by randomly choosing node labels and generating graph edges by the Erdős-Rényi random graph model (Erdős and Rényi, 1960)

. In particular, we fix the number of classes to two, the number of nodes to 100, and the probability of edge formation as .25 between every pair of nodes. Then we generate 100 samples of these random graphs, and compute the mean and standard deviation of both edge homophily

and our measure . As seen in Figure 3, our measure is constantly near zero as we increase the size of the majority class, while the edge homophily increases as the size of majority class increases.

Figure 4. Compatibility matrices of homophilic datasets. The diagonal dominance indicates strong homophily.
Figure 5. Compatibility matrices of datasets in Pei et al. (2019). The “film” dataset is also referred to as “Actor”. Note that there are no edges leading out of the nodes of class 1 in the Cornell dataset, so there is an empty row in its matrix.

Appendix B Experimental Details

For gradient-based optimization, we use the AdamW optimizer (Kingma and Ba, 2014; Loshchilov and Hutter, 2018) with weight decay .001 and learning rate by default, unless we tune the optimizer for a particular method (as noted below in B.1). In all cases, we use full batch gradient descent across the entire graph dataset. Hyperparameter tuning is conducted using grid search for most methods. Tuning for C&S is done as in the original paper (Huang et al., 2021), which uses Optuna (Akiba et al., 2019) for Bayesian hyperparameter optimization. GCN and GCN-JK on ogbn-proteins use the hyper-parameters of methods on the Open Graph Benchmark leaderboards (Hu et al., 2020). All graphs are treated as undirected besides arXiv-year and snap-patents, in which the directed nature of the edges capture useful temporal information; however, we find that label propagation and C&S (which builds on label propagation) perform better with undirected graphs in these cases, so we keep the graphs as undirected for these methods.

Simple methods are run on a Nvidia 2080 Ti with 11 GB GPU memory. In cases where the Nvidia 2080Ti did not provide enough memory, we re-ran experiments on a Nvidia Titan RTX with 24GB GPU memory, reporting (M) in Table 2 if the GPU memory was still insufficient.

b.1. Hyperparameters

Experimental results are reported on the hyperparameter settings below, where we choose the settings that achieve the highest performance on the validation set. We choose hyperparameter grids that do not necessarily give optimal performance, but hopefully cover enough regimes so that each model is reasonably evaluated on each dataset. Unless otherwise stated, each GNN has dropout of .5 (Srivastava et al., 2014) and BatchNorm (Ioffe and Szegedy, 2015) in each layer. The hyperparameter grids for the different methods are:

  • MLP: hidden dimension , number of layers

    . We use ReLU activations.

  • Label propagation: . We use 50 propagation iterations.

  • LINK: weight decay .

  • SGC: weight decay .

  • C&S: Normalized adjacency matrix A1, A2 for the residual propagation and label propagation, where is the adjacency matrix of the graph and is the diagonal degree matrix; for the two propagations. Both Autoscale and FDiff-scale were used for all experiments, and scale was searched in FDiff-scale settings. The base predictor is chosen as the best MLP model for each dataset.

  • GCN: lr , hidden dimension , except for snap-patents and pokec, where we omit hidden dimension = 64, and ogbn-proteins, where we use the experimental setup of Hu et al. (2020). Each activation is a ReLU. 2 layers were used for all experiments except for ogbn-proteins.

  • GAT: lr . For snap-patents and pokec: hidden channels and gat heads . For all other datasets: hidden channels and gat heads . We use the ELU activation (Clevert et al., 2015). 2 layers were used for all experiments.

  • GCN-JK: Identical for GCN, also including JK Type .

  • GAT-JK: Identical for GAT, also including JK Type .

  • APPNP: MLP hidden dimension , learning rate , . BatchNorm is not used in the initial MLP. We truncate the series at the th power of the adjacency.

  • H2GCN: hidden dimension , number of layers , dropout . The architecture follows Section 3.2 of (Zhu et al., 2020b).

  • MixHop: hidden dimension , number of layers . Each layer has uses the 0th, 1st, and 2nd powers of the adjacency and has ReLU activations. The last layer is a linear projection layer, instead of the attention output mechanism in (Abu-El-Haija et al., 2019).

  • GPR-GNN: The basic setup and grid is the same as that of APPNP. We use their Personalized PageRank weight initialization.