Generalizing Graph Neural Networks Beyond Homophily

06/20/2020 ∙ by Jiong Zhu, et al. ∙ University of Michigan Carnegie Mellon University 6

We investigate the representation power of graph neural networks in the semi-supervised node classification task under heterophily or low homophily, i.e., in networks where connected nodes may have different class labels and dissimilar features. Most existing GNNs fail to generalize to this setting, and are even outperformed by models that ignore the graph structure (e.g., multilayer perceptrons). Motivated by this limitation, we identify a set of key designs – ego- and neighbor-embedding separation, higher-order neighborhoods, and combination of intermediate representations – that boost learning from the graph structure under heterophily, and combine them into a new graph convolutional neural network, H2GCN. Going beyond the traditional benchmarks with strong homophily, our empirical analysis on synthetic and real networks shows that, thanks to the identified designs, H2GCN has consistently strong performance across the full spectrum of low-to-high homophily, unlike competitive prior models without them.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Boost learning for GNNs from the graph structure under challenging heterophily settings. (NeurIPS'20)

view repo


Code for the 2021 KDD submission 'On local aggregation in heterophilic graphs'

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We focus on the effectiveness of graph neural networks (GNNs) Zhang et al. (2020) in tackling the semi-supervised node classification task in challenging settings: the goal of the task is to infer the unknown labels of the nodes by using the network structure Zhu (2005), given partially labeled networks with node features (or attributes). Unlike most prior work that considers networks with strong homophily, we study the representation power of GNNs in settings with different levels of homophily or class label smoothness.

Homophily is a key principle of many real-world networks, whereby linked nodes often belong to the same class or have similar features (“birds of a feather flock together”) McPherson et al. (2001). For example, friends are likely to have similar political beliefs or age, and papers tend to cite papers from the same research area Newman (2018). GNNs model the homophily principle by propagating features and aggregating them within various graph neighborhoods via different mechanisms (e.g., averaging, LSTM) Kipf and Welling (2016); Hamilton et al. (2017); Veličković et al. (2018). However, in the real world, there are also settings where “opposites attract”, leading to networks with heterophily: linked nodes are likely from different classes or have dissimilar features. For instance, the majority of people tend to connect with people of the opposite gender in dating networks, different amino acid types are more likely to connect in protein structures, fraudsters are more likely to connect to accomplices than to other fraudsters in online purchasing networks Pandit et al. (2007).

Since GNNs assume strong homophily, most existing models fail to generalize to networks with heterophily (or low/medium level of homophily). In such cases, we find that even models that ignore the graph structure altogether, such as multilayer perceptrons or MLPs, can outperform a number of existing GNNs. Motivated by this limitation, we make the following contributions:

Current Limitations: We reveal the limitation of GNNs to learn over networks with heterophily, which is ignored in the literature due to evaluation on few benchmarks with similar properties. § 3

Key Design Choices for Heterophily & New Model: We identify a set of key design choices that can boost learning from the graph structure in heterophily settings without trading off accuracy in homophily settings: (D1) ego- and neighbor-embedding separation, (D2) higher-order neighborhoods, and (D3) combination of intermediate representations. We justify the designs theoretically and empirically, combine them into a new model, HGCN, that effectively adapts to both heterophily and homophily settings, and compare our framework to prior GNN models. § 3-4

Extensive Empirical Evaluation: We empirically analyze our model and competitive existing GNN models on both synthetic and real networks covering the full spectrum of low-to-high homophily (besides the typically-used benchmarks with high homophily). We show that HGCN has consistently strong performance unlike existing models tailored to homophily. § 5

2 Notation and Preliminaries

Figure 1: Neighborhoods.

We summarize our notation in Table A.1 (App. A). Let be an undirected, unweighted graph with nodeset and edgeset . We denote a general neighborhood centered around as ( may have self-loops), the corresponding neighborhood that does not include the ego (node ) as , and the general neighbors of node at exactly hops/steps away (minimum distance) as . For example, are the immediate neighbors of . Other examples are shown in Fig. 1. We represent the graph by its adjacency matrix and its node feature matrix

, where the vector

corresponds to the ego-feature of node , and to its neighbor-features.

We further assume a class label vector , which for each node contains a unique class label . The goal of semi-supervised node classification is to learn a mapping , where is the set of labels, given a set of labeled nodes as training data.

Graph neural networks From a probabilistic perspective, most GNN models assume the following local Markov property on node features: for each node , there exists a neighborhood such that only depends on the ego-feature and neighbor-features . Most models derive the class label via the following representation learning approach:


where the embedding function is applied repeatedly in total rounds, node ’s representation (or hidden state vector) at round ,

, is learned from its ego- and neighbor-representations in the previous round, and a softmax classifier with learnable weight matrix

is applied to the final representation of . Most existing models differ in their definitions of neighborhoods and embedding function . A typical definition of neighborhood is —i.e., the 1-hop neighbors of . As for , in graph convolutional networks (GCN) Kipf and Welling (2016) each node repeatedly averages its own features and those of its neighbors to update its own feature representation. Using an attention mechanism, GAT Veličković et al. (2018) models the influence of different neighbors more precisely as a weighted average of the ego- and neighbor-features. GraphSAGE Hamilton et al. (2017) generalizes the aggregation beyond averaging, and models the ego-features distinctly from the neighbor-features in its subsampled neighborhood.

Homophily and heterophily In this work, we focus on heterophily in class labels. We first define the edge homophily ratio as a measure of the graph homophily level, and use it to define graphs with strong homophily/heterophily:

Definition 1

The edge homophily ratio is the fraction of edges in a graph which connect nodes that have the same class label (i.e., intra-class edges).

Definition 2

Graphs with strong homophily have high edge homophily ratio , while graphs with strong heterophily (i.e., low/weak homophily) have small edge homophily ratio .

The edge homophily ratio in Dfn. 1 gives an overall trend for all the edges in the graph. The actual level of homophily may vary within different pairs of node classes, i.e., there is different tendency of connection between each pair of classes. In App. B, we give more details about capturing these more complex network characteristics via an empirical class compatibility matrix , whose -th entry is the fraction of outgoing edges to nodes in class among all outgoing edges from nodes in class .

Heterophily Heterogeneity. We remark that heterophily, which we study in this work, is a distinct network concept from heterogeneity. Formally, a network is heterogeneous Sun and Han (2012)

if it has at least two types of nodes and different relationships between them (e.g., knowledge graphs), and homogeneous if it has a single type of nodes (e.g., users) and a single type of edges (e.g., friendship). The type of nodes in heterogeneous graphs does

not necessarily match the class labels , therefore both homogeneous and heterogeneous networks may have different levels of homophily.

3 Learning Over Networks with Heterophily

GCN Kipf and Welling (2016)
GAT Veličković et al. (2018)
GCN-Cheby Defferrard et al. (2016)
GraphSAGE Hamilton et al. (2017)
MixHop Abu-El-Haija et al. (2019)
HGCN (ours)
Table 1: Example of a heterophily setting () where existing GNNs fail to generalize, and a typical homophily setting (

): mean accuracy and standard deviation over three runs (cf. App. 


While many GNN models have been proposed, most of them are designed under the assumption of homophily, and are not capable of handling heterophily. As a motivating example, Table 1 shows the mean classification accuracy for several leading GNN models on our synthetic benchmark syn-cora, where we can control the homophily/heterophily level (see App. G for details on the data and setup). Here we consider two homophily ratios, and , one for high heterophily and one for high homophily. We observe that for heterophily () all existing methods fail to perform better than a Multilayer Perceptron (MLP) with 1 hidden layer, a graph-agnostic baseline that relies solely on the node features for classification (differences in accuracy of MLP for different are due to randomness). Especially, GCN Kipf and Welling (2016), GAT Veličković et al. (2018) and MixHop Abu-El-Haija et al. (2019) show up to 42% worse performance than MLP, highlighting that methods that work well under high homophily () may not be appropriate for networks with low/medium homophily.

Motivated by this limitation, in the following subsections, we discuss and theoretically justify a set of key design choices that, when appropriately incorporated in a GNN framework, can improve the performance in the challenging heterophily settings. Then, we present HGCN, a model that, thanks to these designs, adapts well to both homophily and heterophily (Table 1, last row). In Section 5, we provide comprehensive empirical analysis on both synthetic and real data with varying homophily levels, and show that HGCN performs consistently well across different levels, and improves over MLP by effectively leveraging the graph structure in challenging settings.

3.1 Effective Designs for Networks with Heterophily

We have identified three key designs that—when appropriately integrated—can help improve the performance of GNN models in heterophily settings: (D1) ego- and neighbor-embedding separation; (D2) higher-order neighborhoods; and (D3) combination of intermediate representations.

3.1.1 (D1) Ego- and Neighbor-embedding Separation

The first design entails encoding each ego-embedding (i.e., a node’s embedding) separately from the aggregated embeddings of its neighbors, since they are likely to be dissimilar in heterophily settings. Formally, the representation (or hidden state vector) learned for each node at round is given as:


the neighborhood does not include (no self-loops), the AGGR function aggregates representations only from the neighbors (in some way—e.g., average), and AGGR and COMBINE

may be followed by a non-linear transformation. For heterophily, after aggregating the neighbors’ representations, the definition of

COMBINE (akin to ‘skip connection’ between layers) is critical: a simple way of combining the ego- and the aggregated neighbor-embeddings without ‘mixing’ them is to concatenate them—rather than average all of them as in the GCN model by Kipf and Welling (2016).

Intuition. In heterophily settings, by definition (Dfn. 2), the class label and original features of a node and those of its neighboring nodes (esp. the direct neighbors ) may be different. However, the typical GCN design that mixes the embeddings through an average Kipf and Welling (2016) or weighted average Veličković et al. (2018) as the COMBINE function results in final embeddings that are similar across neighboring nodes (especially within a community or cluster) for any set of original features Rossi et al. (2020). While this may work well in the case of homophily, where neighbors likely belong to the same cluster and class, it poses severe challenges in the case of heterophily: it is not possible to distinguish neighbors from different classes based on the (similar) learned representations. Choosing a COMBINE function that separates the representations of each node and its neighbors allows for more expressiveness, where the skipped or non-aggregated representations can evolve separately over multiple rounds of propagation without becoming prohibitively similar.

Theoretical Justification. We prove theoretically that, under some conditions, a GCN layer that co-embeds ego- and neighbor-features is less capable of generalizing to heterophily than a layer that embeds them separately. We measure its generalization ability by its robustness to test/train data deviations. We give the proof of the theorem in App. C.1. Though the theorem applies to specific conditions, our empirical analysis shows that it holds in more general cases (§ 5).

Theorem 1

Consider a graph without self-loops (§ 2) with node features for each node , and an equal number of nodes per class in the training set . Also assume that all nodes in have degree , and proportion of their neighbors belong to the same class, while proportion of them belong to any other class (uniformly). Then for , a simple GCN layer formulated as is less robust, i.e., misclassifies a node for smaller train/test data deviations, than a layer that separates the ego- and neighbor-embeddings.

Observations. In Table 1, we observe that GCN, GAT, and MixHop, which ‘mix’ the ego- and neighbor-embeddings explicitly111 These models consider self-loops, which turn each ego also into a neighbor, and thus mix the ego- and neighbor-representations. E.g., GCN and MixHop operate on the symmetric normalized adjacency matrix augmented with self-loops: where is the identity and the degree matrix of . , perform poorly in the heterophily setting. On the other hand, GraphSAGE that separates the embeddings (e.g., it concatenates the two embeddings and then applies a non-linear transformation) achieves 33-40% better performance in this setting.

3.1.2 (D2) Higher-order Neighborhoods

The second design involves explicitly aggregating information from higher-order neighborhoods in each round , beyond the immediate neighbors of each node:


where corresponds to the neighbors of at exactly hops away, and the AGGR functions applied to different neighborhoods can be the same or different. This design augments the implicit aggregation over higher-order neighborhoods that most GNN models achieve through multiple rounds of first-order propagation based on variants of Eq. (2).

Intuition. To show why higher-order neighborhoods help in the heterophily settings, we first define homophily-dominant and heterophily-dominant neighborhoods:

Definition 3

is expectedly homophily-dominant if and . If the opposite inequality holds, is expectedly heterophily-dominant.

From this definition, we can see that expectedly homophily-dominant neighborhoods are more beneficial for GNN layers, as in such neighborhoods the class label of each node can in expectation be determined by the majority of the class labels in . In the case of heterophily, we have seen empirically that although the immediate neighborhoods may be heterophily-dominant, the higher-order neighborhoods may be homophily-dominant and thus provide more relevant context.

Theoretical Justification. Below we formalize this observation for 2-hop neighborhoods, and prove one case when they are homophily-dominant in App. C.2:

Theorem 2

Consider a graph without self-loops (§ 2) with label set , where for each node , its neigbhors’ class labels are conditionally independent given , and , . Then, the 2-hop neighborhood for a node will always be homophily-dominant in expectation.

Observations. Under heterophily (), GCN-Cheby, which models different neighborhoods by combining Chebyshev polynomials to approximate a higher-order graph convolution operation Defferrard et al. (2016), outperforms GCN and GAT, which aggregate over only the immediate neighbors , by up to +31% (Table 1). MixHop, which explicitly models 1-hop and 2-hop neighborhoods (though ‘mixes’ the ego- and neighbor-embeddings, violating design D1), also outperforms these two models.

3.1.3 (D3) Combination of Intermediate Representations

The third design combines the intermediate representations of each node from multiple rounds at the final layer:


to explicitly capture local and global information via COMBINE functions that leverage each representation separately–e.g., concatenation, LSTM-attention Xu et al. (2018). This design is introduced in jumping knowledge networks Xu et al. (2018) and shown to increase the representation power of GCNs under homophily.

Intuition. Intuitively, each round collects information with different locality—earlier rounds are more local, while later rounds capture increasingly more global information (implicitly, via propagation). Similar to D2 (which models explicit neighborhoods), this design models the distribution of neighbor representations in low-homophily networks more accurately. It also allows the class prediction to leverage different neighborhood ranges in different networks, adapting to their structural properties.

Theoretical Justification. The benefit of combining intermediate representations can be theoretically explained from the spectral perspective. Assuming a GCN-style layer—where propagation can be viewed as spectral filtering—, the adjacency matrix is a low-pass filter Wu et al. (2019), so intermediate outputs from earlier rounds contain higher-frequency components than outputs from later rounds. At the same time, the following theorem holds for graphs with heterophily, where we view class labels as graph signals (as in graph signal processing):

Theorem 3

Consider graph signals (label vectors) defined on an undirected graph with edge homophily ratios and , respectively. If , then signal has higher energy (Dfn. 5) in high-frequency components than in the spectrum of unnormalized graph Laplacian .

In other words, in heterophily settings, the label distribution contains more information at higher than lower frequencies (see proof in App. C.3). Thus, by combining the intermediate outputs from different layers, this design captures both low- and high-frequency components in the final representation, which is critical in heterophily settings, and allows for more expressiveness in the general setting.

Observations. By concatenating the intermediate representations from two rounds with the embedded ego-representation (following the jumping knowledge framework Xu et al. (2018)), GCN’s accuracy increases to for , a 20% improvement over its counterpart without design D3 (Table 1).

Summary of designs To sum up, D1 models (at each layer) the ego- and neighbor-representations distinctly, D2 leverages (at each layer) representations of neighbors at different distances distinctly, and D3 leverages (at the final layer) the learned ego-representations at previous layers distinctly.

3.2 HGcn: A Framework for Networks with Homophily or Heterophily

We now describe HGCN, which combines designs D1-D3 to adapt to heterophily. It has three stages (Alg. 1, App. D): (S1) feature embedding, (S2) neighborhood aggregation, and (S3) classification.

The feature embedding stage (S1) uses a graph-agnostic dense layer to generate for each node the feature embedding based on its ego-feature : , where is an optional non-linear function, and is a learnable weight matrix.

In the neighborhood aggregation stage (S2), the generated embeddings are aggregated and repeatedly updated within the node’s neighborhood for rounds. Following designs D1 and D2, the neighborhood of our framework involves two sub-neighborhoods without the egos: the 1-hop graph neighbors and the 2-hop neighbors , as shown in Fig. 1:


We set COMBINE as concatenation (as to not mix different neighborhood ranges), and AGGR as a degree-normalized average of the neighbor-embeddings in sub-neighborhood :


where is the -hop degree of node (i.e., number of nodes in its -hop neighborhood). Note that unlike Eq. (2), here we do not combine the ego-embedding of node with the neighbor-embeddings. We found that removing the typical non-linear embedding transformations per round, as in SGC Wu et al. (2019), works better (App. D.2), and in such case including the ego-embedding only in the final representation avoids redundancies. By design D3, the final representation of each node combines all its intermediate representations:


where we empirically find concatenation works better than max-pooling 

Xu et al. (2018) as the COMBINE function.

In the classification stage (S3), the node is classified based on its final embedding :


where is a learnable weight matrix. We visualize our framework in App. D.

Time complexity

The feature embedding stage (S1) takes , where is the number of non-0s in feature matrix , and is the dimension of the feature embeddings. The neighborhood aggregation stage (S2) takes to derive the 2-hop neighborhoods via sparse-matrix multiplications, where is the maximum degree of all nodes, and for rounds of aggregation, where . We give a detailed analysis in App. D.

4 Other Related Work

We discuss relevant work on GNNs here, and give other related work (e.g., classification under heterophily) in App. E. Besides the models mentioned above, there are various comprehensive reviews describing previously proposed architectures Zhang et al. (2020); Chami et al. (2020); Zhang et al. (2019). Recent work has investigated GNN’s ability to capture graph information, proposing diagnostic measurements based on feature smoothness and label smoothness Hou et al. (2020) that may guide the learning process. To capture more graph information, other works generalize graph convolution outside of immediate neighborhoods. For example, apart from MixHop Abu-El-Haija et al. (2019) (cf. § 3.1), Graph Diffusion Convolution Klicpera et al. (2019) replaces the adjacency matrix with a sparsified version of a diffusion matrix (e.g., heat kernel or PageRank). Geom-GCN Pei et al. (2020) precomputes unsupervised node embeddings and uses neighborhoods defined by geometric relationships in the resulting latent space to define graph convolution. Some of these works Abu-El-Haija et al. (2019); Pei et al. (2020); Hou et al. (2020) acknowledge the challenges of learning from graphs with heterophily. Others have noted that node labels may have complex relationships that should be modeled directly. For instance, Graph Agreement Models Stretcu et al. (2019) augment the classification task with an agreement task, co-training a model to predict whether pairs of nodes share the same label. Graph Markov Neural Networks Qu et al. (2019)

model the joint label distribution with a conditional random field, trained with expectation maximization using GNNs.

Method D1 D2 D3
GCN Kipf and Welling (2016)
GAT Veličković et al. (2018)
GCN-Cheby Defferrard et al. (2016)
GraphSAGE Hamilton et al. (2017)
MixHop Abu-El-Haija et al. (2019)
HGCN (proposed)
Table 2: Design Comparison.

Comparison of HGCN to existing GNN models As shown in Table 2, HGCN differs from existing GNN models with respect to designs D1-D3, and their implementations (we give more details in App. D). Notably, HGCN learns a graph-agnostic feature embedding in stage (S1), and skips the non-linear embeddings of aggregated representations per round that other models use (e.g., GraphSAGE, MixHop, GCN), resulting in a simpler yet powerful architecture.

5 Empirical Evaluation

In our analysis, we (1) compare HGCN to existing GNN models on synthetic and real graphs with a wide range of low-to-high homophily values; and (2) evaluate the significance of designs D1-D3.

Baseline models We consider MLP with 1 hidden layer, and all the methods listed in Table 2. For HGCN, we model the first- and second-order neighborhoods ( and ), and consider two variants: HGCN-1 uses one embedding round () and HGCN-2 uses two rounds (). We tune all the models on the same train/validation splits (see App. F for details).

5.1 Evaluation on Synthetic Benchmarks

Synthetic datasets & setup

We generate synthetic graphs with various homophily ratios (cf. table below) by adopting an approach similar to Karimi et al. (2017). In App. G, we describe the data generation process, the experimental setup, and the data statistics in detail. All methods share the same training, validation and test splits (25%, 25%, 50% per class), and we report the average accuracy and standard deviation (stdev) over three generated graphs per heterophily level and benchmark dataset.

Benchmark Name #Nodes #Edges #Classes #Features Homophily #Graphs
syn-cora to 5 cora Sen et al. (2008b); Yang et al. (2016) [0, 0.1, …, 1] (3 per )
syn-products to 10 ogbn-products Hu et al. (2020) [0, 0.1, …, 1] (3 per )
Table 3: Statistics for Synthetic Datasets
(a) syn-cora (Table G.2)
(b) syn-products (Table G.3). GAT out of memory; MixHop acc .
Figure 2: Performance of GNN models on synthetic datasets. HGCN-2 outperforms baseline models in most heterophily settings, while tying with other models in homophily.

Model comparison Figure 2 shows the mean test accuracy (and stdev) over all random splits of our synthetic benchmarks. We observe similar trends on both benchmarks: HGCN has the best trend overall, outperforming the baseline models in most heterophily settings, while tying with other models in homophily. The performance of GCN, GAT and MixHop, which mix the ego- and neighbor-embeddings, increases with respect to the homophily level. But, while they achieve near-perfect accuracy under strong homophily (), they are significantly less accurate than MLP (near-flat performance curve as it is graph-agnostic) for many heterophily settings. GraphSAGE and GCN-Cheby, which leverage some of the identified designs D1-D3 (Table 2, § 3), are more competitive in such settings. We note that all the methods—except GCN and GAT—learn more effectively under perfect heterophily (=) than weaker settings (e.g., ), as evidenced by the J-shaped performance curves in low-homophily ranges.

Significance of design choices Using syn-products, we show the significance of designs D1-D3 (§ 3.1) through ablation studies with variants of HGCN (Fig. 3, Table G.4).

(D1) Ego- and Neighbor-embedding Separation. We consider HGCN-1 variants that separate the ego- and neighbor-embeddings and model: (S0) neighborhoods and (i.e., HGCN-1); (S1) only the 1-hop neighborhood in Eq. (5); and their counterparts that do not separate the two embeddings and use: (NS0) neighborhoods and (including ); and (NS1) only the 1-hop neighborhood . In Fig. 2(a), we see that the two variants that learn separate embedding functions significantly outperform the others (NS0/1) in heterophily settings () by up to , which shows that design D1 is critical for success in heterophily. Vanilla HGCN-1 (S0) performs best for all homophily levels.

(D2) Higher-order Neighborhoods. For this design, we consider three variants of HGCN-1 without specific neighborhoods: (N0) without the 0-hop neighborhood (i.e, the ego-embedding) (N1) without ; and (N2) without Figure 2(b) shows that HGCN-1 consistently performs better than all the variants, indicating that combining all sub-neighborhoods works best. Among the variants, in heterophily settings, contributes most to the performance (N0 causes significant decrease in accuracy), followed by , and . However, when , the importance of sub-neighborhoods is reversed. Thus, the ego-features are the most important in heterophily, and higher-order neighborhoods contribute the most in homophily. The design of HGCN allows it to effectively combine information from different neighborhoods, adapting to all levels of homophily.

(D3) Combination of Intermediate Representations. We consider three variants (K-0,1,2) of HGCN-2 that drop from the final representation of Eq. (7) the , or -round intermediate representation, respectively. We also consider only the intermediate representation as final, which is akin to what the other GNN models do. Figure 2(c) shows that HGCN-2, which combines all the intermediate representations, performs the best, followed by the variant K2 that skips the round-2 representation. The ego-embedding is the most important for heterophily (see trend of K0).

(a) Design D1: Embedding separation.
(b) Design D2: Higher-order neighborhoods.
(c) Design D3: Intermediate representations.
(d) Accuracy per degree in hetero/homo-phily.
Figure 3: (fig:5-ablation-1)-(fig:5-ablation-3): Significance of design choices D1-D3 via ablation studies. (fig:5-degree-2): Performance of HGCN for different node degree ranges. In heterophily, the performance gap between low- and high-degree nodes is significantly larger than in homophily, i.e., low-degree nodes pose challenges.

The challenging case of low-degree nodes Figure 2(d) plots the mean accuracy of HGCN variants on syn-products for different node degree ranges both in a heterophily and a homophily setting (). We observe that under heterophily there is a significantly bigger performance gap between low- and high-degree nodes: 13% for HGCN-1 (10% for HGCN-2) vs. less than 3% under homophily. This is likely due to the importance of the distribution

of class labels in each neighborhood under heterophily, which is harder to estimate accurately for low-degree nodes with few neighbors. On the other hand, in homophily, neighbors are likely to have similar classes

, so the neighborhood size does not have as significant impact on the accuracy.

5.2 Evaluation on Real Benchmarks

Real datasets & setup

We now evaluate the performance of our model and established GNN models on a variety of real-world datasets Tang et al. (2009); Rozemberczki et al. (2019); Sen et al. (2008b); Namata et al. (2012); Bojchevski and Günnemann (2018); Shchur et al. (2018) with edge homophily ratio ranging from strong heterophily to strong homophily, going beyond the traditional Cora, Pubmed and Citeseer graphs that have strong homophily (hence the good performance of existing GNNs on them). We summarize the data in Table 4 (top), and describe them in App. H, where we also point out potential data limitations. For all benchmarks (except Cora-Full), we use the feature vectors, class labels, and 10 random splits (48%/32%/20% of nodes per class for train/validation/test222Pei et al. (2020) claims that the ratios are 60%/20%/20%, which is different from the actual data splits shared on GitHub.) provided by Pei et al. (2020).

Model comparison

Table 4 gives the mean accuracy and stdev of HGCN variants and other models. We observe that the HGCN variants have consistently strong performance across the full spectrum of low-to-high homophily: HGCN-2 achieves the best average rank (2.9) across all datasets (or homophily ratios ), followed by HGCN-1 (3.7). Other models that use some of the designs D1-D3 (§ 3.1), including GraphSAGE and GCN-Cheby, also perform significantly better than GCN and GAT which lack these designs. Here, we also report the best results among the three recently-proposed GEOM-GCN variants (§ 4), directly from the paper Pei et al. (2020): other models (including ours) outperform this method significantly under heterophily. We note that MLP is a competitive baseline under strong heterophily, indicating that the existing models do not use the graph information effectively, or the latter is misleading in such cases. All models perform poorly on Squirrel and Actor likely due to their low-quality node features (small correlation with class labels). Also, Squirrel and Chameleon are dense, with many nodes sharing the same neighbors.

Texas Wisconsin Actor Squirrel Chameleon Cornell Cora Full Citeseer Pubmed Cora
Hom. ratio 0.11 0.21 0.22 0.22 0.23 0.3 0.57 0.74 0.8 0.81

Avg Rank

#Nodes 183 251 7,600 5,201 2,277 183 19,793 3,327 19,717 2,708
#Edges 295 466 26,752 198,493 31,421 280 63,421 4,676 44,327 5,278
#Classes 5 5 5 5 5 5 70 7 3 6
HGCN-1 3.7
HGCN-2 2.9
GraphSAGE 3.8
GCN-Cheby 3.9
MixHop 7.5
GCN 5.3
GAT* N/A 7.6
MLP 5.3
Table 4: Real data: mean accuracy stdev over different data splits. Best graph-aware model highlighted in gray. Asterisk “*” denotes results obtained from Pei et al. (2020) and “N/A” results (for Cora Full) not reported in the paper. We note that GAT runs out of memory on Cora Full in our experiments.

6 Conclusion

We have focused on characterizing the representation power of GNNs in challenging settings with heterophily or low homophily, which is understudied in the literature. We have highlighted the current limitations of GNNs, presented designs that increase representation power under heterophily and are theoretically justified with perturbation analysis and graph signal processing, and introduced a new model that adapts to both heterophily and homophily by effectively synthetizing these designs. We analyzed various challenging datasets, going beyond the often-used benchmark datasets (Cora, Pubmed, Citeseer), and leave as future work extending to a larger-scale experimental testbed.

Broader Impact

Homophily and heterophily are not intrinsically ethical or unethical—they are both phenomena existing in the nature, resulting in the popular proverbs “birds of a feather flock together” and “opposites attract”. However, existing GNN models implicitly assume homophily, thus ignoring the heterophily phenomena which may exist in some networks. As a result, if they are applied to networks that do not satisfy the assumption, the results may be biased, unfair, or erroneous.

Beyond the node classification problem that we tackle in this work, GNN models have been employed in a wide range of applications, such as recommendation systems, analysis of molecules and proteins, and more. In some of these cases, the homophily assumption may have ethical implications: For example, a GNN model that intrinsically assumes homophily may contribute to the so-called “filter bubble” phenomenon in a recommendation system (reinforcing existing beliefs/views, and downplaying the opposite ones), or make the minority groups less visible in social networks. In other cases, a reliance on homophily may hinder scientific progress: Among other domains, this is critical for the emerging research field of applying GNN models to molecular and protein structures, where the connected nodes often belong to different classes; the performance of existing GNNs may be poor in such cases (as we have shown in our analysis) and could hinder new discoveries. Moreover, if the input data contain many errors (e.g., wrong class labels, noisy network with incorrect and missing links), these may be propagated over the network, and lead to compounding errors in the classification results (this is common in most, if not all, machine learning problems).

Our work has the potential to rectify some of these potential negative consequences of existing GNN work. While our methodology does not change the amount of homophily in a network, moving beyond a reliance on homophily can be a key to improve the fairness, diversity and performance for the applications using GNN. We hope that this paper will raise more awareness and discussions regarding the homophily limitations of existing GNN models, and help researchers design models which have the power of learning in both homophily and heterophily settings.


  • (1)
  • Abu-El-Haija et al. (2019) Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Hrayr Harutyunyan, Nazanin Alipourfard, Kristina Lerman, Greg Ver Steeg, and Aram Galstyan. 2019. MixHop: Higher-Order Graph Convolution Architectures via Sparsified Neighborhood Mixing. In International Conference on Machine Learning (ICML).
  • Barabasi and Albert (1999) A. L. Barabasi and R. Albert. 1999. Emergence of scaling in random networks. Science 286, 5439 (October 1999), 509–512.
  • Bojchevski and Günnemann (2018) Aleksandar Bojchevski and Stephan Günnemann. 2018. Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking. In International Conference on Learning Representations.
  • Chami et al. (2020) Ines Chami, Sami Abu-El-Haija, Bryan Perozzi, Christopher Re, and Kevin Murphy. 2020. Machine Learning on Graphs: A Model and Comprehensive Taxonomy. CoRR abs/2005.03675 (2020).
  • Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems. 3844–3852.
  • Eswaran et al. (2017) Dhivya Eswaran, Stephan Günnemann, Christos Faloutsos, Disha Makhija, and Mohit Kumar. 2017. Zoobp: Belief propagation for heterogeneous networks. Proceedings of the VLDB Endowment 10, 5 (2017), 625–636.
  • Gatterbauer et al. (2015) Wolfgang Gatterbauer, Stephan Günnemann, Danai Koutra, and Christos Faloutsos. 2015. Linearized and Single-Pass Belief Propagation. 8, 5 (2015).
  • Hamilton et al. (2017) William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NIPS.
  • Hou et al. (2020) Yifan Hou, Jian Zhang, James Cheng, Kaili Ma, Richard T. B. Ma, Hongzhi Chen, and Ming-Chang Yang. 2020. Measuring and Improving the Use of Graph Information in Graph Neural Networks. In International Conference on Learning Representations.
  • Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv preprint arXiv:2005.00687 (2020).
  • J. Neville (2000) D. Jensen J. Neville. 2000. Iterative classification in relational data, In In Proc. AAAI. Workshop on Learning Statistical Models from Relational, 13–20.
  • Karimi et al. (2017) Fariba Karimi, Mathieu Génois, Claudia Wagner, Philipp Singer, and Markus Strohmaier. 2017. Visibility of minorities in social networks. arXiv preprint arXiv:1702.00150 (2017).
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. arXiv preprint arXiv:1609.02907 (2016).
  • Klicpera et al. (2019) Johannes Klicpera, Stefan Weißenberger, and Stephan Günnemann. 2019. Diffusion Improves Graph Learning. In Conference on Neural Information Processing Systems (NeurIPS).
  • Koutra et al. (2011) Danai Koutra, Tai-You Ke, U Kang, Duen Horng Chau, Hsing-Kuo Kenneth Pao, and Christos Faloutsos. 2011. Unifying Guilt-by-Association Approaches: Theorems and Fast Algorithms. 245–260.
  • Lu and Getoor (2003) Qing Lu and Lise Getoor. 2003. Link-Based Classification. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning (ICML). AAAI Press, 496–503.
  • McPherson et al. (2001) Miller McPherson, Lynn Smith-Lovin, and James M Cook. 2001. Birds of a Feather: Homophily in Social Networks. Annual Review of Sociology 27, 1 (2001), 415–444.
  • Namata et al. (2012) Galileo Namata, Ben London, Lise Getoor, Bert Huang, and UMD EDU. 2012. Query-driven active surveying for collective classification.
  • Newman (2018) Mark Newman. 2018. Networks. Oxford university press.
  • Pandit et al. (2007) Shashank Pandit, Duen Horng Chau, Samuel Wang, and Christos Faloutsos. 2007. NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. In WWW. ACM, 201–210.
  • Pei et al. (2020) Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. 2020. Geom-GCN: Geometric Graph Convolutional Networks. In International Conference on Learning Representations.
  • Qu et al. (2019) Meng Qu, Yoshua Bengio, and Jian Tang. 2019. GMNN: Graph Markov Neural Networks. In International Conference on Machine Learning. 5241–5250.
  • Rossi et al. (2020) Ryan A. Rossi, Di Jin, Sungchul Kim, Nesreen Ahmed, Danai Koutra, and John Boaz Lee. 2020. From Community to Role-based Graph Embeddings. ACM Transactions on Knowledge Discovery from Data (TKDD) (2020).
  • Rozemberczki et al. (2019) Benedek Rozemberczki, Carl Allen, and Rik Sarkar. 2019. Multi-scale Attributed Node Embedding. arXiv:cs.LG/1909.13021
  • Sen et al. (2008b) Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008b. Collective classification in network data. AI magazine 29, 3 (2008), 93–93.
  • Sen et al. (2008a) Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad. 2008a. Collective Classification in Network Data. AI Magazine 29, 3 (2008), 93–106.
  • Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of Graph Neural Network Evaluation. Relational Representation Learning Workshop, NeurIPS 2018 (2018).
  • Shuman et al. (2013) David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2013.

    The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.

    IEEE signal processing magazine 30, 3 (2013), 83–98.
  • Stretcu et al. (2019) Otilia Stretcu, Krishnamurthy Viswanathan, Dana Movshovitz-Attias, Emmanouil Platanios, Sujith Ravi, and Andrew Tomkins. 2019.

    Graph Agreement Models for Semi-Supervised Learning.

    In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8713–8723.
  • Sun and Han (2012) Yizhou Sun and Jiawei Han. 2012. Mining Heterogeneous Information Networks: Principles and Methodologies. Morgan & Claypool Publishers.
  • Tang et al. (2009) Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. 2009. Social influence analysis in large-scale networks. In KDD. ACM, 807–816.
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. International Conference on Learning Representations (2018).
  • Wu et al. (2019) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying Graph Convolutional Networks. In International Conference on Machine Learning. 6861–6871.
  • Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2018. Representation Learning on Graphs with Jumping Knowledge Networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 (Proceedings of Machine Learning Research), Vol. 80. PMLR, 5449–5458.
  • Yang et al. (2016) Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2016. Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861 (2016).
  • Yedidia et al. (2003) J.S. Yedidia, W.T. Freeman, and Y. Weiss. 2003. Understanding Belief Propagation and its Generalizations. 8 (2003), 236–239.
  • Zhang et al. (2019) Si Zhang, Hanghang Tong, Jiejun Xu, and Ross Maciejewski. 2019. Graph convolutional networks: a comprehensive review. Computational Social Networks (2019).
  • Zhang et al. (2020) Z. Zhang, P. Cui, and W. Zhu. 2020. Deep Learning on Graphs: A Survey. IEEE Transactions on Knowledge and Data Engineering (TKDE) (2020).
  • Zhu (2005) X. Zhu. 2005. Semi-supervised learning with graphs. (2005).

Appendix A Nomenclature

We summarize the main symbols used in this work and their definitions below:

Symbols Definitions
graph with nodeset , edgeset
adjacency matrix of
node feature matrix of
-dimensional feature vector for node
unnormalized graph Laplacian matrix
set of class labels
class label for node
-dimensional vector of class labels (for all the nodes)
training data for semi-supervised node classification
general type of neighbors of node in graph
general type of neighbors of node in without self-loops (i.e., excluding )
-hop/step neighbors of node in (at exactly distance ) maybe-with/without self-loops, resp.
set of pairs of nodes with shortest distance between them being 2
node degree, and maximum node degree across all nodes , resp.
edge homophily ratio
class compatibility matrix
node representations learned in GNN model at round / layer
the number of rounds in the neighborhood aggregation stage
learnable weight matrix for GNN model

non-linear activation function

vector concatenation operator
AGGR function that aggregates node feature representations within a neighborhood
COMBINE function that combines feature representations from different neighborhoods
Table A.1: Major symbols and definitions.

Appendix B Homophily and Heterophily: Compatibility Matrix

As we mentioned in § 1, the edge homophily ratio in Definition 1 gives an overall trend for all the edges in the graph. The actual level of homophily may vary within different pairs of node classes, i.e., there is different tendency of connection between each pair of classes. For instance, in an online purchasing network Pandit et al. (2007)

with three classes—fraudsters, accomplices, and honest users—, fraudsters connect with higher probability to accomplices and honest users. Moreover, within the same network, it is possible that some pairs of classes exhibit homophily, while others exhibit heterophily. In belief propagation 

Yedidia et al. (2003), a message-passing algorithm used for inference on graphical models, the different levels of homophily or affinity between classes are captured via the class compatibility, propagation or coupling matrix, which is typically pre-defined based on domain knowledge. In this work, we define the empirical class compatibility matrix as follows:

Definition 4

The class compatibility matrix has entries that capture the fraction of outgoing edges from a node in class to a node in class :

By definition, the class compatibility matrix is a stochastic matrix, with each row summing up to 1.

Appendix C Proofs and Discussions of Theorems

c.1 Detailed Analysis of Theorem 1

Proof 1 (for Theorem 1)

We first discuss the GCN layer formulated as . Given training set , the goal of the training process is to optimize the weight matrix

to minimize the loss function

, where

is the one-hot encoding of class labels provided in the training set, and

is the predicted probability distribution of class labels for each node

in the training set .

Without loss of generality, we reorder accordingly such that the one-hot encoding of labels for nodes in training set is in increasing order of the class label :

Now we look into the term , which is the aggregated feature vectors within neighborhood for nodes in the training set. Since we assumed that all nodes in have degree , proportion of their neighbors belong to the same class, while proportion of them belong to any other class uniformly, and one-hot representations of node features for each node , we obtain:

For and that we derived in Eq. (1) and (1), we can find an optimal weight matrix such that , making the loss . We can use the following way to find : First, sample one node from each class to form a smaller set , therefore we have:


Note that is a circulant matrix, therefore its inverse exists. Using the Sherman-Morrison formula, we can find its inverse as:

Let , and we have . It is also easy to verify that . is the optimal weight matrix we can learn under , since it satisfies .

Now consider an arbitrary training datapoint , and a perturbation added to the neighborhood of node , such that the number of nodes with a randomly selected class label is less than expected in . We denote the perturbed graph adjacency matrix as . Without loss of generality, we assume node has , and the perturbed class is . In this case we have

Applying the optimal weight matrix we learned on to the aggregated feature on the perturbed neighborhood , we obtain which equals to:

Notice that we always have > , thus the GCN layer formulated as would misclassify only if the following inequality holds:

Solving the above inequality for , we get the amount of perturbation needed as


and the least absolute amount of perturbation needed is .

Now we move on to discuss the GCN layer formulated as without self loops. Following similar derivations, we obtain the optimal weight matrix which makes as:


Again if for an arbitrary , a perturbation is added to the neighborhood of the node , such that the number of nodes with a randomly selected class label is less than expected in , we have:

Then applying the optimal weight matrix that we learned on to the aggregated feature on perturbed neighborhood , we obtain which equals to:

Thus, the GCN layer formulated as would misclassify when the following inequality holds:

Or the amount of perturbation is:


As a result, the least absolute amount of perturbation needed is .

By comparing the least absolute amount of perturbation needed for both formulations to misclassify ( derived in Eq. (9) for the formulation; derived in Eq. (11) for the formulation), we can see that if and only if , which happens when . When (heterophily), we have , which means the formulation is less robust to perturbation than the formulation.


From the above proof, we can see that the least absolute amount of perturbation needed for both GCN formulations is a function of the assumed homophily ratio , the node degree for each node in the training set , and the size of the class label set . Fig. 4 shows the plots of and as functions of , and : from Fig. 3(a), we can see that the least absolute amount of perturbations needed for both formulation first decreases as the assumed homophily level increases, until reaches 0, where the GCN layer predicts the same probability for all class labels; after that, decreases further below 0, and increases as increases; the formulation is less robust to perturbation than the formulation at low homophily level until as our proof shows, where