H2GCN
Boost learning for GNNs from the graph structure under challenging heterophily settings. (NeurIPS'20)
view repo
We investigate the representation power of graph neural networks in the semisupervised node classification task under heterophily or low homophily, i.e., in networks where connected nodes may have different class labels and dissimilar features. Most existing GNNs fail to generalize to this setting, and are even outperformed by models that ignore the graph structure (e.g., multilayer perceptrons). Motivated by this limitation, we identify a set of key designs – ego and neighborembedding separation, higherorder neighborhoods, and combination of intermediate representations – that boost learning from the graph structure under heterophily, and combine them into a new graph convolutional neural network, H2GCN. Going beyond the traditional benchmarks with strong homophily, our empirical analysis on synthetic and real networks shows that, thanks to the identified designs, H2GCN has consistently strong performance across the full spectrum of lowtohigh homophily, unlike competitive prior models without them.
READ FULL TEXT VIEW PDFBoost learning for GNNs from the graph structure under challenging heterophily settings. (NeurIPS'20)
None
Code for the 2021 KDD submission 'On local aggregation in heterophilic graphs'
We focus on the effectiveness of graph neural networks (GNNs) Zhang et al. (2020) in tackling the semisupervised node classification task in challenging settings: the goal of the task is to infer the unknown labels of the nodes by using the network structure Zhu (2005), given partially labeled networks with node features (or attributes). Unlike most prior work that considers networks with strong homophily, we study the representation power of GNNs in settings with different levels of homophily or class label smoothness.
Homophily is a key principle of many realworld networks, whereby linked nodes often belong to the same class or have similar features (“birds of a feather flock together”) McPherson et al. (2001). For example, friends are likely to have similar political beliefs or age, and papers tend to cite papers from the same research area Newman (2018). GNNs model the homophily principle by propagating features and aggregating them within various graph neighborhoods via different mechanisms (e.g., averaging, LSTM) Kipf and Welling (2016); Hamilton et al. (2017); Veličković et al. (2018). However, in the real world, there are also settings where “opposites attract”, leading to networks with heterophily: linked nodes are likely from different classes or have dissimilar features. For instance, the majority of people tend to connect with people of the opposite gender in dating networks, different amino acid types are more likely to connect in protein structures, fraudsters are more likely to connect to accomplices than to other fraudsters in online purchasing networks Pandit et al. (2007).
Since GNNs assume strong homophily, most existing models fail to generalize to networks with heterophily (or low/medium level of homophily). In such cases, we find that even models that ignore the graph structure altogether, such as multilayer perceptrons or MLPs, can outperform a number of existing GNNs. Motivated by this limitation, we make the following contributions:
Current Limitations: We reveal the limitation of GNNs to learn over networks with heterophily, which is ignored in the literature due to evaluation on few benchmarks with similar properties. § 3
Key Design Choices for Heterophily & New Model: We identify a set of key design choices that can boost learning from the graph structure in heterophily settings without trading off accuracy in homophily settings: (D1) ego and neighborembedding separation, (D2) higherorder neighborhoods, and (D3) combination of intermediate representations. We justify the designs theoretically and empirically, combine them into a new model, HGCN, that effectively adapts to both heterophily and homophily settings, and compare our framework to prior GNN models. § 34
Extensive Empirical Evaluation: We empirically analyze our model and competitive existing GNN models on both synthetic and real networks covering the full spectrum of lowtohigh homophily (besides the typicallyused benchmarks with high homophily). We show that HGCN has consistently strong performance unlike existing models tailored to homophily. § 5
We summarize our notation in Table A.1 (App. A). Let be an undirected, unweighted graph with nodeset and edgeset . We denote a general neighborhood centered around as ( may have selfloops), the corresponding neighborhood that does not include the ego (node ) as , and the general neighbors of node at exactly hops/steps away (minimum distance) as . For example, are the immediate neighbors of . Other examples are shown in Fig. 1. We represent the graph by its adjacency matrix and its node feature matrix
, where the vector
corresponds to the egofeature of node , and to its neighborfeatures.We further assume a class label vector , which for each node contains a unique class label . The goal of semisupervised node classification is to learn a mapping , where is the set of labels, given a set of labeled nodes as training data.
Graph neural networks From a probabilistic perspective, most GNN models assume the following local Markov property on node features: for each node , there exists a neighborhood such that only depends on the egofeature and neighborfeatures . Most models derive the class label via the following representation learning approach:
(1) 
where the embedding function is applied repeatedly in total rounds, node ’s representation (or hidden state vector) at round ,
, is learned from its ego and neighborrepresentations in the previous round, and a softmax classifier with learnable weight matrix
is applied to the final representation of . Most existing models differ in their definitions of neighborhoods and embedding function . A typical definition of neighborhood is —i.e., the 1hop neighbors of . As for , in graph convolutional networks (GCN) Kipf and Welling (2016) each node repeatedly averages its own features and those of its neighbors to update its own feature representation. Using an attention mechanism, GAT Veličković et al. (2018) models the influence of different neighbors more precisely as a weighted average of the ego and neighborfeatures. GraphSAGE Hamilton et al. (2017) generalizes the aggregation beyond averaging, and models the egofeatures distinctly from the neighborfeatures in its subsampled neighborhood.Homophily and heterophily In this work, we focus on heterophily in class labels. We first define the edge homophily ratio as a measure of the graph homophily level, and use it to define graphs with strong homophily/heterophily:
The edge homophily ratio is the fraction of edges in a graph which connect nodes that have the same class label (i.e., intraclass edges).
Graphs with strong homophily have high edge homophily ratio , while graphs with strong heterophily (i.e., low/weak homophily) have small edge homophily ratio .
The edge homophily ratio in Dfn. 1 gives an overall trend for all the edges in the graph. The actual level of homophily may vary within different pairs of node classes, i.e., there is different tendency of connection between each pair of classes. In App. B, we give more details about capturing these more complex network characteristics via an empirical class compatibility matrix , whose th entry is the fraction of outgoing edges to nodes in class among all outgoing edges from nodes in class .
Heterophily Heterogeneity. We remark that heterophily, which we study in this work, is a distinct network concept from heterogeneity. Formally, a network is heterogeneous Sun and Han (2012)
if it has at least two types of nodes and different relationships between them (e.g., knowledge graphs), and homogeneous if it has a single type of nodes (e.g., users) and a single type of edges (e.g., friendship). The type of nodes in heterogeneous graphs does
not necessarily match the class labels , therefore both homogeneous and heterogeneous networks may have different levels of homophily.GCN Kipf and Welling (2016)  

GAT Veličković et al. (2018)  
GCNCheby Defferrard et al. (2016)  
GraphSAGE Hamilton et al. (2017)  
MixHop AbuElHaija et al. (2019)  
MLP  
HGCN (ours) 
): mean accuracy and standard deviation over three runs (cf. App.
G).While many GNN models have been proposed, most of them are designed under the assumption of homophily, and are not capable of handling heterophily. As a motivating example, Table 1 shows the mean classification accuracy for several leading GNN models on our synthetic benchmark syncora, where we can control the homophily/heterophily level (see App. G for details on the data and setup). Here we consider two homophily ratios, and , one for high heterophily and one for high homophily. We observe that for heterophily () all existing methods fail to perform better than a Multilayer Perceptron (MLP) with 1 hidden layer, a graphagnostic baseline that relies solely on the node features for classification (differences in accuracy of MLP for different are due to randomness). Especially, GCN Kipf and Welling (2016), GAT Veličković et al. (2018) and MixHop AbuElHaija et al. (2019) show up to 42% worse performance than MLP, highlighting that methods that work well under high homophily () may not be appropriate for networks with low/medium homophily.
Motivated by this limitation, in the following subsections, we discuss and theoretically justify a set of key design choices that, when appropriately incorporated in a GNN framework, can improve the performance in the challenging heterophily settings. Then, we present HGCN, a model that, thanks to these designs, adapts well to both homophily and heterophily (Table 1, last row). In Section 5, we provide comprehensive empirical analysis on both synthetic and real data with varying homophily levels, and show that HGCN performs consistently well across different levels, and improves over MLP by effectively leveraging the graph structure in challenging settings.
We have identified three key designs that—when appropriately integrated—can help improve the performance of GNN models in heterophily settings: (D1) ego and neighborembedding separation; (D2) higherorder neighborhoods; and (D3) combination of intermediate representations.
The first design entails encoding each egoembedding (i.e., a node’s embedding) separately from the aggregated embeddings of its neighbors, since they are likely to be dissimilar in heterophily settings. Formally, the representation (or hidden state vector) learned for each node at round is given as:
(2) 
the neighborhood does not include (no selfloops), the AGGR function aggregates representations only from the neighbors (in some way—e.g., average), and AGGR and COMBINE
may be followed by a nonlinear transformation. For heterophily, after aggregating the neighbors’ representations, the definition of
COMBINE (akin to ‘skip connection’ between layers) is critical: a simple way of combining the ego and the aggregated neighborembeddings without ‘mixing’ them is to concatenate them—rather than average all of them as in the GCN model by Kipf and Welling (2016).Intuition. In heterophily settings, by definition (Dfn. 2), the class label and original features of a node and those of its neighboring nodes (esp. the direct neighbors ) may be different. However, the typical GCN design that mixes the embeddings through an average Kipf and Welling (2016) or weighted average Veličković et al. (2018) as the COMBINE function results in final embeddings that are similar across neighboring nodes (especially within a community or cluster) for any set of original features Rossi et al. (2020). While this may work well in the case of homophily, where neighbors likely belong to the same cluster and class, it poses severe challenges in the case of heterophily: it is not possible to distinguish neighbors from different classes based on the (similar) learned representations. Choosing a COMBINE function that separates the representations of each node and its neighbors allows for more expressiveness, where the skipped or nonaggregated representations can evolve separately over multiple rounds of propagation without becoming prohibitively similar.
Theoretical Justification. We prove theoretically that, under some conditions, a GCN layer that coembeds ego and neighborfeatures is less capable of generalizing to heterophily than a layer that embeds them separately. We measure its generalization ability by its robustness to test/train data deviations. We give the proof of the theorem in App. C.1. Though the theorem applies to specific conditions, our empirical analysis shows that it holds in more general cases (§ 5).
Consider a graph without selfloops (§ 2) with node features for each node , and an equal number of nodes per class in the training set . Also assume that all nodes in have degree , and proportion of their neighbors belong to the same class, while proportion of them belong to any other class (uniformly). Then for , a simple GCN layer formulated as is less robust, i.e., misclassifies a node for smaller train/test data deviations, than a layer that separates the ego and neighborembeddings.
Observations. In Table 1, we observe that GCN, GAT, and MixHop, which ‘mix’ the ego and neighborembeddings explicitly^{1}^{1}1 These models consider selfloops, which turn each ego also into a neighbor, and thus mix the ego and neighborrepresentations. E.g., GCN and MixHop operate on the symmetric normalized adjacency matrix augmented with selfloops: where is the identity and the degree matrix of . , perform poorly in the heterophily setting. On the other hand, GraphSAGE that separates the embeddings (e.g., it concatenates the two embeddings and then applies a nonlinear transformation) achieves 3340% better performance in this setting.
The second design involves explicitly aggregating information from higherorder neighborhoods in each round , beyond the immediate neighbors of each node:
(3) 
where corresponds to the neighbors of at exactly hops away, and the AGGR functions applied to different neighborhoods can be the same or different. This design augments the implicit aggregation over higherorder neighborhoods that most GNN models achieve through multiple rounds of firstorder propagation based on variants of Eq. (2).
Intuition. To show why higherorder neighborhoods help in the heterophily settings, we first define homophilydominant and heterophilydominant neighborhoods:
is expectedly homophilydominant if and . If the opposite inequality holds, is expectedly heterophilydominant.
From this definition, we can see that expectedly homophilydominant neighborhoods are more beneficial for GNN layers, as in such neighborhoods the class label of each node can in expectation be determined by the majority of the class labels in . In the case of heterophily, we have seen empirically that although the immediate neighborhoods may be heterophilydominant, the higherorder neighborhoods may be homophilydominant and thus provide more relevant context.
Theoretical Justification. Below we formalize this observation for 2hop neighborhoods, and prove one case when they are homophilydominant in App. C.2:
Consider a graph without selfloops (§ 2) with label set , where for each node , its neigbhors’ class labels are conditionally independent given , and , . Then, the 2hop neighborhood for a node will always be homophilydominant in expectation.
Observations. Under heterophily (), GCNCheby, which models different neighborhoods by combining Chebyshev polynomials to approximate a higherorder graph convolution operation Defferrard et al. (2016), outperforms GCN and GAT, which aggregate over only the immediate neighbors , by up to +31% (Table 1). MixHop, which explicitly models 1hop and 2hop neighborhoods (though ‘mixes’ the ego and neighborembeddings, violating design D1), also outperforms these two models.
The third design combines the intermediate representations of each node from multiple rounds at the final layer:
(4) 
to explicitly capture local and global information via COMBINE functions that leverage each representation separately–e.g., concatenation, LSTMattention Xu et al. (2018). This design is introduced in jumping knowledge networks Xu et al. (2018) and shown to increase the representation power of GCNs under homophily.
Intuition. Intuitively, each round collects information with different locality—earlier rounds are more local, while later rounds capture increasingly more global information (implicitly, via propagation). Similar to D2 (which models explicit neighborhoods), this design models the distribution of neighbor representations in lowhomophily networks more accurately. It also allows the class prediction to leverage different neighborhood ranges in different networks, adapting to their structural properties.
Theoretical Justification. The benefit of combining intermediate representations can be theoretically explained from the spectral perspective. Assuming a GCNstyle layer—where propagation can be viewed as spectral filtering—, the adjacency matrix is a lowpass filter Wu et al. (2019), so intermediate outputs from earlier rounds contain higherfrequency components than outputs from later rounds. At the same time, the following theorem holds for graphs with heterophily, where we view class labels as graph signals (as in graph signal processing):
Consider graph signals (label vectors) defined on an undirected graph with edge homophily ratios and , respectively. If , then signal has higher energy (Dfn. 5) in highfrequency components than in the spectrum of unnormalized graph Laplacian .
In other words, in heterophily settings, the label distribution contains more information at higher than lower frequencies (see proof in App. C.3). Thus, by combining the intermediate outputs from different layers, this design captures both low and highfrequency components in the final representation, which is critical in heterophily settings, and allows for more expressiveness in the general setting.
Observations. By concatenating the intermediate representations from two rounds with the embedded egorepresentation (following the jumping knowledge framework Xu et al. (2018)), GCN’s accuracy increases to for , a 20% improvement over its counterpart without design D3 (Table 1).
Summary of designs To sum up, D1 models (at each layer) the ego and neighborrepresentations distinctly, D2 leverages (at each layer) representations of neighbors at different distances distinctly, and D3 leverages (at the final layer) the learned egorepresentations at previous layers distinctly.
We now describe HGCN, which combines designs D1D3 to adapt to heterophily. It has three stages (Alg. 1, App. D): (S1) feature embedding, (S2) neighborhood aggregation, and (S3) classification.
The feature embedding stage (S1) uses a graphagnostic dense layer to generate for each node the feature embedding based on its egofeature : , where is an optional nonlinear function, and is a learnable weight matrix.
In the neighborhood aggregation stage (S2), the generated embeddings are aggregated and repeatedly updated within the node’s neighborhood for rounds. Following designs D1 and D2, the neighborhood of our framework involves two subneighborhoods without the egos: the 1hop graph neighbors and the 2hop neighbors , as shown in Fig. 1:
(5) 
We set COMBINE as concatenation (as to not mix different neighborhood ranges), and AGGR as a degreenormalized average of the neighborembeddings in subneighborhood :
(6) 
where is the hop degree of node (i.e., number of nodes in its hop neighborhood). Note that unlike Eq. (2), here we do not combine the egoembedding of node with the neighborembeddings. We found that removing the typical nonlinear embedding transformations per round, as in SGC Wu et al. (2019), works better (App. D.2), and in such case including the egoembedding only in the final representation avoids redundancies. By design D3, the final representation of each node combines all its intermediate representations:
(7) 
where we empirically find concatenation works better than maxpooling
Xu et al. (2018) as the COMBINE function.In the classification stage (S3), the node is classified based on its final embedding :
(8) 
where is a learnable weight matrix. We visualize our framework in App. D.
The feature embedding stage (S1) takes , where is the number of non0s in feature matrix , and is the dimension of the feature embeddings. The neighborhood aggregation stage (S2) takes to derive the 2hop neighborhoods via sparsematrix multiplications, where is the maximum degree of all nodes, and for rounds of aggregation, where . We give a detailed analysis in App. D.
We discuss relevant work on GNNs here, and give other related work (e.g., classification under heterophily) in App. E. Besides the models mentioned above, there are various comprehensive reviews describing previously proposed architectures Zhang et al. (2020); Chami et al. (2020); Zhang et al. (2019). Recent work has investigated GNN’s ability to capture graph information, proposing diagnostic measurements based on feature smoothness and label smoothness Hou et al. (2020) that may guide the learning process. To capture more graph information, other works generalize graph convolution outside of immediate neighborhoods. For example, apart from MixHop AbuElHaija et al. (2019) (cf. § 3.1), Graph Diffusion Convolution Klicpera et al. (2019) replaces the adjacency matrix with a sparsified version of a diffusion matrix (e.g., heat kernel or PageRank). GeomGCN Pei et al. (2020) precomputes unsupervised node embeddings and uses neighborhoods defined by geometric relationships in the resulting latent space to define graph convolution. Some of these works AbuElHaija et al. (2019); Pei et al. (2020); Hou et al. (2020) acknowledge the challenges of learning from graphs with heterophily. Others have noted that node labels may have complex relationships that should be modeled directly. For instance, Graph Agreement Models Stretcu et al. (2019) augment the classification task with an agreement task, cotraining a model to predict whether pairs of nodes share the same label. Graph Markov Neural Networks Qu et al. (2019)
model the joint label distribution with a conditional random field, trained with expectation maximization using GNNs.
Method  D1  D2  D3 

GCN Kipf and Welling (2016)  ✗  ✗  ✗ 
GAT Veličković et al. (2018)  ✗  ✗  ✗ 
GCNCheby Defferrard et al. (2016)  ✗  ✓  ✗ 
GraphSAGE Hamilton et al. (2017)  ✓  ✗  ✗ 
MixHop AbuElHaija et al. (2019)  ✗  ✓  ✗ 
HGCN (proposed)  ✓  ✓  ✓ 
Comparison of HGCN to existing GNN models As shown in Table 2, HGCN differs from existing GNN models with respect to designs D1D3, and their implementations (we give more details in App. D). Notably, HGCN learns a graphagnostic feature embedding in stage (S1), and skips the nonlinear embeddings of aggregated representations per round that other models use (e.g., GraphSAGE, MixHop, GCN), resulting in a simpler yet powerful architecture.
In our analysis, we (1) compare HGCN to existing GNN models on synthetic and real graphs with a wide range of lowtohigh homophily values; and (2) evaluate the significance of designs D1D3.
Baseline models We consider MLP with 1 hidden layer, and all the methods listed in Table 2. For HGCN, we model the first and secondorder neighborhoods ( and ), and consider two variants: HGCN1 uses one embedding round () and HGCN2 uses two rounds (). We tune all the models on the same train/validation splits (see App. F for details).
We generate synthetic graphs with various homophily ratios (cf. table below) by adopting an approach similar to Karimi et al. (2017). In App. G, we describe the data generation process, the experimental setup, and the data statistics in detail. All methods share the same training, validation and test splits (25%, 25%, 50% per class), and we report the average accuracy and standard deviation (stdev) over three generated graphs per heterophily level and benchmark dataset.
Benchmark Name  #Nodes  #Edges  #Classes  #Features  Homophily  #Graphs 

syncora  to  5  cora Sen et al. (2008b); Yang et al. (2016)  [0, 0.1, …, 1]  (3 per )  
synproducts  to  10  ogbnproducts Hu et al. (2020)  [0, 0.1, …, 1]  (3 per ) 
Model comparison Figure 2 shows the mean test accuracy (and stdev) over all random splits of our synthetic benchmarks. We observe similar trends on both benchmarks: HGCN has the best trend overall, outperforming the baseline models in most heterophily settings, while tying with other models in homophily. The performance of GCN, GAT and MixHop, which mix the ego and neighborembeddings, increases with respect to the homophily level. But, while they achieve nearperfect accuracy under strong homophily (), they are significantly less accurate than MLP (nearflat performance curve as it is graphagnostic) for many heterophily settings. GraphSAGE and GCNCheby, which leverage some of the identified designs D1D3 (Table 2, § 3), are more competitive in such settings. We note that all the methods—except GCN and GAT—learn more effectively under perfect heterophily (=) than weaker settings (e.g., ), as evidenced by the Jshaped performance curves in lowhomophily ranges.
Significance of design choices Using synproducts, we show the significance of designs D1D3 (§ 3.1) through ablation studies with variants of HGCN (Fig. 3, Table G.4).
(D1) Ego and Neighborembedding Separation. We consider HGCN1 variants that separate the ego and neighborembeddings and model: (S0) neighborhoods and (i.e., HGCN1); (S1) only the 1hop neighborhood in Eq. (5); and their counterparts that do not separate the two embeddings and use: (NS0) neighborhoods and (including ); and (NS1) only the 1hop neighborhood . In Fig. 2(a), we see that the two variants that learn separate embedding functions significantly outperform the others (NS0/1) in heterophily settings () by up to , which shows that design D1 is critical for success in heterophily. Vanilla HGCN1 (S0) performs best for all homophily levels.
(D2) Higherorder Neighborhoods. For this design, we consider three variants of HGCN1 without specific neighborhoods: (N0) without the 0hop neighborhood (i.e, the egoembedding) (N1) without ; and (N2) without Figure 2(b) shows that HGCN1 consistently performs better than all the variants, indicating that combining all subneighborhoods works best. Among the variants, in heterophily settings, contributes most to the performance (N0 causes significant decrease in accuracy), followed by , and . However, when , the importance of subneighborhoods is reversed. Thus, the egofeatures are the most important in heterophily, and higherorder neighborhoods contribute the most in homophily. The design of HGCN allows it to effectively combine information from different neighborhoods, adapting to all levels of homophily.
(D3) Combination of Intermediate Representations. We consider three variants (K0,1,2) of HGCN2 that drop from the final representation of Eq. (7) the , or round intermediate representation, respectively. We also consider only the intermediate representation as final, which is akin to what the other GNN models do. Figure 2(c) shows that HGCN2, which combines all the intermediate representations, performs the best, followed by the variant K2 that skips the round2 representation. The egoembedding is the most important for heterophily (see trend of K0).
The challenging case of lowdegree nodes Figure 2(d) plots the mean accuracy of HGCN variants on synproducts for different node degree ranges both in a heterophily and a homophily setting (). We observe that under heterophily there is a significantly bigger performance gap between low and highdegree nodes: 13% for HGCN1 (10% for HGCN2) vs. less than 3% under homophily. This is likely due to the importance of the distribution
of class labels in each neighborhood under heterophily, which is harder to estimate accurately for lowdegree nodes with few neighbors. On the other hand, in homophily, neighbors are likely to have similar classes
, so the neighborhood size does not have as significant impact on the accuracy.We now evaluate the performance of our model and established GNN models on a variety of realworld datasets Tang et al. (2009); Rozemberczki et al. (2019); Sen et al. (2008b); Namata et al. (2012); Bojchevski and Günnemann (2018); Shchur et al. (2018) with edge homophily ratio ranging from strong heterophily to strong homophily, going beyond the traditional Cora, Pubmed and Citeseer graphs that have strong homophily (hence the good performance of existing GNNs on them). We summarize the data in Table 4 (top), and describe them in App. H, where we also point out potential data limitations. For all benchmarks (except CoraFull), we use the feature vectors, class labels, and 10 random splits (48%/32%/20% of nodes per class for train/validation/test^{2}^{2}2Pei et al. (2020) claims that the ratios are 60%/20%/20%, which is different from the actual data splits shared on GitHub.) provided by Pei et al. (2020).
Table 4 gives the mean accuracy and stdev of HGCN variants and other models. We observe that the HGCN variants have consistently strong performance across the full spectrum of lowtohigh homophily: HGCN2 achieves the best average rank (2.9) across all datasets (or homophily ratios ), followed by HGCN1 (3.7). Other models that use some of the designs D1D3 (§ 3.1), including GraphSAGE and GCNCheby, also perform significantly better than GCN and GAT which lack these designs. Here, we also report the best results among the three recentlyproposed GEOMGCN variants (§ 4), directly from the paper Pei et al. (2020): other models (including ours) outperform this method significantly under heterophily. We note that MLP is a competitive baseline under strong heterophily, indicating that the existing models do not use the graph information effectively, or the latter is misleading in such cases. All models perform poorly on Squirrel and Actor likely due to their lowquality node features (small correlation with class labels). Also, Squirrel and Chameleon are dense, with many nodes sharing the same neighbors.
Texas  Wisconsin  Actor  Squirrel  Chameleon  Cornell  Cora Full  Citeseer  Pubmed  Cora  
Hom. ratio  0.11  0.21  0.22  0.22  0.23  0.3  0.57  0.74  0.8  0.81 
Avg Rank 
#Nodes  183  251  7,600  5,201  2,277  183  19,793  3,327  19,717  2,708  
#Edges  295  466  26,752  198,493  31,421  280  63,421  4,676  44,327  5,278  
#Classes  5  5  5  5  5  5  70  7  3  6  
HGCN1  3.7  
HGCN2  2.9  
GraphSAGE  3.8  
GCNCheby  3.9  
MixHop  7.5  
GCN  5.3  
GAT*  N/A  7.6  
GEOMGCN*  N/A  4.6  
MLP  5.3 
We have focused on characterizing the representation power of GNNs in challenging settings with heterophily or low homophily, which is understudied in the literature. We have highlighted the current limitations of GNNs, presented designs that increase representation power under heterophily and are theoretically justified with perturbation analysis and graph signal processing, and introduced a new model that adapts to both heterophily and homophily by effectively synthetizing these designs. We analyzed various challenging datasets, going beyond the oftenused benchmark datasets (Cora, Pubmed, Citeseer), and leave as future work extending to a largerscale experimental testbed.
Homophily and heterophily are not intrinsically ethical or unethical—they are both phenomena existing in the nature, resulting in the popular proverbs “birds of a feather flock together” and “opposites attract”. However, existing GNN models implicitly assume homophily, thus ignoring the heterophily phenomena which may exist in some networks. As a result, if they are applied to networks that do not satisfy the assumption, the results may be biased, unfair, or erroneous.
Beyond the node classification problem that we tackle in this work, GNN models have been employed in a wide range of applications, such as recommendation systems, analysis of molecules and proteins, and more. In some of these cases, the homophily assumption may have ethical implications: For example, a GNN model that intrinsically assumes homophily may contribute to the socalled “filter bubble” phenomenon in a recommendation system (reinforcing existing beliefs/views, and downplaying the opposite ones), or make the minority groups less visible in social networks. In other cases, a reliance on homophily may hinder scientific progress: Among other domains, this is critical for the emerging research field of applying GNN models to molecular and protein structures, where the connected nodes often belong to different classes; the performance of existing GNNs may be poor in such cases (as we have shown in our analysis) and could hinder new discoveries. Moreover, if the input data contain many errors (e.g., wrong class labels, noisy network with incorrect and missing links), these may be propagated over the network, and lead to compounding errors in the classification results (this is common in most, if not all, machine learning problems).
Our work has the potential to rectify some of these potential negative consequences of existing GNN work. While our methodology does not change the amount of homophily in a network, moving beyond a reliance on homophily can be a key to improve the fairness, diversity and performance for the applications using GNN. We hope that this paper will raise more awareness and discussions regarding the homophily limitations of existing GNN models, and help researchers design models which have the power of learning in both homophily and heterophily settings.
The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains.
IEEE signal processing magazine 30, 3 (2013), 83–98.Graph Agreement Models for SemiSupervised Learning.
In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8713–8723. http://papers.nips.cc/paper/9076graphagreementmodelsforsemisupervisedlearning.pdfWe summarize the main symbols used in this work and their definitions below:
Symbols  Definitions 

graph with nodeset , edgeset  
adjacency matrix of  
node feature matrix of  
dimensional feature vector for node  
unnormalized graph Laplacian matrix  
set of class labels  
class label for node  
dimensional vector of class labels (for all the nodes)  
training data for semisupervised node classification  
general type of neighbors of node in graph  
general type of neighbors of node in without selfloops (i.e., excluding )  
hop/step neighbors of node in (at exactly distance ) maybewith/without selfloops, resp.  
set of pairs of nodes with shortest distance between them being 2  
node degree, and maximum node degree across all nodes , resp.  
edge homophily ratio  
class compatibility matrix  
node representations learned in GNN model at round / layer  
the number of rounds in the neighborhood aggregation stage  
learnable weight matrix for GNN model  
nonlinear activation function 

vector concatenation operator  
AGGR  function that aggregates node feature representations within a neighborhood 
COMBINE  function that combines feature representations from different neighborhoods 
As we mentioned in § 1, the edge homophily ratio in Definition 1 gives an overall trend for all the edges in the graph. The actual level of homophily may vary within different pairs of node classes, i.e., there is different tendency of connection between each pair of classes. For instance, in an online purchasing network Pandit et al. (2007)
with three classes—fraudsters, accomplices, and honest users—, fraudsters connect with higher probability to accomplices and honest users. Moreover, within the same network, it is possible that some pairs of classes exhibit homophily, while others exhibit heterophily. In belief propagation
Yedidia et al. (2003), a messagepassing algorithm used for inference on graphical models, the different levels of homophily or affinity between classes are captured via the class compatibility, propagation or coupling matrix, which is typically predefined based on domain knowledge. In this work, we define the empirical class compatibility matrix as follows:The class compatibility matrix has entries that capture the fraction of outgoing edges from a node in class to a node in class :
By definition, the class compatibility matrix is a stochastic matrix, with each row summing up to 1.
We first discuss the GCN layer formulated as . Given training set , the goal of the training process is to optimize the weight matrix
to minimize the loss function
, whereis the onehot encoding of class labels provided in the training set, and
is the predicted probability distribution of class labels for each node
in the training set .Without loss of generality, we reorder accordingly such that the onehot encoding of labels for nodes in training set is in increasing order of the class label :
Now we look into the term , which is the aggregated feature vectors within neighborhood for nodes in the training set. Since we assumed that all nodes in have degree , proportion of their neighbors belong to the same class, while proportion of them belong to any other class uniformly, and onehot representations of node features for each node , we obtain:
For and that we derived in Eq. (1) and (1), we can find an optimal weight matrix such that , making the loss . We can use the following way to find : First, sample one node from each class to form a smaller set , therefore we have:
and
Note that is a circulant matrix, therefore its inverse exists. Using the ShermanMorrison formula, we can find its inverse as:
Let , and we have . It is also easy to verify that . is the optimal weight matrix we can learn under , since it satisfies .
Now consider an arbitrary training datapoint , and a perturbation added to the neighborhood of node , such that the number of nodes with a randomly selected class label is less than expected in . We denote the perturbed graph adjacency matrix as . Without loss of generality, we assume node has , and the perturbed class is . In this case we have
Applying the optimal weight matrix we learned on to the aggregated feature on the perturbed neighborhood , we obtain which equals to:
Notice that we always have > , thus the GCN layer formulated as would misclassify only if the following inequality holds:
Solving the above inequality for , we get the amount of perturbation needed as
(9) 
and the least absolute amount of perturbation needed is .
Now we move on to discuss the GCN layer formulated as without self loops. Following similar derivations, we obtain the optimal weight matrix which makes as:
(10) 
Again if for an arbitrary , a perturbation is added to the neighborhood of the node , such that the number of nodes with a randomly selected class label is less than expected in , we have:
Then applying the optimal weight matrix that we learned on to the aggregated feature on perturbed neighborhood , we obtain which equals to:
Thus, the GCN layer formulated as would misclassify when the following inequality holds:
Or the amount of perturbation is:
(11) 
As a result, the least absolute amount of perturbation needed is .
By comparing the least absolute amount of perturbation needed for both formulations to misclassify ( derived in Eq. (9) for the formulation; derived in Eq. (11) for the formulation), we can see that if and only if , which happens when . When (heterophily), we have , which means the formulation is less robust to perturbation than the formulation.
From the above proof, we can see that the least absolute amount of perturbation needed for both GCN formulations is a function of the assumed homophily ratio , the node degree for each node in the training set , and the size of the class label set . Fig. 4 shows the plots of and as functions of , and : from Fig. 3(a), we can see that the least absolute amount of perturbations needed for both formulation first decreases as the assumed homophily level increases, until reaches 0, where the GCN layer predicts the same probability for all class labels; after that, decreases further below 0, and increases as increases; the formulation is less robust to perturbation than the formulation at low homophily level until as our proof shows, where