1. Introduction
Graphstructure data is ubiquitous in representing complex interactions between objects (Du et al., 2018; Chen et al., 2020b; Song et al., 2020). Graph Neural Network (GNN), as a powerful tool for graph data modeling, has been widely developed for various realworld applications (Du et al., 2021; Yao et al., 2022; Wang et al., 2020). Based on the message passing mechanism, GNNs update node representations by aggregating messages from neighbors, thereby concurrently exploiting the rich information inherent in the graph structure and node attributes.
Traditional GNNs (Kipf and Welling, 2016; Veličković et al., 2017; Hamilton et al., 2017) mainly focus on homophily graphs that satisfy property of homophily (i.e. most of connected nodes belong to the same class). However, these GNNs usually can not perform well on graphs with heterophily (i.e. most of connected nodes belong to different classes) for the node classification problem, because message passing between nodes from different classes makes their representations less distinguishable, and thus leading to bad performance on node classification task. The aforementioned issues motivate considerable studies around GNNs for heterophily graph. For example, some studies (Wang et al., 2021; Du et al., 2022b; Yan et al., 2021) adjust message passing mechanism for heterophily edges, while others (AbuElHaija et al., 2019; Zhu et al., 2020a; Chien et al., 2020; Pei et al., 2020) enlarge the receptive field for the message passing. Note that, all these works mitigate the distinguishability issue caused by heterophily from the perspective of the GNN model design. While, there is another orthogonal perspective to mitigate the issue caused by heterophily, i.e., rewiring graph to reduce heterophily or increase homophily, which is still underexplored.
Graph rewiring (Alon and Yahav, 2020; Topping et al., 2021; Franceschi et al., 2019; Chen et al., 2020c) is a kind of method that decouples the input graph from the graph for message passing and boost the performance of GNN on node classification tasks via changing the message passing structure. Many works have utilized graph rewiring for different tasks. However, most existing graph rewiring techniques have been developed for graphs under homophily assumption (sparsity (Louizos et al., 2017), smoothness (Ortega et al., 2018; Kalofolias, 2016) and lowrank (Zhu et al., 2020b)), and thereby can not directly transfer to heterophily graphs. Different from existing solutions that design specific GNN architectures adapted to heterophily graphs, in this paper, we conduct comprehensive study on graph rewiring and propose an effective rewiring algorithm to reduce graph heterophily, which make GNNs perform better for both heterophily and homophily graphs.
First we demonstrate the effects of increasing homophilylevel for heterophily graphs in Sec. 3 with comprehensive controlled experiments. Note that the homophily (and heterophily) level can be measured with Homophily Ratio (HR) (Pei et al., 2020; Zhu et al., 2020a), which is formally defined as an average of the consistency of labels between each connected nodepair. From the analysis in Sec. 3.1, we find that both the nodelevel homophily ratio (Du et al., 2022b; Pei et al., 2020) and node degree (reflects the recall of nodes from the same class) can affect the performance of GCN on the node classification task, where increasing either of the two variables can lead to better performance of GCN. This finding, i.e., classification performance of GCN on heterophily graphs can be increased by reducing the heterophilylevel of graphs, motivates us to design a graphrewiring strategy to increase homophilylevel for heterophily graphs so that GNNs can perform better on the rewired graphs.
Then, we propose a learningbased graph rewiring approach on heterophily graphs, namely Deep Heterophily Graph Rewiring (DHGR). DHGR rewires the graph by adding/pruning edges on the input graph to reduce its heterophilylevel. It can be viewed as a plugin module for graph preprocessing that can work together with many kinds of GNN models including both GNN for homophily and heterophily, to boost their performance on node classification tasks.
The key idea of DHGR is to reduce the heterophily while keeping the effectiveness by adding more homophilic edges and removing heterophilic edges. However, simply adding homophilic edges and removing heterophilic edges between nodes in the training set may increase the risk of overfitting and lead to poor performance (we prove this in Sec. 5.4). Another challenge is that unlike homophily graphs that can leverage Laplace Smooth to enhance the correlation between node features and labels, heterophily graphs do not satisfy the property of smoothness (Kalofolias, 2016; Ortega et al., 2018). In this paper, we propose to use label/featuredistribution of neighbors on the input graph as guidance signals to identify edge polarity (homophily/heterophily) and prove its effectiveness in 3.2.
Under the guidance of the neighbors’ labeldistribution, DHGR learns the similarity between each nodepair, which forms a similarity matrix. Then based on the learned similarity matrix, we can rewire the graph by adding edges between highsimilarity nodepairs and pruning edges connecting lowsimilarity nodepairs. Then the learned graph structure can be further fed into GNNs for node classification tasks. Besides, we also design a scalable implementation of DHGR which avoids the quadratic time and memory complexity with respect to the numbers of nodes, making our method available for largescale graphs. Finally, extensive experiments on 11 realworld graph datasets, including both homophily and heterophily graphs, demonstrate the superiority of our method.
We summarize the contributions of this paper as follows:

We propose the new perspective, i.e., graph rewiring, to deal with heterophily graphs by reducing heterophily and make GNNs perform better.

We propose to use neighbor’s labeldistribution as guidance signals to identify homophily and heterophily edges with comprehensive experiments.

We design a learnable plugin module for graph rewiring on heterophily graphs, namely DHGR. And we design a highefficient scalable training algorithm for DHGR.

We conduct extensive experiments on 11 realworld graphs, including both heterophily and homophily graphs. The results show that GNNs with DHGR consistently outperform their vanilla versions. In addition, DHGR has additional gain even when combined with GNNs specifically designed for heterophily graphs.
2. Preliminary
In this section, we give the definitions of some important terminologies and concepts appearing in this paper.
2.1. Graph Neural Networks
Let denotes a graph, where is the node set, is the number of nodes in . Let denote the feature matrix and the th row of denoted as is the
dimensional feature vector of node
. and is connected. GNNs aim to learn representation for nodes in the graph. Typically, GNN models follow a neighborhood aggregation framework, where node representations are updated by aggregating information of its neighboring nodes. Let denotes the output vector of node at the th hidden layer and let . The th iteration of aggregation step can be written as:where is the set of neighbors of . The AGG function indicates the aggregation function aimed to gather information from neighbors and the goal of the COMBINE function is to fuse the information from neighbors and the central node. For graphlevel tasks, an additional READOUT function is required to get the global representation of the graph.
2.2. Graph Rewiring
Given a graph with node features as the input, Graph Rewiring (GR) aims at learning an optimal under a given criterion, where the edge set is updated and the node set is constant. Let denote the adjacent matrix of and , respectively. The rewired graph is used as input of GNNs, which is expected to be more effective than directly inputting the original graph . As shown in Fig. 1, the pipeline of Graph Rewiring models usually involves two stages, the similarity learning and the graph rewiring based on the learned similarity between pairs of nodes. It is obvious that the criterion (i.e., objective function) design plays a critical role for the similarity learning stage. Thus, we first mine knowledge from data in the next section to abstract an effective criterion of graph rewiring.
3. Observations from data
We observed from data that there exist two important properties of graph (i.e., nodelevel homophily ratio^{1}^{1}1Nodelevel homophily ratio is the homophily ratio of one specific node, which equals the percent of the sameclass neighbors in all neighboring nodes. and degree) that are strongly correlated with the performance of GNNs. And the two properties provide vital guidance so that we can optimize the graph structure by graph rewiring. However, we cannot directly calculate nodelevel homophily ratio because of the partially observable labels during training. Therefore, we introduce two other effective signals, i.e., neighbor’s observable label/featuredistribution, which have strong correlations with the nodelevel homophily ratio. In this section, we first verify the relations between the two properties and the performance of GNNs. Then we verify the correlations between neighbor distribution and nodelevel homophily ratio.
3.1. Effects of Nodelevel Homophily Ratio and Degree
First, we conduct validation experiments to verify the effects of nodelevel homophily ratio (Pei et al., 2020; du2021gbk) and node degree on the performance of GCN, as guidance for graph rewiring. Specifically, we first construct graphs by quantitatively controlling the nodelevel homophily ratio and node degree, and then verify the performance of GCN on the constructed graphs as a basis for measuring the quality of constructed graph structure. Note that considering the direction of message passing is from source nodes to target nodes, the node degree mentioned in this paper refers to the indegree. For example, given node degree and nodelevel homophily ratio , we can constructed a directional Graph where each node on the has different neighboring nodes pointing to it and there are sameclass nodes among the k neighbors, with other neighbors randomly selected from remaining differentclass nodes on the graph.
As shown in the Fig. 2, we conduct validation experiments on three different graph datasets, including one homophily graph (Cora) and two heterophily graphs (Chameleon, Actor). In this experiments, we construct graphs with node degree ranging from to and nodelevel homophily ratio ranging from to , totally 35 constructed graphs for each dataset. And then for each constructed graph, we train vanilla GCN (Kipf and Welling, 2016) on it three times and calculate the average test accuracy on node classification task. From the Fig. 2, we find that both the homophily graph and the heterophily graph follow the same rule: when the degree is fixed, the accuracy of GCN increases with the increase of the nodelevel homophily ratio; when the homophily ratio is fixed, the accuracy of GCN increases with the increase of the degree. It should be noted that when the homophily ratio equals 0 (i.e., all neighboring nodes are from different classes), it may have a higher GCN accuracy than that when the homophily ratio is very small (around ). Besides, when the homophily ratio is largr than a threshold, the GCN accuracy converges to 100%. In general, the GCN accuracy almost varies monotonically with the nodelevel homophily ratio and node degree. And this motivates us to use graph rewiring as a way of increasing both nodelevel homophily ratio and degree.
3.2. Effects of Neighbor’s Label/Feature Distribution
From the Sec. 3.1
, we conclude that graph rewiring can be used as a way of reducing heterophily to make GNNs perform well on both homophily and heterophily graphs. However, it is not easy to accurately identify the edge polarity (homophily or heterophily) on a heterophily graph so that we can estimate the nodelevel homophily ratio. For a homophily graph, we can leverage its homophily property and use Laplacian Smoothing
(Ortega et al., 2018; Kalofolias, 2016) to make its node representation more distinguishable. However, heterophily graphs do not satisfy property of smoothness thus the information available is limited. A straightforward idea is to use node features to identify edge polarity, however, the information of this single signal is limited. In this paper, we propose to use similarity between the neighbor’s labeldistribution for nodepairs as a measure of edge polarity. Besides, considering that not all node labels are observable, we also introduce neighbor’s featuredistribution (mean of neighbor features), which is completely observable, as signals in addition to neighbor’s labeldistribution.Up to now, we have three signals (i.e. raw node features, labeldistribution and featuredistribution of neighbors) that can be used as measures for edge polarity. We quantitatively evaluate the effectiveness of the three signals and find that the distribution signals are more informative than the raw node feature through the following empirical experiments and analysis. To be specific, we consider the label/featuredistribution of the 1storder and 2ndorder neighbors. Then we calculate the similarity between each nodepair with one of these signals and compute the mutual information between the nodepair similarity and edge polarity on the graph. The formula of Mutual information is written as follows:
(1) 
In the case of discrete random variables, the integral operation is replaced by the sum operation.
As shown in the Fig. 3, we conduct statistical analysis on three datasets (i.e. Cora, Chameleon, Actor). From the Fig. 3, we find that both the similarity of neighbor’s labeldistribution and neighbor’s featuredistribution have a strong correlation with edge polarity than that of the raw node features similarity, and neighbor’s labeldistribution has a stronger correlation than neighbor’s featuredistribution in most cases. And this rule applies to both homophily graphs and heterophily graphs.
4. Method
Based on the observations mentioned above, we design the Deep Heterophily Graph Rewiring method (DHGR) for heterophily graphs, which can be easily plugged into various existing GNN models. Following the pipeline in Fig. 1, DHGR first learns a similarity matrix representing the similarity between each nodepair based on the neighbor distribution (i.e. labeldistribution and featuredistribution of neighbors). Then we can rewire the graph structure by adding edges between highsimilarity nodepair and pruning lowsimilarity edges on the original graph. Finally, the rewired graph is further fed into other GNN models for node classification tasks.
4.1. Similarity Learner Based on Neighbor Distribution
Before rewiring the graph, we first learn the similarity between each pair of nodes. According to the analysis in Sec. 3.2, we design a graph learner that learns the nodepair similarity based on the neighbor distribution. Considering that in the training process, only the labels of nodes in the training set are available, we cannot observe the full labeldistribution of neighbors. Therefore, we also leverage the featuredistribution of neighbors which can be fully observed to enhance this similarity learning process with the intuition that node features have correlations to labels for an attributedgraph. Besides, the results shown in Sec. 3.2 also validate the effectiveness of neighbor’s featuredistribution.
The overview of similarity learner used in DHGR is shown in Fig. 4. Specifically, for an attributed graph, we can first calculate its observable labeldistribution and featuredistribution for the hop neighbors of each node using nodelabels in the training set and all nodefeatures:
(2) 
and is respectively the labeldistribution and featuredistribution of order neighbors in the graph, is the maximum neighbororder and we use in this paper. is the onehot label matrix, the th row of is the onehot label vector of node if belongs to the training set, else use a zerovector instead. is the adjacent matrix and is the corresponding degree diagonal matrix. Then for each node, we can get the observed labeldistribution vector and featuredistribution vector of its neighbors. Next we calculate the cosine similarity between each nodepair with respect to both labeldistribution and featuredistribution, and we can get the similarity matrix of labeldistribution and similarity matrix of featuredistribution .
(3) 
where
(4) 
Note that before calculating cosine similarity, we first decentralize the input variable by subtracting the mean of this variable for all nodes. Considering that not all nodes have an observed labeldistribution, e.g., if all neighbors of node do not belong to the training set, then the observed labeldistribution of is a zero vector. Obviously, this is not ideal, so we compensate for this with the featuredistribution of neighbors. In addition, we restrict the utilization condition of neighbor labeldistribution by using a mask. Specifically, for node , we leverage its neighbor labeldistribution only when the percentage of its neighbors in the training set is larger than a threshold :
(5) 
is the mask vector, is the neighbor set of , is the set of nodes in the training set.
Then our similarity learner targets at learning the similarity of nodepairs based on the neighbor distribution. Specifically, the similarity learner first aggregates and transforms the feature of neighboring nodes and then uses the aggregated node representation to calculate cosine similarity for each nodepair:
(6) 
denotes the similarity between and , and similarity of all nodepairs form a similarity matrix . In practice, we also optionally use the concatenation of distribution feature and (transformed feature of node itself) for similarity calculation in Eq. 6. Finally, we use the and calculated in advance to guide the training of . We have the following two objective functions with respect and :
(7) 
(8) 
In practice, we first use to reconstruct as the pretraining process and then further use to reconstruct under as the finetuning process.
4.2. A Scalable Implementation of DHGR
However, directly optimizing the objective function mentioned above has quadratic computational complexity. For node attributes , the complexity is unacceptable for large graphs when . So we design a scalable training strategy with stochastic minibatch. Specifically, we randomly select nodepairs as a batch and optimize the similarity matrix by a sized sliding window in each iteration. We can assign small numbers to . We give the pseudocode in Algorithm 1.
4.3. Graph Rewiring with Learned Similarity
After we obtain the similarity of each nodepair, we can use the learned similarity to rewire the graph. Specifically, we add edges between nodepairs with highsimilarity and remove edges with lowsimilarity on the original graph. Three parameters are set to control this process: indicates the maximum number of edges that can be added for each node; constrains that the similarity of nodepairs to add edges mush larger than a threshold . Finally another parameter is set for pruning edges with similarity smaller than . The details of the Graph Rewiring process are given in Algorithm 2. Finally, we can feed the rewired graph into any GNNbased models for node classification tasks.
Dataset  Chameleon  Squirrel  Actor  FB100  Flickr  Cornell  Texas  Wisconsin  Cora  CiteSeer  PubMed 

Nodes  2277  5201  7600  41554  89250  183  183  251  2708  3327  19717 
Edges  36101  217073  30019  2724458  899756  298  325  511  10556  9104  88648 
Features  2325  2089  93 2  4814  500  1703  1703  1703  1433  3703  500 
Classes  5  5  5  2  7  5  5  5  7  6  3 
H.R.  23.5%  22.4%  21.9%  47.0%  31.9%  30.5%  10.8%  19.6%  81.0%  73.6%  80.2% 
4.4. Complexity Analysis
We analyze the computational complexity of Algorithm 1 and Algorithm 2 with respect to the number of nodes . For Algorithm 1, the complexity of random sampling nodes is . Lets denote the feature dimension as and denote the onehot label dimension as . Considering that the complexity of calculating cosine similarity between two dimension vectors is , the complexity of calculating the similarity matrix , , is . The complexity of calculating and equals to . Therefore, the final computational complexity of one epoch of Algorithm 1 is where are two constants. For Algorithm 2, we use BallTree to compute the topK nearest neighbors, the complexity of one topK query is . Therefore, the time complexity of the first loop which performs the topK algorithm is approximately . The second loop filters each edge in the original Graph and thus its complexity is . Therefore the final complexity of Algorithm 2 is .
GNN Model  Chameleon  Squirrel  Actor  Flickr  FB100  Cornell  Texas  Wisconsin  
GCN  vanilla  37.683.06  26.390.88  28.900.57  49.680.45  74.340.20  55.563.21  61.961.27  52.357.07 
DHGR  70.832.03  67.151.43  36.290.12  51.010.25  77.010.14  67.385.33  81.780.89  76.473.62  
GAT  vanilla  44.341.42  29.820.98  29.100.57  49.670.81  70.010.66  56.226.02  60.365.55  49.616.20 
DHGR  72.112.87  62.371.78  34.710.48  50.400.09  79.415.13  70.096.77  83.783.37  73.204.89  
GraphSAGE  vanilla  49.061.88  36.731.21  35.070.15  50.210.31  75.990.09  80.082.96  82.032.77  81.363.91 
DHGR  69.571.28  68.08 1.55  37.170.11  50.850.05  76.560.10  82.885.56  85.682.72  83.161.72  
APPNP  vanilla  40.442.02  29.201.45  30.020.89  49.050.10  74.220.11  56.764.58  55.106.23  54.596.13 
DHGR  70.352.62  60.311.51  36.930.86  49.360.05  75.460.11  68.116.59  81.584.36  77.653.06  
GCNII  vanilla  57.372.35  39.511.63  31.050.14  50.340.22  77.060.12  61.705.91  62.437.37  52.754.23 
DHGR  74.572.56  58.381.79  36.030.12  50.730.31  78.380.91  72.976.73  81.086.02  78.244.99  
GPRGNN  vanilla  41.561.66  30.031.11  35.720.19  49.760.10  78.580.23  72.786.05  69.371.27  76.085.86 
DHGR  71.581.59  64.822.07  37.430.78  50.560.32  82.280.56  76.565.77  83.982.54  79.414.98  
H2GCN  vanilla  49.212.57  34.581.61  35.610.31  —  —  79.066.36  80.275.41  80.204.51 
DHGR  69.191.913  72.241.52  36.510.67  —  —  82.066.27  84.865.01  85.015.51  
Avg Gain  25.51  32.44  4.23  0.70  3.15  8.27  15.89  15.17 
5. Experiments
In this section, we first give the experimental configurations, including the introduction of datasets, baselines and setups used in this paper. Then we give the results of experiments comparing DHGR with other graph rewiring methods on the node classification task under transductive learning scenarios. Besides, we also conduct extensive hyperparameter studies and ablation studies to validate the effectiveness of DHGR.
5.1. Datasets
We evaluate the performanes of DHGR and the existing methods on eleven realworld graphs. To demonstrate the effectiveness of DHGR , we select eight heterophily graph datasets (i.e. Chameleon, Squirrel, Actor, Cornell, Texas, Wisconsin (Pei et al., 2020), FB100 (Traud et al., 2012), Flickr (Zeng et al., 2019)) and three homophily graph datasets (i.e. Cora, CiteSeer, PubMed (Kipf and Welling, 2016)). The detailed information of these datasets are presented in the Table 1. For graph rewiring methods, we use both the original graphs and the rewired graphs as the input of GNN models to validate their performance on the node classification task.
5.2. Baselines
DHGR can be viewed as a plugin module for other stateoftheart GNN models. And we select five GNN models tackling homophily, including GCN (Kipf and Welling, 2016), GAT (Veličković et al., 2017), GraphSAGE (Hamilton et al., 2017), APPNP (Klicpera et al., 2018) and GCNII (Chen et al., 2020a). To demonstrate the significant improvement on heterophily graphs caused by DHGR, we also choose two GNNs tackling heterophily (i.e. GPRGNN (Chien et al., 2020), H2GCN (Zhu et al., 2020a)). Besides, to validate the effectiveness of DHGR as a graph rewiring method, we also compare DHGR with two Graph Structure Learning (GSL) methods (i.e. LDS (Franceschi et al., 2019) and IDGL (Chen et al., 2020c)) and one Graph Rewiring methods (i.e. SDRF (Topping et al., 2021)), which are all aimed at optimizing the graph structure. For GPRGNN and H2GCN, we use the implementation from the benchmark (Lim et al., 2021)
, and we use the official implementation of other GNNs provided by Torch Geometric. For all Graph Rewiring methods except SDRF whose code is not available, we all use their official implementations proposed in the original papers.
5.3. Experimental Setup
For datasets in this paper, we all use their public released data splits. For Chameleon, Squirrel, Actor, Cornell, Texas ,and Wisconsin, ten random generated splits of data are provided by (Pei et al., 2020)
, and we therefore train models on each data split with 3 random seeds for model initialization (totally 30 trails for each dataset) and finally we calculate the average and standard deviation of all 30 results. And we use the official splits of other datasets (i.e. Cora
(Kipf and Welling, 2016), PubMed (Kipf and Welling, 2016), CiteSeer (Kipf and Welling, 2016), Flickr (Zeng et al., 2019), FB100 (Lim et al., 2021)) from the corresponding papers. We train our DHGR models with 200 epochs for pretraining and 30 epochs for finetuning in all datasets. And we search the hyperparameters of DHGR in the same space for all datasets. (the max order of neighbors) is searched in {1, 2}, (the growing threshold) is searched in {3, 6, 8, 16} and (the pruning threshold) is searched in {0., 0.3, 0.6}, where we do not prune edges for homophily datasets which equals to set to 1.0. The batch size for training DHGR is searched in {5000, 10000}. For other GSL methods (i.e. LDS (Franceschi et al., 2019), IDGL (Chen et al., 2020c)) , we adjust their hyperparameters according to the configurations used in their papers. For GNNs used in this paper, we adjust the hyperparameters in the same searching space for fairness. We search the hidden dimensions in {32, 64} for all GNNs and set the number of model layers to 2 for GNNs except for GCNII (Chen et al., 2020a) which is designed with deeper depth and we search the number of layers for GCN2 in {2, 64} according to its official implementation. We train 200/300/400 epoch for all models and select the best parameters via the validation set. The learning rate is searched in {1e2, 1e3, 1e4}, the weight decay is searched in {1e4, 1e3, 5e3}, and we use Adam optimizer to optimize all the models on Nvidia Tesla V100 GPU.5.4. Main Results
We conduct experiments of node classification task on both heterophily and homophily graph datasets, and the results are presented in Table 2 and Table 3 respectively. We evaluate the performance of DHGR by comparing the classification accuracy of GNN with original graphs and graphs rewired by DHGR respectively. We also calculate the average gain (AG) of DHGR for all models on each dataset. The formula of average gain is given as follows:
(9) 
where is the set of GNN models. is the short form of accuracy. is the original graph and is the graph rewired by DHGR. We also compare the proposed DHGR with other Graph Rewiring methods on their performance and running time, and the results of different graph rewiring methods are reported in Table 4 and Fig. 5. By analyzing these results, we have the following observations:
(1) All GNNs enhanced by DHGR, including GNNs for homophily and GNNs for heterophily, outperform their vanilla versions on the eight heterophily graph datasets. The average gain of DHGR on heterophily graph can be up to 32.44% on Squirrel. However, vanilla GCN on Squirrel only has 26.39% classification accuracy on the test set. Even with the sateoftheart GNNs for heterophily (i.e. GPRGNN, H2GCN), an test accuracy of no more than 40% can be achieved. The H2GCN enhanced by DHGR can achieve an astonishing 72.24% test accuracy on Squirrel, almost doubling. For most other heterophily datasets, GNN with DHGR can provide significant accuracy improvements. It demonstrates the importance of graph rewiring strategy for improving GNN’s performance on heterophily graphs. Besides, the significant average gain by DHGR also demonstrates the effectiveness of DHGR. For largescale and edgedense datasets such as Flickr and FB100 (), graph rewiring with DHGR can still provide a competitive boost for GNNs, which verifies the effectiveness and scalability of DHGR on largescale graphs.
GNN Model  Cora  CiteSeer  PubMed  
GCN  vanilla  81.090.39  70.130.45  78.380.39 
DHGR  82.700.41  70.790.12  79.100.33  
GAT  vanilla  81.900.73  69.600.63  78.10.63 
DHGR  82.930.51  70.430.65  78.810.93  
GraphSAGE  vanilla  80.620.47  70.300.57  77.10.23 
DHGR  81.300.26  71.110.65  77.630.16  
APPNP  vanilla  83.250.42  70.460.31  78.90.45 
DHGR  83.860.40  71.600.35  79.610.53  
GCNII  vanilla  83.110.37  70.900.73  79.460.33 
DHGR  83.930.28  71.960.67  79.490.39  
Avg Gain  0.95  0.90  0.54 
(2) For homophily graphs (i.e., Cora, Citeseer, Pubmed), the proposed DHGR can still provide competitive gain of node classification performance for the GNNs. Note that homophily graphs usually have a higher homophily ratio (i.e. 81%, 74%, 80% for Cora, CiteSeer and PubMed), so even vanilla GCNs can achieve great results and thus the benefit of adjusting the graph structure to achieve a higher homophily ratio is less than that for heterophily graphs. To be specific, DHGR gains best average gain on Cora, e.g., the classification accuracy of vanilla GCN on Cora is improved from 81.1% to 82.6%. For another two datasets, DHGR also provide average gain no less than 0.5% accuracy for all GNN models. These results demonstrate that our method can provide significant improvements for heterophily graphs while maintaining competitive improvements for homophily graphs.
Methods  Chameleon  Squirrel  Actor  Texas 

Vanilla GCN  37.683.06  26.390.88  28.900.57  61.961.27 
RandAddEdge  32.176.06  22.775.05  26.682.26  55.851.68 
RandDropEdge  39.012.47  26.481.09  29.540.36  66.761.52 
37.013.36  27.892.28  29.571.17  60.082.13  
LDS  36.122.89  28.021.78  27.580.97  58.755.57 
IDGL  37.283.36  23.572.07  27.170.85  67.575.85 
SDRF*  44.460.17  41.470.21  29.850.07  70.350.60 
DHGR  70.832.03  67.151.43  36.290.12  81.780.89 
(3) To demonstrate the effectiveness of DHGR as a method of graph rewiring, we also compare the proposed approach with other graph rewiring methods (i.e. LDS, IDGL, SDRF). Besides, we also use two random graph structure transformation by adding or removing edges on the original graph with a probability of 0.5, namely RandAddEdge and RandDropEdge. To validate the effect of adding edges between sameclass nodes with training label, we also design a method named
that randomly adds edges between sameclass nodes within the training set (for we can only observe labels of node in the training set) with a probability of 0.5. As shown in Table 4, GCN with DHGR outperform GCN with other graph transformation methods on the presented four heterophily datasets. Note that which only use training label to add edges, though increases the homophily ratio, it cannot add edges beyond nodes in the training set. Only adding homophilic edges within the training set cannot guarantee an improvement of GCN’s performance and make the nodes in the training set easier to distinguish, increasing the risk of overfitting. The significant improvements made by DHGR demonstrates the effectiveness of DHGR as a graph rewiring method.(4) Note that the traditional paradigm of GSL methods (e.g., LDS, IDGL.) is training a graph learner and a GNN through an end2end manner and based on the dense matrix optimization, which have larger complexity. The running time of DHGR and two other GSL methods is presented in Fig. 5, we find that the running time of DHGR is significantly smaller than that of GSL methods under the same device environment. We did not present the running time of SDRF because its code has not been released publicly yet.
5.5. HyperParameter Study
To demonstrate the robustness of the proposed approach, we study the effect of the four main hyperparameter of DHGR, i.e. Batchsize, (maximum number of added edges for each node), (the threshold of lowestsimilarity when adding edges) and training ratio of datasets in this section.
Dataset  Squirrel  FB100  

Batchsize  
100100  64.57  64.01  63.31  75.36  75.02  74.78 
10001000  66.01  65.68  64.53  76.21  76.30  75.01 
50005000  66.57  66.21  66.17  76.58  76.37  75.97 
1000010000  67.79  67.66  66.32  77.32  76.57  76.32 
67.79  67.66  66.32  77.23  76.87  76.21 
5.5.1. The effect of batchsize and training set ratio
Table 5 shows the results of GCN with DHGR on two heterophily datasets varying with different batchsize for DHGR and training ratio (percentage of nodes in the training set.). The batchsize is ranging from to , where is the number of nodes and indicates using fullbatch for training. Note that for the Squirrel dataset which has only 5201 nodes, the batchsize of equals fullbatch. The results in Table 5 show that the proposed approach has stable improvements with different batchsize and training ratio. To be specific, GCN with DHGR only has a 3% decrease in accuracy when decreasing the batchsize to 100, which is extremely small and with no more than 2% decrease in accuracy with training ratio ranging from 40% to 10%. Besides, we usually set the batchsize of DHGR ranging from 5000 to 10000 in real applications, because the overhead of 1000010000 matrix storage and operation is completely acceptable. These results demonstrate the robustness of DHGR when adjusting the batchsize and training ratio.
5.5.2. The effect of and
We have two important hyperparameters when rewiring graphs with DHGR, the maximum number of edges added for each node (denoted as ) and the threshold of lowestsimilarity when adding edges (denoted as ). Given the learned similarity by DHGR, the two hyperparameters almost determines the degree and homophily ratio of the rewired graph. Motivated by the obversations presented in Sec. 3, we verify the effectiveness of DHGR for graph rewiring by using different and . Fig. 6 (a) shows the homophily ratio of rewired graphs using different and and Fig. 6 (b) shows the node classification accuracy of GCN on the rewired graphs using different and . We observe that the homophily ratio usually increases when increasing with fixed , while decreases when increasing with fixed . Besides, the change of GCN node classification accuracy basically matches the change of homophily ratio with different and . This demonstrates the effectiveness and robustness of the rewired graphs learned by DHGR.
5.6. Ablation Study
Considering that DHGR leverage three different types of information (i.e. raw feature, labeldistribution, featuredistribution), we also verify the effectiveness of each type of formation by removing them from DHGR and designing three variants of it. indicates removing the using of neighbor’s labeldistribution (the finetuning process). indicates removing the using of neighbor’s featuredistribution (the pretraining process). indicates do not use the concatenation of distribution feature and (transformed feature of node itself) for similarity calculation in Eq. 6 (only use the distribution feature ). As shown in Table 6, the node classification of GCN with rewired graphs from almost all variants deteriorates to some extent on the four selected datasets (i.e. Cora, Cornell, Texas, FB100). For the Texas dataset, the results of that do not utilize neighbors featuredistribution have slight improvement over the full DHGR and we think it is caused by the poor performance of featuredistribution reflected by the results of , which only leverages the featuredistribution and feature of node itself on this dataset. And the result of DHGR on Texas dataset only decreases slightly with 0.2% accuracy compared with . The results of the ablation study demonstrate the effectiveness of neighbor labeldistribution for modeling heterophily graphs. Also, it demonstrates that the proposed approach makes full use of the useful information from neighbor distribution and raw feature.
Methods  Cora  Cornell  Texas  FB100 

80.970.05  65.385.53  79.671.79  75.950.16  
81.30.13  67.086.08  82.021.06  76.680.56  
81.70.11  62.214.49  67.851.02  75.650.26  
DHGR  82.630.41  67.385.33  81.780.89  77.010.14 
6. Related work
6.1. Graph Representation Learning
Graph Neural Networks (GNNs) have been popular for modeling graph data (Bi et al., 2022; Yang et al., 2021; Chen et al., 2021; Wang et al., 2019b; Du et al., 2022a). GCN (Kipf and Welling, 2016) proposed to use graph convolution based on neighborhood aggregation. GAT (Veličković et al., 2017) proposed to use attention mechanism to learn weights for neighbors. GraphSAGE (Hamilton et al., 2017) was proposed with graph sampling for inductive learning on graphs. These early methods are designed for homophily graphs, and they perform poorly on heterophily graphs. Recently, some studies (AbuElHaija et al., 2019; Pei et al., 2020; Zhu et al., 2020a; Chien et al., 2020; Du et al., 2022b) propose to design GNNs for modeling heterophily graphs. MixHop (AbuElHaija et al., 2019) was proposed to aggregate representations from multihops neighbors to alleviate heterophily. GeomGCN (Pei et al., 2020) proposed a bilevel aggregation scheme considering both node embedding and structural neighborhood. GPRGNN(Chien et al., 2020) proposed to adaptively learn the Generalized PageRank (GPR) weights to jointly optimize node feature and structural information extraction. More recently, GBKGNN (Du et al., 2022b) was designed with bikernels for homophilic and heterophilic neighbors respectively.
6.2. Graph Rewiring
The traditional message passing GNNs usually assumes that messages are propagated on the original graph (Kipf and Welling, 2016; Veličković et al., 2017; Hamilton et al., 2017; Chen et al., 2020a). Recently, there is a trend to decouple the input graph from the graph used for message passing. For example, graph sampling methods for inductive learning (Hamilton et al., 2017; Zhang et al., 2019), motifbased methods (Monti et al., 2018) or graph filter leveraging multihop neighbors (AbuElHaija et al., 2019), or changing the graph either as a preprocessing step (Klicpera et al., 2019; Alon and Yahav, 2020) or adaptively for the downstream task (Kazi et al., 2022; Wang et al., 2019a). Besides, Graph Structure Learning (GSL) methods (Li et al., 2018; Franceschi et al., 2019; Chen et al., 2020c; Zhu et al., 2022; Gao et al., 2020; Wan and Kokel, 2021) aim at learning an optimized graph structure and its corresponding node representations jointly. Such methods of changing graphs for better performance of downstream tasks are often generically named as graph rewiring (Topping et al., 2021). The works of (Alon and Yahav, 2020; Topping et al., 2021) proposed rewiring the graph as a way of reducing the bottleneck, which is a structural property in the graph leading to oversquashing. Some GSL methods (Wan and Kokel, 2021; Gao et al., 2020) directly make adjacent matrix a learnable parameter and optimize it with GNN. Other GSL methods (Franceschi et al., 2019; Chen et al., 2020c) use a bilevel optimization pipeline, in which the inner loop denotes the downstream tasks and the outer loop learns the optimal graph structure with a structure learner. Some studies (Ying et al., 2021; Dwivedi et al., 2021) also use transformerlike GNNs to construct global connections between all nodes. However, both GSL methods and graph transformerbased methods usually have a higher time and space complexity than other graph rewiring methods. Most of existing Graph Rewiring methods are under the similar assumption (e.g., sparsity (Louizos et al., 2017), lowrank (Zhu et al., 2020b), smoothness (Ortega et al., 2018; Kalofolias, 2016)) on graphs. However, the property of lowrank and smoothness are not satisfied by heterophily graphs. Thus, graph rewiring methods for modeling heterophily graphs still need to be explored.
7. Conclusion
In this paper, we propose a new perspective of modeling heterophily graphs by graph rewiring, which targets at improving the homophily ratio and degree of the original graphs and making GNNs gain better performance on the node classification task. Besides, we design a learnable plugin module of graph rewiring for heterophily graphs namely DHGR which can be easily plugged into any GNN models to improve their performance on heterophily graphs. DHGR improves homophily of graph by adjusting structure of the original graph based on neighbor’s labeldistribution. And we design a scalable optimization strategy for training DHGR to guarantee a linear computational complexity. Experiments on eleven realworld datasets demonstrate that DHGR can provide significant performance gain for GNNs under heterophily, while gain competitive performance under homophily. The extensive ablation studies further demonstrate the effectiveness of the proposed approach.
References
 Mixhop: higherorder graph convolutional architectures via sparsified neighborhood mixing. In Proceedings of International Conference on Machine :earning, pp. 21–29. Cited by: §1, §6.1, §6.2.
 On the bottleneck of graph neural networks and its practical implications. arXiv preprint arXiv:2006.05205. Cited by: §1, §6.2.

MMgnn: mixmoment graph neural network towards modeling neighborhood feature distribution
. arXiv preprint arXiv:2208.07012. Cited by: §6.1.  Simple and deep graph convolutional networks. In Proceedings of International Conference on Machine Learning, pp. 1725–1735. Cited by: §5.2, §5.3, §6.2.
 Fast hierarchy preserving graph embedding via subspace constraints. In ICASSP 20212021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3580–3584. Cited by: §6.1.
 TSSRGCN: temporal spectral spatial retrieval graph convolutional network for traffic flow forecasting. In 2020 IEEE International Conference on Data Mining (ICDM), Vol. , pp. 954–959. External Links: Document Cited by: §1.
 Iterative deep graph learning for graph neural networks: better and robust node embeddings. Advances in Neural Information Processing Systems 33, pp. 19314–19326. Cited by: §1, §5.2, §5.3, §6.2.
 Adaptive universal generalized pagerank graph neural network. In Proceedings of International Conference on Learning Representations, Cited by: §1, §5.2, §6.1.
 Understanding and improvement of adversarial training for network embedding from an optimization perspective. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pp. 230–240. Cited by: §6.1.
 TabularNet: a neural network architecture for understanding semantic structures of tabular data. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 322–331. Cited by: §1.
 GBKgnn: gated bikernel graph neural networks for modeling both homophily and heterophily. In Proceedings of the ACM Web Conference 2022, pp. 1550–1558. Cited by: §1, §1, §6.1.
 Traffic events oriented dynamic traffic assignment model for expressway network: a network flow approach. IEEE Intelligent Transportation Systems Magazine 10 (1), pp. 107–120. Cited by: §1.
 Graph neural networks with learnable structural and positional representations. arXiv preprint arXiv:2110.07875. Cited by: §6.2.
 Learning discrete structures for graph neural networks. In Proceedings of International conference on machine learning, pp. 1972–1982. Cited by: §1, §5.2, §5.3, §6.2.
 Exploring structureadaptive graph learning for robust semisupervised classification. In Proceedings of 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §6.2.
 Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025–1035. Cited by: §1, §5.2, §6.1, §6.2.

How to learn a graph from smooth signals.
In
Proceedings of Artificial Intelligence and Statistics
, pp. 920–929. Cited by: §1, §1, §3.2, §6.2.  Differentiable graph module (dgm) for graph convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. Cited by: §6.2.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §3.1, §5.1, §5.2, §5.3, §6.1, §6.2.
 Predict then propagate: graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997. Cited by: §5.2.
 Diffusion improves graph learning. arXiv preprint arXiv:1911.05485. Cited by: §6.2.

Adaptive graph convolutional neural networks
. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3546–3553. Cited by: §6.2.  New benchmarks for learning on nonhomophilous graphs. arXiv preprint arXiv:2104.01404. Cited by: §5.2, §5.3.
 Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §1, §6.2.

Motifnet: a motifbased graph convolutional network for directed graphs.
In
Proceedings of 2018 IEEE Data Science Workshop (DSW)
, pp. 225–228. Cited by: §6.2.  Graph signal processing: overview, challenges, and applications. Proceedings of the IEEE 106 (5), pp. 808–828. Cited by: §1, §1, §3.2, §6.2.
 Geomgcn: geometric graph convolutional networks. arXiv preprint arXiv:2002.05287. Cited by: §1, §1, §3.1, Table 1, §5.1, §5.3, §6.1.
 Inferring explicit and implicit social ties simultaneously in mobile social networks. Science China Information Sciences 63 (4), pp. 1–3. Cited by: §1.
 Understanding oversquashing and bottlenecks on graphs via curvature. arXiv preprint arXiv:2111.14522. Cited by: §1, §5.2, §6.2.
 Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications 391 (16), pp. 4165–4180. Cited by: §5.1.
 Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §5.2, §6.1, §6.2.
 Graph sparsification via metalearning. DLG@ AAAI. Cited by: §6.2.
 Powerful graph convolutioal networks with adaptive propagation mechanism for homophily and heterophily. arXiv preprint arXiv:2112.13562. Cited by: §1.
 Cocogum: contextual code summarization with multirelational gnn on umls. Microsoft, Tech. Rep. MSRTR202016. Cited by: §1.
 Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38 (5), pp. 1–12. Cited by: §6.2.

Tag2Gauss: learning tag representations via gaussian distribution in tagged networks.
. In IJCAI, pp. 3799–3805. Cited by: §6.1.  Two sides of the same coin: heterophily and oversmoothing in graph convolutional neural networks. arXiv preprint arXiv:2102.06462. Cited by: §1.
 Domain adaptive classification on heterogeneous information networks. In Proceedings of the TwentyNinth International Conference on International Joint Conferences on Artificial Intelligence, pp. 1410–1416. Cited by: §6.1.
 TrajGAT: a graphbased longterm dependency modeling approach for trajectory similarity computation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2275–2285. Cited by: §1.
 Do transformers really perform badly for graph representation?. Advances in Neural Information Processing Systems 34. Cited by: §6.2.
 Graphsaint: graph sampling based inductive learning method. arXiv preprint arXiv:1907.04931. Cited by: §5.1, §5.3.
 Bayesian graph convolutional neural networks for semisupervised classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 5829–5836. Cited by: §6.2.
 Beyond homophily in graph neural networks: current limitations and effective designs. Advances in Neural Information Processing Systems 33, pp. 7793–7804. Cited by: §1, §1, §5.2, §6.1.
 A survey on graph structure learning: progress and opportunities. arXiv preprint arXiv:2103.03036. Cited by: §6.2.
 Cagnn: clusteraware graph neural networks for unsupervised graph representation learning. arXiv preprint arXiv:2009.01674. Cited by: §1, §6.2.