1. Introduction
Graph data are ubiquitous today, e.g., friendship graphs in social networks, phone call or message graphs in telecommunication, useritem interaction graphs in recommender systems, and proteinprotein interaction graphs in biology. For graphbased tasks such as node classification, link prediction and graph classification, graph neural networks (GNNs) achieve excellent performance thanks to its ability to utilize both graph structure and feature information (on nodes or edges). Most GNN models can be formulated under the message passing framework, in which each node passes messages to its neighbors in the graph and aggregates messages from the neighbors to update its own embedding.
Different attempts have been made to design algorithms and models for graph analytics. Random walk based methods, e.g., DeepWalk (Perozzi et al., 2014) uses the random walk paths as the input to a skipgram model to learn node embeddings , while node2vec (Grover and Leskovec, 2016) learns node embeddings by combining breadthfirst random walk and depthfirst random walk. Motivated by graph spectral theory, graph convolutional network (GCN) (Kipf and Welling, 2017) conducts graph convolution using the adjacency matrix to avoid the high complexity spectral decomposition. Instead of using the adjacency matrix to derive the weights for message aggregation, graph attention network (GAT) (Veličković et al., 2018) uses an attention module to learn the weights from data. Simple graph convolution network (SGC) (Wu et al., 2019) proposes to remove the nonlinearity in GCN as it observes that the good performance of GCN mainly comes from local averaging rather than nonlinearity. There are also many other GNN models such as GraphSAGE (Hamilton et al., 2017), jumping knowledge network (JKNet) (Xu et al., 2018), geometric graph convolutional network (GeomGCN) (Pei et al., 2020), and gated graph neural network (GGNN) (Li et al., 2015), and we refer readers to a comprehensive survey in (Zhou et al., 2018).
In this paper, we focus on the problem of semisupervised node classification, which is also most GNN models are designed for. We observed that most existing researches attempt to design more effective GNN models, while the quality of the input data has not received much attention. However, data quality ^{1}^{1}1Here, data quality is problemspecific. Given a GNN model and a specific problem, high data quality means that the GNN model produces good output for the problem on the input data. In this paper, we discuss data quality with respect to the node classification problem. and model quality can be equally important for good performance. For example, if the input graph contains only intraclass edges (i.e., edges connecting nodes from the same class) and no interclass edges (i.e., edges connecting nodes from different classes), node classification can achieve perfect accuracy with only one training sample from each connected component. Moreover, classification tasks are usually easier with more training samples.
At first glance, data quality (i.e., the quality of the input graph structure and training nodes) is the fixed problem input and cannot be improved. However, we observed that existing GNN models already achieve good classification accuracy, and thus their outputs can actually be used to update the input data to improve its quality. Then, the GNN models can be trained on the improved data to achieve better performance. We call this idea selfenhanced GNN and propose two algorithms under this framework, namely topology update (TU) and training node augmentation (TNA).
As GNN models essentially smooth the embeddings of neighboring nodes, interclass edges harm the performance as they make it difficult to distinguish nodes from different classes. To this end, TU removes interclass edges and adds intraclass edges according to node labels predicted by a GNN model. Our analysis shows that TU reduces the percentage of interclass edges in the input graph as long as the performance of the GNN model is good enough. Since the number of labeled nodes are usually small for semisupervised node classification, TNA enlarges the training set by treating the predicted labels of multiple GNN models as the ground truth. We show by analysis that jointly considering the predicted labels of multiple diverse GNN models reduces errors in the enlarged training set. We also develop a method to create diversity among multiple GNN models. In addition, we propose techniques such as thresholdbased selection, validationbased tuning and class balance to stabilize the performance of TU and TNA. Both TU and TNA are general techniques that can be easily combined with existing GNN models.
We conducted extensive experiments on three wellknown GNN models, GCN, GAT and SGC, and seven widely used benchmark datasets. The results show that selfenhanced GNN consistently improves the performance of different GNN models. The reduction in the classification error is 16.2% on average and can be up to 35.1%. Detailed profiling finds that TU and TNA indeed improve the input data quality for node classification. Specifically, TU effectively improves an input graph for the task by deleting interclass edges and adding intraclass edges, while most of the nodes added by TNA are assigned a right label. Based on the results, one interesting future direction is to apply the idea of selfenhanced GNN to other problems such as link prediction and graph classification where GNNs are also used.
2. Topology Update
Denote a graph as , where is the set of nodes and is the set of edges. There are nodes and edges in the graph. The groundtruth label of a node is . We define the noise ratio of the graph as
(1) 
Noise ratio measures the percentage of interclass edges (i.e., with ) in the graph.
Motivation. In Figure 1, we show the relation between classification accuracy and noise ratio for the CORA dataset, where edge deletion randomly removes interclass edges in the graph and edge addition randomly adds intraclass edges (i.e., with ) that are not present in the original graph. The results show that the classification accuracy of all the three models is higher with lower noise ratio. This is understandable since GNN models are generally lowpass filters that smooth the embeddings of neighboring nodes (NT and Maehara, 2019). As interclass edges encourage nodes from different classes to have similar embeddings, they make the classification task difficult. Therefore, we make the following assumption.
Assumption 1 ().
Lower noise ratio leads to better classification performance for popular GNNs such as GCN, GAT and SGC.
2.1. Topology Update Algorithms
For Figure 1, we delete/add edges using the groundtruth labels. However, we may not have access to the groundtruth labels in a practical node classification problem. As popular GNN models already provide quite accurate predictions of the true labels, we can use their output for edge edition. Denote a GNN model trained for a node classification problem with classes as a mapping function , where is the integer set . Edge deletion and edge addition can be conducted using Algorithm 1 and Algorithm 2, respectively.
In the following, we show that Algorithm 1 and Algorithm 2 reduce the noise ratio of the input graph if the classification accuracy of the GNN model is high enough. We first present some assumptions and definitions that will be used in the analysis.
Assumption 2 ().
(Symmetric Error) The GNN model has a classification accuracy of and makes symmetric errors, i.e., for every node , we have and for and , where is the number of classes and is the groundtruth label of node .
Note that symmetric error is a common assumption in the literature (Chen et al., 2019a) and our analysis methodology is not limited to symmetric error. As the GNN model makes random errors (and hence the topology update algorithms also make random errors), we use the expected noise ratio as a replacement for the noise ratio . For the graph after edition, i.e., , we define the expected noise ratio as , in which is the expected number of interclass edges in and is the expected number of intraclass edges in . We can compare the expected noise ratio of with the noise ratio of the original graph .
Theorem 1 ().
Proof.
The probability that an intraclass edge in
is kept in by Algorithm 1 is . Therefore, , where is the number of edges in . The probability that an interclass edge is kept is , and thus . We haveSolving gives , which is satisfied when . ∎
Theorem 1 shows that edge deletion reduces noise ratio under a mild condition on the classification accuracy of the GNN model, i.e., . For example, for a node classification problem with 5 classes, it only requires the classification accuracy . To analyze the expected noise ratio of the graph after edge addition, we further assume that the classes are balanced, i.e., each class has nodes.
Theorem 2 ().
Proof.
Denote the expected number of added intraclass edges as and the expected number of added interclass edges as . To ensure , it suffices to show that . As there are possible interclass edges and intraclass edges in , we have
where and are the probability of keeping an interclass edge and an intraclass edge in , respectively. Their expressions are given in the proof of Theorem 1. The and terms are included to exclude the overlaps between the edges in the original graph and the edges that may be added by Algorithm 2. With , we have
Solving gives the result. ∎
The bound on in Theorem 2 is complex for interpretation but we can approximate it as if we assume that the term is small enough to be ignored and is very small compared to . The bound can be further simplified as if we assume that is small and approximate with . Note that is a higher requirement on the classification accuracy of than for edge deletion. Thus, as we will show in the experiments, the performance improvement of edge addition is usually smaller than edge deletion.
Theorem 1 and Theorem 2 can be extended to more general assumptions. For example, the symmetric error assumption can be replaced with an error matrix , where
is the probability of classifying class
as class . The number of nodes in each class can also be different. The analysis methodology in the proofs can still be applied but the bounds will be in more complex forms. In addition, we show in the experiments that edge deletion and addition can be conducted simultaneously.2.2. Optimizations for TU
For practical topology update, we use the following techniques to improve Algorithm 1 and Algorithm 2.
Thresholdbased selection. The GNN model usually outputs a distribution on the classes (e.g., using softmax) rather than a single decision. For a node , we denote its class distribution provided by the model as with for and . For edge deletion, we first generate a candidate edge set based on the classification labels using Algorithm 1. For each candidate edge in , we calculate the correlation between their class distributions (i.e., ) and select the edges with for actual deletion, where is a threshold. For edge addition, we also generate a candidate set using Algorithm 2 first and add only edges with . Moreover, we constrain the number of added edges to be less than 2 times of the edges in the original graph to avoid making the cost of model training too high ^{2}^{2}2The cost of GNN training is proportional to the number of edges.. Thresholdbased selection makes Algorithm 1 and Algorithm 2 more conservative and it also helps to avoid deleting intraclass edges and adding interclass edges.
Validationbased tuning. We use the validation set to tune the thresholds and . For each threshold, we use it to make the topology update decisions and generate a new graph . Then we train a GNN model on the updated graph and test its accuracy on the validation set. A number of candidate thresholds are checked and the one that provides the best validation accuracy is adopted. Validationbased tuning allows us to reject topology update (by setting and ) when it cannot improve performance, e.g., the noise ratio of the graph is already very low or the accuracy of the model is not good enough.
Efficiency issue. For edge addition, naively computing the label correlation for all possible node pairs incurs high complexity, especially for large graphs. Therefore, for each node , we only find the top nodes (e.g., 2 or 3) that have the largest label correlation with and use them as the candidates for edge addition. This corresponds to the wellknown allpair maximum inner product search problem, for which there are many efficient solutions such as LEMP (Teflioudi and Gemulla, 2016) and FEXIPRO (Li et al., 2017).
3. Training Node Augmentation
Motivation. In Figure 2, we experiment the influence of the number of training nodes on classification accuracy. The results show that using more training nodes consistently leads to higher classification accuracy for GCN, GAT and SGC. Unfortunately, for the semisupervised node classification problem, usually only a very small number of labeled nodes are available. To enlarge the training set, an intuitive idea is to train a GNN model to label some nodes and add those nodes to the training set. However, a GNN model usually makes a considerable amount of errors in its label prediction, and naively using the predicted labels as the groundtruth labels may lead to worse performance.
3.1. Training Node Augmentation Algorithm
For a GNN model that outputs a distribution on classes, we define the confidence () and prediction result () of node as
where is the label of predicted by and is the likelihood of . Usually is more likely to be correct (i.e., ) when is large (we show this in Figure 6, Appendix B). Utilizing and , we present the training node augmentation (TNA) procedure in Algorithm 3, which produces an enlarged training set using the outputs of multiple GNN models. In Algorithm 3, and denote the original training set and validation set. Before adding a node to , we check if it is already in and to avoid assigning a new label to nodes in the two sets.
Algorithm 3 is based on two key ideas. The first one is only considering nodes with a high confidence (i.e., ) as the candidates to be added to since GNN models tend to produce more accurate label predictions at higher confidence. Similar to the case of topology update, we tune the value of based on the accuracy (of the model trained using ) on the validation set. The second and most important idea is to utilize the diversity of multiple GNN models to reduce the number of errors in . With multiple diverse models, even if some classifiers assign a wrong label to node , it will not be added to as long as one classifier gives the right label. In the following, we formalize this intuition with an analysis under the case of using two GNN models and , i.e., .
Following Assumption 2, we assume that both and have a classification accuracy of and make symmetric error. We also simplify Algorithm 3 and assume that a node is added to if the two models give the same label (i.e., ). Algorithm 3 can be viewed as a special case of this simplified algorithm with as it adds highconfidence nodes. The accuracy of is defined as . We are interested in the relation between and , which are the accuracies of when using one model and two models for TNA, respectively. As the two models are trained on the same graph structure, it is unrealistic to assume that they are independent. Therefore, we make the following assumption on how they correlate.
Assumption 3 ().
(Model Correlation) The correlation between the two GNN models and can be formulated as follows
where and , and . We also assume that as the two models should be positively correlated.
Theorem 1 ().
(Train Set Accuracy) Under Assumption 3 and assume that , we have the following results on the accuracy of
(1) ;
(2) is maximized when , in which case the two models and are independent.
Proof.
The probability that gives the right label can be expressed as
We assume that has a classification accuracy of and solving gives the relation between and as . We can express as
Substituting into the above expression gives . Solving gives the following result
It can be verified that when . Therefore, we have regardless of the value of and , which proves the first part of the theorem. For the second part of theorem, we have
As , is a decreasing function of . As , is maximized when . In this case, we can obtain that by solving . shows that does not depend on , which means that the two models are independent. ∎
3.2. Optimizations for TNA
Creating diversity in GNN models. A straightforward method to generate multiple different GNN models is random initialization, which trains the same model with different parameter initializations. We show the number of errors (i.e., nodes with wrong labels) in using random initialization and under different threshold (adjusting controls the number of added nodes) in Figure 3. The results show that using 2 models, random initialization does not significantly outperform a single model. We conducted detailed profiling and found that this is because the two models lack diversity. For example, two randomly initialized models provide the same label prediction for 2,900 nodes (out of a total number of 3,327 nodes) on the CiteSeer dataset and the prediction accuracy in these agreed nodes is 71.9%. We found that this phenomenon is consistent across different GNN models and datasets. It is observed that GNN models resemble label propagation algorithm in some sense (Wang and Leskovec, 2020) and the results of label propagation are totally determined by the graph structure and the labeled nodes. Therefore, two GNN models trained with different random initializations tend to produce the same label prediction because they use the same graph structure and training set.
Motivated by this finding, we propose to generate multiple GNN models with better diversity using train set swapping, which randomly repartitions the visible set (training and validation set, i.e., ) for each model. Train set swapping first unites the original training set and validation set . Then nodes in the visible set are randomly selected as the training set for a model and the remaining samples go to the validation set. The motivation is to use a different training set to train each GNN model for better diversity. We plot the errors in the produced by train set swapping in Figure 3. The results show that train set swapping generates significantly fewer errors than random initialization when adding the same number of nodes. This is because the models have better diversity than random initialization and they agree on the label prediction of only 2,230 nodes on the CiteSeer dataset. The prediction accuracy in the agreed nodes is 85.4%, which is significantly higher than the 71.9% accuracy for random initialization.
Class balance. A trick that is crucial for the performance of TNA is ensuring that each class has a similar number of nodes in the enlarged training set . We observed that different classes can have a very different number of nodes. For example, for the Coauthor CS dataset, the number of nodes in the largest class is 4.78x that of the smallest class. If we assume that every node has the same probability of being added to , the large classes can have significantly more training samples than the small classes. We found that TNA can even degrade the accuracy (compared to without TNA) in this case. We conjecture that this is because an unbalanced training set encourages the GNN model to label nodes as from the large classes, which does not generalize. Therefore, we constrain each class to have the same number of nodes in . If the number of nodes to be added to for a class is larger than that for the smallest class, then we add only the nodes with the largest confidence for this class.
CORA  CiteSeer  PubMed  Coauthor CS  Coauthor Physics  Amazon Computers  Amazon Photo  

GCN  78.7  66.5  75.5  90.7  93.1  71.9  85.2 
GCN+SEG  82.3  71.1  80.0  92.9  93.9  80.2  90.4 
Error Reduction  16.9%  13.7%  18.4%  23.7%  11.6%  29.5%  35.1% 
GAT  79.0  65.7  75.3  89.9  92.0  82.2  89.6 
GCN+SEG  81.4  70.0  78.9  91.6  93.5  83.7  90.8 
Error Reduction  11.4%  12.3%  14.6%  16.8%  18.8%  8.4%  11.5% 
SGC  77.4  65.0  73.3  91.3  93.3  81.1  89.3 
GCN+SEG  82.2  70.2  78.1  93.1  94.1  82.8  89.9 
Error Reduction  21.2%  14.9%  8.5%  16.8%  18.8%  9.0%  7.7% 
4. Experimental Results
Settings. The experiments were conducted on seven widely used benchmark datasets for node classification. Due to the space limitation, we give the statistics of the datasets in Table 6, Appendix A.1. We evaluated the performance of topology update (TU) and training node augmentation (TNA) on three wellknown GNN models, i.e., GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018) and SGC (Wu et al., 2019). We configured all the three models to have two layers because GNN models usually perform the best with two layers due to oversmoothing (Oono and Suzuki, 2019) and increasing the layers also increases the computation cost exponentially due to neighbor propagation. All weights for the models were initialized according to Glorot and Bengio (2010) and all biases were initialized as zeros. The models were trained using the Adam (Kingma and Ba, 2014) optimizer and the learning rate was set to 0.01. For both TU and TNA, we utilized a grid search to tune their parameters (i.e., the thresholds , and ) on the validation set. The detailed settings of other hyperparameters can be found in Appendix A.2.
We followed the evaluation protocol proposed by Shchur et al. (2018)
and recorded the average classification accuracy and standard deviation of 10 different dataset splits. For each split, 20 and 30 nodes from each class were randomly sampled as the training set and validation set, respectively, and the other nodes were used as the test set. Under each split, we ran 10 random initializations of the model parameters and used the average accuracy of the 10 initializations as the performance of this split. The motivation of this evaluation protocol was to exclude the influence of the randomness in data split on performance, which was found to be significant.
4.1. Overall Performance Results
We first present the overall performance results of selfenhanced GNN (abbreviated as SEG) in Table 1. The reported performance of SEG is the best performance that can be obtained using TU, TNA, or (TU + TNA). In practice, we may choose to use TU, TNA, or (TU + TNA) by their prediction accuracy on the validation set. The results in Table 1 show that SEG consistently improves the performance of the 3 GNN models on the 7 datasets, where the reduction in classification error is 16.2% on average and can be as high as 35.1%. The result is significant particularly because it shows that SEG is an effective, general framework that improves the performance of wellknown models that are already recognized to be effective.
In the subsequent subsections, we analyze the performance of TU and TNA individually, as well as examine how they influence data quality.
4.2. Results for Topology Update
CORA  CiteSeer  PubMed  Coauthor CS  Coauthor Physics  Amazon Computers  Amazon Photo  

GCN+Delete  79.2  66.5  75.6  91.8  93.2  80.1  89.0 
Error Reduction  2.3%  0.0%  0.4%  11.8%  1.4%  29.2%  25.7% 
GAT+Delete  79.3  65.8  75.3  90.9  92.2  82.8  90.3 
Error Reduction  1.4%  0.3%  0.0%  9.9%  2.5%  3.4%  4.9% 
SGC+Delete  77.8  65.5  73.6  92.6  93.5  82.0  89.6 
Error Reduction  1.8%  1.4%  1.1%  14.9%  3.0%  4.8%  2.8% 
GCN+Add  78.8  66.8  75.6  90.7  93.2  78.9  88.2 
Error Reduction  0.5%  0.9%  0.4%  0.0%  1.4%  24.9%  20.3% 
GAT+Add  79.1  65.7  75.7  90.0  92.1  82.6  89.7 
Error Reduction  0.5%  0.0%  1.6%  1.0%  1.2%  2.2%  1.0% 
SGC+Add  77.5  65.7  73.8  91.5  93.5  81.6  89.4 
Error Reduction  0.4%  2.0%  1.9%  2.3%  3.0%  1.6%  0.9% 
GCN+Modify  79.4  67.1  75.9  91.7  93.4  79.2  88.5 
Error Reduction  3.3%  1.8%  1.6%  10.8%  4.3%  26.0%  22.3% 
GAT+Modify  79.1  65.8  76.0  90.7  92.1  82.4  90.1 
Error Reduction  0.5%  0.3%  2.8%  7.9%  1.2%  1.1%  4.8% 
SGC+Modify  78.5  66.7  74.0  92.7  93.5  81.7  89.4 
Error Reduction  4.9%  4.9%  2.6%  16.1%  3.0%  2.1%  0.9% 
The performance results of TU are reported in Table 2. To control the complexity of parameter search, we constrained the number of added edges to be the same as the number of deleted edges for Modify. The following observations can be made from the results in Table 2.
First, TU improves the performance of GCN, GAT and SGC in most cases and the improvement is significant in some cases. For example, the error reduction is over 25% for GCN on the Amazon Photo dataset. The error reduction is zero in 4 out of the 63 cases because threshold tuning (for and ) on the validation set rejects TU as it cannot improve the performance. Thus, even in the worst case, TU does not degrade the performance of the base models.
Second, edge deletion generally achieves greater performance improvements than edge addition. This is because there is a large number of possible interclass edges (e.g., when the classes are balanced). Even if the probability of adding an interclass edge is small (the same as the probability of keeping an interclass edge in in edge deletion), the algorithm may still add a considerable number of interclass edges in expectation.
Third, the performance improvement of TU is relatively smaller for CiteSeer and PubMed than that for the other datasets, which can be explained as follows. The accuracy of GCN, GAT and SGC for CiteSeer and PubMed is considerably lower than that for the other datasets. As a result, the TU algorithms are also more likely to make wrong decisions (i.e., deleting intraclass edges or adding interclass edges) since TU decisions are guided by the model predictions. Motivated by this observation, we experimented a dualmodel edge deletion/addition algorithm on CiteSeer, which uses the intersection of the edge deletion/addition decisions of two GNN models. The intuition is similar to the idea of TNA, which utilizes the diversity of different GNN models to reduce errors. The dualmodel algorithm improves the error reduction of singlemodel edge deletion/addition from 0.0% and 0.9% to 0.6% and 2.1%, respectively.
Fourth, although GCN, GAT and SGC have high accuracy for both Coauthor CS and Coauthor Physics, TU has considerably greater performance improvements on Coauthor CS than on Coauthor Physics. This is because the noise ratio of the original Coauthor Physics graph is much lower than the Coauthor CS graph (6.85% vs. 19.20%), and thus reducing noise ratio has smaller influence on the performance for Coauthor Physics.
Model  Edge Deletion  Noise Ratio  Edge Addition  Noise Ratio 

GCN  (332, 218)  14.19%  (4692, 85)  10.82% 
GAT  (309, 212)  14.59%  (5995, 165)  10.21% 
SGC  (242, 116)  15.47%  (3807, 25)  11.28% 
Finally, TU generally achieves greater performance improvements for GCN and SGC than for GAT. We plot the distributions of the attention weights of GAT on the edges that are deleted and kept by Algorithm 1 in Figure 4. The results show that the deleted edges have significantly smaller attention weights than the kept edges. As we mainly delete interclass edges, the results suggest that GAT can prevent the interclass edges from smoothing the embeddings of nodes from different classes by assigning them small weights. This explains why GAT is less sensitive to changes in noise ratio. However, GAT cannot really set the weights of the interclass edges to 0 as it uses the softmax function to compute the attention weights. In contrast, Algorithm 1 can completely remove interclass edges and can thus even improve the performance of GAT further in most cases.
CORA  CiteSeer  PubMed  Coauthor CS  Coauthor Physics  Amazon Computers  Amazon Photo  

GCN+TNA  82.1  70.6  80.0  91.8  93.7  80.2  89.5 
Error Reduction  16.0%  12.2%  18.4%  11.8%  8.7%  29.5%  29.1% 
GAT+TNA  81.4  70.0  78.9  91.1  93.4  82.7  90.8 
Error Reduction  11.4%  12.3%  14.6%  11.9%  5.7%  2.8%  11.5% 
SGC+TNA  82.2  70.2  73.3  92.0  93.9  82.8  89.9 
Error Reduction  21.2%  14.9%  0.0%  8.0%  9.0%  9.0%  5.6% 
We also examined the edge deletion and addition decisions made by TU in Table 3. For both edge deletion and addition, we report the number of correct decisions (i.e., removing interclass edges for deletion and adding intraclass edges for addition) and wrong decisions (i.e., removing intraclass edges for deletion and adding interclass edges for addition), and the noise ratio of the CORA graph after TU. The results show that TU effectively reduces noise ratio. Most of the added edges are intraclass edges and only a few are interclass edges. Edge deletion effectively removes interclass edges but a considerable number of intraclass edges are also removed. This is because there are much more intraclass edges in the graph than interclass edges, and thus the expectation of the number of removed intraclass edges may not be small even if the probability of removing an intraclass edge is small.
4.3. Results for Training Node Augmentation
Model  GCN  GAT  SGC 

# Added Nodes  826  714  637 
# Errors  83  72  45 
Error Ratio  10.05%  10.08%  7.06% 
We present the performance results of TNA in Table 4, which show that TNA improves the performance of GCN, GAT and SGC in 20 out of the 21 cases. The performance improvements are significant in many cases, e.g., 29.1% for GCN on the Amazon Photo dataset. The performance improvements are large on CORA and CiteSeer for all three GNN models. We conjecture that this is because the two datasets are relatively smaller and thus adding more training samples has a large impact on the performance. To explain the good performance of TNA, we examined the number of added nodes and the errors in the added nodes in Table 5. The results show that most of the added nodes are assigned the correct label. Compared with GAT and GCN, a small number of nodes are added for SGC and the error ratio is also lower. This may be because the model of SGC is simpler than GAT and GCN (without nonlinearity) and thus SGC is more sensitive to noise in the training samples.
We examined the two important designs in TNA, i.e., class balance and multimodel diversity. We experimented with a version of TNA without class balance for GCN on the Amazon Photo dataset, which records a classification accuracy of 86.67%. In contrast, the classification accuracy with class balance is 89.54% as reported in Table 4. We plot in Figure 4(a) the class distribution of the nodes added by TNA without class balance, which shows that the number of nodes in the largest class is 11.6 times of the smallest class. The results show that without class balance, the enlarged training set can be highly screwed.
To demonstrate the benefits of using the diversity of multiple models in TNA, we report the relation between the test accuracy and the number of models (used for node selection) on the CiteSeer dataset in Figure 4(b). The result show that using 2 models provide a significant improvement in classification accuracy over 1 model, but the improvement drops when using more models. This is because more models are difficult to agree with each other and thus a low confidence threshold (i.e., ) needs to be used to add a good number of nodes. However, a low confidence threshold means that the added nodes are likely to contain errors.
5. Related Work
GNN models. Many GNN models have been proposed in recent years, including GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018), SGC (Wu et al., 2019), GraphSAGE (Hamilton et al., 2017), GeomGCN (Pei et al., 2020), GGNN (Li et al., 2015), JKNet (Xu et al., 2018), ChebNet (Defferrard et al., 2016), Highway GNN (Zilly et al., 2017) and MoNet (Monti et al., 2017). These works focus on improving the performance of a task, e.g., the prediction accuracy of node classification, comparing with prior methods. In contrast, our method, selfenhanced GNN, aims to improve the quality of the input data. By providing data with higher quality, selfenhanced GNN provides a general framework that can be easily applied on existing GNN models to further improve their performance.
To the best of our knowledge, our work is most related to Chen et al. (2019b) and Li et al. (2018). Chen et al. (2019b) observed that the performance of GNN models usually degrades when using more than 2 layers due to local smoothing and proposed to remove/add edges in a graph to mitigate the oversmoothing problem of GNN models. In contrast, we come from the perspective of data quality and observe that lower noise ratio leads to higher classification accuracy. In addition, we provide theoretical analysis to show that adding/removing edges can reduce noise ratio if the performance of a model is good enough. The idea of enlarging the training set with cotraining and selftraining was proposed in (Li et al., 2018), which corresponds to the singlemodel case of our training node augmentation algorithm. However, as we have shown in our analysis and profiling results in Section 3.2, using the diversity of multiple models and explicitly balancing the classes in the training set are crucial for performance. In fact, the results reported in (Li et al., 2018) also show that the performance of GNN (e.g., GCN) actually degrades in many cases when applying cotraining and selftraining with a single model.
Noisy label training. Selfenhanced GNN is partly motivated by noisy label training, which aims at learning good models from data with noisy labels, i.e., a large number of training samples come with wrong labels. Representative work along this line include Decoupling (Malach and ShalevShwartz, 2017), MentorNet (Jiang et al., 2017), Noisy CrossValidation (Chen et al., 2019a) and Coteaching (Han et al., 2018). These works focus on how to select samples with possibly correct labels from a noisy dataset to conduct model training, and our multimodel sample selection method is motivated by these works. However, as GNNs work on graph data, selfenhanced GNN handles not only noise in labels but also noise in the graph structure (i.e., interclass edges) with topology update. Given the excellent performance of GNNs on graph data, a potential direction is to apply selfenhanced GNN to noisy label training. With the assumption that samples with similar features are likely to share the same label, a similarity graph (e.g., a nearestneighbor graph based on image descriptors) can be constructed on a noisy dataset and noisy label training can be modeled as a semisupervised node classification problem on graphs.
6. Conclusions
We presented selfenhanced GNN. The main idea is to improve the quality of the input data using the outputs of existing GNN models, so that the proposed method can be used as a general framework to improve the performance of different existing GNN models. Two algorithms were developed under this idea, i.e., topology update, which deletes/adds edges to remove interclass edges and add potential intraclass edges in an input graph, and training node augmentation, which enlarges the training set by adding nodes with high classification confidence. Theoretical analyses were provided to motivate the algorithm designs and comprehensive experimental evaluation was conducted to validate the performance of the algorithms. The results show that selfenhanced GNN is an effective general framework that consistently improves the performance of different GNN models on a broad set of datasets.
References
 (1)
 Chen et al. (2019b) Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2019b. Measuring and Relieving the Oversmoothing Problem for Graph Neural Networks from the Topological View. arXiv preprint arXiv:1909.03211 (2019).

Chen
et al. (2019a)
Pengfei Chen, Ben Ben
Liao, Guangyong Chen, and Shengyu
Zhang. 2019a.
Understanding and Utilizing Deep Neural Networks
Trained with Noisy Labels. In
International Conference on Machine Learning
. 1062–1070.  Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems. 3844–3852.

Fey and Lenssen (2019)
Matthias Fey and Jan E.
Lenssen. 2019.
Fast Graph Representation Learning with PyTorch Geometric. In
ICLR Workshop on Representation Learning on Graphs and Manifolds. 
Glorot and Bengio (2010)
Xavier Glorot and Yoshua
Bengio. 2010.
Understanding the difficulty of training deep
feedforward neural networks. In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(Proceedings of Machine Learning Research), Yee Whye Teh and Mike Titterington (Eds.), Vol. 9. PMLR, Chia Laguna Resort, Sardinia, Italy, 249–256. http://proceedings.mlr.press/v9/glorot10a.html  Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855–864.
 Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems. 1024–1034.
 Han et al. (2018) Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. 2018. Coteaching: Robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems. 8527–8537.
 Jiang et al. (2017) Lu Jiang, Zhengyuan Zhou, Thomas Leung, LiJia Li, and Li FeiFei. 2017. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055 (2017).
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. SemiSupervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR).
 Li et al. (2017) Hui Li, Tsz Nam Chan, Man Lung Yiu, and Nikos Mamoulis. 2017. FEXIPRO: fast and exact inner product retrieval in recommender systems. In Proceedings of the 2017 ACM International Conference on Management of Data. 835–850.

Li et al. (2018)
Qimai Li, Zhichao Han,
and XiaoMing Wu. 2018.
Deeper insights into graph convolutional networks for semisupervised learning. In
ThirtySecond AAAI Conference on Artificial Intelligence.  Li et al. (2015) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015).
 Malach and ShalevShwartz (2017) Eran Malach and Shai ShalevShwartz. 2017. Decoupling” when to update” from” how to update”. In Advances in Neural Information Processing Systems. 960–970.
 McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Imagebased recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 43–52.

Monti et al. (2017)
Federico Monti, Davide
Boscaini, Jonathan Masci, Emanuele
Rodola, Jan Svoboda, and Michael M
Bronstein. 2017.
Geometric deep learning on graphs and manifolds using mixture model cnns. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
. 5115–5124.  NT and Maehara (2019) Hoang NT and Takanori Maehara. 2019. Revisiting Graph Neural Networks: All We Have is LowPass Filters. arXiv:stat.ML/1905.09550
 Oono and Suzuki (2019) Kenta Oono and Taiji Suzuki. 2019. Graph Neural Networks Exponentially Lose Expressive Power for Node Classification. arXiv:cs.LG/1905.10947
 Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, HighPerformance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015pytorchanimperativestylehighperformancedeeplearninglibrary.pdf
 Pei et al. (2020) Hongbin Pei, Bingzhe Wei, Kevin ChenChuan Chang, Yu Lei, and Bo Yang. 2020. GeomGCN: Geometric Graph Convolutional Networks. In International Conference on Learning Representations. https://openreview.net/forum?id=S1e2agrFvS
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701–710.
 Shchur et al. (2018) Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. 2018. Pitfalls of Graph Neural Network Evaluation. Relational Representation Learning Workshop, NeurIPS 2018 (2018).
 Teflioudi and Gemulla (2016) Christina Teflioudi and Rainer Gemulla. 2016. Exact and approximate maximum inner product search with lemp. ACM Transactions on Database Systems (TODS) 42, 1 (2016), 1–49.
 Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. International Conference on Learning Representations (2018). https://openreview.net/forum?id=rJXMpikCZ
 Wang and Leskovec (2020) Hongwei Wang and Jure Leskovec. 2020. Unifying Graph Convolutional Neural Networks and Label Propagation. https://openreview.net/forum?id=rkgdYhVtvH
 Wu et al. (2019) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying Graph Convolutional Networks. In Proceedings of the 36th International Conference on Machine Learning. PMLR, 6861–6871.
 Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Kenichi Kawarabayashi, and Stefanie Jegelka. 2018. Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536 (2018).
 Yang et al. (2016) Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2016. Revisiting semisupervised learning with graph embeddings. arXiv preprint arXiv:1603.08861 (2016).
 Zhou et al. (2018) Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2018. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434 (2018).
 Zilly et al. (2017) Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. 2017. Recurrent highway networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 4189–4198.
Appendix A Details of Experimental Evaluation
All the code of this work is released via the following anonymous link ^{3}^{3}3https://gofile.io/?c=h0S6ya
and will be open source later. The datasets used in the experiments have been widely used for the evaluation of GNN models and they are all publicly available.
a.1. Models and Datasets
We evaluated our methods on 3 popular GNN models, i.e., GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018) and SGC (Wu et al., 2019). We used 7 datasets to evaluate our methods. Among them, CORA, CiteSeer and PubMed are 3 well known citation networks and we used the version provided by Yang et al. (2016). Amazon Computers and Amazon Photo are derived from the Amazon copurchase graph in McAuley et al. (2015). Coauthor CS and Coauthor Physics are obtained from the Microsoft Academic Graph for the KDD Cup 2016 challenge ^{4}^{4}4https://www.kdd.org/kddcup/view/kddcup2016. For these 4 datasets, we used the version preprocessed by Shchur et al. (2018). The statistics of the datasets are summarized in Table 6, where is the noise ratio defined in Section 2.
Classes  Features  Nodes  Edges  

CORA  7  1,433  2,485  5,069  0.19 
CiteSeer  6  3,703  2,110  3,668  0.26 
PubMed  3  500  19,717  44,324  0.19 
Coauthor CS  15  6,805  18,333  81,894  0.19 
Coauthor Physics  5  8,415  34,493  247,962  0.06 
Amazon Computers  10  767  13,381  245,778  0.22 
Amazon Photo  8  745  7,487  119,043  0.17 
a.2. Implementation Details
Evaluation protocol. To eliminate the influence of random factors and ensure that the performance comparison is fair, we adopted the evaluation protocol provided by Shchur et al. (2018). A 20/30/rest split for train/val/test set was used for all the datasets. In the experiments, we evaluated each model on 10 randomly generated dataset splits, and under each split, we ran the model for 10 times using different random seeds. We reported the mean value and standard deviation of the test accuracies across the 100 runs for each model on each dataset. For the experiments comparing SelfEnhanced GNN with the base GNN models (i.e., GCN, GAT and SGC), all model implementation and evaluation settings were kept fixed and identical.
Structure of the base models
. Our GCN model implementation has 2 GCN convolutional layers with a hidden size of 16. The activation function is
ReLU. A dropout layer with a dropout rate of 0.5 is used after the first GCN layer. Our GAT model implementation has 2 GAT layers with an attention coefficient dropout probability of 0.6. The first layer is an 8heads attention layer with a hidden size of 8. The second layer has a hidden size of . The activation function is ELU . Two dropout layers with a dropout rate of 0.6 are used between the input layer and the first GAT layer, and between the first GAT layer and the second GAT layer. Our SGC model implementation has a SGC convolutional layer with 2 hops (equivalent to 2 SGC layers according to the SGC definition).Model training. We used the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.01 and an regularization coefficient of
. We did not use learning rate decay and early stopping. As the difficulty of model training varies for different datasets, we used a different number of training epochs for each dataset, i.e., CORA 400 epochs, CiteSeer 400 epochs, PubMed 400 epochs, Amazon Computers 1000 epochs, Amazon Photo 2000 epochs, Coauthor CS 2000 epochs and Coauthor Physics 400 epochs.
Software. All models and algorithms in the experiments are implemented on PyTorch (Paszke et al., 2019) and PyTorchGeometric (Fey and Lenssen, 2019)
. The software versions are python=3.6.9, torch=1.2.0, CUDA=10.2.89, pytorch_geometric=1.3.2.
Topology update. For Delete, before edge deletion, we remove all selfloop edges in the original graph. Then the edges are deleted according to Algorithm 1 with a threshold. After edge deletion, we add back the removed selfloop edges. For Add, we constrain the number of added edges to be less than 4 times of the number of edges in the original graph. This threshold is used to decide the number of candidate edges for addition, i.e., . We get the top edges from the potential edges according to the label correlation (i.e., ). After filtering the edges already in the graph, we add new edges using Algorithm 2. For Modify, we constrain the total number of added edges to be the same as the number of deleted edges because tuning the parameters for edge deletion and addition jointly will result in high complexity. This constraint also helps maintain the graph topology to some degree by not changing the structure too much. We conduct edge deletion first, and then add the same number of edges as that of the deleted edges. We ensure that deleted edges will not be added back.
Training node augmentation. For training node augmentation, we use two models trained with swapped training and validation set to label the nodes in the test set. Only the nodes having the same label prediction from the two models can be added to the augmented training set. A confidence threshold is used to control the number of preselected nodes for addition. We count the number of nodes from each class in the preselected nodes and obtain the class with the minimum number of preselected nodes. This number is used to control the number of added nodes for all classes (i.e., the class balance trick) to avoid introducing additional biases.
Joint use of TU and TNA. For experiments that jointly used topology update and training node augmentation, we applied the two techniques independently and used the thresholds selected by each algorithm individually to avoid the high complexity of joint parameter tuning. Denote the optimal parameter for topology update and training node augmentation as and , respectively. We considered three configurations, i.e., , and (setting the means disabling the algorithm) and selected the best configuration using the validation accuracy. The reported results is the test accuracy of the selected configuration. Therefore, our framework still has the potential to perform even better if more finegrained tuning on the thresholds parameters are conducted.
All the thresholds mentioned above are determined totally by the classification accuracy on the validation set.
Appendix B Additional Experimental Results
Relation between confidence and classification accuracy. In Algorithm 3, we only add nodes with a high confidence into the enlarged training set . In Figure 6, we plot the relation between confidence and classification accuracy. The results show that the model is more likely to give the right label prediction under high confidence.
Relation between label correlation and label alignment. Recall that the label correlation between a pair of nodes and is defined as , in which is the class distribution for node predicted by a model. For topology update, we delete edges with small label correlation and add edges with large label correlation. In Figure 7, we plot the relation between label correlation and the probability that a pair of nodes have the same label (called node alignment). The results show that a pair of nodes is more likely to be in the same class under higher label correlation.
Comments
There are no comments yet.