Self-Enhanced_GNN
Self-Enhanced GNN: Improving Graph Neural Networks Using Model Outputs (IJCNN2021)
view repo
Graph neural networks (GNNs) have received much attention recently because of their excellent performance on graph-based tasks. However, existing research on GNNs focuses on designing more effective models without considering much the quality of the input data itself. In this paper, we propose self-enhanced GNN, which improves the quality of the input data using the outputs of existing GNN models for better performance on semi-supervised node classification. As graph data consist of both topology and node labels, we improve input data quality from both perspectives. For topology, we observe that higher classification accuracy can be achieved when the ratio of inter-class edges (connecting nodes from different classes) is low and propose topology update to remove inter-class edges and add intra-class edges. For node labels, we propose training node augmentation, which enlarges the training set using the labels predicted by existing GNN models. As self-enhanced GNN improves the quality of the input graph data, it is general and can be easily combined with existing GNN models. Experimental results on three well-known GNN models and seven popular datasets show that self-enhanced GNN consistently improves the performance of the three models. The reduction in classification error is 16.2 on average and can be as high as 35.1
READ FULL TEXT VIEW PDFSelf-Enhanced GNN: Improving Graph Neural Networks Using Model Outputs (IJCNN2021)
Graph data are ubiquitous today, e.g., friendship graphs in social networks, phone call or message graphs in tele-communication, user-item interaction graphs in recommender systems, and protein-protein interaction graphs in biology. For graph-based tasks such as node classification, link prediction and graph classification, graph neural networks (GNNs) achieve excellent performance thanks to its ability to utilize both graph structure and feature information (on nodes or edges). Most GNN models can be formulated under the message passing framework, in which each node passes messages to its neighbors in the graph and aggregates messages from the neighbors to update its own embedding.
Different attempts have been made to design algorithms and models for graph analytics. Random walk based methods, e.g., DeepWalk (Perozzi et al., 2014) uses the random walk paths as the input to a skip-gram model to learn node embeddings , while node2vec (Grover and Leskovec, 2016) learns node embeddings by combining breadth-first random walk and depth-first random walk. Motivated by graph spectral theory, graph convolutional network (GCN) (Kipf and Welling, 2017) conducts graph convolution using the adjacency matrix to avoid the high complexity spectral decomposition. Instead of using the adjacency matrix to derive the weights for message aggregation, graph attention network (GAT) (Veličković et al., 2018) uses an attention module to learn the weights from data. Simple graph convolution network (SGC) (Wu et al., 2019) proposes to remove the non-linearity in GCN as it observes that the good performance of GCN mainly comes from local averaging rather than non-linearity. There are also many other GNN models such as GraphSAGE (Hamilton et al., 2017), jumping knowledge network (JK-Net) (Xu et al., 2018), geometric graph convolutional network (Geom-GCN) (Pei et al., 2020), and gated graph neural network (GGNN) (Li et al., 2015), and we refer readers to a comprehensive survey in (Zhou et al., 2018).
In this paper, we focus on the problem of semi-supervised node classification, which is also most GNN models are designed for. We observed that most existing researches attempt to design more effective GNN models, while the quality of the input data has not received much attention. However, data quality ^{1}^{1}1Here, data quality is problem-specific. Given a GNN model and a specific problem, high data quality means that the GNN model produces good output for the problem on the input data. In this paper, we discuss data quality with respect to the node classification problem. and model quality can be equally important for good performance. For example, if the input graph contains only intra-class edges (i.e., edges connecting nodes from the same class) and no inter-class edges (i.e., edges connecting nodes from different classes), node classification can achieve perfect accuracy with only one training sample from each connected component. Moreover, classification tasks are usually easier with more training samples.
At first glance, data quality (i.e., the quality of the input graph structure and training nodes) is the fixed problem input and cannot be improved. However, we observed that existing GNN models already achieve good classification accuracy, and thus their outputs can actually be used to update the input data to improve its quality. Then, the GNN models can be trained on the improved data to achieve better performance. We call this idea self-enhanced GNN and propose two algorithms under this framework, namely topology update (TU) and training node augmentation (TNA).
As GNN models essentially smooth the embeddings of neighboring nodes, inter-class edges harm the performance as they make it difficult to distinguish nodes from different classes. To this end, TU removes inter-class edges and adds intra-class edges according to node labels predicted by a GNN model. Our analysis shows that TU reduces the percentage of inter-class edges in the input graph as long as the performance of the GNN model is good enough. Since the number of labeled nodes are usually small for semi-supervised node classification, TNA enlarges the training set by treating the predicted labels of multiple GNN models as the ground truth. We show by analysis that jointly considering the predicted labels of multiple diverse GNN models reduces errors in the enlarged training set. We also develop a method to create diversity among multiple GNN models. In addition, we propose techniques such as threshold-based selection, validation-based tuning and class balance to stabilize the performance of TU and TNA. Both TU and TNA are general techniques that can be easily combined with existing GNN models.
We conducted extensive experiments on three well-known GNN models, GCN, GAT and SGC, and seven widely used benchmark datasets. The results show that self-enhanced GNN consistently improves the performance of different GNN models. The reduction in the classification error is 16.2% on average and can be up to 35.1%. Detailed profiling finds that TU and TNA indeed improve the input data quality for node classification. Specifically, TU effectively improves an input graph for the task by deleting inter-class edges and adding intra-class edges, while most of the nodes added by TNA are assigned a right label. Based on the results, one interesting future direction is to apply the idea of self-enhanced GNN to other problems such as link prediction and graph classification where GNNs are also used.
Denote a graph as , where is the set of nodes and is the set of edges. There are nodes and edges in the graph. The ground-truth label of a node is . We define the noise ratio of the graph as
(1) |
Noise ratio measures the percentage of inter-class edges (i.e., with ) in the graph.
Motivation. In Figure 1, we show the relation between classification accuracy and noise ratio for the CORA dataset, where edge deletion randomly removes inter-class edges in the graph and edge addition randomly adds intra-class edges (i.e., with ) that are not present in the original graph. The results show that the classification accuracy of all the three models is higher with lower noise ratio. This is understandable since GNN models are generally low-pass filters that smooth the embeddings of neighboring nodes (NT and Maehara, 2019). As inter-class edges encourage nodes from different classes to have similar embeddings, they make the classification task difficult. Therefore, we make the following assumption.
Lower noise ratio leads to better classification performance for popular GNNs such as GCN, GAT and SGC.
For Figure 1, we delete/add edges using the ground-truth labels. However, we may not have access to the ground-truth labels in a practical node classification problem. As popular GNN models already provide quite accurate predictions of the true labels, we can use their output for edge edition. Denote a GNN model trained for a node classification problem with classes as a mapping function , where is the integer set . Edge deletion and edge addition can be conducted using Algorithm 1 and Algorithm 2, respectively.
In the following, we show that Algorithm 1 and Algorithm 2 reduce the noise ratio of the input graph if the classification accuracy of the GNN model is high enough. We first present some assumptions and definitions that will be used in the analysis.
(Symmetric Error) The GNN model has a classification accuracy of and makes symmetric errors, i.e., for every node , we have and for and , where is the number of classes and is the ground-truth label of node .
Note that symmetric error is a common assumption in the literature (Chen et al., 2019a) and our analysis methodology is not limited to symmetric error. As the GNN model makes random errors (and hence the topology update algorithms also make random errors), we use the expected noise ratio as a replacement for the noise ratio . For the graph after edition, i.e., , we define the expected noise ratio as , in which is the expected number of inter-class edges in and is the expected number of intra-class edges in . We can compare the expected noise ratio of with the noise ratio of the original graph .
The probability that an intra-class edge in
is kept in by Algorithm 1 is . Therefore, , where is the number of edges in . The probability that an inter-class edge is kept is , and thus . We haveSolving gives , which is satisfied when . ∎
Theorem 1 shows that edge deletion reduces noise ratio under a mild condition on the classification accuracy of the GNN model, i.e., . For example, for a node classification problem with 5 classes, it only requires the classification accuracy . To analyze the expected noise ratio of the graph after edge addition, we further assume that the classes are balanced, i.e., each class has nodes.
Denote the expected number of added intra-class edges as and the expected number of added inter-class edges as . To ensure , it suffices to show that . As there are possible inter-class edges and intra-class edges in , we have
where and are the probability of keeping an inter-class edge and an intra-class edge in , respectively. Their expressions are given in the proof of Theorem 1. The and terms are included to exclude the overlaps between the edges in the original graph and the edges that may be added by Algorithm 2. With , we have
Solving gives the result. ∎
The bound on in Theorem 2 is complex for interpretation but we can approximate it as if we assume that the term is small enough to be ignored and is very small compared to . The bound can be further simplified as if we assume that is small and approximate with . Note that is a higher requirement on the classification accuracy of than for edge deletion. Thus, as we will show in the experiments, the performance improvement of edge addition is usually smaller than edge deletion.
Theorem 1 and Theorem 2 can be extended to more general assumptions. For example, the symmetric error assumption can be replaced with an error matrix , where
is the probability of classifying class
as class . The number of nodes in each class can also be different. The analysis methodology in the proofs can still be applied but the bounds will be in more complex forms. In addition, we show in the experiments that edge deletion and addition can be conducted simultaneously.For practical topology update, we use the following techniques to improve Algorithm 1 and Algorithm 2.
Threshold-based selection. The GNN model usually outputs a distribution on the classes (e.g., using softmax) rather than a single decision. For a node , we denote its class distribution provided by the model as with for and . For edge deletion, we first generate a candidate edge set based on the classification labels using Algorithm 1. For each candidate edge in , we calculate the correlation between their class distributions (i.e., ) and select the edges with for actual deletion, where is a threshold. For edge addition, we also generate a candidate set using Algorithm 2 first and add only edges with . Moreover, we constrain the number of added edges to be less than 2 times of the edges in the original graph to avoid making the cost of model training too high ^{2}^{2}2The cost of GNN training is proportional to the number of edges.. Threshold-based selection makes Algorithm 1 and Algorithm 2 more conservative and it also helps to avoid deleting intra-class edges and adding inter-class edges.
Validation-based tuning. We use the validation set to tune the thresholds and . For each threshold, we use it to make the topology update decisions and generate a new graph . Then we train a GNN model on the updated graph and test its accuracy on the validation set. A number of candidate thresholds are checked and the one that provides the best validation accuracy is adopted. Validation-based tuning allows us to reject topology update (by setting and ) when it cannot improve performance, e.g., the noise ratio of the graph is already very low or the accuracy of the model is not good enough.
Efficiency issue. For edge addition, naively computing the label correlation for all possible node pairs incurs high complexity, especially for large graphs. Therefore, for each node , we only find the top- nodes (e.g., 2 or 3) that have the largest label correlation with and use them as the candidates for edge addition. This corresponds to the well-known all-pair maximum inner product search problem, for which there are many efficient solutions such as LEMP (Teflioudi and Gemulla, 2016) and FEXIPRO (Li et al., 2017).
Motivation. In Figure 2, we experiment the influence of the number of training nodes on classification accuracy. The results show that using more training nodes consistently leads to higher classification accuracy for GCN, GAT and SGC. Unfortunately, for the semi-supervised node classification problem, usually only a very small number of labeled nodes are available. To enlarge the training set, an intuitive idea is to train a GNN model to label some nodes and add those nodes to the training set. However, a GNN model usually makes a considerable amount of errors in its label prediction, and naively using the predicted labels as the ground-truth labels may lead to worse performance.
For a GNN model that outputs a distribution on classes, we define the confidence () and prediction result () of node as
where is the label of predicted by and is the likelihood of . Usually is more likely to be correct (i.e., ) when is large (we show this in Figure 6, Appendix B). Utilizing and , we present the training node augmentation (TNA) procedure in Algorithm 3, which produces an enlarged training set using the outputs of multiple GNN models. In Algorithm 3, and denote the original training set and validation set. Before adding a node to , we check if it is already in and to avoid assigning a new label to nodes in the two sets.
Algorithm 3 is based on two key ideas. The first one is only considering nodes with a high confidence (i.e., ) as the candidates to be added to since GNN models tend to produce more accurate label predictions at higher confidence. Similar to the case of topology update, we tune the value of based on the accuracy (of the model trained using ) on the validation set. The second and most important idea is to utilize the diversity of multiple GNN models to reduce the number of errors in . With multiple diverse models, even if some classifiers assign a wrong label to node , it will not be added to as long as one classifier gives the right label. In the following, we formalize this intuition with an analysis under the case of using two GNN models and , i.e., .
Following Assumption 2, we assume that both and have a classification accuracy of and make symmetric error. We also simplify Algorithm 3 and assume that a node is added to if the two models give the same label (i.e., ). Algorithm 3 can be viewed as a special case of this simplified algorithm with as it adds high-confidence nodes. The accuracy of is defined as . We are interested in the relation between and , which are the accuracies of when using one model and two models for TNA, respectively. As the two models are trained on the same graph structure, it is unrealistic to assume that they are independent. Therefore, we make the following assumption on how they correlate.
(Model Correlation) The correlation between the two GNN models and can be formulated as follows
where and , and . We also assume that as the two models should be positively correlated.
(Train Set Accuracy) Under Assumption 3 and assume that , we have the following results on the accuracy of
(1) ;
(2) is maximized when , in which case the two models and are independent.
The probability that gives the right label can be expressed as
We assume that has a classification accuracy of and solving gives the relation between and as . We can express as
Substituting into the above expression gives . Solving gives the following result
It can be verified that when . Therefore, we have regardless of the value of and , which proves the first part of the theorem. For the second part of theorem, we have
As , is a decreasing function of . As , is maximized when . In this case, we can obtain that by solving . shows that does not depend on , which means that the two models are independent. ∎
Creating diversity in GNN models. A straightforward method to generate multiple different GNN models is random initialization, which trains the same model with different parameter initializations. We show the number of errors (i.e., nodes with wrong labels) in using random initialization and under different threshold (adjusting controls the number of added nodes) in Figure 3. The results show that using 2 models, random initialization does not significantly outperform a single model. We conducted detailed profiling and found that this is because the two models lack diversity. For example, two randomly initialized models provide the same label prediction for 2,900 nodes (out of a total number of 3,327 nodes) on the CiteSeer dataset and the prediction accuracy in these agreed nodes is 71.9%. We found that this phenomenon is consistent across different GNN models and datasets. It is observed that GNN models resemble label propagation algorithm in some sense (Wang and Leskovec, 2020) and the results of label propagation are totally determined by the graph structure and the labeled nodes. Therefore, two GNN models trained with different random initializations tend to produce the same label prediction because they use the same graph structure and training set.
Motivated by this finding, we propose to generate multiple GNN models with better diversity using train set swapping, which randomly re-partitions the visible set (training and validation set, i.e., ) for each model. Train set swapping first unites the original training set and validation set . Then nodes in the visible set are randomly selected as the training set for a model and the remaining samples go to the validation set. The motivation is to use a different training set to train each GNN model for better diversity. We plot the errors in the produced by train set swapping in Figure 3. The results show that train set swapping generates significantly fewer errors than random initialization when adding the same number of nodes. This is because the models have better diversity than random initialization and they agree on the label prediction of only 2,230 nodes on the CiteSeer dataset. The prediction accuracy in the agreed nodes is 85.4%, which is significantly higher than the 71.9% accuracy for random initialization.
Class balance. A trick that is crucial for the performance of TNA is ensuring that each class has a similar number of nodes in the enlarged training set . We observed that different classes can have a very different number of nodes. For example, for the Coauthor CS dataset, the number of nodes in the largest class is 4.78x that of the smallest class. If we assume that every node has the same probability of being added to , the large classes can have significantly more training samples than the small classes. We found that TNA can even degrade the accuracy (compared to without TNA) in this case. We conjecture that this is because an unbalanced training set encourages the GNN model to label nodes as from the large classes, which does not generalize. Therefore, we constrain each class to have the same number of nodes in . If the number of nodes to be added to for a class is larger than that for the smallest class, then we add only the nodes with the largest confidence for this class.
CORA | CiteSeer | PubMed | Coauthor CS | Coauthor Physics | Amazon Computers | Amazon Photo | |
---|---|---|---|---|---|---|---|
GCN | 78.7 | 66.5 | 75.5 | 90.7 | 93.1 | 71.9 | 85.2 |
GCN+SEG | 82.3 | 71.1 | 80.0 | 92.9 | 93.9 | 80.2 | 90.4 |
Error Reduction | 16.9% | 13.7% | 18.4% | 23.7% | 11.6% | 29.5% | 35.1% |
GAT | 79.0 | 65.7 | 75.3 | 89.9 | 92.0 | 82.2 | 89.6 |
GCN+SEG | 81.4 | 70.0 | 78.9 | 91.6 | 93.5 | 83.7 | 90.8 |
Error Reduction | 11.4% | 12.3% | 14.6% | 16.8% | 18.8% | 8.4% | 11.5% |
SGC | 77.4 | 65.0 | 73.3 | 91.3 | 93.3 | 81.1 | 89.3 |
GCN+SEG | 82.2 | 70.2 | 78.1 | 93.1 | 94.1 | 82.8 | 89.9 |
Error Reduction | 21.2% | 14.9% | 8.5% | 16.8% | 18.8% | 9.0% | 7.7% |
Settings. The experiments were conducted on seven widely used benchmark datasets for node classification. Due to the space limitation, we give the statistics of the datasets in Table 6, Appendix A.1. We evaluated the performance of topology update (TU) and training node augmentation (TNA) on three well-known GNN models, i.e., GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018) and SGC (Wu et al., 2019). We configured all the three models to have two layers because GNN models usually perform the best with two layers due to over-smoothing (Oono and Suzuki, 2019) and increasing the layers also increases the computation cost exponentially due to neighbor propagation. All weights for the models were initialized according to Glorot and Bengio (2010) and all biases were initialized as zeros. The models were trained using the Adam (Kingma and Ba, 2014) optimizer and the learning rate was set to 0.01. For both TU and TNA, we utilized a grid search to tune their parameters (i.e., the thresholds , and ) on the validation set. The detailed settings of other hyper-parameters can be found in Appendix A.2.
We followed the evaluation protocol proposed by Shchur et al. (2018)
and recorded the average classification accuracy and standard deviation of 10 different dataset splits. For each split, 20 and 30 nodes from each class were randomly sampled as the training set and validation set, respectively, and the other nodes were used as the test set. Under each split, we ran 10 random initializations of the model parameters and used the average accuracy of the 10 initializations as the performance of this split. The motivation of this evaluation protocol was to exclude the influence of the randomness in data split on performance, which was found to be significant.
We first present the overall performance results of self-enhanced GNN (abbreviated as SEG) in Table 1. The reported performance of SEG is the best performance that can be obtained using TU, TNA, or (TU + TNA). In practice, we may choose to use TU, TNA, or (TU + TNA) by their prediction accuracy on the validation set. The results in Table 1 show that SEG consistently improves the performance of the 3 GNN models on the 7 datasets, where the reduction in classification error is 16.2% on average and can be as high as 35.1%. The result is significant particularly because it shows that SEG is an effective, general framework that improves the performance of well-known models that are already recognized to be effective.
In the subsequent subsections, we analyze the performance of TU and TNA individually, as well as examine how they influence data quality.
CORA | CiteSeer | PubMed | Coauthor CS | Coauthor Physics | Amazon Computers | Amazon Photo | |
---|---|---|---|---|---|---|---|
GCN+Delete | 79.2 | 66.5 | 75.6 | 91.8 | 93.2 | 80.1 | 89.0 |
Error Reduction | 2.3% | 0.0% | 0.4% | 11.8% | 1.4% | 29.2% | 25.7% |
GAT+Delete | 79.3 | 65.8 | 75.3 | 90.9 | 92.2 | 82.8 | 90.3 |
Error Reduction | 1.4% | 0.3% | 0.0% | 9.9% | 2.5% | 3.4% | 4.9% |
SGC+Delete | 77.8 | 65.5 | 73.6 | 92.6 | 93.5 | 82.0 | 89.6 |
Error Reduction | 1.8% | 1.4% | 1.1% | 14.9% | 3.0% | 4.8% | 2.8% |
GCN+Add | 78.8 | 66.8 | 75.6 | 90.7 | 93.2 | 78.9 | 88.2 |
Error Reduction | 0.5% | 0.9% | 0.4% | 0.0% | 1.4% | 24.9% | 20.3% |
GAT+Add | 79.1 | 65.7 | 75.7 | 90.0 | 92.1 | 82.6 | 89.7 |
Error Reduction | 0.5% | 0.0% | 1.6% | 1.0% | 1.2% | 2.2% | 1.0% |
SGC+Add | 77.5 | 65.7 | 73.8 | 91.5 | 93.5 | 81.6 | 89.4 |
Error Reduction | 0.4% | 2.0% | 1.9% | 2.3% | 3.0% | 1.6% | 0.9% |
GCN+Modify | 79.4 | 67.1 | 75.9 | 91.7 | 93.4 | 79.2 | 88.5 |
Error Reduction | 3.3% | 1.8% | 1.6% | 10.8% | 4.3% | 26.0% | 22.3% |
GAT+Modify | 79.1 | 65.8 | 76.0 | 90.7 | 92.1 | 82.4 | 90.1 |
Error Reduction | 0.5% | 0.3% | 2.8% | 7.9% | 1.2% | 1.1% | 4.8% |
SGC+Modify | 78.5 | 66.7 | 74.0 | 92.7 | 93.5 | 81.7 | 89.4 |
Error Reduction | 4.9% | 4.9% | 2.6% | 16.1% | 3.0% | 2.1% | 0.9% |
The performance results of TU are reported in Table 2. To control the complexity of parameter search, we constrained the number of added edges to be the same as the number of deleted edges for Modify. The following observations can be made from the results in Table 2.
First, TU improves the performance of GCN, GAT and SGC in most cases and the improvement is significant in some cases. For example, the error reduction is over 25% for GCN on the Amazon Photo dataset. The error reduction is zero in 4 out of the 63 cases because threshold tuning (for and ) on the validation set rejects TU as it cannot improve the performance. Thus, even in the worst case, TU does not degrade the performance of the base models.
Second, edge deletion generally achieves greater performance improvements than edge addition. This is because there is a large number of possible inter-class edges (e.g., when the classes are balanced). Even if the probability of adding an inter-class edge is small (the same as the probability of keeping an inter-class edge in in edge deletion), the algorithm may still add a considerable number of inter-class edges in expectation.
Third, the performance improvement of TU is relatively smaller for CiteSeer and PubMed than that for the other datasets, which can be explained as follows. The accuracy of GCN, GAT and SGC for CiteSeer and PubMed is considerably lower than that for the other datasets. As a result, the TU algorithms are also more likely to make wrong decisions (i.e., deleting intra-class edges or adding inter-class edges) since TU decisions are guided by the model predictions. Motivated by this observation, we experimented a dual-model edge deletion/addition algorithm on CiteSeer, which uses the intersection of the edge deletion/addition decisions of two GNN models. The intuition is similar to the idea of TNA, which utilizes the diversity of different GNN models to reduce errors. The dual-model algorithm improves the error reduction of single-model edge deletion/addition from 0.0% and 0.9% to 0.6% and 2.1%, respectively.
Fourth, although GCN, GAT and SGC have high accuracy for both Coauthor CS and Coauthor Physics, TU has considerably greater performance improvements on Coauthor CS than on Coauthor Physics. This is because the noise ratio of the original Coauthor Physics graph is much lower than the Coauthor CS graph (6.85% vs. 19.20%), and thus reducing noise ratio has smaller influence on the performance for Coauthor Physics.
Model | Edge Deletion | Noise Ratio | Edge Addition | Noise Ratio |
---|---|---|---|---|
GCN | (332, 218) | 14.19% | (4692, 85) | 10.82% |
GAT | (309, 212) | 14.59% | (5995, 165) | 10.21% |
SGC | (242, 116) | 15.47% | (3807, 25) | 11.28% |
Finally, TU generally achieves greater performance improvements for GCN and SGC than for GAT. We plot the distributions of the attention weights of GAT on the edges that are deleted and kept by Algorithm 1 in Figure 4. The results show that the deleted edges have significantly smaller attention weights than the kept edges. As we mainly delete inter-class edges, the results suggest that GAT can prevent the inter-class edges from smoothing the embeddings of nodes from different classes by assigning them small weights. This explains why GAT is less sensitive to changes in noise ratio. However, GAT cannot really set the weights of the inter-class edges to 0 as it uses the softmax function to compute the attention weights. In contrast, Algorithm 1 can completely remove inter-class edges and can thus even improve the performance of GAT further in most cases.
CORA | CiteSeer | PubMed | Coauthor CS | Coauthor Physics | Amazon Computers | Amazon Photo | |
---|---|---|---|---|---|---|---|
GCN+TNA | 82.1 | 70.6 | 80.0 | 91.8 | 93.7 | 80.2 | 89.5 |
Error Reduction | 16.0% | 12.2% | 18.4% | 11.8% | 8.7% | 29.5% | 29.1% |
GAT+TNA | 81.4 | 70.0 | 78.9 | 91.1 | 93.4 | 82.7 | 90.8 |
Error Reduction | 11.4% | 12.3% | 14.6% | 11.9% | 5.7% | 2.8% | 11.5% |
SGC+TNA | 82.2 | 70.2 | 73.3 | 92.0 | 93.9 | 82.8 | 89.9 |
Error Reduction | 21.2% | 14.9% | 0.0% | 8.0% | 9.0% | 9.0% | 5.6% |
We also examined the edge deletion and addition decisions made by TU in Table 3. For both edge deletion and addition, we report the number of correct decisions (i.e., removing inter-class edges for deletion and adding intra-class edges for addition) and wrong decisions (i.e., removing intra-class edges for deletion and adding inter-class edges for addition), and the noise ratio of the CORA graph after TU. The results show that TU effectively reduces noise ratio. Most of the added edges are intra-class edges and only a few are inter-class edges. Edge deletion effectively removes inter-class edges but a considerable number of intra-class edges are also removed. This is because there are much more intra-class edges in the graph than inter-class edges, and thus the expectation of the number of removed intra-class edges may not be small even if the probability of removing an intra-class edge is small.
Model | GCN | GAT | SGC |
---|---|---|---|
# Added Nodes | 826 | 714 | 637 |
# Errors | 83 | 72 | 45 |
Error Ratio | 10.05% | 10.08% | 7.06% |
We present the performance results of TNA in Table 4, which show that TNA improves the performance of GCN, GAT and SGC in 20 out of the 21 cases. The performance improvements are significant in many cases, e.g., 29.1% for GCN on the Amazon Photo dataset. The performance improvements are large on CORA and CiteSeer for all three GNN models. We conjecture that this is because the two datasets are relatively smaller and thus adding more training samples has a large impact on the performance. To explain the good performance of TNA, we examined the number of added nodes and the errors in the added nodes in Table 5. The results show that most of the added nodes are assigned the correct label. Compared with GAT and GCN, a small number of nodes are added for SGC and the error ratio is also lower. This may be because the model of SGC is simpler than GAT and GCN (without nonlinearity) and thus SGC is more sensitive to noise in the training samples.
We examined the two important designs in TNA, i.e., class balance and multi-model diversity. We experimented with a version of TNA without class balance for GCN on the Amazon Photo dataset, which records a classification accuracy of 86.67%. In contrast, the classification accuracy with class balance is 89.54% as reported in Table 4. We plot in Figure 4(a) the class distribution of the nodes added by TNA without class balance, which shows that the number of nodes in the largest class is 11.6 times of the smallest class. The results show that without class balance, the enlarged training set can be highly screwed.
To demonstrate the benefits of using the diversity of multiple models in TNA, we report the relation between the test accuracy and the number of models (used for node selection) on the CiteSeer dataset in Figure 4(b). The result show that using 2 models provide a significant improvement in classification accuracy over 1 model, but the improvement drops when using more models. This is because more models are difficult to agree with each other and thus a low confidence threshold (i.e., ) needs to be used to add a good number of nodes. However, a low confidence threshold means that the added nodes are likely to contain errors.
GNN models. Many GNN models have been proposed in recent years, including GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018), SGC (Wu et al., 2019), GraphSAGE (Hamilton et al., 2017), Geom-GCN (Pei et al., 2020), GGNN (Li et al., 2015), JK-Net (Xu et al., 2018), ChebNet (Defferrard et al., 2016), Highway GNN (Zilly et al., 2017) and MoNet (Monti et al., 2017). These works focus on improving the performance of a task, e.g., the prediction accuracy of node classification, comparing with prior methods. In contrast, our method, self-enhanced GNN, aims to improve the quality of the input data. By providing data with higher quality, self-enhanced GNN provides a general framework that can be easily applied on existing GNN models to further improve their performance.
To the best of our knowledge, our work is most related to Chen et al. (2019b) and Li et al. (2018). Chen et al. (2019b) observed that the performance of GNN models usually degrades when using more than 2 layers due to local smoothing and proposed to remove/add edges in a graph to mitigate the over-smoothing problem of GNN models. In contrast, we come from the perspective of data quality and observe that lower noise ratio leads to higher classification accuracy. In addition, we provide theoretical analysis to show that adding/removing edges can reduce noise ratio if the performance of a model is good enough. The idea of enlarging the training set with co-training and self-training was proposed in (Li et al., 2018), which corresponds to the single-model case of our training node augmentation algorithm. However, as we have shown in our analysis and profiling results in Section 3.2, using the diversity of multiple models and explicitly balancing the classes in the training set are crucial for performance. In fact, the results reported in (Li et al., 2018) also show that the performance of GNN (e.g., GCN) actually degrades in many cases when applying co-training and self-training with a single model.
Noisy label training. Self-enhanced GNN is partly motivated by noisy label training, which aims at learning good models from data with noisy labels, i.e., a large number of training samples come with wrong labels. Representative work along this line include Decoupling (Malach and Shalev-Shwartz, 2017), MentorNet (Jiang et al., 2017), Noisy Cross-Validation (Chen et al., 2019a) and Co-teaching (Han et al., 2018). These works focus on how to select samples with possibly correct labels from a noisy dataset to conduct model training, and our multi-model sample selection method is motivated by these works. However, as GNNs work on graph data, self-enhanced GNN handles not only noise in labels but also noise in the graph structure (i.e., inter-class edges) with topology update. Given the excellent performance of GNNs on graph data, a potential direction is to apply self-enhanced GNN to noisy label training. With the assumption that samples with similar features are likely to share the same label, a similarity graph (e.g., a -nearest-neighbor graph based on image descriptors) can be constructed on a noisy dataset and noisy label training can be modeled as a semi-supervised node classification problem on graphs.
We presented self-enhanced GNN. The main idea is to improve the quality of the input data using the outputs of existing GNN models, so that the proposed method can be used as a general framework to improve the performance of different existing GNN models. Two algorithms were developed under this idea, i.e., topology update, which deletes/adds edges to remove inter-class edges and add potential intra-class edges in an input graph, and training node augmentation, which enlarges the training set by adding nodes with high classification confidence. Theoretical analyses were provided to motivate the algorithm designs and comprehensive experimental evaluation was conducted to validate the performance of the algorithms. The results show that self-enhanced GNN is an effective general framework that consistently improves the performance of different GNN models on a broad set of datasets.
International Conference on Machine Learning
. 1062–1070.Fast Graph Representation Learning with PyTorch Geometric. In
ICLR Workshop on Representation Learning on Graphs and Manifolds.Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
(Proceedings of Machine Learning Research), Yee Whye Teh and Mike Titterington (Eds.), Vol. 9. PMLR, Chia Laguna Resort, Sardinia, Italy, 249–256. http://proceedings.mlr.press/v9/glorot10a.htmlDeeper insights into graph convolutional networks for semi-supervised learning. In
Thirty-Second AAAI Conference on Artificial Intelligence.Geometric deep learning on graphs and manifolds using mixture model cnns. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
. 5115–5124.All the code of this work is released via the following anonymous link ^{3}^{3}3https://gofile.io/?c=h0S6ya
and will be open source later. The datasets used in the experiments have been widely used for the evaluation of GNN models and they are all publicly available.
We evaluated our methods on 3 popular GNN models, i.e., GCN (Kipf and Welling, 2017), GAT (Veličković et al., 2018) and SGC (Wu et al., 2019). We used 7 datasets to evaluate our methods. Among them, CORA, CiteSeer and PubMed are 3 well known citation networks and we used the version provided by Yang et al. (2016). Amazon Computers and Amazon Photo are derived from the Amazon co-purchase graph in McAuley et al. (2015). Coauthor CS and Coauthor Physics are obtained from the Microsoft Academic Graph for the KDD Cup 2016 challenge ^{4}^{4}4https://www.kdd.org/kdd-cup/view/kdd-cup-2016. For these 4 datasets, we used the version pre-processed by Shchur et al. (2018). The statistics of the datasets are summarized in Table 6, where is the noise ratio defined in Section 2.
Classes | Features | Nodes | Edges | ||
---|---|---|---|---|---|
CORA | 7 | 1,433 | 2,485 | 5,069 | 0.19 |
CiteSeer | 6 | 3,703 | 2,110 | 3,668 | 0.26 |
PubMed | 3 | 500 | 19,717 | 44,324 | 0.19 |
Coauthor CS | 15 | 6,805 | 18,333 | 81,894 | 0.19 |
Coauthor Physics | 5 | 8,415 | 34,493 | 247,962 | 0.06 |
Amazon Computers | 10 | 767 | 13,381 | 245,778 | 0.22 |
Amazon Photo | 8 | 745 | 7,487 | 119,043 | 0.17 |
Evaluation protocol. To eliminate the influence of random factors and ensure that the performance comparison is fair, we adopted the evaluation protocol provided by Shchur et al. (2018). A 20/30/rest split for train/val/test set was used for all the datasets. In the experiments, we evaluated each model on 10 randomly generated dataset splits, and under each split, we ran the model for 10 times using different random seeds. We reported the mean value and standard deviation of the test accuracies across the 100 runs for each model on each dataset. For the experiments comparing Self-Enhanced GNN with the base GNN models (i.e., GCN, GAT and SGC), all model implementation and evaluation settings were kept fixed and identical.
Structure of the base models
. Our GCN model implementation has 2 GCN convolutional layers with a hidden size of 16. The activation function is
ReLU. A dropout layer with a dropout rate of 0.5 is used after the first GCN layer. Our GAT model implementation has 2 GAT layers with an attention coefficient dropout probability of 0.6. The first layer is an 8-heads attention layer with a hidden size of 8. The second layer has a hidden size of . The activation function is ELU . Two dropout layers with a dropout rate of 0.6 are used between the input layer and the first GAT layer, and between the first GAT layer and the second GAT layer. Our SGC model implementation has a SGC convolutional layer with 2 hops (equivalent to 2 SGC layers according to the SGC definition).Model training. We used the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.01 and an regularization coefficient of
. We did not use learning rate decay and early stopping. As the difficulty of model training varies for different datasets, we used a different number of training epochs for each dataset, i.e., CORA 400 epochs, CiteSeer 400 epochs, PubMed 400 epochs, Amazon Computers 1000 epochs, Amazon Photo 2000 epochs, Coauthor CS 2000 epochs and Coauthor Physics 400 epochs.
Software. All models and algorithms in the experiments are implemented on PyTorch (Paszke et al., 2019) and PyTorch-Geometric (Fey and Lenssen, 2019)
. The software versions are python=3.6.9, torch=1.2.0, CUDA=10.2.89, pytorch_geometric=1.3.2.
Topology update. For Delete, before edge deletion, we remove all self-loop edges in the original graph. Then the edges are deleted according to Algorithm 1 with a threshold. After edge deletion, we add back the removed self-loop edges. For Add, we constrain the number of added edges to be less than 4 times of the number of edges in the original graph. This threshold is used to decide the number of candidate edges for addition, i.e., . We get the top- edges from the potential edges according to the label correlation (i.e., ). After filtering the edges already in the graph, we add new edges using Algorithm 2. For Modify, we constrain the total number of added edges to be the same as the number of deleted edges because tuning the parameters for edge deletion and addition jointly will result in high complexity. This constraint also helps maintain the graph topology to some degree by not changing the structure too much. We conduct edge deletion first, and then add the same number of edges as that of the deleted edges. We ensure that deleted edges will not be added back.
Training node augmentation. For training node augmentation, we use two models trained with swapped training and validation set to label the nodes in the test set. Only the nodes having the same label prediction from the two models can be added to the augmented training set. A confidence threshold is used to control the number of pre-selected nodes for addition. We count the number of nodes from each class in the pre-selected nodes and obtain the class with the minimum number of pre-selected nodes. This number is used to control the number of added nodes for all classes (i.e., the class balance trick) to avoid introducing additional biases.
Joint use of TU and TNA. For experiments that jointly used topology update and training node augmentation, we applied the two techniques independently and used the thresholds selected by each algorithm individually to avoid the high complexity of joint parameter tuning. Denote the optimal parameter for topology update and training node augmentation as and , respectively. We considered three configurations, i.e., , and (setting the means disabling the algorithm) and selected the best configuration using the validation accuracy. The reported results is the test accuracy of the selected configuration. Therefore, our framework still has the potential to perform even better if more fine-grained tuning on the thresholds parameters are conducted.
All the thresholds mentioned above are determined totally by the classification accuracy on the validation set.
Relation between confidence and classification accuracy. In Algorithm 3, we only add nodes with a high confidence into the enlarged training set . In Figure 6, we plot the relation between confidence and classification accuracy. The results show that the model is more likely to give the right label prediction under high confidence.
Relation between label correlation and label alignment. Recall that the label correlation between a pair of nodes and is defined as , in which is the class distribution for node predicted by a model. For topology update, we delete edges with small label correlation and add edges with large label correlation. In Figure 7, we plot the relation between label correlation and the probability that a pair of nodes have the same label (called node alignment). The results show that a pair of nodes is more likely to be in the same class under higher label correlation.