Large-scale graphs, which are characterized by massive nodes and edges, are ubiquitous in real-world applications, such as social networks Ching et al. (2015); Wang et al. (2018a); Perozzi et al. (2014)
and knowledge graphsWang et al. (2018b, b). Although graph neural networks (GNNs) have shown effectiveness in many fields Kipf and Welling (2016); Veličković et al. (2017); Wu et al. (2019); Chen et al. (2020b); Xu et al. (2018), most of them rely on propagating messages over the whole graph dataset, and are mainly developed for relatively small graphs. Such message passing paradigms lead to prohibitive computation and memory requirements.
Recently, several scalable algorithms of GNNs have been proposed to handle the large-scale graphs, among which sub-graph sampling methods are dominant in literature Hamilton et al. (2017); Chen et al. (2018, 2017); Chiang et al. (2019); Zeng et al. (2019). Specifically, instead of training on the full graph, the sampling methods sample subsets of nodes and edges to formulate a sub-graph at each step, which is treated as an independent mini-batch. For example, Cluster-GCN Chiang et al. (2019) first clusters the input graph into sub-graph groups, and then formulates each batch with a fixed number of groups (referred as batch size) during model training. LGCN Gao et al. (2018) samples sub-graphs via breadth first search, as motivated by small patch cropping on a large image.
Nevertheless, the label bias existing in the sampled sub-graphs could make GNN models become over-confident about their predictions, which leads to overfitting and lowers the generalization accuracy Ghoshal et al. (2020). Note that in the real-world assortative graphs Tang et al. (2013)
, the closely connected nodes are potential to share the same label or positively related labels. The sub-graph sampling methods usually assign these related nodes into the same sub-graph and lead to label bias in a batch, whose node label distribution is significantly different from the other batches. Taking Cluster-GCN as an example, where the community with the same node labels is clustered as a sub-graph, the label distribution variance among batches is dramatic as shown in Figure1. Comparing with the traditional deep neural networks trained by uniform batch Müller et al. (2019); Li et al. (2020); Ding et al. (2019); Xu et al. (2020)
, the cross-entropy minimization in the biased batch will severely make GNN model to attend only on the correctness of biased ground-truth category by producing an extremely large prediction probability. Such over-confident prediction will overfit on training set (e.g., the decreasing training loss of Cluster-GCN in Figure1
), but generalizes poorly on the testing set (e.g., the increasing testing loss). To overcome the over-prediction and overfitting, label smoothing has been proposed to soften the one-hot class label by mixing it with a uniform distributionSzegedy et al. (2016). Through penalizing the over-confident prediction towards the ground-truth label, the label smoothing (LS) has been used to improve the generalization performance across a range of tasks, including image classification Müller et al. (2019); Xu et al. (2020); Lienen and Hüllermeier (2021); Yuan et al. (2020), semantic parsing Ghoshal et al. (2020) et al. (2020); Gao et al. (2020).
However, it is non-trivial to apply the label smoothing to regularize and adapt to the large-scale graph training from two structural levels: local node and global graph. First, different from generic machine learning problems associated with independent data instances, in graph data, it is generally assumed that class labels of connected nodes are positively related in many real-world applicationsTang et al. (2013); Huang et al. (2020). In other words, for a specific local node, its label prediction highly depends on the label distribution of its neighbors. The traditional label smoothing, after mixing one-hot hard target with a uniform distribution, could wrongly regularize nodes in graph data. Second, considering the global graph, the relevance between different pairs of labels could vary. For example, in academic networks, civil engineering researchers tend to collaborate more with ecology researchers than with physicists Newman (2004). The optimal label smoothing should be conditioned on the the relevance between the ground-truth label and the related labels. The fixed label smoothing paradigm, by mixing uniform distribution, will fail to model such global label relevance specifically to the downstream tasks.
To bridge the gap, in this paper, we develop a simple yet effective adaptive label smoothing (ALS) method to regularize the representation learning on large-scale graphs. We aim to answer two research questions. First, given a local node, how can we estimate its smoothed label that is aware of the neighborhood structure? Second, how can we learn the global label relevance for a specific task? Through exploring these questions, we make three significant contributions as follows.
[leftmargin=*, topsep=0pt, noitemsep]
We are the first to analyze the label bias problem in the sub-graph sampling methods for the large-scale graph training. The biased batch training could make GNN model produce over-confident prediction and overfit on the training set.
We present an adaptive label smoothing methods decoupled into the following stages: a label propagation preprocessing step to aggregate the local neighborhood label distribution; a label refinement step mapping the preprocessed neighborhood labels to learn the global smoothed label adpatively. Our method is very simple and memory efficient, and could be scaled to the large-scale graph with negligible step to map the desired smoothed label.
We propose a label smoothing pacing function to allocate different smoothing strengths along the training process, in order to avoid the overly regularization at the beginning.
The empirical results show that our adaptive label smoothing could relieve the overfitting issue and yield better node classification accuracies based upon most scalable learning frameworks.
2 Label Bias, Over-confident Prediction and Overfitting
Notations and problem definition.
We denote matrices with boldface capital letters (e.g.
), vectors with boldface lowercase letters (e.g.,or ) and scalars with lowercase alphabets (e.g., ). We use to index the -th element in matrix , and use to represent the -th entity of vector . In this work, we focus on node classification tasks, and propose using label smoothing to address the over-confident prediction and overfitting issue in large-scale graph analysis. A graph is represented by , where denotes the adjacency matrix, denotes the feature matrix, and is the number of nodes. Each node is associated with a feature vector (indexed by the -th row in ) and a one-hot class label , where is the number of class labels. Given a training set
with labels, the goal is to classify the nodes in the unlabeled setvia learning effective node representations. Let denote the GNN model, where denotes model parameters. The prediction for a node is . Recalling the batch training in a large-scale graph, the plain cross-entropy loss in a batch is given by:
where is the cross-entropy function. Sub-graphs are sampled to build each batch, and contains the nodes in the sampled sub-graphs for training. denotes the number of training nodes in .
We empirically study the label bias phenomenon, over-prediction and overfitting issues of Cluster-GCN models trained on the ogbn-products dataset. Other sub-graph sampling methods with the similar issue are shown in Appendix. First, to evaluate the label bias, we define the probability of nodes with class in a batch as: , where indicates whether node belongs to class . The mean and standard variance of among batches are shown in the right part of Figure 1. We could observe that the standard variance is significantly larger when batch size is small, while the mean value is relatively stable. It means that nodes within a small batch tend to belong to certain classes, instead of evenly distribute across all classes. Such label bias is inherent in sub-graph sampling methods, such as clustering Chiang et al. (2019) and random walk sampling Zeng et al. (2019), since positively related nodes are more likely to be selected together in a sampled sub-graph.
We further analyze the training and testing losses of Cluster-GCN in Figure 1. Different from using uniform batch labels in traditional machine learning, the label bias in the batches further leads the GNN model to over-confidently attend (i.e., produce large prediction probabilities) on the ground-truth classes by minimizing the vanilla cross-entropy loss. The over-confident prediction overfits the training set and accelerates the decrease of training loss, but poorly generalizes to the testing set as shown in the increased testing loss. Comparing with the batch size of , a larger batch size of reduces the variance of and relieves the label bias to some extent, which brings a smaller testing loss and better generalization performance. However, the big batch size would improve computation and memory costs, which is not inline with the purpose of sub-graph batch training on large graphs.
To overcome over-confidence and overfitting, label smoothing has been proposed to mix the one-hot hard labels with uniform distribution in the image classification Müller et al. (2019); Xu et al. (2020); Lienen and Hüllermeier (2021); Yuan et al. (2020)et al. (2020); Meister et al. (2020). To be specific, considering a training node , its smoothed label is given by: , where is a uniform distribution and is the regularization strength. Then the cross-entropy is given by:
By minimizing , we are able to prevent the model from being over-confident on the ground-truth , via a penalty that enforces non-nigligible prediction probabilities on the other classes.
3 Adaptive Label Smoothing
The above label smoothing fails to adapt to graph data by ignoring two informative attributes: the label distribution within local neighborhood and the global label relevance. First, the nodes’ labels are not independently distributed, and correlate positively to their local neighbors Huang et al. (2020); Wang and Leskovec (2020); Jia and Benson (2020); Zhou et al. (2004); Zhu et al. (2003). The vanilla label smoothing assumes that node labels are independent and identically distributed, and applies the uniform distribution to regularize representation learning. It could mislead the model prediction to attend on the negatively related labels. Second, in the overall graph, the relevance between each pair of labels is different from each other. For example, the pair-wise collaboration strengths among engineering, ecology, and physical researchers are unbalanced in the academic networks Newman (2004). In the hierarchical GNNs, it is commonly assumed that the connections between labeled communities should be sparse Ying et al. (2018b); Zhou et al. (2019b), instead of the uniform and full connection. The vanilla label smoothing with fixed uniform distribution cannot properly learn the latent global label relevance for the downstream task.
3.1 Proposed Techniques
In this work, we propose ALS to calibrate the label bias and regularize the sub-graph batch training for the large-scale graph. ALS consists of three parts: (1) a pre-processing step of label propagation to obtain the prior knowledge of neighborhood label distribution; (2) a label refinement step to correct the prior knowledge and learn the global label relevance in the training phase; and (3) a smooth pacing function to gradually schedule the smooth strength and avoid the overly label regularization.
Based on the expectation that the two connected nodes are likely to have the same label according to graph homophily, label propagation passes labels iteratively to learn the label predictions Jia and Benson (2020); Wang and Leskovec (2020); Ando and Zhang (2007); Belkin et al. (2006). However, most of them involve parameters and are expensive to be trained. To scale to large-scale graphs, we simplify the label propagation by removing the trainable weights, and conduct it only as a pre-processing step. Specifically, let denote the propagated label matrix obtained from the -th iteration of label propagation. At the -th iteration, we update the propagated label matrix as follows:
For the initial label matrix , it consists of one-hot hard labels for training nodes and zero vectors otherwise. is residual strength to preserve the initial training labels, and is the diagonal degree matrix of . The label propagation in Eq. (3) is similar in spirit to Zhu and Ghahramani (2002), but we preserve to avoid the overwhelming of training labels. After iterations of label propagation, we obtain the prior knowledge of neighborhood label distribution up to hops away. Such prior knowledge provides enough neighborhood information to be refined. Our label propagation has a good trade-off between the efficiency and effectiveness.
In this step, we aim to refine propagated label matrix . Specifically, given the propagated label of training node (i.e., indexed from the -th row of ), we correct it by: . is a trainable matrix. The smoothed label used to regularize model training is then given by:
Notably, element indicates the latent relevance between classes and , and is shared globally by all nodes over the graph. Considering node , the real relevance to class is corrected to be proportional to . To well learn the global label relevance in , we jointly train with classification task and compute the batch loss as follows:
denotes the KL distance of two probability distribution vectors, and
is a positive hyperparameter. Compared with the traditional label smoothing in Eq. (2), ALS relaxes the uniform distribution to learn the optimal soft label and adapt to the downstream task. On one hand, parameter is updated to learn the global label relevance and produce a reasonable . On the other hand, the KL distance constraint is exploited to avoid collapsing into the one-hot hard target and guarantee the divergence.
Considering the batch training in large-scale graph, the constant smoothing strength may overly regularize model at the initial training phase. Given the randomly initialized parameter , the soft label will mislead model prediction to attend on the unrelated classes. Motivated from the batch pacing in curriculum learning Bengio et al. (2009); Jiang et al. (2015), we propose a smooth pacing function to gradually schedule the appropriate smoothing strength at the
-th epoch. At the early phase, since the over-confident prediction has not appeared, we use a smallto let model learn the correct prediction. With the ongoing of training, we gradually improve to regularize model. Specially, we consider the following two categories of pacing function: (1) a linear pacing function of , where is pacing rate and is the maximum smoothing strength; (2) an exponential pacing function of , where is the initial smoothing strength at epoch . In our ALS, we replace the constant smoothing strength in Eq. (5) with .
3.2 Model Analysis
Based on the sparse matrix multiplication, the time complexity of label propagation is , where is number of nonzeros in . Since the label propagation is conducted in pre-processing step, it could scale to the large-scale graph on CPU platforms with large memory. We thus ignore its memory complexity. The computation of label refinement mainly lines in the matrix multiplication with . Considering any a backbone network, the extra time complexity is only , and the extra memory complexity is only . Therefore, our ALS can augment any scalable algorithms to handle the large-scale graph.
We apply ALS to augment Cluster-GCN and train on dataset ogbn-products. As shown in Figure 1, comparing with the plain Cluster-GCN, our ALS has larger training losses but is accompanied with smaller testing losses. In other word, our model has better generalization performance on testing test by avoiding the overfitting on training set. Specially, the label bias problem is much severe in the small batch size of . In this case, the testing loss of plain Cluster-GCN increases significantly due to the extremely over-confident prediction and the overfitting on training set. In contrast, our model could still avoid the over-confident prediction by even increasing the training loss at the end of training. The label smoothing regularization brings and maintains a lower testing loss.
Comparison to previous work.
Although the label smoothing has been applied in computer vision and natural language processingMüller et al. (2019); Li et al. (2020), it has not been studied to regularize GNN models for the graph data analytics. The previous GNNs are mainly developed to process small graph. In this paper, we observe the label bias problem resulted from the sub-graph batch training in the large-scale graph, and analyze the over-confident prediction and overfitting issue. Compared with the traditional label smoothing with uniform distribution, we propose ALS to adapt to the graph data. We are aware that recently there have been some label smoothing works to learn the soft label Ghoshal et al. (2020); Ding et al. (2019), which is similar to our label refinement module. However, they are not targeted for graph data, missing to incorporate the graph structure. In the experiments, we empirically demonstrate that all the three modules in ALS are crucial to regularize the large-scale graph training.
Another similar line of work is label propagation. Most of previous methods involve trainable weights and cannot scale to the large-scale graph Jia and Benson (2020); Wang and Leskovec (2020); Ando and Zhang (2007); Belkin et al. (2006). For those simple and scalable methods, they either directly use the propagated labels to predict the testing nodes Zhu and Ghahramani (2002), or concatenate them to node features as the nodes’ inputs Sun and Wu (2021). In this work, we exploit the pre-processed soft label to regularize the model training. In the experiments, we empirically show that the label smoothing is a better way to exploit this prior label knowledge.
In this section, we empirically evaluate the effectiveness of ALS on several real-world datasets. Overall, we aim to answer four research questions as follows. Q1: Can ALS effectively regularize the model to obtain better generalization performance, comparing with the plain model and label smoothing with uniform distribution? Q2: How does each module of ALS affect its performance? Q3: How does ALS preform comparing with the other exploitation ways of prior label knowledge? Q4: How do the hyperparameters influence the performance of ALS?
4.1 Experiment Setup
We evaluate our proposed models on graphs with different scales using node classification tasks, following the previous large-scale graph representation learning efforts. These benchmark datasets include Flickr Zeng et al. (2019), Reddit Hamilton et al. (2017), ogbn-products and ogbn-mag Hu et al. (2020), whose node numbers are 89K, 233K, 2449K and 1940K, respectively. Their data statistics are provided in Appendix.
We mainly evaluate ALS on the scalable backbone frameworks based upon sub-graph sampling. Since the precomputing methods are another important lines of scalable graph representation learning, we conduct our method on them to demonstrate the general effectiveness. For the sub-graph sampling based methods, we adopt the popular backbone frameworks of GraphSAGE Hamilton et al. (2017), Cluster-GCN Chiang et al. (2019) and GraphSAINT Zeng et al. (2019). For the pre-computing based methods, we choose the backbone frameworks of MLP and SIGN Rossi et al. (2020). The detailed descriptions of these five frameworks are provided in Appendix. Note that we aim to demonstrate the general effectiveness of ALS in improving model generalization for the diverse scalable learning frameworks, instead of achieving the state-of-the-art performance on each classification task. Therefore, for each experiment on benchmark datasets, we conduct and compare three implementations: the plain scalable model trained with cross-entropy loss in Eq. (1), the model augmented with conventional label smoothing (LS) as shown in Eq. (2), and the model augmented with ALS as shown in Eq. (5).
We directly use the implementations of the backbone networks either from the their official repositories or based on the official examples of PyTorch Geometric. We further implement LS and ALS over each backbone model. For LS with uniform distribution, we set the constant smoothing strengthas , which is widely applied in regularizing image classification. For our ALS, we choose the appropriate hyperperameters of residual strength and step in the label propagation, and also determine the KL distance constraint as well as the smooth pacing rate . While the linear smooth pacing is adopted in the sub-graph sampling methods, the exponential pacing function is used in the precomputing methods. The detailed choices on four datasets are shown in Appendix. We study the influences of these hyperparameters in the experiments, and show that our model is not sensitive to them within a wide value range.
4.2 Experiment Results
Generalization improvement by label smoothing.
To provide answers for the research question Q1, Table 1 summarizes the comprehensive comparisons among the plain model without any label smoothing, the regularized model with LS, and the regularized model with ALS over each combination of backbone framework and dataset. It is observed that our ALS can achieve superior performances in cases out the 20 in total. Specifically, compared with the plain frameworks based on sub-graph sampling, both LS and ALS can generally improve test accuracy. The sampling methods assign connected nodes possibly with the same label into a sub-graph, which will lead to label bias within a batch. The label bias will make model over-confidently attend on the prediction of the ground-truth class, and may mislead model to fall into local minimums and decrease its generalization ability. While LS uses the uniform distribution to regularize model’s prediction on other classes, our proposed ALS calibrates model more accurately by considering the local neighbors and using the global label refinement to correct the smoothed label.
Compared with the plain precomputing model, our ALS can still generally improve the test accuracy, although LS tends to deteriorate model performance. Trained on GeForce RTX 2080 Ti GPU, the official implementations of MLP and SIGN use a full batch of training nodes. Since the label bias is not a big concern in such full batch training scenarios, the crude LS over-regularizes models and further hinders the accurate predictions on the ground-truth class. Notably, ALS exploits the informatic neighborhood label distribution to calibrate the model prediction, considering that the connected nodes should be close in the label space. Furthermore, the label refinement module learns to correct the smoothed label and jointly trains with the cross-entropy classification loss, which could adapt to the desired classification task.
To demonstrate how each module of ALS affects the generalization performance and answer the research question Q2, we perform ablation studies over two sub-graph sampling based backbone networks, i.e., Cluster-GCN and GraphSAINT. In particular, to study the contribution of label propagation in ALS, we ablate it and replace soft label with one-hot hard label to compute the smoothed label used for model training. The smoothed label is then obtained by: . To ablate the label refinement, we use the soft label obtained from the label propagation to compute the smoothed label, i.e., . At the same time, we remove the KL distance constraint in Eq. (5). To ablate smooth pacing module, we use a constant smoothing strength with to generate smoothed label, i.e., .
We summarize the ablation studies on the three modules of ALS in Table 2. It is observed that the ablation of any module will decrease the test accuracy, which empirically demonstrate their importances to adapt label smoothing in regularizing the graph representation learning. Comparing with the ablation of label propagation, we observe that the removing of label refinement and smooth pacing extremely damages the performance of ALS. Even with inaccurate prior knowledge of neighborhood label distribution, the label refinement module could be supervised to refine the smoothed label correctly to regularize model prediction. The smooth pacing allocates a smaller smoothing strength at the initial training phase, since the smoothed label is far from being well refined, and then gradually improves to regularize model from being overfitting.
|Cluster-||w/o label propagation||50.060.24||95.180.15||80.610.40||37.850.20|
|GCN||w/o label refinement||49.970.25||95.060.16||80.190.41||37.620.28|
|w/o smooth pacing||50.090.20||95.120.13||80.530.62||37.870.30|
|Graph-||w/o label propagation||51.590.13||95.230.08||79.310.44||47.910.35|
|SAINT||w/o label refinement||51.670.19||95.170.09||79.110.55||47.790.39|
|w/o smooth pacing||51.640.24||95.130.09||79.090.61||47.830.24|
Comparison of prior label knowledge.
Besides the label smoothing, to scale to the large-scale graph, there are two other lines of work to exploit the prior knowledge of neighborhood label distribution . First, the label propagation method uses to predict test nodes without any learnable parameters. Second, one can concatenate with node features that are treated as the input for classification model. To answer the research question Q3, we compare these three different methods to exploit . Specifically, we implement label smoothing and concatenate over the GraphSAINT backbone, and directly adopt the proposed label propagation module.
Table 3 summarizes the test accuracies on the four benchmark datasets, where ALS generally achieves the superior performances. The label propagation fails to learn the informatic label prediction on ogbn-mag dataset, since it cannot combine the neighbor labels effectively without modeling the diverse node/edge types in heterogeneous graphs. The label propagation cannot adapt the prior label knowledge to classification tasks without any learnable parameters. Compared with the input concatenation, our label smoothing approach directly regularizes the model prediction to avoid the over-confident prediction and thus obtains better generalization performance.
|GraphSAINT+Label Input||51.75 ± 0.22||95.10 ± 0.09||76.90 ± 0.32||47.20 ± 0.29|
Recently, scalable learning method of C&S is proposed to refine model prediction in the post-processing step Huang et al. (2020), and shows promising performance on ogbn-products. By training a simple MLP to obtain the initial label predictions, C&S propagates prediction errors and labels to obtain smoothed prediction results. To demonstrate that ALS is general to any scalable learning framework, we use LS and ALS to regularize the training of MLP module in C&S, and compare test accuracies in Table 4. By regularizing MLP to obtain better initial label predictions, our proposed ALS can further improve the test accuracy up to , which is the state-of-the-art performance in the leader board.
To answer the research question Q4, we conduct experiments with different values of pacing rate , loss hyperparameter , residual strength , and label propagation step . Figure 2 illustrates the hyperparameter studies of GraphSAINT+ALS on ogbn-products. In general, within appropriate value ranges, most of these hyperparameter settings can achieve test accuracies larger than , which is much superior than the baseline GraphSAINT. Specifically, the over-small (or large) pacing rate damages the test performance due to the insufficient (or excessive) label smoothing regularization. The loss hyperparameter should be large enough (e.g., ), so as to avoid the learned smoothed label collapsing into one-hot hard target and to guarantee its regularization effect. The superior performance brought by demonstrates the importance of neighborhood label distribution in learning the structure-aware smoothed label. Based on Eq. (3), with a smaller , we tend to aggregate neighborhood labels during the label propagation. Similar to the common GNN settings, a small value of is sufficient to aggregate the positively related neighbors to model the smoothed label correctly.
Global label relevance visualization.
We visualize transformation of the global label relevance matrix along each row, i.e., . Note that is learned on backbone framework GraphSAINT and dataset Flickr. As shown in Figure 3, we observe that each class label has unbalanced relevance strengths to the other classes. This is in line with our motivation that the pair-wise label relevance is different from each other, and far away from the uniform distribution. As demonstrated in the previous experiments, the modeling of such global label relevance delivers the superior performances comparing with LS.
5 Related Work
Graph neural networks.
GNNs have shown superiority in processing graphs, i.e., data with rich relational structures. GNNs could be categorized into spectral domain and spatial domain models. The spectral models Bruna et al. (2013) extends convolution on images to graph data by modeling on the spectrum of graph Laplacian. Models designed from the spatial perspective simplify the spectral models. Spatial models such as ChebNet Defferrard et al. (2016), GCN Kipf and Welling (2016), GAT Veličković et al. (2017) and GIN Xu et al. (2018), could also be understood from the message passing perspective. GNNs are playing increasingly crucial roles in various applications such as recommender systems Ying et al. (2018a); Chen et al. (2020a); Zhou et al. (2021a), social network analysis Fan et al. (2019); Zhou et al. (2019a, 2020, 2021b), and biochemical module analysis Gilmer et al. (2017); Zhou et al. (2019b).
Scalable graph representation learning.
Two types of methods have been proposed to tackle scalability issue of GNNs, including sub-graph sampling methods Hamilton et al. (2017); Ying et al. (2018a); Chen et al. (2018); Gao et al. (2018); Chen et al. (2017); Chiang et al. (2019); Zeng et al. (2019) and precomputing methods Wu et al. (2019); Rossi et al. (2020); Bojchevski et al. (2020). To reduce computation and memory cost, the sub-graph sampling methods feed GNNs only with a small batch of sub-graphs, which consist of subsets of nodes and edges. Specifically, ClusterGCN Chiang et al. (2019) conducts training on sampled sub-graphs in each batch, but the sub-graphs are obtained through clustering algorithms. GraphSAINT Zeng et al. (2019) samples sub-graphs that are appropriately connected for information propagation, where a normalization technique is also proposed to eliminate bias. The major limitation for sub-graph based methods is that distant nodes in the original graph are unlikely to be fed into the GNNs in the same batch, thus leading to label bias in the trained models. The precomputing methods of SIGN Rossi et al. (2020) and SGCWu et al. (2019) remove trainable weights, and propagate node features over the graph in advance to store the smoothed features.
Label propagation distributes the observed node labels over the graph following the connection between nodes Zhou et al. (2004); Zhu (2005). It has been used for semi-supervised training on graph data Zhang et al. (2006), where node labels are partially observed. The assumption behind is that labels and features change smoothly over the edges of the graph. It has also been proposed to combine feature propagation with label propagation towards a unified message passing scheme Shi et al. (2020). Some recent work connects GNNs with label propagation Wang and Leskovec (2020); Jia and Benson (2020) by studying how labels/features spread over a graph and how the initial feature/label of one node influences the prediction of another node.
Label smoothing improves the generalization Szegedy et al. (2016); Pereyra et al. (2017) and robustness Pang et al. (2017); Goibert and Dohmatob (2019); Shen et al. (2019) of a deep neural network. Label smoothing replaces one-hot labels with smoothed labels. It has been shown that label smoothing has similar effect as randomly replacing some of the ground-truth labels with incorrect values at each mini-batch Xie et al. (2016). Pang et al. (2017) proposes reverse cross entropy for gradient smoothing. It encourages a model to better distinguish adversarial examples from normal ones in representation space. Wang et al. (2020) proposes the graduated label smoothing method, where high-confidence predictions are assigned with higher smoothing penalty than low-confidence ones.
In this paper, we point out the inherent label bias within the sampled sub-graphs for the mini-batch training of large-scale graph. By minimizing vanilla cross-entropy loss, we empirically analyze that such label bias will make GNN model over-confidently predict the ground-truth class and lead to overfitting issue. To overcome the label bias and the resulted over-confident prediction, we propose an adaptive label smoothing to replace the one-hot hard target with smoothed label, which allocates prediction confidence to other classes to avoid overfitting. Specially, we learn the smoothed label with the prior knowledge of local neighborhood label distribution and the global label refinement to adapt to graph data on hand. The experiments show that our algorithm could generally improve the test performance by relieving the overfitting on biased labels.
-  (2007) Learning on graph with laplacian regularization. Advances in neural information processing systems 19, pp. 25. Cited by: §3.1, §3.2.
-  (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples.. Journal of machine learning research 7 (11). Cited by: §3.1, §3.2.
-  (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §3.1.
-  (2020) Scaling graph neural networks with approximate pagerank. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2464–2473. Cited by: §5.
-  (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §5.
-  (2017) Stochastic training of graph convolutional networks with variance reduction. arXiv preprint arXiv:1710.10568. Cited by: §1, §5.
-  (2018) Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247. Cited by: §1, §5.
Revisiting graph based collaborative filtering: a linear residual graph convolutional network approach.
Proceedings of the AAAI Conference on Artificial Intelligence, pp. 27–34. Cited by: §5.
-  (2020) Simple and deep graph convolutional networks. In International Conference on Machine Learning, pp. 1725–1735. Cited by: §1.
-  (2019) Cluster-gcn: an efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 257–266. Cited by: 2nd item, §1, §2, §4.1, §5.
-  (2015) One trillion edges: graph processing at facebook-scale. Proceedings of the VLDB Endowment 8 (12), pp. 1804–1815. Cited by: §1.
-  (2016) Convolutional neural networks on graphs with fast localized spectral filtering. arXiv preprint arXiv:1606.09375. Cited by: §5.
-  (2019) Adaptive regularization of labels. arXiv preprint arXiv:1908.05474. Cited by: §1, §3.2.
-  (2019) Graph neural networks for social recommendation. In The World Wide Web Conference, pp. 417–426. Cited by: §5.
-  (2018) Large-scale learnable graph convolutional networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1416–1424. Cited by: §1, §5.
-  (2020) Towards a better understanding of label smoothing in neural machine translation. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 212–223. Cited by: §1.
-  (2020) Learning better structured representations using low-rank adaptive label smoothing. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, §2, §3.2.
-  (2017) Neural message passing for quantum chemistry. In International conference on machine learning, pp. 1263–1272. Cited by: §5.
-  (2019) Adversarial robustness via adversarial label-smoothing. arXiv preprint. Cited by: §5.
-  (2017) Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216. Cited by: 1st item, §A.1, §1, §4.1, §4.1, §5.
-  (2020) Open graph benchmark: datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687. Cited by: §A.1, §4.1.
-  (2020) Combining label propagation and simple models out-performs graph neural networks. arXiv preprint arXiv:2010.13993. Cited by: §1, §3, §4.2.
-  (2020) Residual correlation in graph neural network regression. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 588–598. Cited by: §3.1, §3.2, §3, §5.
-  (2015) Self-paced curriculum learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29. Cited by: §3.1.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: 5th item, §1, §5.
-  (2020) Regularization via structural label smoothing. In International Conference on Artificial Intelligence and Statistics, pp. 1453–1463. Cited by: §1, §3.2.
-  (2021) From label smoothing to label relaxation. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, AAAI, Online, February 2-9, 2021, Cited by: §1, §2.
-  (2020) Generalized entropy regularization or: there’s nothing special about label smoothing. arXiv preprint arXiv:2005.00820. Cited by: §1, §2.
-  (2019) When does label smoothing help?. arXiv preprint arXiv:1906.02629. Cited by: §1, §2, §3.2.
-  (2004) Coauthorship networks and patterns of scientific collaboration. Proceedings of the national academy of sciences 101 (suppl 1), pp. 5200–5205. Cited by: §1, §3.
-  (2017) Towards robust detection of adversarial examples. arXiv preprint arXiv:1706.00633. Cited by: §5.
-  (2017) Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548. Cited by: §5.
-  (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1.
-  (2020) Sign: scalable inception graph neural networks. arXiv preprint arXiv:2004.11198. Cited by: 5th item, §4.1, §5.
Defending against adversarial attacks by suppressing the largest eigenvalue of fisher information matrix. arXiv preprint arXiv:1909.06137. Cited by: §5.
-  (2020) Masked label prediction: unified message passing model for semi-supervised classification. arXiv preprint arXiv:2009.03509. Cited by: §5.
-  (2021) Scalable and adaptive graph neural networks with self-label-enhanced training. arXiv preprint arXiv:2104.09376. Cited by: §3.2.
Rethinking the inception architecture for computer vision.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1, §5.
-  (2013) Exploiting homophily effect for trust prediction. In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 53–62. Cited by: §1, §1.
-  (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §5.
-  (2020) Unifying graph convolutional neural networks and label propagation. arXiv preprint arXiv:2002.06755. Cited by: §3.1, §3.2, §3, §5.
-  (2018) Billion-scale commodity embedding for e-commerce recommendation in alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 839–848. Cited by: §1.
-  (2018) Acekg: a large-scale knowledge graph for academic data mining. In Proceedings of the 27th ACM international conference on information and knowledge management, pp. 1487–1490. Cited by: §1.
-  (2020) On the inference calibration of neural machine translation. arXiv preprint arXiv:2005.00963. Cited by: §5.
-  (2019) Simplifying graph convolutional networks. In International conference on machine learning, pp. 6861–6871. Cited by: §1, §5.
-  (2016) Disturblabel: regularizing cnn on the loss layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4753–4762. Cited by: §5.
-  (2018) How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826. Cited by: §1, §5.
-  (2020) Towards understanding label smoothing. arXiv preprint arXiv:2006.11653. Cited by: §1, §2.
-  (2018) Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 974–983. Cited by: §5, §5.
-  (2018) Hierarchical graph representation learning with differentiable pooling. arXiv preprint arXiv:1806.08804. Cited by: §3.
-  (2020) Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3903–3911. Cited by: §1, §2.
-  (2019) Graphsaint: graph sampling based inductive learning method. arXiv preprint arXiv:1907.04931. Cited by: 3rd item, §A.1, §1, §2, §4.1, §4.1, §5.
Hyperparameter learning for graph based semi-supervised learning algorithms. In NIPS, Vol. 33, pp. 101. Cited by: §5.
-  (2004) Learning with local and global consistency. In Advances in neural information processing systems, pp. 321–328. Cited by: §3, §5.
-  (2021) Temporal augmented graph neural networks for session-based recommendations. arXiv preprint. Cited by: §5.
-  (2020) Towards deeper graph neural networks with differentiable group normalization. arXiv preprint arXiv:2006.06972. Cited by: §5.
-  (2021) Dirichlet energy constrained learning for deep graph neural networks. arXiv preprint arXiv:2107.02392. Cited by: §5.
-  (2019) Auto-gnn: neural architecture search of graph neural networks. arXiv preprint arXiv:1909.03184. Cited by: §5.
-  (2019) Multi-channel graph neural networks. arXiv preprint arXiv:1912.08306. Cited by: §3, §5.
Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pp. 912–919. Cited by: §3.
-  (2002) Learning from labeled and unlabeled data with label propagation. arXiv preprint. Cited by: §3.1, §3.2.
-  (2005) Semi-supervised learning literature survey. arXiv preprint. Cited by: §5.
Appendix A Appendix
The dataset statistics of Flickr , Reddit , obgn-products , and ogbn-mag  are listed in Table 5. Flickr is a social network, where the nodes represent images and the edges denote the shared properties between two images. The node classification task in Flickr is to categorize the types of images. Reddit is a social netowork, where the nodes represent posts in Reddit forum and the edges indicate the same user comments between two posts. The node classification task in Reddit is to predict the communities of online posts based on user comments. ogbn-products is an Amazon product co-purchasing network, where the nodes represent products sold in Amazon and the edges indicate the co-purchasing relationships between two products. The node classification task in ogbn-products is to predict the category of a product. ogbn-mag is a heterogeneous network extracted from the Microsoft Academic Graph. It contains four types of entities: papers, authors, institutions, and fields of study. The directed edges are categorized into four types–an author is “affiliated with” an institution, an author “writes” a paper, a paper “cites” a paper, and a paper “has a topic of” a field of study. The node classification task in ogbn-mag is to predict the venue (conference or journal) of each entity of paper.
|Datasets||# Nodes||# Edges||# Classes||# Features||# Train/Validation/Test|
|Flickr||89,250||899,756||7||500||0.50 / 0.25 / 0.25|
|232,965||11,606,919||41||602||0.66 / 0.10 / 0.24|
|ogbn-products||2,449,029||61,859,140||47||100||0.08 / 0.02 / 0.90|
|ogbn-mag||1,939,743||21,111,007||349||128||0.85 / 0.09 / 0.06|
a.2 Backbone Frameworks
We evaluate our ALS on two main categories of scalable graph representation learning frameworks: one is based on the sub-graph sampling and the other one is based on precomputing. Although we aim at solving the label bias and over-confident prediction in the sub-graph sampling methods, we show that our method is general to both of these two scalable backbone frameworks. Specifically, we adopt the following five representative backbone frameworks:
[leftmargin=*, topsep=0pt, noitemsep]
GraphSAGE  (sub-graph sampling based). It is a node-wise sampling method to uniformly sample a batch of training nodes and their neighbors of different orders. The sampled nodes and neighbors construct several sub-graphs to formulate a batch.
Cluster-GCN  (sub-graph sampling based). It first conducts node clustering algorithm to partition the input graph into a series of sub-graphs. During the training phase, each batch is directly formulated by a random subset of preprocessed sub-graphs.
MLP (precomputing based). MLP is widely used to classify nodes based on the precomputed node features. Herein, MLP directly uses the original node features, which has been shown to achieve good classification performance in the graph data. In the precomputing methods, each node can be regarded as an independent sample, and does not connect to its neighbors. The batch is thus directly represented by an independent subset of training nodes.
SIGN  (precomputing based). In the preprocessing step, SIGN conducts message-passing strategy and precomputes node features as: for . is a normalized adjacency matrix used in GCN . The precomputed node features of different orders are concatenated together to augment the original node features. In the training phase, the batch is constructed by a random subset of training nodes, and taken as input to the downstream classification model of MLP.
We implement the above scalable backbone frameworks according to the official examples of Pytorch Geometric111https://github.com/rusty1s/pytorch_geometric. The basic model hyperparameters are defined in the examples or determined according to their public literature, including batch size, learning rate, weight decay, training epochs, hidden units, dropout rate, etc. All of the sub-graph sampling methods apply a three-layer GNN model, while the precomputing methods use a three-layer MLP. Following the official examples, we use full batch training in the precomputing methods.
a.3 Implementation Details
We further implement LS and ALS over each backbone model. For LS with uniform distribution, we set the constant smoothing strength as , which is widely applied in regularizing image classification. For our ALS, we use the linear pacing and exponential pacing functions for the sub-graph sampling methods and precomputing methods, respectively. To have a fair comparison with LS, we set in our ALS. For each combination of backbone framework and dataset, we choose the appropriate hyperperameters of residual strength and step in the label propagation, and also determine the KL distance constraint as well as the smooth pacing rate . The detailed hyperparameters involved in ALS are shown in Table 6. Notably, comparing with the sub-graph sampling methods, we use the negative pacing rate in the precomputing methods. Instead of using mini-batch training, the official examples of MLP and SIGN applies the full batch training. That means the trainable parametere and soft label could be well updated at the initial training phase. Therefore, we use the decreasing smoothing strength in the precomputing methods, where the models are regularized strictly by the difficult smoothed label from the beginning. At the end of training, the precomputing methods are relaxed to learn the easy one-hot hard target to improve the test performance. The initial smoothing strengths in the exponential pacing function for MLP are: in Flickr, in Reddit, and in ogbn-products & ogbn-mag. The values of for SIGN are: in Flickr, in Reddit, in ogbn-products, and in ogbn-mag.
a.4 Running Environment
All the experiments are implemented with PyTorch, and tested on a machine with 24 Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GB processors, 128GB CPU memory size, and one GPU of GeForce RTX 3090 with 24 GB memory size.
a.5 Label Bias, Over-confident Prediction and Overfitting Observations
As shown in Figure 1, the sub-graph sampling method of Cluster-GCN brings label bias, and leads to the over-confident prediction and overfitting in training set. These problems will damage the model’s generalization performance in testing set. Our ALS learns to replace the one-hot hard target with the smoothed label , which could relieve these three problems to improve the generalization ability. In this section, we report the training losses, the testing losses, and the label biases of all the sub-graph sampling methods on ogbn-products, ogbn-mag, and Flickr. We show the experimental results in Figures 4-11. Note that the batch sizes of GraphSAGE, Cluster-GCN, and GraphSAINT are defined by the corresponding sub-graph sampling functions in Pytorch Geometric222https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html, i.e., NeighborSampler, ClusterLoader, and GraphSAINTRandomWalkSampler. While the batch sizes of GraphSAGE and GraphSAINT specify how many training samples per batch to load, the batch size of Cluster-GCN determines how many clustered sub-graphs to sample. We make the following empirical observations:
All the sub-graph sampling methods bring label bias within a batch. It is shown that the standard deviance of is extremely large, which is compatible with the mean value of . In other word, the nodes within a small batch tend to belong to certain classes, and the label distributions vary dramatically between batches. In general, the smaller the batch size is, the larger the standard deviance of will be.
Comparing with the plain backbone frameworks, our ALS has larger training losses. That is because ALS replaces the one-hot hard target with the smoothed label , which distributes label confidences to both ground-truth class and the other classes. By minimizing the regularized loss in Eq. (5), ALS reduces the model’s prediction probability on the ground-truth class, and thus increases the training loss. The regularized prediction probability on the ground-truth class will help the model avoid the over-confident prediction and the overfitting on the training set.
Comparing with the plain backbone frameworks, our ALS generally has smaller testing losses and better generalization ability. Since the model is regularized to avoid the over-confident prediction, the smooth prediction probability in ALS is more easier to be generalized to the unseen testing set.