CLNode: Curriculum Learning for Node Classification

by   Xiaowen Wei, et al.

Node classification is a fundamental graph-based task that aims to predict the classes of unlabeled nodes, for which Graph Neural Networks (GNNs) are the state-of-the-art methods. In current GNNs, training nodes (or training samples) are treated equally throughout training. The quality of the samples, however, varies greatly according to the graph structure. Consequently, the performance of GNNs could be harmed by two types of low-quality samples: (1) Inter-class nodes situated near class boundaries that connect neighboring classes. These nodes' representations lack the typical characteristics of their corresponding classes. Because GNNs are data-driven approaches, training on these nodes could degrade the accuracy. (2) Mislabeled nodes. In real-world graphs, nodes are often mislabeled, which can significantly degrade the robustness of GNNs. To mitigate the detrimental effect of the low-quality samples, we present CLNode (Curriculum Learning for Node Classification), which automatically adjusts the weights of samples during training based on their quality. Specifically, we first design a neighborhood-based difficulty measurer to accurately measure the quality of samples. Subsequently, based on these measurements, we employ a training scheduler to adjust the sample weights in each training epoch. To evaluate the effectiveness of CLNode, we conduct extensive experiments by applying it to four representative backbone GNNs. Experimental results on six real-world networks demonstrate that CLNode is a general framework that can be combined with various GNNs to improve their accuracy and robustness.


page 1

page 2

page 3

page 4


Distance-wise Prototypical Graph Neural Network in Node Imbalance Classification

Recent years have witnessed the significant success of applying graph ne...

Synthetic Over-sampling for Imbalanced Node Classification with Graph Neural Networks

In recent years, graph neural networks (GNNs) have achieved state-of-the...

Beyond Real-world Benchmark Datasets: An Empirical Study of Node Classification with GNNs

Graph Neural Networks (GNNs) have achieved great success on a node class...

GraphMixup: Improving Class-Imbalanced Node Classification on Graphs by Self-supervised Context Prediction

Recent years have witnessed great success in handling node classificatio...

Curvature Graph Neural Network

Graph neural networks (GNNs) have achieved great success in many graph-b...

Ego-based Entropy Measures for Structural Representations on Graphs

Machine learning on graph-structured data has attracted high research in...

Rethinking Efficiency and Redundancy in Training Large-scale Graphs

Large-scale graphs are ubiquitous in real-world scenarios and can be tra...

1 Introduction

[dest=, level=1]1 Introduction Node classification is a fundamental graph-based task. Given a graph and limited labeled nodes, the goal of the task is to assign labels to unlabeled nodes [16]. The state-of-the-art node classification methods are Graph Neural Networks (GNNs) [27]. Generally, GNNs adopt the neighborhood aggregation mechanism to update node representations by aggregating the messages passed from their neighbors. Benefiting from this mechanism, GNNs learn low-dimensional representations for nodes while preserving the topological information and node feature attributes, which are then used to predict the labels.

Figure 1: (a) Illustration of node difficulty. During neighborhood aggregation, aggregates messages from four classes {, , , }, resulting in a unclear representation and thus making a difficult node (low-quality sample) for . By contrast, is an easy node since the aggregated messages are all from the class . (b) When using easy nodes {} as the training set to perform node classification, all nodes are correctly labeled. (c) Compared to subfigure (b), a difficult node is added as a training sample, making more likely to be misclassified to .

Although many GNN-based node classification works [6, 11, 14, 20, 26, 28] have been proposed, these works simply consider training nodes (or training samples) to make equal contributions during training. In fact, due to the topology structure of graphs, the quality levels of samples vary widely. Being data-driven approaches, GNNs exhibit degraded performance by training on the low-quality samples.

By means of the neighborhood aggregation mechanism, GNNs learn a representation for each node in the graph. We define training nodes whose representations lack the typical characteristics of their corresponding classes as difficult nodes, because it is difficult for GNNs to learn class characteristics from these low-quality samples. In contrast, easy nodes refer to high-quality samples whose representations have the representative characteristics of their classes. We illustrate difficult nodes and easy nodes using the paper citation network in Fig.1

(a); here, each paper is classified into a specific field as its class. As the figure shows, the cross-field paper

connects papers from multiple classes. In the process of neighborhood aggregation, aggregates messages from neighbors {}. By aggregating messages {} from classes , obtains an unclear representation that loses the typical characteristics of , indicating that is a difficult node. Conversely, all the aggregated messages of are from class , which help it obtain the representative characteristics of , thus making it an easy node. Therefore, the above observation raises the question of whether current GNNs are able to handle such uneven-quality training samples.

In the process of training current GNNs, all training examples are treated equally. However, the representations of easy nodes are clear, typical, and easily recognizable. Training on such easy nodes helps GNNs find clear decision boundaries. Conversely, difficult nodes should be used carefully, as their representations lack the typical characteristics of their classes. There are two types of difficult nodes that can degrade the performance of GNNs: (1) inter-class nodes situated near class boundaries that connect neighbors from multiple classes. By aggregating messages from neighbors, these nodes obtain unclear representations; as a result, training on these nodes can degrade the accuracy of GNNs. For example, Fig.1(b) shows a node classification task in which only easy nodes {} are used as training samples. In this case, GNNs can learn the typical class characteristics and thus predict the correct labels for all unlabeled nodes. However, as shown in Fig.1(c), if we add a difficult node into the training set, then is more likely to be misclassified to class ; this is because shares similar representations with , which has the ground truth label . (2) mislabeled nodes. Real-world graphs often contain label noise, i.e., nodes can be mislabeled [3]. Mislabeled nodes are difficult nodes because their representations lack the characteristics of their label classes. Current GNNs are easily perturbed by training on these mislabeled difficult nodes, and are thus not robust to label noise [13]. Based on the above analysis, alleviating the negative impact of difficult nodes can improve the accuracy and robustness of GNNs. This observation suggests the utility of curriculum learning [1]

in this context. Curriculum learning is a training strategy for machine learning models that can alleviate the negative impact of noise in data. Motivated by this, we consider applying curriculum learning to node classification.

In more detail, curriculum learning is a training strategy that trains a machine learning model from easier data to harder data [23]. The basic idea is to initially train the model with an easier training subset, then gradually introduce more difficult samples to the training process. By excluding low-quality difficult samples during initial training, curriculum learning alleviates overfitting to data noise, thus improving models’ accuracy and robustness [9, 18]

. Motivated by this, scholars have applied curriculum learning in a wide range of fields, such as computer vision (CV)

[4, 5]

and natural language processing (NLP)

[19, 22]. The most critical component of curriculum learning is the difficulty measurer, which measures the difficulty (quality) of samples. In previous works, researchers have often designed difficulty measurers by observing the sample features; for example, sentence length is a popular difficulty measurer in NLP tasks because shorter sentences are easier for models to learn [15]. However, difficulty cannot be measured directly from node features using a similar approach. One feasible way to measure node difficulty is to utilize the graph structure. For example, if a node connects neighbors from multiple classes, it is likely to be an inter-class difficult node. However, this is challenging due to the limited number of node labels.

In this paper, we attempt to tackle the above challenging problem by proposing a Curriculum Learning framework for Node Classification, called CLNode. The key idea behind CLNode is to enhance the performance of backbone GNN by incrementally introducing samples into the training process, starting with easy samples and progressing to difficult ones. In more detail, we first propose a neighborhood-based difficulty measurer to accurately measure the difficulty of samples. Based on these measurements, we propose a continuous training scheduler that introduces nodes to train GNNs in an easy-to-difficult fashion. CLNode is a general framework that can be combined with various GNNs to improve their node classification performance. The key contributions of this paper can be summarized as follows:

  • We propose CLNode, a novel curriculum learning based framework for node classification. CLNode first accurately identifies low-quality difficult nodes, then employs a selective training strategy to alleviate the negative impact of these nodes.

  • We demonstrate that CLNode can be directly plugged into existing GNNs. Without increasing the time complexity, CLNode enhances backbone GNNs by simply introducing samples to the training process in order from easy to difficult.

  • We conduct extensive experiments on six datasets. The results demonstrate that compared with baseline methods without curriculum learning, CLNode effectively improves the accuracy of backbone GNNs and enhances their robustness to label noise.

2 Related Work

[dest=, level=1]2 Related Work Node Classification and Graph Neural Networks.  Node classification [16] aims to predict labels for unlabeled nodes in a given graph. As a fundamental task on graphs, node classification has various applications, including fraud detection [8], security and privacy analytics [21], and community detection [7, 10].

Recently, GNNs have emerged as promising approaches for analyzing graph data. Due to the long history of GNNs, we refer readers to [27, 31] for a comprehensive review. Based on the definition of graph convolution, GNNs can be broadly divided into two categories, namely spectral-based [2, 11, 26] and spatial-based [6, 20, 28]. [2] first explored spectral-based GNNs by utilizing a spectral filter on the spectral space. In a follow-up work, [11] proposed GCN, which simplifies the graph convolution operation. [26] proposed to remove the nonlinearity in GCN and thereby speed up the model. Compared to spectral-based methods, spatial-based methods define convolutions directly on graphs by performing operations on spatially close neighbors. [6] proposed GraphSAGE, a general inductive framework that generates embeddings for nodes by sampling local neighbors. [28] developed a Jump Knowledge Network for representation learning and devised an alternative graph structure-based strategy to select neighbors for each node. Compared to spectral-based methods, spatial-based methods are easier to apply to large networks. Although GNNs have achieved great success, they simply consider nodes in the training set to make equal contributions during training; consequently, training on the low-quality difficult nodes can significantly degrade their accuracy and robustness.

Curriculum Learning.  Inspired by the learning principle underlying human cognitive processes, curriculum learning [1] is proposed as a training strategy that trains machine learning models from easier samples to harder samples. Previous studies [1, 25] have shown that curriculum learning improves generalization capacity and guides the model towards a better parameter space. Motivated by this, scholars have exploited the power of curriculum learning in a wide range of fields, including CV [4, 5], NLP [15, 19, 22], partial label learning [12], graph classification [24], etc. To the best of our knowledge, however, no work has yet attempted to apply curriculum learning to node classification.

3 Preliminaries

[dest=, level=1]3 Preliminaries

3.1 Notation

[dest=, level=2]3.1 Notation Let denote a graph, where is the vertex set while

is the edge set. The input feature vector of node

is , and the neighborhood of node is . For the node classification task, a labeled node set is given with denoting their ground-truth labels. is the set of classes. The goal is to learn a function that predicts the classes for unlabeled nodes in the graph.

3.2 Graph Neural Networks

[dest=, level=2]3.2 Graph Neural Networks Generally, a GNN involves two key computations for each node at every layer: (1) neighborhood aggregation: aggregating messages passed from . (2) update representation: updating ’s representation from its representation in the previous layer and the aggregated messages. Formally, the -th layer representation of node is given by


The final representation , i.e., the output of the last layer, is used for various downstream tasks. For the node classification task, is usually set to be a -dimensional vector, where

represents the probability that

belongs to class . The class with the highest probability is then used as the predicted class:


3.3 Curriculum Learning

[dest=, level=2]3.3 Curriculum Learning Curriculum learning alleviates the negative impact of difficult samples by using a curriculum to train the model. A curriculum is a sequence of training citeria over training epochs. Each criterion is a reweighting of the training set. The initial consists of easier samples; as increases, the weights of difficult samples in are gradually increased. In essence, designing such a curriculum for node classification requires us to design a difficulty measurer and a training scheduler. Here, the difficulty measurer decides the difficulty of each node in the training set; subsequently, based on the difficulty, the training scheduler generates at any training epoch to train the model.

4 Methodology

[dest=, level=1]4 Methodology In this section, we present the details of the proposed CLNode framework. As shown in Fig.2, the key idea behind our approach is to enhance the performance of the backbone GNN by incrementally introducing the training nodes to the training process from easy to difficult. Specifically, CLNode comprises two components: (i) neighborhood-based difficulty measurer (Fig.2(b)). We first perform a standard node classification to assign pseudo-labels to unlabeled nodes, after which we design a score function to measure the difficulty of training nodes. (ii) continuous training scheduler (Fig.2(c)). After determining the difficulty of training nodes, we design a training scheduler to train with easy nodes initially and continuously introducing difficult nodes into the training process. We present the details of these components in the following subsections.

Figure 2: Overall framework of the proposed CLNode. (a) The input graph, where colored nodes represent the labeled training nodes. (b) Illustration of the neighborhood-based difficulty measurer. We first assign pseudo-labels to unlabeled nodes, after which a score function is used to measure the difficulty of nodes in the training set. (c) A training scheduler is used to train the backbone GNN according to an easy-to-difficult curriculum.

4.1 Neighborhood-based Difficulty Measurer

[dest=, level=2]4.1 Neighborhood-based Difficulty Measurer In this subsection, we describe the process of measuring node difficulty. Specifically, there are two types of nodes that should be considered more difficult: (1) inter-class nodes situated near class boundaries that connect neighbors from multiple classes. The representations of these nodes lack the typical characteristics of their classes, meaning that training on these nodes could degrade the accuracy of GNNs. (2) mislabeled nodes. In real-world graphs, nodes are often mislabeled, and training on these nodes significantly degrades the robustness of GNNs.

In general, neighborhood aggregation benefits from the homophily of graphs, i.e., a node ’s neighbors tend to have the same label as . The two types of difficult nodes violate homophily in different ways: the neighbors of an inter-class node have diverse labels because they belong to multiple classes, while the neighbors of a mislabeled node tend to have different labels from it. Taking a step further, the difficulty of nodes can be measured by the label distribution of their neighborhood. However, in the node classification task, we only have limited labels , meaning that we cannot directly measure the label distribution. Therefore, the designed neighborhood-based difficulty measurer consists of two steps: the first is to assign pseudo-labels to unlabeled nodes, and the second is to measure the difficulty of nodes by the label distribution of their neighborhood.

We begin by assigning pseudo-labels to unlabeled nodes. As shown in Fig.2(b), we first use the backbone GNN model to perform a standard node classification. For convenience, we refer to the model first used to get the pseudo-labels as :


where is the predicted labels for all nodes on the graph. According to , the label distribution can be measured. However, directly using for all nodes may result in inaccurate measured difficulties. For example, assume that a node in is mislabeled. Generally, should be measured to be a difficult node. However, may assign the correct label to ; accordingly, if we use as ’s label to measure the label distribution of its neighborhood, we may mistake for an easy node. Therefore, to better measure the difficulty of nodes, we retain the ground-truth labels for nodes in the training set:


After obtaining , for each node in , we calculate its difficulty with reference to the label distribution of its neighborhood. The first type of difficult nodes (inter-class nodes) have diverse neighbors that belong to multiple classes. In order to identify these inter-class difficult nodes, we calculate the diversity of neighborhood’s labels:


where denotes . A larger indicates a more diverse neighborhood. Taking Fig.2(b) as an example, the of node 1 is 0.53, which is much larger than , indicating that node 1 has more diverse neighbors than node 15. Nodes with large are more likely to be inter-class nodes, the representations of which lack the typical characteristics of their classes. By contrast, nodes with small tend to be intra-class nodes whose representations are clear, making it easy to learn class characteristics from them. By paying less attention to inter-class difficult nodes than intra-class nodes, CLNode learns more useful information and effectively improves the accuracy of backbone GNNs.

measures the diversity of neighborhood to identify the inter-class difficult nodes. However, it does not efficiently identify mislabeled nodes. For example, a mislabeled node could have a small because all its neighbors belong to the same class. To identify mislabeled nodes, we calculate the proportion of neighbors whose labels are inconsistent with :


reflects the difficulty based on the label consistency of the neighborhood. By using , mislabeled nodes can be identified as difficult nodes because their labels tend to be inconsistent with these of their neighbors. In this way, CLNode reduces the weights of mislabeled nodes during training, thus improving the robustness of the backbone GNNs to label noise. Considering both the diversity and consistency of the neighborhood, we finally define the difficulty of as follows:


where is a hyper-parameter that controls the weight of .

4.2 Continuous Training Scheduler

[dest=, level=2]4.2 Continuous Training Scheduler After measuring the difficulty of each node in , we use a curriculum-based training strategy to train a better GNN model (Fig.2(c)). To distinguish it from , we denote the model trained with curriculum as . We propose a continuous training scheduler to generate the easy-to-difficult curriculum. In more detail, we first sort the training set in ascending order of node difficulty; subsequently, a pacing function is used to map each training epoch to a scalar whose range is , meaning that a proportion of the easiest nodes are used as the training set at the -th training epoch. Let denote the initial proportion of the available easiest examples, while denotes the training epoch when reaches 1 for the first time. We consider three pacing functions, namely linear, root, and geometric:

  • linear:

  • root:

  • geometric:


The visualization of these three pacing functions is presented in Fig.4. As shown in the figure, the linear function increases the difficulty of training samples at a uniform rate; the root function is a concave function that introduces more difficult nodes in fewer epochs, while the geometric function is a convex function that trains for a greater number of epochs on the subset of easy nodes before introducing difficult nodes. By using the pacing function to continuously introduce training nodes into the training process, we assign appropriate training weights to nodes of different levels of difficulty. Specifically, the more difficult a training node is, the later it is introduced into the training process, meaning it has a smaller training weight.

Moreover, it is worth noting that we do not stop training immediately when , because at this time, the backbone GNN may not have fully explored the knowledge of samples which have been recently introduced. Instead, we use the early stopping mechanism to decide when to stop training: we first set a patience score, the value of which begins to decrease when the accuracy on the validation set no longer improves. Once the score reaches 0, the training process will be terminated.

1:A graph , the labeled node set , the ground-truth label set , the backbone GNN , the hyper-parameter in Eq.(7), the pacing function .
2:The predicted labels .
3:Initialize parameters of two GNN models and .
4: Eq.(3)
5: Eq.(4)
6:for  do
7:      Eq.(7)
8:end for
9:Sort according to difficulty in ascending order
10:Let .
11:while not converge do
14:     Use to predict the labels
15:     Calculate cross-entropy loss on
16:     Back-propagation on for minimizing
18:end while
19:Predict with
Algorithm 1 CLNode

4.3 Pseudo-code and Complexity Analysis

[dest=, level=2]4.3 Pseudo-code and Complexity Analysis

In this subsection, we present the pseudo-code of CLNode and explore its time complexity. The process of CLNode is detailed in Algorithm 1. Lines 2–6 describe the neighborhood-based difficulty measurer and lines 7–16 describe the process of training the backbone GNN with a curriculum. As the pseudo-code shows, CLNode is easy to be implemented and can be directly plugged into any backbone GNN, as it only changes the training set in each training epoch (lines 10–15).

For the convenience of complexity analysis, we consider GCN as the backbone GNN. The time complexity of an -layer GCN model in one epoch is , where is the number of edges, is the number of nodes, and is the number of node feature attributes. We assume that GCN converges after epochs, such that the time complexity of a standard GCN is . The time complexity of training is the same as that of GCN. Next, for calculating the difficulty, we traverse the neighbors of each node; thus, the time complexity is . The time complexity of sorting is . Because in the node classification task, the sorting operation does not increase the time complexity. Finally, we analyze the time complexity of training . The time complexity of in one epoch is the same as that of GCN. We first train epochs using the curriculum, after which we train using the whole until it converges. The training of the first epochs can be seen as pre-training with high-quality samples. Therefore, will converge before epochs. Based on the above analysis, the upper bound on the time complexity of CLNode is . In our experiments, we observed the best performance when was in the range of ; therefore, CLNode does not increase the time complexity level of backbone GNN.

5 Experiments

[dest=, level=1]5 Experiments In this section, we first evaluate the improvement in accuracy achieved by CLNode over various backbone GNNs on the node classification task. We next conduct experiments on graphs with label noise to demonstrate the robustness of CLNode. Subsequently, we conduct ablation studies to demonstrate the effectiveness of the different components of CLNode. Finally, we discuss the training time and parameter sensitivity of CLNode.

We conduct node classification on six benchmark datasets: Cora, Citeseer

[16], Coauthor Physics (Co-Physics), Coauthor CS (Co-CS), Amazon Computers (A-Computers), and Amazon Photo (A-Photo) [17]. Cora and CiteSeer are paper citation networks, Co-Physics and Co-CS are social networks that represent collaboration relationships between scientists, and A-Computers and A-Photo are co-purchase networks that represent common purchase relationships between products. The detailed statistics of these datasets are listed in Table 1.

Dataset Nodes Edges Features Classes
Cora 2708 5429 1433 7
CiteSeer 3327 4732 3703 6
Co-Physics 34493 247962 8415 5
Co-CS 18333 81894 6805 15
A-Computers 13381 245778 767 10
A-Photo 7487 119043 745 8
Table 1: Statistics of six datasets.

We use four popular GNNs as the backbone models, namely GCN [11], SGC [26], GraphSAGE [6], and JK-Net [28], which are representative of a broad range of GNNs. In more detail, GCN is a typical convolution-based GNN, SGC simplifies GCN by removing the non-linearity components, GraphSAGE can be applied to inductive learning, and JK-Net is a representative deep GNN. We use backbone GNNs without curriculum learning as baselines to explore the improvement achieved by CLNode.

With regard to the hyper-parameters in the baselines, we make only minor changes to the setups suggested by their authors. To facilitate fair comparison, the backbone GNNs’ parameters of CLNode are exactly the same as the baselines. For CLNode, we set to 1.0 and use the geometric pacing function by default. For the hyper-parameters of the training scheduler, is searched in the range of {0.25, 0.5, 0.75}, while the search space of is {50, 100, 150}.

5.1 Node Classification

[dest=, level=2]5.1 Node Classification

In this subsection, we conduct node classification experiments with standard splits and random splits. The standard splits follow [11] in using 20 labeled nodes per class as the training set for six datasets. Under random splits, we randomly label a specific proportion of nodes as the training set. Specifically, we conduct random splits on Cora and CiteSeer with label rates of 1%, 3%, and 5%, respectively. In each dataset, we use 500 nodes for validation to perform early stopping, and 1000 nodes for testing.

We report the mean test accuracy and standard deviation for 10 trials. The results under standard splits and random splits are summarized in Tables

2 and 3, respectively. The experimental results demonstrate that CLNode can be combined with the four selected backbone GNNs and improve their accuracy on node classification. Under the standard split of Cora, CLNode improves the test accuracy of backbone GNNs by 0.9% (GCN), 0.8% (SGC), 1.3% (GraphSAGE), and 2.6% (JK-Net). The results prove that CLNode effectively alleviates the negative impact of difficult nodes, thereby enabling more useful information to be learned from samples of uneven quality.

Method Cora CiteSeer Co-Physics Co-CS A-Computers A-Photo
GCN Original 81.40.8 70.41.6 93.00.9 91.10.5 83.11.2 90.90.7
+CLNode 82.30.6 71.00.6 94.30.6 92.40.3 83.81.0 92.80.5
SGC Original 80.31.1 70.82.0 93.20.5 91.90.3 82.50.8 88.71.3
+CLNode 81.11.0 71.81.1 93.10.8 91.90.5 83.40.4 91.60.4
GraphSAGE Original 80.20.9 69.31.5 92.10.9 91.70.7 79.34.4 90.41.1
+CLNode 81.50.7 70.21.3 93.70.5 91.60.7 83.61.3 92.00.8
JK-Net Original 75.22.6 64.91.2 93.11.0 90.10.7 81.22.0 90.90.8
+CLNode 77.81.4 64.52.4 94.70.6 90.70.6 82.90.7 91.80.6
Table 2: Test Accuracy (%) on six benchmark datasets with standard splits.
Method Cora Citeseer
1% 3% 5% 1% 3% 5%
GCN Original 63.83.2 76.90.5 81.30.8 53.21.9 65.40.9 66.30.7
+CLNode 69.41.4 79.70.4 81.90.5 58.21.7 66.90.6 67.50.6
(Improv.) 5.6% 2.8% 0.6% 5.0% 1.5% 1.2%
SGC Original 54.41.2 76.70.1 80.40.8 50.41.7 65.10.6 65.80.6
+CLNode 64.20.3 79.50.5 81.80.5 54.00.4 67.30.8 67.80.7
(Improv.) 9.8% 2.8% 1.4% 3.6 % 2.2% 2.0%
GraphSage Original 54.01.1 74.41.6 79.90.9 51.02.4 64.22.5 65.21.4
+CLNode 63.52.0 77.71.5 81.00.5 54.42.0 66.61.0 67.10.8
(Improv.) 9.5% 3.3% 1.1% 3.4% 2.4% 1.9%
JKNET Original 64.53.6 78.51.2 80.11.4 53.22.8 65.91.0 66.71.2
+CLNode 68.62.3 79.40.5 81.21.0 57.02.2 67.01.3 67.41.2
(Improv.) 4.1% 0.9% 1.1% 3.8% 1.1% 0.7%
Table 3: Test Accuracy (%) on Cora and CiteSeer with random splits.

We also observe that when there are fewer labeled nodes, the improvement achieved by CLNode is more obvious (Table 3). For example, under the random splits of Cora, CLNode improves GCN by 5.6%, 2.8%, 0.6% at label rates of 1%, 3%, 5%, respectively. That is because, due to the sampling, the training distribution is different from the testtarget distribution . Easy nodes in contain high-confidence knowledge, on which and are consistent. By contrast, the low-quality difficult nodes in appear as the unique characteristics of ; thus, training on difficult nodes causes models to overfit and degrades the accuracy on . When there are fewer training samples, tends to be more significantly different from ; in this case, training on difficult nodes is especially harmful, because overfitting can significantly degrade the accuracy on . Conversely, CLNode pays more attention to the easy nodes on which and are consistent, thereby learning more useful knowledge. For many real-world graphs, the labeling process can be tedious and costly, resulting in limited labels. It would therefore be highly beneficial to use CLNode in these situations.

5.2 Robustness to Noise

[dest=, level=2]5.2 Robustness to Noise In this subsection, we investigate whether CLNode enhances the robustness of backbone GNNs to label noise. In a noisily labeled graph, the labels have a probability of to be flipped to other classes, where denotes the noise rate. Following [3, 29, 30], we corrupt the labels of training and validation set with two kinds of label noise here:

  • Uniform noise. The label has a probability of to be mislabeled as any other class.

  • Pair noise. We assume that samples in one class can only be mislabeled as their closest class; that is, labels have a probability to flip to their pair class.

Figure 3: Test Accuracy (%) on Cora with various levels of label noise.

We conduct experiments on Cora under standard split and vary from {0%, 5%,…, 30%} to compare the performance of CLNode and the baseline GNNs under different levels of noise. The average test accuracy over 10 runs is shown in Fig.3; we only report the results using GCN and SGC as backbone GNNs because we have similar observations for other GNNs. From Fig.3, we can observe that as the noise rate increases, the performance of all baselines drops dramatically. CLNode also suffers under conditions of increased label noise; however, when there is more noise in the graph, the performance gap between CLNode and the baselines increases. This demonstrates that CLNode can effectively enhance the robustness of backbone GNNs to two kinds of label noise, since CLNode considers these mislabeled nodes as difficult nodes and reduces their weights for training, while the baseline GNNs treat all nodes as equal and consequently overfit to noise.

5.3 Ablation Study

[dest=, level=2]5.3 Ablation Study In this subsection, we conduct ablation studies to demonstrate the effectiveness of the different components of CLNode. In order to verify that Eq.(7) accurately identifies two types of difficult nodes (i.e., inter-class nodes and mislabeled nodes) by considering the diversity and consistency of the neighborhood, we design two score functions to replace Eq.(7):

  • Measuring difficulty only with ; i.e., we remove from Eq.(7) and only use the diversity of the neighborhood to measure the difficulty.

  • Measuring difficulty only with ; i.e., we remove from Eq.(7) and only use the consistency of the neighborhood to measure the difficulty.

We use these two score functions to replace Eq.(7) for ablation studies; in the below, we refer to the ablated methods as CLNode(div) and CLNode(cons), respectively. Experiments are conducted on six datasets under standard splits, where the graphs is corrupted with uniform label noise and the noise rate is set to 20%. The results of 10 trials are reported in Table 4, from which we can observe the following: (1) Generally, both CLNode(div) and CLNode(cons) outperform the baseline methods, which demonstrates that they measure the difficulty of nodes from different perspectives and thus alleviate the impact of different types of difficult nodes; (2) CLNode achieves the best results in all experiments, proving that by combining these two perspectives to evaluate the node difficulty, CLNode effectively identifies two types of difficult nodes, thus enhancing the accuracy and robustness of backbone GNNs.

Method Cora CiteSeer Co-Physics Co-CS A-Computers A-Photo
GCN Original 73.72.4 60.41.6 92.01.6 89.01.4 79.21.8 86.61.9
+CLNode(div) 76.62.4 66.01.6 91.71.3 90.91.2 80.31.4 89.01.7
+CLNode(cons) 76.81.6 65.11.9 92.51.0 90.70.9 81.01.6 89.41.4
+CLNode 77.51.4 66.11.4 92.81.0 91.20.8 81.02.2 89.52.3
Table 4: Test Accuracy (%) on noisily labeled graphs with different score functions.
linear root geometric
Cora 82.2 82.0 82.3
CiteSeer 70.4 70.1 71.0
Co-Physics 93.8 93.6 94.3
Co-CS 92.3 91.5 92.4
A-Computers 83.7 83.6 83.8
A-Photo 92.8 92.6 92.6
Figure 4: Visualization of three pacing functions.
Table 5: Test Accuracy (%) on six benchmark datasets with different pacing functions.

In Table 4, we evaluate the sensitivity of CLNode to three pacing functions: linear, root, and geometric. We use GCN as the backbone GNN and compare the accuracy of CLNode using these three pacing functions. We find that the geometric pacing function has a slight advantage on most datasets. As shown in Fig.4, the geometric function trains for a greater number of epochs on the subset of easy nodes before introducing difficult nodes. By contrast, root and linear functions introduce more difficult nodes after fewer training epochs. Therefore, to alleviate the negative impact of difficult nodes, we believe that the high-confidence knowledge in easy nodes should be fully explored before more difficult nodes are introduced.

5.4 Training Time Analysis and Parameter Sensitivity

[dest=, level=2]5.4 Training Time Analysis and Parameter Sensitivity

Method Cora CiteSeer
GCN Original 1.88 2.16
+CLNode 3.91 4.08
SGC Original 1.05 1.12
+CLNode 2.14 2.07
GraphSage Original 1.10 1.47
+CLNode 2.35 2.89
JK-Net Original 5.08 4.87
+CLNode 9.82 10.34
Figure 5: The test accuracy (z-axis) of CLNode under different values of the hyper-parameters and .
Table 6: Training time (second) of baseline methods and CLNode.

As shown in Table 5, we conduct experiments on Cora and CiteSeer with standard splits and report the training time on a NVIDIA GTX 2080 Ti GPU, where the results are averaged over 10 runs. We observe that CLNode takes a few seconds longer than the baselines. Because CLNode first employs a GNN to obtain pseudo-labels, its training time is roughly twice that of the baselines, which corroborates the complexity analysis in subsection 4.3. With affordable extra training time, CLNode improves the accuracy (especially when there are few labeled nodes) and robustness of baseline GNNs; therefore, we think it is beneficial to apply CLNode to real-world systems.

Last but not least, we investigate how the hyper-parameters and affect the performance of CLNode. controls the initial number of training samples, while controls the speed at which difficult samples are introduced to the training process. To explore the parameter sensitivity, we alter and from {0, 0.1,…, 1} and {20, 40,…, 200}, respectively. We use GCN as the backbone GNN and report the results on Cora with 1% label rate. The results in Fig.5 show the following: (1) Generally, with increasing , the performance tends to first increase and then decrease; specifically, the performance is relatively good when is between 0.4 and 0.8. A too small results in few training samples in the initial training process, meaning that the model cannot learn efficiently. Conversely, an overly large introduces more difficult nodes during initial training and thus degrades the accuracy. Considering an extreme example, when is set to 1, CLNode uses the entire training set throughout the training process, meaning that CLNode degenerates to the baseline method. (2) Similarly, as increases, the test accuracy tends to first increase and then decrease. A too small will quickly introduce more difficult nodes, thus degrading the backbone GNN’s performance; in contrast, an extremely large causes the backbone GNN to be trained mainly on the easy subset, causing a loss of the information contained in difficult nodes.

6 Conclusion

[dest=, level=1]6 Conclusion In this paper, we study the problem of training GNNs on uneven-quality samples. Current GNNs do not consider the quality of samples; as a result, training on difficult nodes degrades their accuracy and robustness. To address these issues, we propose a novel framework CLNode to alleviate the negative impact of difficult nodes. Specifically, we design a neighborhood-based difficulty measurer to accurately measure the difficulty of nodes from the label distribution of the neighborhood. Based on these measurements, a continuous training scheduler is proposed to introduce the nodes to the backbone GNN in an easy-to-difficult curriculum for training. Extensive experiments on six benchmark datasets demonstrate that CLNode is a general framework that can be combined with four representative backbone GNNs to improve their accuracy. Further experiments are conducted on noisily labeled graphs to prove that CLNode enhances backbone GNNs’ robustness. An interesting future direction to expand the current work is to explore the application of curriculum learning to more graph-related tasks, e.g., link prediction.


[dest=, level=1]References

  • [1] Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML. pp. 41–48 (2009)
  • [2] Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013)
  • [3] Dai, E., Aggarwal, C., Wang, S.: NRGNN: Learning a label noise resistant graph neural network on sparsely and noisily labeled graphs. In: SIGKDD. pp. 227–236 (2021)
  • [4]

    Guo, S., Huang, W., Zhang, H., Zhuang, C., Dong, D., Scott, M.R., Huang, D.: Curriculumnet: Weakly supervised learning from large-scale web images. In: ECCV. pp. 135–150 (2018)

  • [5] Hacohen, G., Weinshall, D.: On the power of curriculum learning in training deep networks. In: ICML. pp. 2535–2544 (2019)
  • [6] Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: NeurIPS. pp. 1025–1035 (2017)
  • [7] He, D., You, X., Feng, Z., Jin, D., Yang, X., Zhang, W.: A network-specific markov random field approach to community detection. In: AAAI. pp. 306–313 (2018)
  • [8] Hooi, B., Song, H.A., Beutel, A., Shah, N., Shin, K., Faloutsos, C.: Fraudar: Bounding graph fraud in the face of camouflage. In: SIGKDD. pp. 895–904 (2016)
  • [9] Jiang, L., Zhou, Z., Leung, T., Li, L.J., Fei-Fei, L.: Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML. pp. 2304–2313 (2018)
  • [10] Jin, D., Liu, Z., Li, W., He, D., Zhang, W.: Graph convolutional networks meet markov random fields: Semi-supervised community detection in attribute networks. In: AAAI. pp. 152–159 (2019)
  • [11] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
  • [12] Lyu, G., Feng, S., Jin, Y., Li, Y.: Partial label learning via self-paced curriculum strategy. In: ECML-PKDD. pp. 489–505 (2020)
  • [13] NT, H., Jin, C.J., Murata, T.: Learning graph neural networks with noisy labels. arXiv preprint arXiv:1905.01591 (2019)
  • [14] Ogawa, Y., Maekawa, S., Sasaki, Y., Fujiwara, Y., Onizuka, M.: Adaptive node embedding propagation for semi-supervised classification. In: ECML-PKDD. pp. 417–433 (2021)
  • [15] Platanios, E.A., Stretcu, O., Neubig, G., Poczos, B., Mitchell, T.M.: Competence-based curriculum learning for neural machine translation. arXiv preprint arXiv:1903.09848 (2019)
  • [16] Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collective classification in network data. AI Magazine 29(3),  93 (2008)
  • [17] Shchur, O., Mumme, M., Bojchevski, A., Günnemann, S.: Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018)
  • [18] Shu, Y., Cao, Z., Long, M., Wang, J.: Transferable curriculum for weakly-supervised domain adaptation. In: AAAI. pp. 4951–4958 (2019)
  • [19] Tay, Y., Wang, S., Luu, A.T., Fu, J., Phan, M.C., Yuan, X., Rao, J., Hui, S.C., Zhang, A.: Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives. In: ACL. pp. 4922–4931 (2019)
  • [20] Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: ICLR (2018)
  • [21] Wang, B., Jia, J., Gong, N.Z.: Graph-based security and privacy analytics via collective classification with joint weight learning and propagation. In: NDSS (2019)
  • [22] Wang, C., Wu, Y., Liu, S., Zhou, M., Yang, Z.: Curriculum pre-training for end-to-end speech translation. In: ACL. pp. 3728–3738 (2020)
  • [23] Wang, X., Chen, Y., Zhu, W.: A survey on curriculum learning. IEEE TPAMI (2021)
  • [24] Wang, Y., Wang, W., Liang, Y., Cai, Y., Hooi, B.: Curgraph: Curriculum learning for graph classification. In: WWW. pp. 1238–1248 (2021)
  • [25]

    Weinshall, D., Cohen, G., Amir, D.: Curriculum learning by transfer learning: Theory and experiments with deep networks. In: ICML. pp. 5238–5246 (2018)

  • [26] Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., Weinberger, K.: Simplifying graph convolutional networks. In: ICML. pp. 6861–6871 (2019)
  • [27] Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 32(1), 4–24 (2020)
  • [28] Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.i., Jegelka, S.: Representation learning on graphs with jumping knowledge networks. In: ICML. pp. 5453–5462 (2018)
  • [29] Xu, Y., Yan, Y., Xue, J.H., Lu, Y., Wang, H.: Small-vote sample selection for label-noise learning. In: ECML-PKDD. pp. 729–744 (2021)
  • [30] Yu, X., Han, B., Yao, J., Niu, G., Tsang, I.W., Sugiyama, M.: How does disagreement help generalization against label corruption? In: ICML. pp. 7164–7173 (2019)
  • [31]

    Zhang, Z., Cui, P., Zhu, W.: Deep learning on graphs: A survey. IEEE TKDE

    34(1), 249–270 (2022)