1 Introduction
Large datasets are beneficial to modern machine learning models, especially neural networks. Many studies have shown that the accuracy of machine learning models grows loglinear to the amount of training data
(Zhou, 2017). Currently, complex machine learning models can only achieve superhuman classification results when trained with a very large dataset. However, large datasets are usually expensive to collect and create exact label. One solution to create large datasets is crowdsourcing, but this approach introduces a higher level of labeling error into the datasets as well as requires a lot of human resources (Georgakopoulos et al., 2016). As a consequence, neural networks are prone to very high generalization error under noisy label data. Figure 1 demonstrate the accuracy results of a graph neural network trained on MUTAG dataset. Training accuracies tend to remain high while testing accuracies degrades as more label noise is added to the training data.Graph neural network (GNN) is a new class of neural networks which learn from graphstructured data. Typically, GNNs classify graph vertices or the whole graph itself. Given the input as the graph structure and data (e.g. feature vectors) on each vertex, GNNs training aim to learn a predictive model for classification. This new class of neural networks enables endtoend learning from a wider range of data format. In order to build large scale GNNs, it requires large and clean datasets. Since graph data is arguably harder to label than image data both at vertexlevel or graphlevel, graph neural networks should have a mechanism to adapt to training label error or noise.
In this paper, we take the noisecorrection approach to train a graph neural network with noisy labels. We study two stateoftheart graph neural network models: Graph Isomorphism Network (Xu et al., 2019) and GraphSAGE (Hamilton et al., 2017)
. Both of these models are trained under symmetric artificial label noise and tested on uncorrupted testing data. We then apply label noise estimation and loss correction techniques
(Patrini et al., 2016, 2017) to propose our denoising graph neural network model (DGNN).2 Method
2.1 Graph Neural Networks
Notations and Assumption
Let be a graph with vertex set , edge set and vertex feature vector matrix , where is the dimensionality of vertex features. Our task is graph classification with noisy labels. Given a set of graphs: , their labels , we aim to learn a neural network model for graph label prediction: . We assume that the training data is corrupted by a noise process ,
is the probability label
being corrupted to label . We further assume is symmetric, which corresponds to the symmetric label noise setting. Noise matrix is unknown, so we estimate by learning correction matrix from the noisy training data.GNN Models
The most modern approach to the graph classification problem is to learn a graphlevel feature vector . There are several ways to learn . GCN approach by Kipf & Welling (2017)
approximates the Fourier transformation of signals (feature vectors) on graphs to learn representations of a special vertex to use as the representative for the graph. Similar approaches can be founded in the context of compressive sensing. To overcome the disadvantages of GCNlike methods such as memory consumption and scalability, the nonlinear neural message passing method is proposed.
GraphSAGE (Hamilton et al., 2017) proposes an algorithm consists of two operations: aggregate and pooling. aggregate step computes the information on each vertex using the local neighborhood, then pooling computes the output for each vertex. These vector outputs are then used in classification at vertexlevel or graphlevel. More recently, GIN (Xu et al., 2019) model generalizes the concept in GraphSAGE to propose a unified messagepassing framework for graph classification.2.2 Learning Noisy Label Data
Surrogate Loss
Using an alternative loss function to deal with noisy label data is a common practice in the weakly supervised learning literature
(Natarajan et al., 2013; Biggio et al., 2012; Georgakopoulos et al., 2016; Patrini et al., 2016, 2017). We apply the backward loss correction procedure to graph neural network: . This loss can be intuitively understood as going backward one step in the noise process (Patrini et al., 2017).We study the symmetric noise setting where label is corrupted to label with the same probability for to () (Biggio et al., 2012). We use a
symmetric Markov matrix
to describe the noisy process with labels. Furthermore, to simplify the experiment settings, with a given we set: . For example when the noise matrix is:Matrix above can be interpreted as all labels are kept with probability and corrupted to other labels with probability (summation of offdiagonal elements in a row).
2.3 Denoising Graph Neural Networks
Formaly we define our graph neural network model as the message passing approach proposed by Xu et al. (2019). The feature vector of a vertex at th hop (or layer) is given by AGGREGATE and COMBINE functions:
(1) 
denotes the neighborhood set of vertex ; and is the predefined number of “layers” corresponding to network’s perceptive field. The final representation of graph is calculated using a READOUT function. Then, we train the neural network by optimizing the surrogate backward loss.
(2) 
DGNN is different from GIN only at the surrogate loss function as described above. To train a DGNN model, we first train a GIN model on the noisy data for estimating , then we train DGNN using the estimated correction matrix.
We train our DGNN model using three different noise estimator: Conservative (DGNNC), Anchors (DGNNA), and Exact (DGNNE). The exact loss correction is introduced for comparison purposes. The hyperparameters of our models are set similar to GIN model in the previous paragraph. For conservative and anchor correction matrix estimation, we train two models on the same noisy dataset: The first model is without loss correction and the second model is trained using the correction matrix from the first model. For all neural network models, we use the ReLU activation unit as the nonlinearity.
3 Empirical Results
We test our framework on the set of wellstudied 9 datasets for the graph classification task: 4 bioinformatics datasets (MUTAG, PTC, NCI1, PROTEINS), and 5 social network datasets (COLLAB, IMDBBINARY, IMDBMULTI, REDDITBINARY, REDDITMULTI5K) (Yanardag & Vishwanathan, 2015). We follow the preprocessing suggested by Xu et al. (2019)
to use onehot encoding as vertex degrees for social networks (except REDDIT datasets). Table
1 gives the overview of each dataset. Since these datasets have exact label for each graph, we introduce symmetric label noise artificially.Dataset  #graphs  #classes  #vertices 

IMDBB  1000  2  19.8 
IMDBM  1500  3  13.0 
RDTB  2000  2  429.6 
RDTM5K  5000  5  508.5 
COLLAB  5000  3  74.5 
MUTAG  188  2  17.9 
PROTEINS  1113  2  39.1 
PTC  344  2  25.5 
NCI1  4110  2  29.8 
3.1 Noise Estimation
Conservative Estimation
We estimate the corruption probability by the Conservative Estimator described in the previous sections. For each noise configuration, we train the original neural network (GIN) on the noisy data and use the neural response to fill each row of the correction matrix . Table 2 gives an overview of how well the conservative estimation matrix diverges from the correct noise matrix. The matrix norm is the norm with .
Anchor Estimation
We follow the noise estimation method introduced in Patrini et al. (2017) (Equations (12,13)) to estimate the noise probability using an unseen set of samples. These anchor samples are assumed to have the correct labels, hence they can be used to estimate the noise matrix according to the expressivity assumption. In our experiments, these samples are taken from the testing data (one per class). Table 2 demonstrates the similarity results.
Dataset (#classes)  diag()  Avg. diag()  Avg. diag()  

IMDBB (2)  0.8  0.99  0.76  0.77  0.12 
IMDBM (3)  0.8  0.99  1.14  0.85  0.30 
RDTB (2)  0.8  0.99  0.76  0.75  0.20 
RDTM5K (5)  0.8  0.99  1.90  0.81  0.10 
COLLAB (3)  0.8  0.99  1.14  0.75  0.30 
MUTAG (2)  0.8  0.99  0.76  0.74  0.24 
PROTEINS (2)  0.8  0.99  0.76  0.78  0.08 
PTC (2)  0.8  0.99  0.76  0.63  0.68 
NCI1 (2)  0.8  0.99  0.76  0.74  0.24 
Exact Assumption
In this experiment setting, we assume that the noise matrix is exactly known from some other estimation process. In practice, such an assumption might not be realistic. However, under the symmetric noise assumption, the diagonal of the correction matrix can be tuned as a hyperparameter.
3.2 Graph Classification
We compare our model with the original Graph Isomorphism Network (GIN) (Xu et al., 2019). The hyperparameters are fixed across all datasets as follow: epochs=20, num_layers=5, num_mlp_layers=2, batch_size=64. We keep these hyperparameters fixed for all datasets since the similar trend of accuracy degradation is observed independently of hyperparameter tuning. Besides GIN, we consider GraphSAGE model (Hamilton et al., 2017) under the same noisy setting. We use the default setting for GraphSAGE as suggested in the original paper.
MUTAG  IMDBM  RDTB  RDTM5K  COLLAB  IMDBB  PROTEINS  PTC  NCI1  
GIN  .7327  .4476  .6695  .3677  .6544  .6573  .6257  .4824  .6472 
GraphSAGE  .7072  .4373        .6410  .6583  .4892  .6053 
DGNNC  .5727  .4747  .5005  .2000  .5979  .6940  .6693  .5557  .6170 
DGNNA  .7102  .4505  .5307  .2000  .6917  .7088  .6769  .5001  .6405 
DGNNE  .7002  .4633  .5270  .2022  .6960  .7190  .6917  .5235  .6638 
We fix the noise rate at 20% for the experiments in Table 3
and report the mean accuracy after 10 fold cross validation run. The worst performance variance of our model is the conservative estimation model. Due to the overestimation of softmax unit within the crossentropy loss, the model’s confidence to all training data is close to
. Such overconfidence leads to wrong correction matrix estimation, which in turn leads to worse performance (Table 2). In contrast to DGNNC, DGNNA and DGNNE have consistently outperformed the original model. Such improvement comes from the fact that the correction matrix is correctly approximated. Figure 2 suggests that the DGNNC model might work well under the higher label noise settings.4 Conclusion
In this paper, we have introduced the use of loss correction for Graph Neural Networks to deal with symmetric graph label noise. We experimented on two different practical noise estimatation methods and compare them to the case when we know the exact noise matrix. Our empirical results show some improvement on noise tolerant when the correction matrix is correctly estimated. In practice, we can consider as a hyperparameter and tune it following some clean validation data.
Acknowledgments
This work was supported by JSPS GrantinAid for Scientific Research (B) (Grant Number 17H01785) and JST CREST (Grant Number JPMJCR1687).
References
 Biggio et al. (2012) Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389, 2012.
 Georgakopoulos et al. (2016) Spiros V Georgakopoulos, Dimitris K Iakovidis, Michael Vasilakakis, Vassilis P Plagianakos, and Anastasios Koulaouzidis. Weaklysupervised convolutional learning for detection of inflammatory gastrointestinal lesions. In 2016 IEEE international conference on imaging systems and techniques (IST), pp. 510–514. IEEE, 2016.
 Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034, 2017.
 Kipf & Welling (2017) Thomas N. Kipf and Max Welling. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
 Natarajan et al. (2013) Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pp. 1196–1204, 2013.
 Patrini et al. (2016) Giorgio Patrini, Frank Nielsen, Richard Nock, and Marcello Carioni. Loss factorization, weakly supervised learning and label noise robustness. In International conference on machine learning, pp. 708–717, 2016.

Patrini et al. (2017)
Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and
Lizhen Qu.
Making deep neural networks robust to label noise: A loss correction
approach.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1944–1952, 2017.  Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
 Yanardag & Vishwanathan (2015) Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. ACM, 2015.
 Zhou (2017) ZhiHua Zhou. A brief introduction to weakly supervised learning. National Science Review, 5(1):44–53, 2017.