1 Introduction
Learning deep neural networks (DNNs) has achieved impressive successes in many research fields but has suffered from collecting massive cleanannotated training samples such as ImageNet
[9]and MSCOCO
[21]. Since annotating procedures are usually done manually by human experts, it is expensive and timeconsuming to get large clean labeled data, which prevents deep learning models from being trained successfully. On the other hand, it is feasible to access numerous data through internet search engines
[11, 31, 40, 17] or hashtags, whose labels are easy to collect but relatively inaccurate. Thus it becomes to get a spotlight to exploit data sets with corrupted labels instead of clean ones to solve classification tasks with DNNs, which is called the noisy label problem.There have been many kinds of literature dealing with noisy labeled data, and a majority of methods exploited socalled the memorization effect
, which is a special characteristic of DNNs that DNNs memorize data eventually (i.e. perfectly classify training data) but memorize clean labeled samples earlier and noisy samples later
[3, 16]. Hence, we can identify clean data from the given training data contaminated with noisy labels by choosing samples with small loss values. Due to its simplicity and superiority, many followup studies have been proposed based on the smallloss strategy and achieved great success ([34] and references therein).But the smallloss strategy has several weaknesses. First, during the training phase, it is difficult to know a training epoch (or iteration) where the discrepancy of loss values between clean data and noisy data is large since it heavily depends on various factors including data set, model architecture, optimizer type and even learning schedule. Second, it becomes hard to identify cleanannotated samples from training data via the smallloss strategy when the training labels are heavily polluted. Besides, the memorization effect may not appear when we analyze the data with imbalanced label distribution. As we can obtain imbalanced data frequently in many realworld domains, this shortcoming can be an obstacle for the smallloss strategy applied in many industry fields.
To tackle these issues about the memorization effect, we develop a novel and powerful method called INN (Integration with the Nearest Neighbors). We start with a new and interesting observation that the output values of a trained DNN at neighbor regions of labeled and noisy samples are consistently much different regardless of training epochs. We call this phenomenon the consistency effect. Motivated by the consistency effect, the INN method takes averages of the output values of neighbor regions of a given sample and decides it as noisy if the average is small. See Figure 1 for an illustration of the INN method.
In fact, the INN requires more computation than the smallloss method. Still, this additional expense deserves to pay since the INN successfully overcomes the smallloss method’s limitations. The INN works well even when the training labels are heavily contaminated or has imbalanced distribution, while the smallloss method is in trouble for the situations. The stability and superiority make the INN easily applicable to various supervised learning tasks without much effort.
We can also combine the INN with an existing noisylabelproblemsolving learning method based on the smallloss strategy (e.g. DivideMix [19]) to construct deep networks of high accuracy. We replace the parts where the memorization effect and loss information are used with the consistency effect and the INN information. We show that these modifications enhance prediction performances much, especially when training labels have many noises or imbalanced distribution.
This paper is organized as follows. In Section 2, we provide brief reviews for related studies dealing with noisy labels, and detailed descriptions of the INN are given in Section 3. Various experimental analyses including performance test and ablation study are given in Section 4 and final concluding remarks follow in Section 5. The key contributions of this work are as follows.

We find a new observation called the consistency effect, that the output values of a trained DNN at neighbor regions of labeled and noisy samples are consistently much different regardless of training epochs.

Built on the consistency effect, we propose a method called the INN to identify clean annotated data from a given training data.

We empirically demonstrate that the INN can separate clean and noisy samples accurately and stably even under the heavy label corruption and imbalanced label distribution, and also helpful to construct superior prediction models.
2 Related works
The noisy label problem has been studied for several decades [1, 47, 27]. The core issue to solve the noisy label problem with DNNs is that DNNs easily overfit all training samples, including noisy labeled ones, because of too large complexities resulting in inferior generalization performances. Here we review some related studies for efficient algorithms to train robust classifiers in noisy annotations based on the key concepts called the loss correction and the memorization effect. And we also describe several approaches exploiting the information of a target sample’s neighborhoods as the INN does.
Loss correction based algorithms have a goal to improve the generalization error by modifying objective functions [29, 45]. The noise adaptive layerbased algorithm [12]
added additional noisy channels that estimate the correct labels. The iterative noisy label detection
[38, 37]used the weighted softmax loss function where the weights are updated iteratively based on the feature maps of the current DNN model. Some algorithms to estimate groundtruth labels directly have been developed
[36, 41]. The metalearning algorithm was also applied to resolve the noisy label problem [20]. There was an attempt to propose a new loss function more robust than standard loss functions [23].Approaches based on the memorization effect focused on the gap between the output values of clean labeled and noisy labeled samples during an early stage of the training phase. The decouple method [26] proposed a metaalgorithm called decoupling which decides when to update. D2L [25] distinguished clean labeled data from noisy ones by employing a local dimensionality measure and ELR [22] found the faster gradient vanishings of clean labeled samples at the early learning stage. There were several algorithms to train noisyrobust prediction models by using only a subset of the training data based on their loss or prediction values. [13, 42, 32, 8, 33, 28]. Some studies fitted a twocomponent mixture model to a persample loss distribution [2, 19].
Some works tried to utilize the information of neighborhoods to filter out noisy labeled data, similar to the INN’s idea. The distanceweighted NN [10] was an initial work that considers the nearest samples with their distancebased weights. Deep NN [4] proposed a filtering strategy based on the label information of nearest neighbors, and MentorMix [15] applied MixUp [44] to MentorNet [16] to consider linear combinations of two inputs. We will discuss about the difference between the INN and the methods of exploiting neighbor’s information in Section 3.3.
3 Integration with the nearest neighbors
3.1 Notations and definitions
For a given input vector
, let be its observable and groundtruth labels, respectively, where . Of course, might be different from . We say that the sample is cleanly labeled if and noisily labeled if . Let be a training data set with samples. Define and . Our goal is to identify the clean labeled subset from accurately.Let (abbr. ) be a discriminative DNN parametrized by which maps an input to a
dimensional conditional probability vector with the softmax layer. Also let
be the th component of , that is, we can represent as .3.2 Consistency effect
Before we start, we explain the main motivation of our method. For a given training sample , we define , where is the nearest neighbor training input of on the feature space (i.e. is most close to ) and is the output of the penultimate layer of a pretrained prediction model on . Then, we can regard that locates in the neighbor region of . We investigate how the prediction values of the training inputs and their neighbors behave differently by the label cleanness. We also estimate a prediction model by minimizing the standard crossentropy based on . At each training epoch, we calculate the four expectations defined as
Two values and are the expectations of clean and noisy data’s predictions, respectively, and and are the expectations of neighbor region’s prediction values over the clean and noisy labeled data, respectively. From Figure 3, we can see a typical phenomenon related to the memorization effect: and are much different at early epochs but the difference diminishes as the training epoch proceeds. That means that it becomes hard to discriminate noisy data from clean ones by comparing values at each sample at later stages of the training phase. But it is difficult to decide how many epochs are necessary for amply utilizing the memorization effect.
On the other hand, the difference between and is clearly significant regardless of training epochs. That is, the prediction values of neighbor regions for each sample are informative to separate clean and noisy data even when the number of training epoch is large. We call this new observation the consistency effect. This consistent discrepancy occurs by two reasons. First, when is a noisy labeled sample, the label of an input , i.e. , and the label of its neighborhood training sample denoted by tend not to coincide, which yields the small prediction value . And even if and are equal, may not be the nearest neighbor on the input space (i.e. is not close to ). Hence, there exists a region between and at which the value of becomes small.
3.3 INN method
In this section, we propose a new and novel method to identify clean labeled samples motivated by the consistency effect. As being observed in Section 3.2, it is important to take into account the prediction values at a neighbor region of each sample. Let be a prediction model trained with a loss function on for training epochs. In this study, we use the MixUp objective function [44] as the loss function . For a given training sample and its neighborhood training input , a naturally induced score to identify whether is clean or not would be , where . From further experiments, we modify the score as follows. First, it is observed that the consistency effect occurs at many input vectors between and other than . Thus, to exploit the consistency effect fully, we consider integrating the prediction function over the whole interval between and to have
Second, using multiple neighbor samples helps identify clean labeled data more accurately. Based on these two arguments, we propose the INN score given as:
(1) 
where is the set of nearest neighbor training inputs of on the feature space described in Section 3.2. Figure 3 illustrates the effects of these two modifications. The integration in (1) can be easily approximated by the trapezoidal rule as follows:
where and is the number of trapezoids. In practice, we fix the value of and to 10. The larger the score is, the more we could regard as being cleanly labeled. Hereafter, we will abbreviate to . Even after large training epochs, since there still remains the consistency effect in the prediction model, the INN method separates clean labeled data from noisy ones well. The following simple lemma supports the validity of our method.
Lemma 1.
Let be a prediction model which perfectly overfits the MixUp loss function
where CE is the crossentropy loss function, , and
is the Beta distribution with a hyperparameter
. Also let assume that for each training input , its nearest neighbor set satisfies , where is the observed label of and is the groundtruth label of . Then, the following inequality holds:The proof is in the supplementary materials. Lemma 1 means that if we have a prediction model trained with the MixUp and good nearest neighborhood sets, then the INN separates perfectly from .
As mentioned in Section 2, there are several approaches similar to the INN that utilize the nearest neighborhoods’ information to filter out noisy data [38, 20, 4, 15]. They mainly take advantage of the labels of the neighborhoods. When the training labels are polluted heavily, most of the nearest samples also become noisy, thus relying only on the label information might lead to bad results. On the other hand, the INN focuses on the regions between the inputs and their neighbor training samples. So, we expect that the INN would be robust to highly noisy data.
The algorithm of the INN method is summarized in Algorithm 1.
4 Experimental analysis
In this section, we empirically show the superiority of the INN in terms of three aspects. First, the INN is not sensitive to the choice of the training epochs and provides consistent performances. Second, the INN is significantly better than the smallloss strategy when many polluted labels are in the training data. Finally, in a situation where the training labels are imbalanced, the smallloss strategy may not work, while our method still succeeds in finding clean labeled data. And we also provide a combination of the INN and an existing smalllossbased learning framework to construct better deep prediction networks. Some additional ablation studies follow after then.
4.1 Experimental settings
Data sets
We carry out extensive experiments including performance tests and ablation studies by analyzing three data sets, CIFAR10&100 [18] and Clothing1M [40]. Both CIFAR10 and CIFAR100 consist of 50K training data and 10K test data with an input size of all of which are cleanly labeled. Clothing1M is a largescale data set with realworld noisy labels containing 1M training data collected from online shopping websites. We use the subset of the Clothing1M data set whose groundtruth labels are known. The subset consists of 48K samples with a noisy level of 20% roughly.
As for imposing noise labels to CIFAR10 and CIFAR100, we consider symmetric and asymmetric settings as other studies did [45, 41]. In the symmetric noise setting, for each sample in the training data set, its label is contaminated with a probability
to a random label generated from the uniform distribution on
to ( for CIFAR10 and for CIFAR100). In the asymmetric noise setting for CIFAR10, with a probability , a noisy label is generated by one of the following mappings: truckautomobile, birdairplane, deerhorse and catdog. For CIFAR100, labels are asymmetrically contaminated by flipping a given label to the next label with a probability according to the transition chain: class1class2class100class1.Architectures and implementation details
We need two models and , and we use the same architectures for them in all experiments. For CIFAR10&100 we utilize PreActResNet18 (PRN, [14]) with randomly initialized weights, and for Clothing1M we use ResNet50 (RN, [35]
) with pretrained weights by ImageNet. We train all the deep networks using the SGD algorithm with a momentum of 0.9 and the minibatch size of 128, set the initial learning rate as 0.02, and reduce it by a factor of 5 when the half and threefourths of the learning procedure proceed, respectively. All the results of ours in the following experiments are the averaged values of three trials executed from random initial weights and minibatch arrangements.
4.2 Stability test of the INN
In this section, we show the stability and superiority of the INN for identifying clean labeled samples from training data. For CIFAR10&100, we consider eight cases (Case11 to Case42) by varying noise rates and noise types.

CIFAR10 with (Case11) and (Case12) symmetrically noisy labels

CIFAR10 with (Case21) and (Case22) asymmetrically noisy labels

CIFAR100 with (Case31) and (Case32) symmetrically noisy labels

CIFAR100 with (Case41) and (Case42) asymmetrically noisy labels
We consider various training epochs for , , from 50 to 300, and calculate the clean/noisy classification AUC values of the training data induced by the INN for each training epoch. We consider two smallloss methods for the baselines, whose loss functions are the standard crossentropy (CE) and the sum of the crossentropy and negative entropy (CE+NE), respectively. We evaluate the smallloss methods’ clean/noisy classification AUC values based on their persample losses.
The results are depicted in Figure 4. We can clearly see that the INN provides consistent and highquality results for all considering cases regardless of training epochs. In contrast, the performances of other baselines become worse as the training epoch increases. That implies the INN gives a more stable and powerful performance for clean/noisy classification. We also repeat these experiments with another deep architecture, WideResNet282 [43], and leave the results in the supplementary materials.
As for Clothing1M, we extract the weights of from the estimate of RN pretrained by ImageNet and train only for epochs varying from 10 to 300. We compare the AUC values of the INN with those based on the loss values of RN trained by CE. Figure 5 shows that the INN is superior and more stable throughout the whole training procedure compared to the competitor, which again assures the effectiveness of the INN.
4.3 INN with heavy noisy rates
We experiment with situations where training data are highly contaminated with noisy labels. We analyze CIFAR10&100 and change their groundtruth labels with a high probability. Like the previous analysis, we compare the INN to the two smalllossbased methods trained with the CE and CE+NE, respectively.
The best clean/noisy classification AUC values of the training data for each method are summarized in Table 1. In the heavy noise case, the proportion of the clean labeled data is not large. Thus, reducing the loss of clean labeled data may not be an optimal direction to reduce the overall loss values in the early learning stages, leading to the degradation of the smallloss strategy. In contrast, the INN can still identify clean data from noisy ones effectively even when many noisy labels exist and outperform the smallloss methods with large margins. We conjecture that this performance differences arise because the consistency effect is more insensitive to the number of noisy labels.
Data set  CIFAR10  CIFAR100  

Noise type  Symm.  Symm.  Asymm.  
Noise rate ()  0.8  0.9  0.8  0.9  0.4 
CE  0.857 (0.011)  0.756 (0.015)  0.809 (0.005)  0.690 (0.009)  0.589 (0.002) 
CE+NE  0.854 (0.012)  0.750 (0.017)  0.807 (0.015)  0.695 (0.011)  0.608 (0.003) 
INN  0.885 (0.014)  0.817 (0.018)  0.853 (0.016)  0.717 (0.013)  0.671 (0.005) 
The best clean/noisy classification AUC values of the INN and smallloss methods. The averaged values over three trials are listed, and standard deviations are also given in the parenthesis.
4.4 Analysis of imbalanced data
We also analyze the noisy data where the groundtruth labels are imbalanced. In this section, we consider the twoclass classification task. From CIFAR10, we randomly sample two classes: first and second labels are regarded as a majority class and a minority class, respectively. We gather all images in the first class and 10% randomly sampled images in the second class. We relabel the majority and minority classes to 0 and 1, respectively, i.e. , and for each sample, we flip its label with a probability of 0.3 to generate training data with noisy labels. We compare the normalized score distributions of the INN and the smallloss method (CE) with the training data. For each distribution, we make four histograms by considering two factors: 1) whether the groundtruth label is 0 or 1 and 2) whether the observed label is 0 or 1.
As can be seen in Figure 6, we observe that the model trained by the standard CE does not always prioritize memorizing the cleanlabeled data anymore, i.e. data with , rather memorizing the data with , which implies the memorization effect may not occur in these imbalanced cases. Furthermore, their loss distributions are very unstable with respect to the training epochs. We report the histograms at other training epochs in the supplementary materials. In contrast, the consistency effect is still observable though not clear. The INN can separate clean samples from noisy ones to some extent, even if they belong to the minor class.
4.5 Constructing noiserobust classifiers
Data set  CIFAR10  CIFAR100  

Noise type  Symm.  Symm.  Asymm.  
Noise rate ()  0.8  0.9  0.8  0.9  0.4 
CrossEntropy  62.9  42.7  19.9  10.1  42.7 
Coteaching [42]  67.4  47.9  27.9  13.7   
Pcorrection [41]  77.5  58.9  31.1  15.3   
MLNT [20]    59.1 (1.12)       
Mcorrection [2]  86.8  69.1  48.2  24.3   
DivideMix [19]  93.2  76.0  60.2  31.5   
DivideMix*  92.90 (1.08)  71.34 (1.43)  58.26 (1.01)  31.36 (0.67)  59.26 (1.08) 
INNDivideMix  93.48 (1.01)  81.20 (1.05)  59.04 (0.98)  33.11 (0.82)  63.04 (1.10) 
Data set  Imbalanced CIFAR10  
Classes  DivideMix  INNDivideMix  
Best  Last  Best  Last  
1 and 2  89.23 (0.40)  82.21 (0.57)  92.82 (0.27)  87.75 (0.77) 
3 and 8  87.49 (0.68)  82.24 (1.65)  9 3.02 (0.12)  91.06 (0.18) 
4 and 5  79.41 (0.68)  78.35 (0.98)  82.82 (0.20)  82.79 (0.13) 
7 and 9  85.64 (0.35)  79.80 (1.92)  86.33 (0.45)  84.15 (0.25) 
The INN is also helpful for learning deep classification models with high performance with noisy training data. Many conventional learning frameworks built on the smallloss strategy usually begin with learning models with the whole training data by minimizing the standard loss function, such as CE, for a few epochs. Due to the memorization effect, the early estimated models tend to memorize the cleanlabeled training data first. Initialized with the estimated models, they conduct their own strategies to train models. After each training epoch, they update the persample loss values with the current model and utilize this information at the next training epoch.
We can modify them by simple modifications with the INN. First, we replace their initialized models with new ones trained with the INN. We fit a twocomponent mixture model to the INN scores of the training data and split the data into the labeled and unlabeled data based on the posterior probability of belonging to the clean cluster (the cluster with a larger mean). With the labeled and unlabeled data, we train prediction models for a few epochs using a semisupervised learning method, such as the MixMatch
[7], then use them as the initialized estimates. Second, after each training epoch, we recalculate the INN scores with the current prediction model and utilize these score values instead of the loss values at the next training epoch.In this work, we provide an example to mix the INN and the DivideMix [19], known as one of the stateoftheart methods to learning models with data mixed with noisy labels. We add a detailed algorithm of this combination in the supplementary materials. We again stress that any other smalllossbased learning frameworks than the DivideMix are also available to be combined with the INN to train better prediction models, and that there are many rooms to develop our simple application.
To assess the prediction performance of our modification, we analyze the same noisy data sets in Section 4.3 and 4.4. We carry out test accuracy comparison with the modified DivideMix with the INN (INNDivideMix) and other baselines, which are reported in Table 2 and 3.^{1}^{1}1We complete our source code based on the public GitHub code of the DivideMix. We can check that our modified learning framework works better than the existing methods, including the DivideMix, in the cases where the training labels are highly polluted or imbalanced. And the performance gap between the best and last models trained with our method is smaller than the original DivideMix, which means the INN makes learning procedures more stable. We also make additional accuracy tests for data sets with noise rates not too severe and observe that ours gives no significant improvements since the DivideMix already works well. We report these results in the supplementary materials.
4.6 Ablation study
We empirically investigate how the choices of dissimilarity measure and the loss function affect the INN, whose detailed analyses are described in the supplementary materials. We observe that the Euclidean distance on the penultimate layer of a DNN trained by the training data is the best dissimilarity measure. And for the loss function, the MixUp objective function yields the best results.
We also evaluate the total training time of the INN and compare it to that of the standard smallloss method. We use a single NVIDIA TITAN XP GPU, and the results are in the supplementary materials. It takes much more time to implement the INN than the competitor since it needs to extract the nearest neighbors for each sample, which could be one of our limitations. More study to lighten the computation burden of the INN is required.
5 Concluding remarks
In this study, we proposed a new and novel approach, called the INN, to identify clean labeled samples from training data with noisy labels based on a new finding called the consistency effect that discrepancies of predictions at neighbor regions of clean and noisy data are consistently observed. We empirically demonstrated that the INN is stable and superior even when the training labels are heavily contaminated or imbalanced.
It would be interesting to apply our methods to unsupervised anomaly detection problems
[39, 46, 48, 5]. After annotating labels (normal or abnormal) to the training data in a certain way, we can regard the task as the twoclass noisy label problem. We expect that our methods would solve the anomaly detection problems successfully.References
 [1] (1988) Learning from noisy examples. Machine Learning 2, pp. 343–370. Cited by: §2.
 [2] (2019) Unsupervised label noise modeling and loss correction. In 36th International Conference on Machine Learning, pp. 312–321. Cited by: Table D.1, Table D.2, §2, Table 2.
 [3] (2017) A closer look at memorization in deep networks. In International Conference on Machine Learning, pp. 233–242. Cited by: §1.
 [4] (202013–18 Jul) Deep kNN for noisy labels. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 540–550. External Links: Link Cited by: §2, §3.3.
 [5] (2020) Classificationbased anomaly detection for general data. In International Conference on Learning Representations, External Links: Link Cited by: §5.
 [6] (2019) MixMatch: a holistic approach to semisupervised learning. In Advances in Neural Information Processing Systems, pp. 5049–5059. Cited by: §D.1.1.
 [7] (2019) MixMatch: a holistic approach to semisupervised learning. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: §4.5.
 [8] (2019) Understanding and utilizing deep neural networks trained with noisy labels. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 1062–1070. Cited by: §2.

[9]
(2009)
Imagenet: a largescale hierarchical image database.
In
2009 IEEE Conference on Computer Vision and Pattern Recognition
, pp. 248–255. Cited by: §1.  [10] (1976) The distanceweighted knearestneighbor rule. IEEE Transactions on Systems, Man, and Cybernetics SMC6 (4), pp. 325–327. External Links: Document Cited by: §2.
 [11] (2010) Learning object categories from internet image searches. Proceedings of the IEEE 98 (8), pp. 1453–1466. Cited by: §1.
 [12] (2017) Training deep neuralnetworks using a noise adaptation layer. In 5th International Conference on Learning Representations, Cited by: §2.
 [13] (2018) Coteaching: robust training of deep neural networks with extremely noisy labels. In 32nd International Conference on Neural Information Processing Systems, pp. 8536–8546. Cited by: §2.
 [14] (2016) Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Cited by: §4.1.
 [15] (202013–18 Jul) Beyond synthetic noise: deep learning on controlled noisy labels. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 4804–4815. External Links: Link Cited by: §2, §3.3.
 [16] (2018) Mentornet: learning datadriven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2304–2313. Cited by: Table D.2, §1, §2.
 [17] (2016) The unreasonable effectiveness of noisy data for finegrained recognition. In European Conference on Computer Vision, pp. 301–320. Cited by: §1.
 [18] (2009) Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto. Cited by: §4.1.
 [19] (2020) Dividemix: learning with noisy labels as semisupervised learning. In 8th International Conference on Learning Representations, Cited by: Table D.1, Table D.2, §1, §2, §4.5, Table 2.
 [20] (2019) Learning to learn from noisy labeled data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5051–5059. Cited by: Table D.1, Table D.2, §2, §3.3, Table 2.
 [21] (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §1.
 [22] (2020) Earlylearning regularization prevents memorization of noisy labels. In Advances in Neural Information Processing Systems 33 preproceedings (NeurIPS 2020), Cited by: §2.
 [23] (2020) Curriculum loss: robust learning and generalization against label corruption. In 8th International Conference on Learning Representations, Cited by: §2.
 [24] (2018) Dimensionalitydriven learning with noisy labels. In International Conference on Machine Learning, pp. 3355–3364. Cited by: Table D.2.
 [25] (2018) Dimensionalitydriven learning with noisy labels. In 35th International Conference on Machine Learning, pp. 5907–5915. Cited by: §2.
 [26] (2017) Decoupling "when to update" from "how to update". In Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 960–970. Cited by: §2.
 [27] (2010) A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial Intelligence Review 33, pp. 275–306. Cited by: §2.
 [28] (2020) SELF: learning to filter noisy labels with selfensembling. In International Conference on Learning Representations, Cited by: §2.
 [29] (2017) Making deep neural networks robust to label noise: a loss correction approach. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2233–2241. Cited by: §2.
 [30] (2018) Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pp. 4334–4343. Cited by: Table D.2.
 [31] (2010) Harvesting image databases from the web. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (4), pp. 754–766. Cited by: §1.
 [32] (2019) Learning with bad training data via iterative trimmed loss minimization. In 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 5739–5748. Cited by: §2.
 [33] (2019) SELFIE: refurbishing unclean samples for robust deep learning. K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 5907–5915. Cited by: §2.
 [34] (2020) Learning from noisy labels with deep neural networks: a survey. arXiv preprint arXiv:2007.08199. Cited by: §1.
 [35] (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §4.1.
 [36] (2018) Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5552–5560. Cited by: §2.
 [37] (2019) Combating label noise in deep learning using abstention. Proceedings of the 36th International Conference on Machine Learning, pp. 6234–6243. Cited by: Table D.2, §2.
 [38] (2018) Iterative learning with openset noisy labels. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 8688–8696. Cited by: §2, §3.3.

[39]
(2015)
Learning discriminative reconstructions for unsupervised outlier removal
. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1511–1519. External Links: Document Cited by: §5.  [40] (2015) Learning from massive noisy labeled data for image classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2691–2699. Cited by: §1, §4.1.
 [41] (2019) Probabilistic endtoend noise correction for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7017–7025. Cited by: Table D.1, Table D.2, §2, §4.1, Table 2.
 [42] (2019) How does disagreement help generalization against label corruption?. In 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 7164–7173. Cited by: Table D.1, Table D.2, §2, Table 2.
 [43] (2016) Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), E. R. H. Richard C. Wilson and W. A. P. Smith (Eds.), pp. 1–12. Cited by: Appendix B, §4.2.
 [44] (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §E.2, §2, §3.3.
 [45] (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In 32nd International Conference on Neural Information Processing Systems, pp. 8792–8802. Cited by: §2, §4.1.

[46]
(2017)
Anomaly detection with robust deep autoencoders
. KDD ’17, New York, NY, USA, pp. 665–674. External Links: ISBN 9781450348874, Link, Document Cited by: §5.  [47] (2004) Class noise vs. attribute noise: a quantitative study. Artificial Intelligence Review 22, pp. 177–210. Cited by: §2.

[48]
(2018)
Deep autoencoding gaussian mixture model for unsupervised anomaly detection
. In International Conference on Learning Representations, External Links: Link Cited by: §5.
6 Supplementary materials for the INN
Supplementary materials for
INN: A Method Identifying Cleanannotated Samples via Consistency Effect in Deep Neural Networks
Appendix A Proof of Lemma 1
Let and be a given training sample and a set of its nearest training inputs, respectively. Since the model perfectly overfits the MixUp loss function, it is linear over between any two given training inputs. So, for the following equality holds:
where is the corresponding observed label of . Hence, the INN score can be rewritten as:
(2) 
By using , we also have
(3) 
(4) 
if is cleanly labeled, i.e. , and
(5)  
if is noisy labeled. From (4) and (5), we have the following inequality and the proof is completed:
Appendix B INN with another architecture
We carry out additional experiments for the performance test of the INN using another architecture. Here, we consider WideResNet282 [43] and we follow the main manuscript’s implementation settings, such as the strategies to impose noisy labels and learning schedules.
We depict the results in Figure B.1. Similar to the results with the PRN architecture, the clean/noisy classification performances of the INN are insensitive to the choice of the training epochs and consistently outperform other two competitors based on loss values.
Appendix C Instability of the smallloss method for imbalanced data
Figure C.1 illustrates histograms of the loss values at various training epochs when we analyze imbalanced data. The same data sets used in Section 4.4 of the main manuscript are considered. We can check that the loss distribution is unstable, so it is hard to choose an optimal training epoch.
Appendix D Modification of DivideMix with INN
d.1 Detailed algorithm description
Here, we describe our modification (INNDivideMix) in detail by comparing it with the original DivideMix algorithm.
d.1.1 Step 1
Original DivideMix
The DivideMix trains two prediction models by minimizing the sum of the crossentropy and negative entropy with two initial independent parameters for a prespecified training epoch using the whole training data. The training epoch usually ranges from 10 to 30, depending on a data set. This part aims to get two initialization models robust to noisy labels to some extent via the memorization effect.
InnDivideMix
First, we conduct the INN method twice to obtain two corresponding INN score sets of the training data with random initializations. With each INN score set, we separate the training data to labeled and unlabeled data by fitting a twocomponent Beta Mixture Model (BMM) to the INN scores using the ExpectationMaximization algorithm. We regard samples whose posterior probabilities of belonging to the clean cluster (the cluster with a larger mean) is larger than 0.5 as labeled data and treat the remained samples as unlabeled data by discarding their labels. Then, we utilize a semisupervised learning (SSL) method to train two prediction models for a prespecified training epoch with each pair of the labeled and unlabeled data. Any SSL method can be applied, and in our experiments, we adopt the MixMatch
[6].d.1.2 Step 2
Original DivideMix
The DivideMix updates two prediction models pretrained from the first step. For each prediction model, the DivideMix fits a twocomponent Gaussian Mixture Model (GMM) on a persample loss distribution of the model to divide the training set into a labeled data set and an unlabeled data set, which is called the codivide procedure. Then, the two models exchange the codivided data sets and are trained based on the exchanged data sets by use of the SSL method modifying the MixMatch with the corefinement and the coguessing techniques. The DivideMix iterates the codivide and the modified MixMatch method alternately for a prespecified training epoch to construct the final prediction models.
InnDivideMix
Similar to the original DivideMix, our method starts with two initialized prediction models from the first step of ours. For each prediction model, we calculate the INN score set. Note that we need a prediction model and a feature model to conduct the INN. Here, we set and to the current prediction model and its penultimate layer, i.e., the highest hidden layer, respectively. With each INN score set, we fit a twocomponent Beta Mixture Model (BMM) and split the training data into labeled and unlabeled data by using the posterior probabilities. Then, we utilize the same SSL algorithm used in the original DivideMix with the two pairs of labeled and unlabeled data sets to update the two prediction models. We repeat the above learning procedure for a prespecified training epoch. We summarize our modification in Algorithm D.1. Algorithm D.1 requires three kinds of training epochs, and . In practice, we set for all cases.
d.2 Performance tests of INNDivideMix
We report the last test accuracy results of the INNDivideMix on severely contaminated data sets. Table D.1 shows that the modified DivideMix with the INN improves the DivideMix with large margins.
The prediction performance of the INN in ordinary cases where there are not many noisy labeled samples in the training data are provided in Table D.2. Our modification does not give visible enhancement in these situations since the DivideMix already works well, so there is not much room to develop the original one.
Data set  CIFAR10  CIFAR100  

Noise type  Symm.  Symm.  Asymm.  
Noise rate ()  0.8  0.9  0.8  0.9  0.4 
CrossEntropy  26.1  16.8  8.8  3.5   
Coteaching [42]  45.5  30.1  15.5  8.8   
Pcorrection [41]  76.5  58.2  20.7  8.8   
MLNT [20]           
Mcorrection [2]  86.6  68.7  47.6  20.5   
DivideMix [19]  92.9  75.4  59.6  31.0   
DivideMix*  92.71(1.02)  68.79(1.31)  57.78(1.07)  31.16(1.21)  53.79(1.09) 
INNDivideMix  93.04(1.19)  80.51(1.05)  58.81(1.22)  32.36(0.86)  59.34(0.94) 
Data set  CIFAR10  CIFAR100  

Noise type  Symm.  Symm.  
Noise rate ()  0.2  0.4  0.5  0.6  0.2  0.4  0.5  0.6 
CrossEntropy  86.8(82.7)    79.4(57.9)    62.0(61.8)    46.7(37.3)   
Coteaching [42]  89.5(88.2)    85.7(84.1)    65.6(64.1)    51.8(45.3)   
Pcorrection [41]  92.4(92.0)    89.1(88.7)    69.4(68.1)    57.5(56.4)   
MentorNet [16]  92.0()  89.0()      73.0()  68.0()     
D2L [24]  85.1()  83.4()    72.8()  62.2()  52.0()    42.3() 
MLNT [20]  92.9(92.0)    89.3(88.8)    68.5(67.7)    59.2(58.0)   
Reweight [30]  86.9()        61.3()       
Abstention [37]  93.4()  90.9 ()    87.6()  75.8()  68.2()    59.4() 
Mcorrection [2]  94.0(93.8)    92.0(91.9)    73.9(73.4)    66.1(65.4)   
DivideMix [19]  96.1(95.7)  94.9()  94.6(94.4)  94.3()  77.3(76.9)  75.2()  74.6(74.2)  72.0() 
DivideMix*  96.08(95.68)  94.95(94.51)  94.81(94.13)  94.29(93.95)  77.23(76.84)  74.92(74.12)  74.08(73.56)  72.58(71.99) 
INNDivideMix  96.11(95.82)  95.01(94.86)  94.99(94.83)  94.87(94.56)  77.27(76.86)  74.67(74.14)  75.18(74.83)  72.69(72.21) 
Appendix E Detailed descriptions of ablation studies
e.1 Dissimilarity measure
As a dissimilarity measure, we utilize the Euclidean () distance between the feature representations (the penultimate layer representations) generated by a pretrained prediction model, , where is the highest feature output of a pretrained prediction model parametrized by . Here, is trained by minimizing the standard crossentropy (CE) function on . As an alternative feature representation, we consider using an external training data set such as ImageNet. We obtain another feature representation by training based on the ImageNet data set () and compare two representation output functions trained on 1) and 2) , respectively by evaluating AUC values of their corresponding INN scores. The integrand prediction model used in the formula (1) of the main manuscript is trained by minimizing CE based on with the maximum training epochs 300. By each 50 training epoch, we calculate the INN scores on with and evaluate the AUC values for the clean/noisy sample classification problem. The results are depicted in the left panel of Figure E.1, which show that using feature representations trained with the given training data set () gives similar results.
e.2 Loss function
We investigate two loss functions to estimate the prediction model : 1) the standard crossentropy (CE) function and 2) the MixUp (MU) objective function [44]. By each 50 training epoch, we calculate their INN scores on the training data and evaluate their AUC values. The right panel in Figure E.1 shows that the AUC values derived by the MU function are higher and relatively stable.
e.3 Training time analysis
We analyze the total training time of the INN over CIFAR10 and compare it to the smallloss competitor. We use a single NVIDIA TITAN XP GPU and the results are summarized in Table E.1 and E.2. The elapsed time of the INN is two times longer than that of the smallloss method since investigating the nearest neighborhoods requires much more time than other steps. Thus, to overcome the limitation of the INN in terms of elapsed time, we need to modify and lighten the procedure to calculate neighborhoods.
Training two architectures 
Searching the nearest neighbors  Calculating the INN scores  Total 

21.5sec/ep50ep 
2520.4sec  890.9sec  4486.3sec 
Training an architecture 
Calculating the loss scores  Total 

40.2sec/ep50ep 
3.8sec  2013.8sec 