INN: A Method Identifying Clean-annotated Samples via Consistency Effect in Deep Neural Networks

06/29/2021 ∙ by Dongha Kim, et al. ∙ 14

In many classification problems, collecting massive clean-annotated data is not easy, and thus a lot of researches have been done to handle data with noisy labels. Most recent state-of-art solutions for noisy label problems are built on the small-loss strategy which exploits the memorization effect. While it is a powerful tool, the memorization effect has several drawbacks. The performances are sensitive to the choice of a training epoch required for utilizing the memorization effect. In addition, when the labels are heavily contaminated or imbalanced, the memorization effect may not occur in which case the methods based on the small-loss strategy fail to identify clean labeled data. We introduce a new method called INN(Integration with the Nearest Neighborhoods) to refine clean labeled data from training data with noisy labels. The proposed method is based on a new discovery that a prediction pattern at neighbor regions of clean labeled data is consistently different from that of noisy labeled data regardless of training epochs. The INN method requires more computation but is much stable and powerful than the small-loss strategy. By carrying out various experiments, we demonstrate that the INN method resolves the shortcomings in the memorization effect successfully and thus is helpful to construct more accurate deep prediction models with training data with noisy labels.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning deep neural networks (DNNs) has achieved impressive successes in many research fields but has suffered from collecting massive clean-annotated training samples such as ImageNet

[9]

and MS-COCO

[21]

. Since annotating procedures are usually done manually by human experts, it is expensive and time-consuming to get large clean labeled data, which prevents deep learning models from being trained successfully. On the other hand, it is feasible to access numerous data through internet search engines

[11, 31, 40, 17] or hashtags, whose labels are easy to collect but relatively inaccurate. Thus it becomes to get a spotlight to exploit data sets with corrupted labels instead of clean ones to solve classification tasks with DNNs, which is called the noisy label problem.

There have been many kinds of literature dealing with noisy labeled data, and a majority of methods exploited so-called the memorization effect

, which is a special characteristic of DNNs that DNNs memorize data eventually (i.e. perfectly classify training data) but memorize clean labeled samples earlier and noisy samples later

[3, 16]. Hence, we can identify clean data from the given training data contaminated with noisy labels by choosing samples with small loss values. Due to its simplicity and superiority, many follow-up studies have been proposed based on the small-loss strategy and achieved great success ([34] and references therein).

Figure 1: An illustration of the INN method. Circle and square are inputs with clean label and noisy label, respectively. Numbered dots are the nearest inputs. Each graph presents the value of a given prediction model along the dashed line. The INN method takes an average of the areas under each of the graphs.

But the small-loss strategy has several weaknesses. First, during the training phase, it is difficult to know a training epoch (or iteration) where the discrepancy of loss values between clean data and noisy data is large since it heavily depends on various factors including data set, model architecture, optimizer type and even learning schedule. Second, it becomes hard to identify clean-annotated samples from training data via the small-loss strategy when the training labels are heavily polluted. Besides, the memorization effect may not appear when we analyze the data with imbalanced label distribution. As we can obtain imbalanced data frequently in many real-world domains, this shortcoming can be an obstacle for the small-loss strategy applied in many industry fields.

To tackle these issues about the memorization effect, we develop a novel and powerful method called INN (Integration with the Nearest Neighbors). We start with a new and interesting observation that the output values of a trained DNN at neighbor regions of labeled and noisy samples are consistently much different regardless of training epochs. We call this phenomenon the consistency effect. Motivated by the consistency effect, the INN method takes averages of the output values of neighbor regions of a given sample and decides it as noisy if the average is small. See Figure 1 for an illustration of the INN method.

In fact, the INN requires more computation than the small-loss method. Still, this additional expense deserves to pay since the INN successfully overcomes the small-loss method’s limitations. The INN works well even when the training labels are heavily contaminated or has imbalanced distribution, while the small-loss method is in trouble for the situations. The stability and superiority make the INN easily applicable to various supervised learning tasks without much effort.

We can also combine the INN with an existing noisy-label-problem-solving learning method based on the small-loss strategy (e.g. DivideMix [19]) to construct deep networks of high accuracy. We replace the parts where the memorization effect and loss information are used with the consistency effect and the INN information. We show that these modifications enhance prediction performances much, especially when training labels have many noises or imbalanced distribution.

This paper is organized as follows. In Section 2, we provide brief reviews for related studies dealing with noisy labels, and detailed descriptions of the INN are given in Section 3. Various experimental analyses including performance test and ablation study are given in Section 4 and final concluding remarks follow in Section 5. The key contributions of this work are as follows.

  • We find a new observation called the consistency effect, that the output values of a trained DNN at neighbor regions of labeled and noisy samples are consistently much different regardless of training epochs.

  • Built on the consistency effect, we propose a method called the INN to identify clean annotated data from a given training data.

  • We empirically demonstrate that the INN can separate clean and noisy samples accurately and stably even under the heavy label corruption and imbalanced label distribution, and also helpful to construct superior prediction models.

2 Related works

The noisy label problem has been studied for several decades [1, 47, 27]. The core issue to solve the noisy label problem with DNNs is that DNNs easily over-fit all training samples, including noisy labeled ones, because of too large complexities resulting in inferior generalization performances. Here we review some related studies for efficient algorithms to train robust classifiers in noisy annotations based on the key concepts called the loss correction and the memorization effect. And we also describe several approaches exploiting the information of a target sample’s neighborhoods as the INN does.

Loss correction based algorithms have a goal to improve the generalization error by modifying objective functions [29, 45]. The noise adaptive layer-based algorithm [12]

added additional noisy channels that estimate the correct labels. The iterative noisy label detection

[38, 37]

used the weighted softmax loss function where the weights are updated iteratively based on the feature maps of the current DNN model. Some algorithms to estimate ground-truth labels directly have been developed

[36, 41]. The meta-learning algorithm was also applied to resolve the noisy label problem [20]. There was an attempt to propose a new loss function more robust than standard loss functions [23].

Approaches based on the memorization effect focused on the gap between the output values of clean labeled and noisy labeled samples during an early stage of the training phase. The decouple method [26] proposed a meta-algorithm called decoupling which decides when to update. D2L [25] distinguished clean labeled data from noisy ones by employing a local dimensionality measure and ELR [22] found the faster gradient vanishings of clean labeled samples at the early learning stage. There were several algorithms to train noisy-robust prediction models by using only a subset of the training data based on their loss or prediction values. [13, 42, 32, 8, 33, 28]. Some studies fitted a two-component mixture model to a per-sample loss distribution [2, 19].

Some works tried to utilize the information of neighborhoods to filter out noisy labeled data, similar to the INN’s idea. The distance-weighted -NN [10] was an initial work that considers the nearest samples with their distance-based weights. Deep -NN [4] proposed a filtering strategy based on the label information of nearest neighbors, and MentorMix [15] applied MixUp [44] to MentorNet [16] to consider linear combinations of two inputs. We will discuss about the difference between the INN and the methods of exploiting neighbor’s information in Section 3.3.

3 Integration with the nearest neighbors

3.1 Notations and definitions

For a given input vector

, let be its observable and ground-truth labels, respectively, where . Of course, might be different from . We say that the sample is cleanly labeled if and noisily labeled if . Let be a training data set with samples. Define and . Our goal is to identify the clean labeled subset from accurately.

Let (abbr. ) be a discriminative DNN parametrized by which maps an input to a

-dimensional conditional probability vector with the softmax layer. Also let

be the -th component of , that is, we can represent as .

3.2 Consistency effect

Before we start, we explain the main motivation of our method. For a given training sample , we define , where is the nearest neighbor training input of on the feature space (i.e. is most close to ) and is the output of the penultimate layer of a pre-trained prediction model on . Then, we can regard that locates in the neighbor region of . We investigate how the prediction values of the training inputs and their neighbors behave differently by the label cleanness. We also estimate a prediction model by minimizing the standard cross-entropy based on . At each training epoch, we calculate the four expectations defined as

Two values and are the expectations of clean and noisy data’s predictions, respectively, and and are the expectations of neighbor region’s prediction values over the clean and noisy labeled data, respectively. From Figure 3, we can see a typical phenomenon related to the memorization effect: and are much different at early epochs but the difference diminishes as the training epoch proceeds. That means that it becomes hard to discriminate noisy data from clean ones by comparing values at each sample at later stages of the training phase. But it is difficult to decide how many epochs are necessary for amply utilizing the memorization effect.

On the other hand, the difference between and is clearly significant regardless of training epochs. That is, the prediction values of neighbor regions for each sample are informative to separate clean and noisy data even when the number of training epoch is large. We call this new observation the consistency effect. This consistent discrepancy occurs by two reasons. First, when is a noisy labeled sample, the label of an input , i.e. , and the label of its neighborhood training sample denoted by tend not to coincide, which yields the small prediction value . And even if and are equal, may not be the nearest neighbor on the input space (i.e. is not close to ). Hence, there exists a region between and at which the value of becomes small.

Figure 2: Comparison of , , and for various training epochs. We use the 30% symmetrically noisy CIFAR10 as .
Figure 3: Comparison of AUC values for clean/noisy sample classification of the INN by varying some factors. We consider about (Left) how to utilize the prediction information of neighbor regions and (Right) the number of the nearest neighborhoods. We use the 40% asymmetrically noisy CIFAR10 as

3.3 INN method

In this section, we propose a new and novel method to identify clean labeled samples motivated by the consistency effect. As being observed in Section 3.2, it is important to take into account the prediction values at a neighbor region of each sample. Let be a prediction model trained with a loss function on for training epochs. In this study, we use the MixUp objective function [44] as the loss function . For a given training sample and its neighborhood training input , a naturally induced score to identify whether is clean or not would be , where . From further experiments, we modify the score as follows. First, it is observed that the consistency effect occurs at many input vectors between and other than . Thus, to exploit the consistency effect fully, we consider integrating the prediction function over the whole interval between and to have

Second, using multiple neighbor samples helps identify clean labeled data more accurately. Based on these two arguments, we propose the INN score given as:

(1)

where is the set of nearest neighbor training inputs of on the feature space described in Section 3.2. Figure 3 illustrates the effects of these two modifications. The integration in (1) can be easily approximated by the trapezoidal rule as follows:

where and is the number of trapezoids. In practice, we fix the value of and to 10. The larger the score is, the more we could regard as being cleanly labeled. Hereafter, we will abbreviate to . Even after large training epochs, since there still remains the consistency effect in the prediction model, the INN method separates clean labeled data from noisy ones well. The following simple lemma supports the validity of our method.

Lemma 1.

Let be a prediction model which perfectly over-fits the MixUp loss function

where CE is the cross-entropy loss function, , and

is the Beta distribution with a hyperparameter

. Also let assume that for each training input , its nearest neighbor set satisfies , where is the observed label of and is the ground-truth label of . Then, the following inequality holds:

0:  Training data , a prediction model , a pre-trained feature model , the number of neighborhoods , a training epoch , an optimizer
1:   //dissimilarity measure
2:  for  to  do
3:      //train using MixUp
4:  end for
5:   //Define INN score set
6:  for  to  do
7:      // nearest neighborhoods of
8:     
9:      //INN score of
10:      //append to
11:  end for
11:  
Algorithm 1 INN (In practice, we fix to 10.)

The proof is in the supplementary materials. Lemma 1 means that if we have a prediction model trained with the MixUp and good nearest neighborhood sets, then the INN separates perfectly from .

As mentioned in Section 2, there are several approaches similar to the INN that utilize the nearest neighborhoods’ information to filter out noisy data [38, 20, 4, 15]. They mainly take advantage of the labels of the neighborhoods. When the training labels are polluted heavily, most of the nearest samples also become noisy, thus relying only on the label information might lead to bad results. On the other hand, the INN focuses on the regions between the inputs and their neighbor training samples. So, we expect that the INN would be robust to highly noisy data.

The algorithm of the INN method is summarized in Algorithm 1.

4 Experimental analysis

In this section, we empirically show the superiority of the INN in terms of three aspects. First, the INN is not sensitive to the choice of the training epochs and provides consistent performances. Second, the INN is significantly better than the small-loss strategy when many polluted labels are in the training data. Finally, in a situation where the training labels are imbalanced, the small-loss strategy may not work, while our method still succeeds in finding clean labeled data. And we also provide a combination of the INN and an existing small-loss-based learning framework to construct better deep prediction networks. Some additional ablation studies follow after then.

4.1 Experimental settings

Data sets

We carry out extensive experiments including performance tests and ablation studies by analyzing three data sets, CIFAR10&100 [18] and Clothing1M [40]. Both CIFAR10 and CIFAR100 consist of 50K training data and 10K test data with an input size of all of which are cleanly labeled. Clothing1M is a large-scale data set with real-world noisy labels containing 1M training data collected from online shopping websites. We use the subset of the Clothing1M data set whose ground-truth labels are known. The subset consists of 48K samples with a noisy level of 20% roughly.

As for imposing noise labels to CIFAR10 and CIFAR100, we consider symmetric and asymmetric settings as other studies did [45, 41]. In the symmetric noise setting, for each sample in the training data set, its label is contaminated with a probability

to a random label generated from the uniform distribution on

to ( for CIFAR10 and for CIFAR100). In the asymmetric noise setting for CIFAR10, with a probability , a noisy label is generated by one of the following mappings: truckautomobile, birdairplane, deerhorse and catdog. For CIFAR100, labels are asymmetrically contaminated by flipping a given label to the next label with a probability according to the transition chain: class1class2class100class1.

Architectures and implementation details

We need two models and , and we use the same architectures for them in all experiments. For CIFAR10&100 we utilize PreActResNet18 (PRN, [14]) with randomly initialized weights, and for Clothing1M we use ResNet50 (RN, [35]

) with pre-trained weights by ImageNet. We train all the deep networks using the SGD algorithm with a momentum of 0.9 and the mini-batch size of 128, set the initial learning rate as 0.02, and reduce it by a factor of 5 when the half and three-fourths of the learning procedure proceed, respectively. All the results of ours in the following experiments are the averaged values of three trials executed from random initial weights and mini-batch arrangements.


Figure 4: Comparison of AUC values for clean/noisy sample classification between the INN and the small-loss methods with two losses. The plot positioned at the -th row from the top and the -th column from the left is the result of the Case-

4.2 Stability test of the INN

In this section, we show the stability and superiority of the INN for identifying clean labeled samples from training data. For CIFAR10&100, we consider eight cases (Case1-1 to Case4-2) by varying noise rates and noise types.

  • CIFAR10 with (Case1-1) and (Case1-2) symmetrically noisy labels

  • CIFAR10 with (Case2-1) and (Case2-2) asymmetrically noisy labels

  • CIFAR100 with (Case3-1) and (Case3-2) symmetrically noisy labels

  • CIFAR100 with (Case4-1) and (Case4-2) asymmetrically noisy labels

Figure 5: Comparison of AUC values for clean/noisy sample classification for Clothing1M.

We consider various training epochs for , , from 50 to 300, and calculate the clean/noisy classification AUC values of the training data induced by the INN for each training epoch. We consider two small-loss methods for the baselines, whose loss functions are the standard cross-entropy (CE) and the sum of the cross-entropy and negative entropy (CE+NE), respectively. We evaluate the small-loss methods’ clean/noisy classification AUC values based on their per-sample losses.

The results are depicted in Figure 4. We can clearly see that the INN provides consistent and high-quality results for all considering cases regardless of training epochs. In contrast, the performances of other baselines become worse as the training epoch increases. That implies the INN gives a more stable and powerful performance for clean/noisy classification. We also repeat these experiments with another deep architecture, WideResNet28-2 [43], and leave the results in the supplementary materials.

As for Clothing1M, we extract the weights of from the estimate of RN pre-trained by ImageNet and train only for epochs varying from 10 to 300. We compare the AUC values of the INN with those based on the loss values of RN trained by CE. Figure 5 shows that the INN is superior and more stable throughout the whole training procedure compared to the competitor, which again assures the effectiveness of the INN.

4.3 INN with heavy noisy rates

We experiment with situations where training data are highly contaminated with noisy labels. We analyze CIFAR10&100 and change their ground-truth labels with a high probability. Like the previous analysis, we compare the INN to the two small-loss-based methods trained with the CE and CE+NE, respectively.

The best clean/noisy classification AUC values of the training data for each method are summarized in Table 1. In the heavy noise case, the proportion of the clean labeled data is not large. Thus, reducing the loss of clean labeled data may not be an optimal direction to reduce the overall loss values in the early learning stages, leading to the degradation of the small-loss strategy. In contrast, the INN can still identify clean data from noisy ones effectively even when many noisy labels exist and outperform the small-loss methods with large margins. We conjecture that this performance differences arise because the consistency effect is more insensitive to the number of noisy labels.

Data set CIFAR10 CIFAR100
Noise type Symm. Symm. Asymm.
Noise rate () 0.8 0.9 0.8 0.9 0.4
CE 0.857 (0.011) 0.756 (0.015) 0.809 (0.005) 0.690 (0.009) 0.589 (0.002)
CE+NE 0.854 (0.012) 0.750 (0.017) 0.807 (0.015) 0.695 (0.011) 0.608 (0.003)
INN 0.885 (0.014) 0.817 (0.018) 0.853 (0.016) 0.717 (0.013) 0.671 (0.005)
Table 1:

The best clean/noisy classification AUC values of the INN and small-loss methods. The averaged values over three trials are listed, and standard deviations are also given in the parenthesis.

4.4 Analysis of imbalanced data

We also analyze the noisy data where the ground-truth labels are imbalanced. In this section, we consider the two-class classification task. From CIFAR10, we randomly sample two classes: first and second labels are regarded as a majority class and a minority class, respectively. We gather all images in the first class and 10% randomly sampled images in the second class. We relabel the majority and minority classes to 0 and 1, respectively, i.e. , and for each sample, we flip its label with a probability of 0.3 to generate training data with noisy labels. We compare the normalized score distributions of the INN and the small-loss method (CE) with the training data. For each distribution, we make four histograms by considering two factors: 1) whether the ground-truth label is 0 or 1 and 2) whether the observed label is 0 or 1.


Figure 6: Score histograms for (Upper) the small-loss strategy (CE) and (Lower) the INN for imbalanced data. We fix the number of training epochs to 50 for the INN and 10 for the small-loss method. (Left to Right) We pick four class pairs, and and the first class for each pair is treated as a majority class. Data with and are colored by blue, orange, green and red, respectively.

As can be seen in Figure 6, we observe that the model trained by the standard CE does not always prioritize memorizing the clean-labeled data anymore, i.e. data with , rather memorizing the data with , which implies the memorization effect may not occur in these imbalanced cases. Furthermore, their loss distributions are very unstable with respect to the training epochs. We report the histograms at other training epochs in the supplementary materials. In contrast, the consistency effect is still observable though not clear. The INN can separate clean samples from noisy ones to some extent, even if they belong to the minor class.

4.5 Constructing noise-robust classifiers

Data set CIFAR10 CIFAR100
Noise type Symm. Symm. Asymm.
Noise rate () 0.8 0.9 0.8 0.9 0.4
Cross-Entropy 62.9 42.7 19.9 10.1 42.7
Co-teaching [42] 67.4 47.9 27.9 13.7 -
P-correction [41] 77.5 58.9 31.1 15.3 -
MLNT [20] - 59.1 (1.12) - - -
M-correction [2] 86.8 69.1 48.2 24.3 -
DivideMix [19] 93.2 76.0 60.2 31.5 -
DivideMix* 92.90 (1.08) 71.34 (1.43) 58.26 (1.01) 31.36 (0.67) 59.26 (1.08)
INNDivideMix 93.48 (1.01) 81.20 (1.05) 59.04 (0.98) 33.11 (0.82) 63.04 (1.10)
Table 2: Comparison of the test accuracies(%) of various method. The results marked with * are re-implemented by us and the other results except for INNDivideMix are copied from the comparison table in [19] or their original papers. We only report the best results and attach the last accuracy results in the supplementary materials.
Data set Imbalanced CIFAR10
Classes DivideMix INNDivideMix
Best Last Best Last
1 and 2 89.23 (0.40) 82.21 (0.57) 92.82 (0.27) 87.75 (0.77)
3 and 8 87.49 (0.68) 82.24 (1.65) 9 3.02 (0.12) 91.06 (0.18)
4 and 5 79.41 (0.68) 78.35 (0.98) 82.82 (0.20) 82.79 (0.13)
7 and 9 85.64 (0.35) 79.80 (1.92) 86.33 (0.45) 84.15 (0.25)
Table 3: The test accuracies(%) of the DivideMix and the modified DivideMix with the INN. The averaged values and standard deviations are reported.

The INN is also helpful for learning deep classification models with high performance with noisy training data. Many conventional learning frameworks built on the small-loss strategy usually begin with learning models with the whole training data by minimizing the standard loss function, such as CE, for a few epochs. Due to the memorization effect, the early estimated models tend to memorize the clean-labeled training data first. Initialized with the estimated models, they conduct their own strategies to train models. After each training epoch, they update the per-sample loss values with the current model and utilize this information at the next training epoch.

We can modify them by simple modifications with the INN. First, we replace their initialized models with new ones trained with the INN. We fit a two-component mixture model to the INN scores of the training data and split the data into the labeled and unlabeled data based on the posterior probability of belonging to the clean cluster (the cluster with a larger mean). With the labeled and unlabeled data, we train prediction models for a few epochs using a semi-supervised learning method, such as the MixMatch

[7], then use them as the initialized estimates. Second, after each training epoch, we recalculate the INN scores with the current prediction model and utilize these score values instead of the loss values at the next training epoch.

In this work, we provide an example to mix the INN and the DivideMix [19], known as one of the state-of-the-art methods to learning models with data mixed with noisy labels. We add a detailed algorithm of this combination in the supplementary materials. We again stress that any other small-loss-based learning frameworks than the DivideMix are also available to be combined with the INN to train better prediction models, and that there are many rooms to develop our simple application.

To assess the prediction performance of our modification, we analyze the same noisy data sets in Section 4.3 and 4.4. We carry out test accuracy comparison with the modified DivideMix with the INN (INNDivideMix) and other baselines, which are reported in Table 2 and 3.111We complete our source code based on the public GitHub code of the DivideMix. We can check that our modified learning framework works better than the existing methods, including the DivideMix, in the cases where the training labels are highly polluted or imbalanced. And the performance gap between the best and last models trained with our method is smaller than the original DivideMix, which means the INN makes learning procedures more stable. We also make additional accuracy tests for data sets with noise rates not too severe and observe that ours gives no significant improvements since the DivideMix already works well. We report these results in the supplementary materials.

4.6 Ablation study

We empirically investigate how the choices of dissimilarity measure and the loss function affect the INN, whose detailed analyses are described in the supplementary materials. We observe that the Euclidean distance on the penultimate layer of a DNN trained by the training data is the best dissimilarity measure. And for the loss function, the MixUp objective function yields the best results.

We also evaluate the total training time of the INN and compare it to that of the standard small-loss method. We use a single NVIDIA TITAN XP GPU, and the results are in the supplementary materials. It takes much more time to implement the INN than the competitor since it needs to extract the nearest neighbors for each sample, which could be one of our limitations. More study to lighten the computation burden of the INN is required.

5 Concluding remarks

In this study, we proposed a new and novel approach, called the INN, to identify clean labeled samples from training data with noisy labels based on a new finding called the consistency effect that discrepancies of predictions at neighbor regions of clean and noisy data are consistently observed. We empirically demonstrated that the INN is stable and superior even when the training labels are heavily contaminated or imbalanced.

It would be interesting to apply our methods to unsupervised anomaly detection problems

[39, 46, 48, 5]. After annotating labels (normal or abnormal) to the training data in a certain way, we can regard the task as the two-class noisy label problem. We expect that our methods would solve the anomaly detection problems successfully.

References

  • [1] D. Angluin and P. Laird (1988) Learning from noisy examples. Machine Learning 2, pp. 343–370. Cited by: §2.
  • [2] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness (2019) Unsupervised label noise modeling and loss correction. In 36th International Conference on Machine Learning, pp. 312–321. Cited by: Table D.1, Table D.2, §2, Table 2.
  • [3] D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. (2017) A closer look at memorization in deep networks. In International Conference on Machine Learning, pp. 233–242. Cited by: §1.
  • [4] D. Bahri, H. Jiang, and M. Gupta (2020-13–18 Jul) Deep k-NN for noisy labels. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 540–550. External Links: Link Cited by: §2, §3.3.
  • [5] L. Bergman and Y. Hoshen (2020) Classification-based anomaly detection for general data. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • [6] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019) MixMatch: a holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 5049–5059. Cited by: §D.1.1.
  • [7] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019) MixMatch: a holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: §4.5.
  • [8] P. Chen, B. B. Liao, G. Chen, and S. Zhang (2019) Understanding and utilizing deep neural networks trained with noisy labels. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 1062–1070. Cited by: §2.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 248–255. Cited by: §1.
  • [10] S. A. Dudani (1976) The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics SMC-6 (4), pp. 325–327. External Links: Document Cited by: §2.
  • [11] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman (2010) Learning object categories from internet image searches. Proceedings of the IEEE 98 (8), pp. 1453–1466. Cited by: §1.
  • [12] J. Goldberger and E. Ben-Reuven (2017) Training deep neural-networks using a noise adaptation layer. In 5th International Conference on Learning Representations, Cited by: §2.
  • [13] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. W. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In 32nd International Conference on Neural Information Processing Systems, pp. 8536–8546. Cited by: §2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Cited by: §4.1.
  • [15] L. Jiang, D. Huang, M. Liu, and W. Yang (2020-13–18 Jul) Beyond synthetic noise: deep learning on controlled noisy labels. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 4804–4815. External Links: Link Cited by: §2, §3.3.
  • [16] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2018) Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2304–2313. Cited by: Table D.2, §1, §2.
  • [17] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei (2016) The unreasonable effectiveness of noisy data for fine-grained recognition. In European Conference on Computer Vision, pp. 301–320. Cited by: §1.
  • [18] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto. Cited by: §4.1.
  • [19] J. Li, R. Socher, and S. C. Hoi (2020) Dividemix: learning with noisy labels as semi-supervised learning. In 8th International Conference on Learning Representations, Cited by: Table D.1, Table D.2, §1, §2, §4.5, Table 2.
  • [20] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli (2019) Learning to learn from noisy labeled data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5051–5059. Cited by: Table D.1, Table D.2, §2, §3.3, Table 2.
  • [21] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §1.
  • [22] S. Liu, J. Niles-Weed, N. Razavian, and C. Fernandez-Granda (2020) Early-learning regularization prevents memorization of noisy labels. In Advances in Neural Information Processing Systems 33 pre-proceedings (NeurIPS 2020), Cited by: §2.
  • [23] Y. Lyu and I. W. Tsang (2020) Curriculum loss: robust learning and generalization against label corruption. In 8th International Conference on Learning Representations, Cited by: §2.
  • [24] X. Ma, Y. Wang, M. E. Houle, S. Zhou, S. Erfani, S. Xia, S. Wijewickrema, and J. Bailey (2018) Dimensionality-driven learning with noisy labels. In International Conference on Machine Learning, pp. 3355–3364. Cited by: Table D.2.
  • [25] X. Ma, Y. Wang, M. E. Houle, S. Zhou, S. Erfani, S. Xia, S. Wijewickrema, and J. Bailey (2018) Dimensionality-driven learning with noisy labels. In 35th International Conference on Machine Learning, pp. 5907–5915. Cited by: §2.
  • [26] E. Malach and S. Shalev-Shwartz (2017) Decoupling "when to update" from "how to update". In Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 960–970. Cited by: §2.
  • [27] D. F. Nettleton, A. Orriols-Puig, and A. Fornells (2010) A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial Intelligence Review 33, pp. 275–306. Cited by: §2.
  • [28] D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox (2020) SELF: learning to filter noisy labels with self-ensembling. In International Conference on Learning Representations, Cited by: §2.
  • [29] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu (2017) Making deep neural networks robust to label noise: a loss correction approach. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2233–2241. Cited by: §2.
  • [30] M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pp. 4334–4343. Cited by: Table D.2.
  • [31] F. Schroff, A. Criminisi, and A. Zisserman (2010) Harvesting image databases from the web. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (4), pp. 754–766. Cited by: §1.
  • [32] Y. Shen and S. Sanghavi (2019) Learning with bad training data via iterative trimmed loss minimization. In 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 5739–5748. Cited by: §2.
  • [33] H. Song, M. Kim, and J. Lee (2019) SELFIE: refurbishing unclean samples for robust deep learning. K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 5907–5915. Cited by: §2.
  • [34] H. Song, M. Kim, D. Park, and J. Lee (2020) Learning from noisy labels with deep neural networks: a survey. arXiv preprint arXiv:2007.08199. Cited by: §1.
  • [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §4.1.
  • [36] D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa (2018) Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5552–5560. Cited by: §2.
  • [37] S. Thulasidasan, T. Bhattacharya, J. Bilmes, G. Chennupati, and J. Mohd-Yusof (2019) Combating label noise in deep learning using abstention. Proceedings of the 36th International Conference on Machine Learning, pp. 6234–6243. Cited by: Table D.2, §2.
  • [38] Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S. Xia (2018) Iterative learning with open-set noisy labels. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 8688–8696. Cited by: §2, §3.3.
  • [39] Y. Xia, X. Cao, F. Wen, G. Hua, and J. Sun (2015)

    Learning discriminative reconstructions for unsupervised outlier removal

    .
    In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 1511–1519. External Links: Document Cited by: §5.
  • [40] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang (2015) Learning from massive noisy labeled data for image classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2691–2699. Cited by: §1, §4.1.
  • [41] K. Yi and J. Wu (2019) Probabilistic end-to-end noise correction for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7017–7025. Cited by: Table D.1, Table D.2, §2, §4.1, Table 2.
  • [42] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama (2019) How does disagreement help generalization against label corruption?. In 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), pp. 7164–7173. Cited by: Table D.1, Table D.2, §2, Table 2.
  • [43] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), E. R. H. Richard C. Wilson and W. A. P. Smith (Eds.), pp. 1–12. Cited by: Appendix B, §4.2.
  • [44] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §E.2, §2, §3.3.
  • [45] Z. Zhang and M. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In 32nd International Conference on Neural Information Processing Systems, pp. 8792–8802. Cited by: §2, §4.1.
  • [46] C. Zhou and R. C. Paffenroth (2017)

    Anomaly detection with robust deep autoencoders

    .
    KDD ’17, New York, NY, USA, pp. 665–674. External Links: ISBN 9781450348874, Link, Document Cited by: §5.
  • [47] X. Zhu and X. Wu (2004) Class noise vs. attribute noise: a quantitative study. Artificial Intelligence Review 22, pp. 177–210. Cited by: §2.
  • [48] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen (2018)

    Deep autoencoding gaussian mixture model for unsupervised anomaly detection

    .
    In International Conference on Learning Representations, External Links: Link Cited by: §5.

6 Supplementary materials for the INN

Supplementary materials for
INN: A Method Identifying Clean-annotated Samples via Consistency Effect in Deep Neural Networks

Appendix A Proof of Lemma 1

Let and be a given training sample and a set of its nearest training inputs, respectively. Since the model perfectly over-fits the MixUp loss function, it is linear over between any two given training inputs. So, for the following equality holds:

where is the corresponding observed label of . Hence, the INN score can be rewritten as:

(2)

By using , we also have

(3)

With (2) and (3),

(4)

if is cleanly labeled, i.e. , and

(5)

if is noisy labeled. From (4) and (5), we have the following inequality and the proof is completed:

Appendix B INN with another architecture

We carry out additional experiments for the performance test of the INN using another architecture. Here, we consider WideResNet28-2 [43] and we follow the main manuscript’s implementation settings, such as the strategies to impose noisy labels and learning schedules.

We depict the results in Figure B.1. Similar to the results with the PRN architecture, the clean/noisy classification performances of the INN are insensitive to the choice of the training epochs and consistently outperform other two competitors based on loss values.


Figure B.1: Comparison of AUC values for clean/noisy sample classification between the INN and the two small-loss methods. We utilize the WideResNet28-2 architecture. The plot positioned at the -th row from the top and the -th column from the left is the result of the Case-. As for the description of the eight cases, see the main manuscript.

Appendix C Instability of the small-loss method for imbalanced data

Figure C.1 illustrates histograms of the loss values at various training epochs when we analyze imbalanced data. The same data sets used in Section 4.4 of the main manuscript are considered. We can check that the loss distribution is unstable, so it is hard to choose an optimal training epoch.


Figure C.1: The histograms of the loss value. (Upper to Lower) We consider five training epochs (1,5,10,15 and 20). (Left to Right) We pick four class pairs, and , and the first class for each pair is treated as a majority class. Data with and are colored by blue, orange, green and red, respectively.

Appendix D Modification of DivideMix with INN

d.1 Detailed algorithm description

Here, we describe our modification (INNDivideMix) in detail by comparing it with the original DivideMix algorithm.

d.1.1 Step 1

Original DivideMix

The DivideMix trains two prediction models by minimizing the sum of the cross-entropy and negative entropy with two initial independent parameters for a pre-specified training epoch using the whole training data. The training epoch usually ranges from 10 to 30, depending on a data set. This part aims to get two initialization models robust to noisy labels to some extent via the memorization effect.

InnDivideMix

First, we conduct the INN method twice to obtain two corresponding INN score sets of the training data with random initializations. With each INN score set, we separate the training data to labeled and unlabeled data by fitting a two-component Beta Mixture Model (BMM) to the INN scores using the Expectation-Maximization algorithm. We regard samples whose posterior probabilities of belonging to the clean cluster (the cluster with a larger mean) is larger than 0.5 as labeled data and treat the remained samples as unlabeled data by discarding their labels. Then, we utilize a semi-supervised learning (SSL) method to train two prediction models for a pre-specified training epoch with each pair of the labeled and unlabeled data. Any SSL method can be applied, and in our experiments, we adopt the MixMatch

[6].

d.1.2 Step 2

Original DivideMix

The DivideMix updates two prediction models pre-trained from the first step. For each prediction model, the DivideMix fits a two-component Gaussian Mixture Model (GMM) on a per-sample loss distribution of the model to divide the training set into a labeled data set and an unlabeled data set, which is called the co-divide procedure. Then, the two models exchange the co-divided data sets and are trained based on the exchanged data sets by use of the SSL method modifying the MixMatch with the co-refinement and the co-guessing techniques. The DivideMix iterates the co-divide and the modified MixMatch method alternately for a pre-specified training epoch to construct the final prediction models.

InnDivideMix

Similar to the original DivideMix, our method starts with two initialized prediction models from the first step of ours. For each prediction model, we calculate the INN score set. Note that we need a prediction model and a feature model to conduct the INN. Here, we set and to the current prediction model and its penultimate layer, i.e., the highest hidden layer, respectively. With each INN score set, we fit a two-component Beta Mixture Model (BMM) and split the training data into labeled and unlabeled data by using the posterior probabilities. Then, we utilize the same SSL algorithm used in the original DivideMix with the two pairs of labeled and unlabeled data sets to update the two prediction models. We repeat the above learning procedure for a pre-specified training epoch. We summarize our modification in Algorithm D.1. Algorithm D.1 requires three kinds of training epochs, and . In practice, we set for all cases.

0:  Training data , two prediction models and , three integers and , an optimizer
1:  Train two prediction models and and two feature models and .
2:  ; //two INN score sets with and , respectively
3:  ; //clean posterior probabilities
4:  ;
5:  ; //two pairs of labeled and unlabeled data
6:  for  to  do
7:     
8:      //conduct the MixMatch
9:  end for
10:  for  to  do
11:     ; //update and
12:     ; //update and
13:     ; //two INN score sets
14:     ; //clean posterior probabilities
15:     ;
16:     ; //two pairs of labeled and unlabeled data
17:     for  to  do
18:         //conduct the SSL part of the DivideMix
19:     end for
20:  end for
20:   and //final outputs
Algorithm D.1 INN+DivideMix: line 1-9: Step 1, line 10-20: Step 2

d.2 Performance tests of INNDivideMix

We report the last test accuracy results of the INNDivideMix on severely contaminated data sets. Table D.1 shows that the modified DivideMix with the INN improves the DivideMix with large margins.

The prediction performance of the INN in ordinary cases where there are not many noisy labeled samples in the training data are provided in Table D.2. Our modification does not give visible enhancement in these situations since the DivideMix already works well, so there is not much room to develop the original one.

Data set CIFAR10 CIFAR100
Noise type Symm. Symm. Asymm.
Noise rate () 0.8 0.9 0.8 0.9 0.4
Cross-Entropy 26.1 16.8 8.8 3.5 -
Co-teaching [42] 45.5 30.1 15.5 8.8 -
P-correction [41] 76.5 58.2 20.7 8.8 -
MLNT [20] - - - - -
M-correction [2] 86.6 68.7 47.6 20.5 -
DivideMix [19] 92.9 75.4 59.6 31.0 -
DivideMix* 92.71(1.02) 68.79(1.31) 57.78(1.07) 31.16(1.21) 53.79(1.09)
INNDivideMix 93.04(1.19) 80.51(1.05) 58.81(1.22) 32.36(0.86) 59.34(0.94)
Table D.1: Comparison of the last test accuracies(%) of various methods on severely polluted data sets. The averaged values (and standard deviations) are listed. The results marked with * are re-implemented by us and the other results except for INNDivideMix are copied from the comparison table in [19] or their original papers.
Data set CIFAR10 CIFAR100
Noise type Symm. Symm.
Noise rate () 0.2 0.4 0.5 0.6 0.2 0.4 0.5 0.6
Cross-Entropy 86.8(82.7) - 79.4(57.9) - 62.0(61.8) - 46.7(37.3) -
Co-teaching [42] 89.5(88.2) - 85.7(84.1) - 65.6(64.1) - 51.8(45.3) -
P-correction [41] 92.4(92.0) - 89.1(88.7) - 69.4(68.1) - 57.5(56.4) -
MentorNet [16] 92.0(-) 89.0(-) - - 73.0(-) 68.0(-) - -
D2L [24] 85.1(-) 83.4(-) - 72.8(-) 62.2(-) 52.0(-) - 42.3(-)
MLNT [20] 92.9(92.0) - 89.3(88.8) - 68.5(67.7) - 59.2(58.0) -
Reweight [30] 86.9(-) - - - 61.3(-) - - -
Abstention [37] 93.4(-) 90.9 (-) - 87.6(-) 75.8(-) 68.2(-) - 59.4(-)
M-correction [2] 94.0(93.8) - 92.0(91.9) - 73.9(73.4) - 66.1(65.4) -
DivideMix [19] 96.1(95.7) 94.9(-) 94.6(94.4) 94.3(-) 77.3(76.9) 75.2(-) 74.6(74.2) 72.0(-)
DivideMix* 96.08(95.68) 94.95(94.51) 94.81(94.13) 94.29(93.95) 77.23(76.84) 74.92(74.12) 74.08(73.56) 72.58(71.99)
INNDivideMix 96.11(95.82) 95.01(94.86) 94.99(94.83) 94.87(94.56) 77.27(76.86) 74.67(74.14) 75.18(74.83) 72.69(72.21)
Table D.2: Comparison of the best (and last) test accuracies(%) of various methods on data sets polluted not much. The results marked with * are re-implemented by us and the other results except for INNDivideMix are copied from the comparison table in [19] or their original papers.

Appendix E Detailed descriptions of ablation studies

e.1 Dissimilarity measure

As a dissimilarity measure, we utilize the Euclidean () distance between the feature representations (the penultimate layer representations) generated by a pre-trained prediction model, , where is the highest feature output of a pre-trained prediction model parametrized by . Here, is trained by minimizing the standard cross-entropy (CE) function on . As an alternative feature representation, we consider using an external training data set such as ImageNet. We obtain another feature representation by training based on the ImageNet data set () and compare two representation output functions trained on 1) and 2) , respectively by evaluating AUC values of their corresponding INN scores. The integrand prediction model used in the formula (1) of the main manuscript is trained by minimizing CE based on with the maximum training epochs 300. By each 50 training epoch, we calculate the INN scores on with and evaluate the AUC values for the clean/noisy sample classification problem. The results are depicted in the left panel of Figure E.1, which show that using feature representations trained with the given training data set () gives similar results.

e.2 Loss function

We investigate two loss functions to estimate the prediction model : 1) the standard cross-entropy (CE) function and 2) the MixUp (MU) objective function [44]. By each 50 training epoch, we calculate their INN scores on the training data and evaluate their AUC values. The right panel in Figure E.1 shows that the AUC values derived by the MU function are higher and relatively stable.

Figure E.1: AUC values for clean/noisy sample classification of the INN method obtained by varying two settings: (Left) training data set for learning and (Right) loss function for . We analyze the 40% asymmetrically noisy CIFAR10.

e.3 Training time analysis

We analyze the total training time of the INN over CIFAR10 and compare it to the small-loss competitor. We use a single NVIDIA TITAN XP GPU and the results are summarized in Table E.1 and E.2. The elapsed time of the INN is two times longer than that of the small-loss method since investigating the nearest neighborhoods requires much more time than other steps. Thus, to overcome the limitation of the INN in terms of elapsed time, we need to modify and lighten the procedure to calculate neighborhoods.


Training two architectures
Searching the nearest neighbors Calculating the INN scores Total

21.5sec/ep50ep
2520.4sec 890.9sec 4486.3sec
Table E.1: Total elapsed time for running the INN. We train two deep architectures ( and ) for 50 epochs by minimizing the MixUp loss function.

Training an architecture
Calculating the loss scores Total

40.2sec/ep50ep
3.8sec 2013.8sec
Table E.2: Total elapsed time for running the small-loss strategy. We train a deep architecture for 50 epochs by minimizing the sum of the cross-entropy and negative entropy.