Learning to Purify Noisy Labels via Meta Soft Label Corrector

08/03/2020 ∙ by Yichen Wu, et al. ∙ 1

Recent deep neural networks (DNNs) can easily overfit to biased training data with noisy labels. Label correction strategy is commonly used to alleviate this issue by designing a method to identity suspected noisy labels and then correct them. Current approaches to correcting corrupted labels usually need certain pre-defined label correction rules or manually preset hyper-parameters. These fixed settings make it hard to apply in practice since the accurate label correction usually related with the concrete problem, training data and the temporal information hidden in dynamic iterations of training process. To address this issue, we propose a meta-learning model which could estimate soft labels through meta-gradient descent step under the guidance of noise-free meta data. By viewing the label correction procedure as a meta-process and using a meta-learner to automatically correct labels, we could adaptively obtain rectified soft labels iteratively according to current training problems without manually preset hyper-parameters. Besides, our method is model-agnostic and we can combine it with any other existing model with ease. Comprehensive experiments substantiate the superiority of our method in both synthetic and real-world problems with noisy labels compared with current SOTA label correction strategies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The remarkable success of deep neural networks (DNNs) on various tasks heavily relies on pre-collected large-scale dataset with high-quality annotations [13, 19]. However, practical annotated training dataset always contains certain amount of noisy (incorrect) labels, easily conducting overfitting issue and leading to the poor performance of the trained DNNs in generalization [47]. In fact, such biased training data are commonly encountered in practice, due to the coarse annotation sources for collecting them, like web searches [25] and crowd-sourcing [42]

. Therefore, how to train DNNs robustly with such biased training data is a critical issue in current machine learning field.

To address this problem, various methods have been proposed [2, 34, 16], which can be coarsely categorized as sample selection and label correction approaches. Sample selection approaches tackle this challenge mainly via adopting sample re-weighting schemes by imposing importance weights on samples according to their loss. Typical methods include boosting and self-paced learning methods [20, 28]. Recently, some pioneering works [32, 34] further make such weighting schemes more adaptive and automatic through employing a small set of clean meta data to guide the network training process. All these methods are built on the basis of throwing off the suspected noisy samples in training. However, these corrupted samples contain beneficial information that could improve the accuracy and robustness of network especially in large noise-ratio scenarios [5].

Label correction approaches alleviate this issue through attempting to find and correct noisy labels to their underlying correct ones. For example, [14][30][35]

revised class probabilities through estimating noise transition matrix, aiming to recover the underlying ground-truth label distribution to guide the training process towards the correct classes. However, owing to the difficulty in estimating the noise transition matrix or true labels, the network training could easily accumulate errors, especially when the number of classes or mislabeled samples is large

[34, 16]. Another common methodology is to directly rectify the noisy labels by exploiting the prediction of network, e.g., Reed et al.[31] employed the bootstrapping loss to incorporate a perceptual consistency term (assigning a new label generated by the convex combination of current network prediction and the original noisy label) in the learning process. Along this research line, SELFIE [36] is known by using the co-teaching strategy to select clean samples and progressively refurbish noisy labels that most frequently predicted by previous learned models. Another typical work is Joint Optimization [38]

, using two progressive steps to update the whole data labels and classifier weights alternatively based on the knowledge delivered in dynamic iteration of the algorithm. Besides, U-correction

[2] built a two-component Beta Mixture Model (BMM) to estimate the probability of sample being mislabeled and correct noisy labels by bootstrapping loss. From the perspective of label correction, we can view all these methods as different means of generating soft labels to replace the original targets. Albeit capable of correcting noisy labels to a certain extent, the performance of these methods heavily rely on the reliability of the generated soft labels, which is depend on the accuracy of the classifier trained on the noisy dataset. When the classifier has poor performance, the false label information supplied by it will further degrade the quality of the obtained classifier. Moreover, these methods usually need to manually preset proper hyper-parameters to better fit different training data. This, on the other hand, makes them hardly generalized to variant and diverse scenarios in real cases.

To solve the above problems, in this paper we design a meta soft label corrector (MSLC), which could correct corrupted labels iteratively, from the angle of meta-learning. Concretely, we treat the label correction procedure as a two stage optimization process. In the first stage, we generate soft labels through MSLC by utilizing the original targets and different temporal information of predictions from base model. Then we update the MSLC by gradient descent step in order to minimize the loss of clean meta data. In the second stage, we let the base learner train to fit the soft labels which generated by MSLC in the first stage. Through optimizing the two stages alternatively, it could effectively utilize the guidance of meta data and improve the performance in noisy labels. The contributions of this paper can be summarized as follows:

  • Our method propose a meta soft label corrector which could map input label to a corrected soft label without using conventional pre-defined generating rules, and thus making the label correction process more flexible and easily adapt to complicated real dataset with different types and levels of noise.

  • Under the guidance of noise-free meta-data, our method could adaptively make use of the temporal predictions of the model to generate more accurate pseudo-labels without manually pre-set combination coefficient.

  • Our proposed model is model agnostic and could be added on the top of any existing models at hand. Comprehensive synthetic and real experiments validate the superiority of the proposed method on robust deep learning with noisy labels. This can be finely interpreted by its obviously better noisy-clean label distinguishing capability and superior quality of the new soft labels generated by MSLC.

2 Meta Soft Label Corrector

2.1 Analysis of the existing label correction methods

For a c-class classification, let be the feature space, be the label space. Given training data , where is the -th sample with its label denoted as . Denoting the network as a function with input and output , then is a network with representing the network parameters. In order to learn the model , given dataset , the parameters

can be optimized by a chosen loss function.

The label correction methods focus on how to generate more accurate pseudo-labels that could replace the original noisy ones so that increase the performance of the classifier. E.g., Reed et al. [31] proposed a static hard bootstrapping loss to deal with label noise, in which the training objective for step is

(1)

where is the predicted label by the classifier in the step, can be seen as a soft pseudo-label that replaces the original target with preset parameter , and is a chosen loss function. In similar formulation as Eq. (1), some other methods design its own strategy to generate pseudo-labels. For example, SELFIE [36] set a threshold to separate the low-loss instances as clean samples and decide which samples are corrupted according to the volatility of the predictions of samples, and then correct these by the most frequently predicted label in previous iterations. Furthermore, Arazo et al. [2] learned the dynamically for every sample by using a Beta-Mixture model which is an unsupervised method to group the loss of samples into two categories and choose the prediction of the step as similar to Eq. (1).

Different from the form of Eq. (1), Joint Optimization [38]

trained their model on the original targets with cross-entropy loss in a large learning rate for several epochs, and then tried to use the network predictions to generate pseudo-labels without using the original labels. They used loss function is,

(2)

where the pseudo-labels are the average of the predictions that from the past epochs. With a finely set hyper-parameters , it could achieve robust performance.

It can be seen that the existing label correction methods exploited a manually set mechanism for correcting labels. However, compared with specifically design to the investigated problem, it is a more difficult task to construct a unique label correction methodology that could be finely adaptable to different applications and datasets, which constitutes the main task of this work.

Moreover, these methods may cause severe error accumulation due to the low quality of the new soft labels that replaced the original ones. Bootstrap [31] and U-correction[2] combined the observed label with the current prediction to generate new soft labels. However, there exists significant variation in the predictions of base model especially to the samples which labels are corrupted. Joint Optimization [38] method used the the predictions of earlier network to alleviate this problem, but it used the new soft labels to replace all the observed targets no matter whether it’s clean or not may cause the question that the clean original labels were wrongly corrected.

2.2 Structure of the proposed MSLC

To alleviate the aforementioned issues of the current methods, we want to build a learning framework which could generate pseudo-labels with following data-adaptive label corrector for each training step:

(3)

where is the soft pseudo-label generated by our proposed MSLC, denotes the original label, represents the side information that is helpful to make such label correction, and denotes hyper-parameters involved in this function. The questions are now how to specify and the function parametric format of , and how to learn its involved parameters .

With meta soft label corrector Eq. (3), the final training objective for step can be written as:

(4)

Synthesize these helpful experience that we analyzed previous section, we use and Eq. (3) as the side information for helping correcting the input label 111Note that more earlier generated pseudo-labels for could be easily adopted in our method. Our experiments show that one projection can already guarantee a good performance. We thus easily use this simple setting, but could readily explore to use more in future. In this sense, both pseudo-label utilization manners (e.g., [31] and [38]) as introduced in Section 3.1 can be seen as special cases of ours, but with manually preset combination coefficients instead of automatically learned ones directly from data like ours. , i.e.,

(5)

It is worth noting that U-correction [2] adopt an unsupervised model to learn the hyper-parameter, however, possibly due to the alternatively updating procedure of the unsupervised model and the base classifier, although it could fit well to the fixed loss distribution, it can not split noisy samples accurately in the training process(See section 3.2). To alleviate these issues, we view the label correction procedure as a meta-process and using a meta-learner to automatically correct labels. Inspired by [31] and [38], we easily set the corrected label to be a convex combination of . That is:

(6)

where and are two networks, whose outputs represent coefficients of this convex combination, with their parameters denoted as and , respectively, and thus . These two coefficient networks, with and , then constitute the main parts of our proposed soft label corrector, which is intuitively shown in Fig. 1. Through the two networks, the input target information, i.e. , , , could be combined in a convex combination to form a new soft target, , which will replace the original label in the training process. and are the output value of and respectively.

Figure 1: The Structure of MSLC

Our proposed MSLC exploits meta-learning method and could better distinguish the noisy and clean samples than the unsupervised manner. Also we take more temporal predictions information into consideration so that the generated new soft labels are more accurate and could effectively prevent severe error accumulation.

2.3 Training with meta dataset

We then introduce how to learn hyper-parameter for the MSLC Eq. (6). We readily employ a meta-data driven learning regime as used in [34], which exploits a small but noise free dataset (i.e., meta data) for learning the hyper-parameter for training samples. The meta dataset contains the meta-knowledge of underlying label distribution of clean samples, it is thus rationally to exploit it as a sound guide to help estimate for our task. Such data can be seen as the conventional validation data (but with high quality), with much smaller size than those used for training, and thus feasible to be pre-collected. In this work, we denoted meta dataset as,

(7)

where () is the number of data samples in meta dataset. By utilizing the meta dataset, we can then design the entire training framework for the noise label correction model Eq. (4).

Specifically, we formulate the following bi-level minimization problem:

(8)

where is the meta loss on meta dataset. After achieving , we can then get the soft label corrector, which incline to ameliorate noisy labels to be correct ones, and further improve the quality of the trained classifier.

Optimizing the parameters and hyper-parameters requires two nested loop of optimization Eq. (8), which tends to be computationally inefficient [9]. We thus exploit SGD technique to speedup the algorithm by approximately solving the problem in a mini-batch updating manner [34, 8] to jointly ameliorating and . The algorithm flowchart is shown in Fig. 2.

Figure 2: Main flowchart of the proposed MSLC
0:  Training data , meta data , batch size , MaxEpoch .
0:  Classifier network parameter
1:  Initialize classifier parameter and Meta-Learner parameter .
2:  for  to  do
3:      SampleMiniBatch().
4:      SampleMiniBatch().
5:     Update by Eq. (10).
6:     Update by Eq. (11).
7:     Update by the current classifier with parameter .
8:  end for
Algorithm 1 The Learning Algorithm of Meta Label Corrector

The algorithm includes mainly following steps. Firstly, denote the mini-batch training samples as , and then the training loss becomes . We can then deduce the formulate of one-step updating with respect to as

(9)

where is the learning rate. Then, with current mini-batch meta data samples , we can perform one step updating for solving , that is

(10)

After we achieve , we can calculated the pseudo label by Eq. (3), and update , that is

(11)

The predicted pseudo-labels can then be updated by using the current classifier with parameter . The entire algorithm is then summarized in Algorithm 1.

3 Experimental Results

To evaluate the capability of the proposed method, we implement experiments on CIFAR-10, CIFAR-100 [18] under different types and levels of noise, as well as a real-word large-scale noisy dataset Clothing1M [44]. Both CIFAR-10 and CIFAR-100 contain 50k training images and 10k test images of size 32 32. For CIFAR-10/100, we use two types of label noise: symmetric and asymmetric. Symmetric: We follow [47, 38] for label noise addition, which generates label corruptions by flipping labels of a given proportion of training samples to one of the other class labels uniformly (the true label could be randomly maintained). Asymmetric: We use the setting in [45], which designs to mimic the structure of real-world label noise. Concretely, we set a probability to disturb the label to its similar class, e.g., truck automobile, bird airplane, deer horse, cat dog. For CIFAR-100, a similar is set but the label flip only happens in each super-class as described in [14].

Baselines. The compared methods include: Fine-tuning, which finetunes the result of Cross-Entropy on the meta-data to further enhance its performance. GCE [48], which employs a robust loss combining the benefits of both CE loss and mean absolute error loss against label noise. GLC [14], which estimates the noise transition matrix by using a small clean label dataset. MW-Net [34], which uses a MLP net to learn the weighting function. Bootstrap [31], which deals with label noise by adding a perceptual term to the standard CE loss. Joint Optimization [38], which updates the label and model at the same time by using the pseudo-labels it generated. U-correction [2], which models sample loss with BMM and applied MixUp. For fair comparison, we only compare its proposed method without mixup augmentation.

Experiment Details. We use ResNet-34 [13] as classifier network for all baseline experiments in Table 1 .We use two multi-layer perception(MLP) with 100 hidden layers as the network structure of and respectively. In the proposed method, we chose cross-entropy as loss function, we began to correct labels at 80th epoch (i.e. there is an initial warm-up).

Noise-type Symmetric Noise Asymmetric Noise
Dataset CIFAR-10 CIFAR-100 CIFAR-10 CIFAR-100
Method Noise ratio 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.2 0.4
Cross-Entropy Best 90.22 87.33 83.2 54.79 68.03 61.18 46.43 17.91 92.85 90.22 69.05 65.14
Last 86.33 79.61 72.99 54.26 63.67 46.92 30.96 8.29 91.29 87.23 63.68 50.10
Fine-tuning Best 91.17 87.34 83.75 56.28 67.81 62.55 50.82 19.05 93.11 91.04 69.55 65.75
Last 88.27 82.16 79.36 54.82 63.97 51.14 38.22 18.86 92.35 89.49 66.43 55.08
GCE[48] Best 90.27 88.50 83.70 57.27 71.36 63.39 58.06 16.51 90.11 85.24 69.56 57.50
Last 90.15 88.01 82.87 57.22 71.02 52.15 45.31 15.71 89.33 82.04 66.36 56.81
GLC[14] Best 91.43 88.52 84.08 64.21 69.30 63.24 56.12 18.59 92.46 91.74 71.40 67.73
Last 90.13 87.04 82.63 62.19 66.62 59.03 51.96 8.08 92.41 91.02 70.01 66.68
MW-Net[34] Best 91.48 87.34 81.98 65.88 69.79 65.44 55.42 19.62 93.44 91.64 67.54 60.24
Last 90.11 86.42 81.62 64.78 68.37 64.81 55.04 19.20 91.95 90.88 66.71 59.53
Bootstrap[31] Best 91.46 88.75 84.03 63.80 69.79 63.73 57.20 17.63 93.08 91.18 70.93 67.82
Last 88.00 83.57 78.69 63.41 63.00 47.08 35.86 17.04 91.02 85.59 63.46 49.18
Joint Optimization[38] Best 90.85 90.27 86.49 66.39 63.84 59.82 49.13 18.95 93.39 91.43 66.90 64.82
Last 89.77 88.58 85.57 65.92 60.10 56.85 47.68 17.38 92.12 90.20 66.69 59.31
U-correction[2] Best 92.05 89.07 85.64 68.23 68.37 62.37 55.19 17.10 91.85 90.34 67.71 66.75
Last 90.21 85.45 83.15 64.78 67.42 55.40 55.04 9.33 90.92 84.31 63.82 60.64
Ours Best 93.46 91.42 87.39 69.87 72.51 68.98 60.81 24.32 94.39 92.81 72.66 70.51
Last 93.38 91.21 87.25 68.88 72.02 68.70 60.25 20.53 94.11 92.48 70.20 69.24
Table 1: Test accuracy (%) of all competing methods on CIFAR-10 and CIFAR-100 under Symmetric noise and Asymmetric noise with different noise levels. The best results are highlighted in bold.

3.1 Comparison with State-of-the-Art Methods

Table 1 shows the results of all competing methods on CIFAR-10 and CIFAR-100 under symmetric and asymmetric noise as aforementioned. To compare different methods in more detail, we report both the best test accuracy and the averaged test accuracy over the last 5 epochs. It can be observed that our method gets the best performance across the both datasets and all noise rates. Specifically, even under relatively high noise ratios (E.g. on CIFAR-10 with sym-noise), our algorithm has competitive classification accuracy (). It worth noted that U-correction achieved best accuracy of that is comparable with, while its accuracy decreases in the later training as

probably due to its error accumulation. This indicating that our proposed meta soft label corrector has better convergence under the guidance of meta data in the training process. It also can be seen that MW-Net has poor performance in asymmetric condition, that might because all classes share one weighting function in the method, which is unreasonable when noise is asymmetric. Comparatively, our proposed MSLC has a higher degree of freedom and thus performs much better with asymmetric noise.

(a) Cifar100-Sym-Noise40
(b) Cifar100-Sym-Noise60
(c) Cifar100-Asy-Noise40
Figure 3: The corrected label accuracy on different noise types and noise ratios. (a) shows the accuracy of symmetric noise on Cifar100, (b) shows the accuracy of symmetric noise on Cifar100, (c) shows the accuracy of asymmetric noise on Cifar100

Fig.6 plots the corrected label accuracy, which used the hard form of pseudo-labels Eq. (3) compared with the ground truth. As can be seen in Fig. 6, the corrected labels generated by our method are the most accurate. The accuracy of MW-Net always below the value of the proportion of clean samples, since it intrinsically tries to select the clean samples while ignores the corrupted ones by its weighting mechanism. From Fig. 6 (a)(c), we could see that the corrected label accuracy of the U-correction are slightly decrease, it might be caused by its massive false correction222This will be further analysis in the section 3.2. Moreover, although the accuracy of JointOptimization increase all the time, its performance is limited by the strategy that only use the pseudo-labels to replace all the targets, which has the risk of corrupting the original clean labels22footnotemark: 2.

Tabel 2 are the results on real noisy dataset Clothing1M, which consists of 1 million clothing images belonging to 14 classes from online shopping websites e.g. T-shirt, Shirt, Knitwear and additional smaller sets with clean labels for validation(14K) and testing(10K). Since the labels are generated by using surrounding texts of the images provided by the sellers, they thus contain many error labels. From Table 2, it can be observed that the proposed method achieves the best performance, which indicates our meta soft label corrector could be applied to real complicated dataset.

Method Accuracy Method Accuracy
1 Cross Entropy 68.94 4 Joint Optimization[38] 72.23
2 Bootstrapping[31] 69.12 5 MW-Net[34] 73.72
3 U-correction[2] 71.00 6 Ours 74.02
Table 2: Test accuracy () of different models on real-world noisy dataset Clothing1M.

3.2 Analysis of the proposed MSLC

(a) Asym-Noise40
(b) Sym-Noise40
Figure 4: The comparison of confusion matrices between before and after correction on CIFAR-10 with (a) asymmetric noise and (b) symmetric noise .
(a) The output weight of
(b) Accuracy in clean/noisy samples corresponding to Fig.6 (a)
Figure 5: The Analysis of the proposed method on CIFAR-100 with Symmetric noise. (a) denotes the output weight of on clean/noise samples, (b) shows the corrected label accuracy on clean/noisy data which split by the whole dataset according to the ground-truth.
Dataset CIFAR-10 CIFAR-100
0 0.2 0.4 0.6 0.8 Ours 0 0.2 0.4 0.6 0.8 Ours
Accuracy Best 89.84 90.49 91.04 90.34 89.46 91.27 67.42 68.52 68.25 67.13 67.08 68.84
Last 89.46 90.19 90.91 89.64 89.20 91.11 66.93 68.06 67.83 66.61 66.24 68.35
Corrected Label Accuracy 92.23 93.36 94.24 93.44 91.94 94.52 81.47 83.29 83.04 81.28 81.24 83.98
Table 3: The test accuracy(%) of ablation study on CIFAR-10/100 under 40% of sym-noise. Mean accuracy over 3 repetitions are reported.

Fig.4 shows the confusion matrices of our method under symmetric and asymmetric noise on CIFAR-10. The left column of Fig.4 (a) and (b) is the noise transition matrix, which is the guideline for generating the synthesized noisy datasets. And the right column is the matrix after corrected by our proposed method, which x-axis denotes the hard form corrected labels. By comparing the left and right column of Fig.4 (a) and (b), we could see that the probability of most diagonal terms exceeds after correction. That could indicate the high correction accuracy of our proposed MSLC.

Fig.5 demonstrates the output weights of and the corrected labels accuracy on clean and noisy samples, respectively. From Fig.5 (a), we could see that the weights of clean and noisy samples are significantly different, that means our meta soft label corrector inclines to choose the original clean labels and prones to use other target information when the original labels are noisy. Fig.5 (b) explains that our method could greatly correct the noise samples while retaining the original clean samples. It is worth noting that U-correction retains more than 99% of clean samples, however, we through experiments have found the reason is that it inclines to treat most of the samples as clean ones in the training process, which limits its ability to correct noise samples, as show in right column of Fig.5(b). As for JointOptimization, we could see that its training process corrupted the original clean labels from the left column of (b), since it used prediction targets replaced all original labels without considering if they are clean or not.

For further analysis the effectiveness of the network , we compared it learned hyper-parameters () with a set of different manually set values on CIFAR-10 and CIFAR-100. It can be observed from Table 3 that the performance is worst when the is set to 0, which means directly choose the predictions of current model could not accurately correct the original labels. On the other hand, we can find that the best manually set changes when the dataset is different. Specifically, for CIFAR-10, the best test accuracy is 91.04 corresponding to case, while for CIFAR-100, the best is 68.52 corresponding to

. Compared with the way of setting the hyperparameter manually, our algorithm could learn it more flexibly and achieves the best performance in both test accuracy and the corrected label accuracy.

4 Related Work

Sample Selection: The main idea of this approach is to filter out clean samples from data and train the learner only on these selected ones. Some methods along this line designed their specific selective strategies. For example, Decouple [27] utilized two networks to select samples with different label predictions and then used them to update. Similarly, Co-teaching [12] also used two networks, but chose small-loss samples as clean ones for each network. Other methods tend to select clean samples by assigning weights to losses of all training samples, and iteratively update these weights based on the loss values during the training process. A typical method is SPL (Self-paced learning), which set smaller weights to samples with larger loss since they are more possible to be noisy samples [20, 15, 49]. Very recently, inspired by the idea of meta-learning, some advanced sample reweighting methods have been raised. Typically, MentorNet [16] pre-trained an additional teacher network with clean samples to guide the training process. Ren et al. [32] used a small set of validation data to training procedure and re-weight the backward losses of the mini-batch samples such that the updated gradient minimized the losses of those validation data. These methods usually have a more complex weighting scheme, which makes them able to deal with more general data bias and select the clean samples more accurately. In these methods, however, most noisy data useful for learning visual representations [29, 10] are discard from training, making them leaving large room for further performance improvement.

Label Correction: The traditional label correction approach aims to correct noisy labels to true ones through an additional inference step, such as conditional random fields [40]

, knowledge graphs

[24] or directed graphical models [44]. Recently, transition matrix approach assumes that there exists a probabilities matrix that most probably flip the true labels into “noise" ones. There exist mainly two approaches to estimate the noise transition matrix. One is to train the classifier by pre-estimating noise transition matrix with the anchor point prior assumption. The other approach is to jointly estimate the noise transition matrix and the classifier parameters in a unified framework without employing anchor points [37, 17, 11, 43]. Besides this, some other methods exploit the predictions of network to rectify labels. For example, Joint Optimization [38] optimizes the parameters and updates the labels at the same time by using average prediction results of the network. SELFIE [36] used the co-teaching strategy to select clean samples and progressively refurbish noisy samples by using the most frequently predicted labels of previous learned model. Arazo et al. [2] proposed a two-component Beta Mixture Model to define whether the data is corrupted or not, and then correct them by introducing the bootstrapping loss.

5 Conclusion

Combining with meta-learning, we proposed a novel label correction method that could adaptively ameliorating corrupted labels for robust deep learning when the training data is corrupted. Compared with current label correction methods that use a pre-fixed generation mechanism and require manually set hyper-parameters, our method is able to do this task in a flexible automatic and adaptive data-driven manner. Experimental results show consistent superiority of our method in datasets with different types and levels of noise. In the future study, we will try to construct a new structure of meta soft label corrector, which input is not only the loss information, so that its well-trained model could transfer to other datasets under different noise level.

References

  • [1] S. E. R. andHonglak Lee (2015) Training deep neural networks on noisy labels with bootstrapping. In Accepted as a workshop contribution at ICLR, pp. 1–11.
  • [2] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness (2019) Unsupervised label noise modeling and loss correction. arXiv preprint arXiv:1904.11238. Cited by: Appendix A, Appendix B, §1, §1, §2.1, §2.1, §2.2, Table 1, Table 2, §3, §4.
  • [3] D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. (2017) A closer look at memorization in deep networks. In ICML,
  • [4] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019)

    Mixmatch: a holistic approach to semi-supervised learning

    .
    In NeurIPS,
  • [5] H. Chang, E. Learned-Miller, and A. McCallum (2017)

    Active bias: training more accurate neural networks by emphasizing high variance samples

    .
    In Advances in Neural Information Processing Systems, pp. 1002–1012. Cited by: §1.
  • [6] P. Chen, B. B. Liao, G. Chen, and S. Zhang (2019) Understanding and utilizing deep neural networks trained with noisy labels. In ICML,
  • [7] M. Dehghani, A. Mehrjou, S. Gouws, J. Kamps, and B. Schölkopf (2017) Fidelity-weighted learning. arXiv preprint arXiv:1711.02799.
  • [8] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §2.3.
  • [9] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil (2018) Bilevel programming for hyperparameter optimization and meta-learning. arXiv preprint arXiv:1806.04910. Cited by: §2.3.
  • [10] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In ICLR 2018, Cited by: §4.
  • [11] J. Goldberger and E. Ben-Reuven (2016) Training deep neural-networks using a noise adaptation layer. Cited by: §4.
  • [12] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pp. 8527–8537. Cited by: §4.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §3.
  • [14] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel (2018) Using trusted data to train deep networks on labels corrupted by severe noise. In Advances in neural information processing systems, pp. 10456–10465. Cited by: §1, Table 1, §3, §3.
  • [15] L. Jiang, D. Meng, T. Mitamura, and A. G. Hauptmann (2014) Easy samples first: self-paced reranking for zero-example multimedia search. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 547–556. Cited by: §4.
  • [16] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2018) MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2304–2313. Cited by: §1, §1, §4.
  • [17] I. Jindal, M. Nokleby, and X. Chen (2016) Learning deep networks from noisy labels with dropout regularization. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 967–972. Cited by: §4.
  • [18] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §3.
  • [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, Cited by: §1.
  • [20] M. P. Kumar, B. Packer, and D. Koller (2010) Self-paced learning for latent variable models. In NeurIPS, Cited by: §1, §4.
  • [21] S. Laine and T. Aila (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.
  • [22] D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3, pp. 2.
  • [23] J. Li, R. Socher, and S. C. Hoi (2020) Dividemix: learning with noisy labels as semi-supervised learning. In ICLR,
  • [24] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L. Li (2017) Learning from noisy labels with distillation. In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    pp. 1910–1918. Cited by: §4.
  • [25] W. Liu, Y. Jiang, J. Luo, and S. Chang (2011) Noise resistant graph ranking for improved web image search. In CVPR 2011, pp. 849–856. Cited by: §1.
  • [26] X. Ma, Y. Wang, M. E. Houle, S. Zhou, S. M. Erfani, S. Xia, S. Wijewickrema, and J. Bailey (2018) Dimensionality-driven learning with noisy labels. arXiv preprint arXiv:1806.02612.
  • [27] E. Malach and S. Shalev-Shwartz (2017) Decoupling" when to update" from" how to update". In Advances in Neural Information Processing Systems, pp. 960–970. Cited by: §4.
  • [28] D. Meng, Q. Zhao, and L. Jiang (2017) A theoretical understanding of self-paced learning. Information Sciences 414, pp. 319–328. Cited by: §1.
  • [29] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan (2017) Learning features by watching objects move. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2701–2710. Cited by: §4.
  • [30] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu (2017) Making deep neural networks robust to label noise: a loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1944–1952. Cited by: Appendix A, §1.
  • [31] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich (2015) Training deep neural networks on noisy labels with bootstrapping. In ICLR, Cited by: §1, §2.1, §2.1, §2.2, Table 1, Table 2, §3, footnote 1.
  • [32] M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pp. 4334–4343. Cited by: §1, §4.
  • [33] Y. Shen and S. Sanghavi (2019) Learning with bad training data via iterative trimmed loss minimization. In International Conference on Machine Learning, pp. 5739–5748.
  • [34] J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng (2019) Meta-weight-net: learning an explicit mapping for sample weighting. In Advances in Neural Information Processing Systems, pp. 1917–1928. Cited by: Appendix A, §1, §1, §2.3, §2.3, Table 1, Table 2, §3.
  • [35] J. Shu, Q. Zhao, Z. Xu, and D. Meng (2020) Meta transition adaptation for robust deep learning with noisy labels. arXiv preprint arXiv:2006.05697. Cited by: §1.
  • [36] H. Song, M. Kim, and J. Lee (2019) SELFIE: refurbishing unclean samples for robust deep learning. In International Conference on Machine Learning, pp. 5907–5915. Cited by: §1, §2.1, §4.
  • [37] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus (2014) Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080. Cited by: §4.
  • [38] D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa (2018) Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5552–5560. Cited by: Appendix A, Appendix A, Appendix B, §1, §2.1, §2.1, §2.2, Table 1, Table 2, §3, §3, §4, footnote 1.
  • [39] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204.
  • [40] A. Vahdat (2017) Toward robustness against label noise in training deep discriminative neural networks. In Advances in Neural Information Processing Systems, pp. 5596–5605. Cited by: §4.
  • [41] Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S. Xia (2018) Iterative learning with open-set noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8688–8696.
  • [42] P. Welinder, S. Branson, P. Perona, and S. J. Belongie (2010) The multidimensional wisdom of crowds. In Advances in neural information processing systems, pp. 2424–2432. Cited by: §1.
  • [43] X. Xia, T. Liu, N. Wang, B. Han, C. Gong, G. Niu, and M. Sugiyama (2019) Are anchor points really indispensable in label-noise learning?. In Advances in Neural Information Processing Systems, pp. 6835–6846. Cited by: §4.
  • [44] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang (2015) Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2691–2699. Cited by: §3, §4.
  • [45] J. Yao, H. Wu, Y. Zhang, I. W. Tsang, and J. Sun (2019) Safeguarded dynamic label regression for noisy supervision. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 9103–9110. Cited by: §3.
  • [46] K. Yi and J. Wu (2019) Probabilistic end-to-end noise correction for learning with noisy labels. In CVPR,
  • [47] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1, §3.
  • [48] Z. Zhang and M. Sabuncu (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, pp. 8778–8788. Cited by: Table 1, §3.
  • [49] Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, and A. G. Hauptmann (2015) Self-paced learning for matrix factorization. In Twenty-ninth AAAI conference on artificial intelligence, Cited by: §4.

Appendix A More Setting Details on Our Method

Network Structure.  For the classifier network, we choose ResNet-34. For meta learners, inspired by [34]

, we adopt a single multilayer perceptron (MLP) with one hidden layer containing 100 nodes in both networks

and to output the weight.

Synthetic Datasets.  We conducted these experiments across both synthetic datasets (i.e. CIFAR-10 and CIFAR-100 with different types and levels of noise) sharing the same configuration and lead to consistent improvements over the state-of-the-arts. Our proposed meta label corrector was trained with two steps, firstly through warm-up to learn the structured data with only cross-entropy loss, and secondly by introducing two meta learners to correct labels under the guidance of a small set of meta data with clean labels. We used SGD with a momentum of 0.9, a weight decay of , and the batchsize of 100. The learning rate is set as 0.1 which is divided by 10 after 80 and 100 epochs for a total of 120 epochs. After we trained the first step with 80 epochs, we used and Adam to train the two meta learners.

Clothing1M Data.  In training on the Clothing1M dataset, we used ResNet-50 pre-trained on ImageNet to align experimental condition with previous study [30, 38]. We resized the images as , performed mean subtraction and cropped the middle for preprocessing. For classifier network, we used SGD with a momentum of 0.9, a weight decay of , and batch size of 32. The initial learning rate is set as and divided by 10 after 5 epochs. We trained the network for 10 epochs and began updating labels from the 2nd epoch (i.e. we only warm up for the 1st epoch). For meta learners , we set the initial learning rate and used Adam to optimize the training process.

Accuracy of Corrected Label.  In section 3.1, we plot the accuracy of corrected labels to show the effectiveness of our proposed method. Since both JointOptimization [38] and U-correction [2] have warm-up operations, for more comprehensive comparison, we let the three methods (i.e. JointOptimization, U-correction, Ours) begin with correct labels from the 80th epoch. For MW-Net, we just follow its original settings (i.e. starts sample-reweighting from the 1st epoch without warm-up). We normalized its sample weights, and consider those samples which weight greater than 0.5 as its preserved clean samples (the rest samples are corrupted ones). From the perspective of label correction, its accuracy of the corrected label is the proportion of clean samples it retains on the original clean ones.

Appendix B More Experimental Results

To further analyze the corrected label accuracy of different methods as demonstrated in Section 3.1, we plot Fig.6 to show the accuracy in clean/noisy labels concretely. Fig.6(b)(d) reflect how many original clean samples are rectified mistakenly, and Fig.6(a)(c) represent how many original noisy samples are corrected accurately.

It can be seen that the accuracy in noisy labels decreases in the training process of U-correction [2] from Fig.6(b). That is because it used unsupervised clustering method to split the clean and noisy samples, which is easy to treat most samples as clean ones when it processes imbalanced data that the clean samples are the majority. Joint Optimization [38] performs well on the correction of noisy samples, as shown in Fig.6(b)(d). By simply using predicted labels to replace the original ones, this strategy, however, also causes the critical issue that many original clean samples are corrupted, as shown in Fig.6(a)(c).