1 Introduction
The remarkable success of deep neural networks (DNNs) on various tasks heavily relies on precollected largescale dataset with highquality annotations [13, 19]. However, practical annotated training dataset always contains certain amount of noisy (incorrect) labels, easily conducting overfitting issue and leading to the poor performance of the trained DNNs in generalization [47]. In fact, such biased training data are commonly encountered in practice, due to the coarse annotation sources for collecting them, like web searches [25] and crowdsourcing [42]
. Therefore, how to train DNNs robustly with such biased training data is a critical issue in current machine learning field.
To address this problem, various methods have been proposed [2, 34, 16], which can be coarsely categorized as sample selection and label correction approaches. Sample selection approaches tackle this challenge mainly via adopting sample reweighting schemes by imposing importance weights on samples according to their loss. Typical methods include boosting and selfpaced learning methods [20, 28]. Recently, some pioneering works [32, 34] further make such weighting schemes more adaptive and automatic through employing a small set of clean meta data to guide the network training process. All these methods are built on the basis of throwing off the suspected noisy samples in training. However, these corrupted samples contain beneficial information that could improve the accuracy and robustness of network especially in large noiseratio scenarios [5].
Label correction approaches alleviate this issue through attempting to find and correct noisy labels to their underlying correct ones. For example, [14][30][35]
revised class probabilities through estimating noise transition matrix, aiming to recover the underlying groundtruth label distribution to guide the training process towards the correct classes. However, owing to the difficulty in estimating the noise transition matrix or true labels, the network training could easily accumulate errors, especially when the number of classes or mislabeled samples is large
[34, 16]. Another common methodology is to directly rectify the noisy labels by exploiting the prediction of network, e.g., Reed et al.[31] employed the bootstrapping loss to incorporate a perceptual consistency term (assigning a new label generated by the convex combination of current network prediction and the original noisy label) in the learning process. Along this research line, SELFIE [36] is known by using the coteaching strategy to select clean samples and progressively refurbish noisy labels that most frequently predicted by previous learned models. Another typical work is Joint Optimization [38], using two progressive steps to update the whole data labels and classifier weights alternatively based on the knowledge delivered in dynamic iteration of the algorithm. Besides, Ucorrection
[2] built a twocomponent Beta Mixture Model (BMM) to estimate the probability of sample being mislabeled and correct noisy labels by bootstrapping loss. From the perspective of label correction, we can view all these methods as different means of generating soft labels to replace the original targets. Albeit capable of correcting noisy labels to a certain extent, the performance of these methods heavily rely on the reliability of the generated soft labels, which is depend on the accuracy of the classifier trained on the noisy dataset. When the classifier has poor performance, the false label information supplied by it will further degrade the quality of the obtained classifier. Moreover, these methods usually need to manually preset proper hyperparameters to better fit different training data. This, on the other hand, makes them hardly generalized to variant and diverse scenarios in real cases.To solve the above problems, in this paper we design a meta soft label corrector (MSLC), which could correct corrupted labels iteratively, from the angle of metalearning. Concretely, we treat the label correction procedure as a two stage optimization process. In the first stage, we generate soft labels through MSLC by utilizing the original targets and different temporal information of predictions from base model. Then we update the MSLC by gradient descent step in order to minimize the loss of clean meta data. In the second stage, we let the base learner train to fit the soft labels which generated by MSLC in the first stage. Through optimizing the two stages alternatively, it could effectively utilize the guidance of meta data and improve the performance in noisy labels. The contributions of this paper can be summarized as follows:

Our method propose a meta soft label corrector which could map input label to a corrected soft label without using conventional predefined generating rules, and thus making the label correction process more flexible and easily adapt to complicated real dataset with different types and levels of noise.

Under the guidance of noisefree metadata, our method could adaptively make use of the temporal predictions of the model to generate more accurate pseudolabels without manually preset combination coefficient.

Our proposed model is model agnostic and could be added on the top of any existing models at hand. Comprehensive synthetic and real experiments validate the superiority of the proposed method on robust deep learning with noisy labels. This can be finely interpreted by its obviously better noisyclean label distinguishing capability and superior quality of the new soft labels generated by MSLC.
2 Meta Soft Label Corrector
2.1 Analysis of the existing label correction methods
For a cclass classification, let be the feature space, be the label space. Given training data , where is the th sample with its label denoted as . Denoting the network as a function with input and output , then is a network with representing the network parameters. In order to learn the model , given dataset , the parameters
can be optimized by a chosen loss function.
The label correction methods focus on how to generate more accurate pseudolabels that could replace the original noisy ones so that increase the performance of the classifier. E.g., Reed et al. [31] proposed a static hard bootstrapping loss to deal with label noise, in which the training objective for step is
(1) 
where is the predicted label by the classifier in the step, can be seen as a soft pseudolabel that replaces the original target with preset parameter , and is a chosen loss function. In similar formulation as Eq. (1), some other methods design its own strategy to generate pseudolabels. For example, SELFIE [36] set a threshold to separate the lowloss instances as clean samples and decide which samples are corrupted according to the volatility of the predictions of samples, and then correct these by the most frequently predicted label in previous iterations. Furthermore, Arazo et al. [2] learned the dynamically for every sample by using a BetaMixture model which is an unsupervised method to group the loss of samples into two categories and choose the prediction of the step as similar to Eq. (1).
Different from the form of Eq. (1), Joint Optimization [38]
trained their model on the original targets with crossentropy loss in a large learning rate for several epochs, and then tried to use the network predictions to generate pseudolabels without using the original labels. They used loss function is,
(2) 
where the pseudolabels are the average of the predictions that from the past epochs. With a finely set hyperparameters , it could achieve robust performance.
It can be seen that the existing label correction methods exploited a manually set mechanism for correcting labels. However, compared with specifically design to the investigated problem, it is a more difficult task to construct a unique label correction methodology that could be finely adaptable to different applications and datasets, which constitutes the main task of this work.
Moreover, these methods may cause severe error accumulation due to the low quality of the new soft labels that replaced the original ones. Bootstrap [31] and Ucorrection[2] combined the observed label with the current prediction to generate new soft labels. However, there exists significant variation in the predictions of base model especially to the samples which labels are corrupted. Joint Optimization [38] method used the the predictions of earlier network to alleviate this problem, but it used the new soft labels to replace all the observed targets no matter whether it’s clean or not may cause the question that the clean original labels were wrongly corrected.
2.2 Structure of the proposed MSLC
To alleviate the aforementioned issues of the current methods, we want to build a learning framework which could generate pseudolabels with following dataadaptive label corrector for each training step:
(3) 
where is the soft pseudolabel generated by our proposed MSLC, denotes the original label, represents the side information that is helpful to make such label correction, and denotes hyperparameters involved in this function. The questions are now how to specify and the function parametric format of , and how to learn its involved parameters .
With meta soft label corrector Eq. (3), the final training objective for step can be written as:
(4) 
Synthesize these helpful experience that we analyzed previous section, we use and Eq. (3) as the side information for helping correcting the input label ^{1}^{1}1Note that more earlier generated pseudolabels for could be easily adopted in our method. Our experiments show that one projection can already guarantee a good performance. We thus easily use this simple setting, but could readily explore to use more in future. In this sense, both pseudolabel utilization manners (e.g., [31] and [38]) as introduced in Section 3.1 can be seen as special cases of ours, but with manually preset combination coefficients instead of automatically learned ones directly from data like ours. , i.e.,
(5) 
It is worth noting that Ucorrection [2] adopt an unsupervised model to learn the hyperparameter, however, possibly due to the alternatively updating procedure of the unsupervised model and the base classifier, although it could fit well to the fixed loss distribution, it can not split noisy samples accurately in the training process(See section 3.2). To alleviate these issues, we view the label correction procedure as a metaprocess and using a metalearner to automatically correct labels. Inspired by [31] and [38], we easily set the corrected label to be a convex combination of . That is:
(6) 
where and are two networks, whose outputs represent coefficients of this convex combination, with their parameters denoted as and , respectively, and thus . These two coefficient networks, with and , then constitute the main parts of our proposed soft label corrector, which is intuitively shown in Fig. 1. Through the two networks, the input target information, i.e. , , , could be combined in a convex combination to form a new soft target, , which will replace the original label in the training process. and are the output value of and respectively.
Our proposed MSLC exploits metalearning method and could better distinguish the noisy and clean samples than the unsupervised manner. Also we take more temporal predictions information into consideration so that the generated new soft labels are more accurate and could effectively prevent severe error accumulation.
2.3 Training with meta dataset
We then introduce how to learn hyperparameter for the MSLC Eq. (6). We readily employ a metadata driven learning regime as used in [34], which exploits a small but noise free dataset (i.e., meta data) for learning the hyperparameter for training samples. The meta dataset contains the metaknowledge of underlying label distribution of clean samples, it is thus rationally to exploit it as a sound guide to help estimate for our task. Such data can be seen as the conventional validation data (but with high quality), with much smaller size than those used for training, and thus feasible to be precollected. In this work, we denoted meta dataset as,
(7) 
where () is the number of data samples in meta dataset. By utilizing the meta dataset, we can then design the entire training framework for the noise label correction model Eq. (4).
Specifically, we formulate the following bilevel minimization problem:
(8) 
where is the meta loss on meta dataset. After achieving , we can then get the soft label corrector, which incline to ameliorate noisy labels to be correct ones, and further improve the quality of the trained classifier.
Optimizing the parameters and hyperparameters requires two nested loop of optimization Eq. (8), which tends to be computationally inefficient [9]. We thus exploit SGD technique to speedup the algorithm by approximately solving the problem in a minibatch updating manner [34, 8] to jointly ameliorating and . The algorithm flowchart is shown in Fig. 2.
The algorithm includes mainly following steps. Firstly, denote the minibatch training samples as , and then the training loss becomes . We can then deduce the formulate of onestep updating with respect to as
(9) 
where is the learning rate. Then, with current minibatch meta data samples , we can perform one step updating for solving , that is
(10) 
After we achieve , we can calculated the pseudo label by Eq. (3), and update , that is
(11) 
The predicted pseudolabels can then be updated by using the current classifier with parameter . The entire algorithm is then summarized in Algorithm 1.
3 Experimental Results
To evaluate the capability of the proposed method, we implement experiments on CIFAR10, CIFAR100 [18] under different types and levels of noise, as well as a realword largescale noisy dataset Clothing1M [44]. Both CIFAR10 and CIFAR100 contain 50k training images and 10k test images of size 32 32. For CIFAR10/100, we use two types of label noise: symmetric and asymmetric. Symmetric: We follow [47, 38] for label noise addition, which generates label corruptions by flipping labels of a given proportion of training samples to one of the other class labels uniformly (the true label could be randomly maintained). Asymmetric: We use the setting in [45], which designs to mimic the structure of realworld label noise. Concretely, we set a probability to disturb the label to its similar class, e.g., truck automobile, bird airplane, deer horse, cat dog. For CIFAR100, a similar is set but the label flip only happens in each superclass as described in [14].
Baselines. The compared methods include: Finetuning, which finetunes the result of CrossEntropy on the metadata to further enhance its performance. GCE [48], which employs a robust loss combining the benefits of both CE loss and mean absolute error loss against label noise. GLC [14], which estimates the noise transition matrix by using a small clean label dataset. MWNet [34], which uses a MLP net to learn the weighting function. Bootstrap [31], which deals with label noise by adding a perceptual term to the standard CE loss. Joint Optimization [38], which updates the label and model at the same time by using the pseudolabels it generated. Ucorrection [2], which models sample loss with BMM and applied MixUp. For fair comparison, we only compare its proposed method without mixup augmentation.
Experiment Details. We use ResNet34 [13] as classifier network for all baseline experiments in Table 1 .We use two multilayer perception(MLP) with 100 hidden layers as the network structure of and respectively. In the proposed method, we chose crossentropy as loss function, we began to correct labels at 80th epoch (i.e. there is an initial warmup).
Noisetype  Symmetric Noise  Asymmetric Noise  
Dataset  CIFAR10  CIFAR100  CIFAR10  CIFAR100  
Method Noise ratio  0.2  0.4  0.6  0.8  0.2  0.4  0.6  0.8  0.2  0.4  0.2  0.4  
CrossEntropy  Best  90.22  87.33  83.2  54.79  68.03  61.18  46.43  17.91  92.85  90.22  69.05  65.14 
Last  86.33  79.61  72.99  54.26  63.67  46.92  30.96  8.29  91.29  87.23  63.68  50.10  
Finetuning  Best  91.17  87.34  83.75  56.28  67.81  62.55  50.82  19.05  93.11  91.04  69.55  65.75 
Last  88.27  82.16  79.36  54.82  63.97  51.14  38.22  18.86  92.35  89.49  66.43  55.08  
GCE[48]  Best  90.27  88.50  83.70  57.27  71.36  63.39  58.06  16.51  90.11  85.24  69.56  57.50 
Last  90.15  88.01  82.87  57.22  71.02  52.15  45.31  15.71  89.33  82.04  66.36  56.81  
GLC[14]  Best  91.43  88.52  84.08  64.21  69.30  63.24  56.12  18.59  92.46  91.74  71.40  67.73 
Last  90.13  87.04  82.63  62.19  66.62  59.03  51.96  8.08  92.41  91.02  70.01  66.68  
MWNet[34]  Best  91.48  87.34  81.98  65.88  69.79  65.44  55.42  19.62  93.44  91.64  67.54  60.24 
Last  90.11  86.42  81.62  64.78  68.37  64.81  55.04  19.20  91.95  90.88  66.71  59.53  
Bootstrap[31]  Best  91.46  88.75  84.03  63.80  69.79  63.73  57.20  17.63  93.08  91.18  70.93  67.82 
Last  88.00  83.57  78.69  63.41  63.00  47.08  35.86  17.04  91.02  85.59  63.46  49.18  
Joint Optimization[38]  Best  90.85  90.27  86.49  66.39  63.84  59.82  49.13  18.95  93.39  91.43  66.90  64.82 
Last  89.77  88.58  85.57  65.92  60.10  56.85  47.68  17.38  92.12  90.20  66.69  59.31  
Ucorrection[2]  Best  92.05  89.07  85.64  68.23  68.37  62.37  55.19  17.10  91.85  90.34  67.71  66.75 
Last  90.21  85.45  83.15  64.78  67.42  55.40  55.04  9.33  90.92  84.31  63.82  60.64  
Ours  Best  93.46  91.42  87.39  69.87  72.51  68.98  60.81  24.32  94.39  92.81  72.66  70.51 
Last  93.38  91.21  87.25  68.88  72.02  68.70  60.25  20.53  94.11  92.48  70.20  69.24 
3.1 Comparison with StateoftheArt Methods
Table 1 shows the results of all competing methods on CIFAR10 and CIFAR100 under symmetric and asymmetric noise as aforementioned. To compare different methods in more detail, we report both the best test accuracy and the averaged test accuracy over the last 5 epochs. It can be observed that our method gets the best performance across the both datasets and all noise rates. Specifically, even under relatively high noise ratios (E.g. on CIFAR10 with symnoise), our algorithm has competitive classification accuracy (). It worth noted that Ucorrection achieved best accuracy of that is comparable with, while its accuracy decreases in the later training as
probably due to its error accumulation. This indicating that our proposed meta soft label corrector has better convergence under the guidance of meta data in the training process. It also can be seen that MWNet has poor performance in asymmetric condition, that might because all classes share one weighting function in the method, which is unreasonable when noise is asymmetric. Comparatively, our proposed MSLC has a higher degree of freedom and thus performs much better with asymmetric noise.
Fig.6 plots the corrected label accuracy, which used the hard form of pseudolabels Eq. (3) compared with the ground truth. As can be seen in Fig. 6, the corrected labels generated by our method are the most accurate. The accuracy of MWNet always below the value of the proportion of clean samples, since it intrinsically tries to select the clean samples while ignores the corrupted ones by its weighting mechanism. From Fig. 6 (a)(c), we could see that the corrected label accuracy of the Ucorrection are slightly decrease, it might be caused by its massive false correction^{2}^{2}2This will be further analysis in the section 3.2. Moreover, although the accuracy of JointOptimization increase all the time, its performance is limited by the strategy that only use the pseudolabels to replace all the targets, which has the risk of corrupting the original clean labels^{2}^{2}footnotemark: 2.
Tabel 2 are the results on real noisy dataset Clothing1M, which consists of 1 million clothing images belonging to 14 classes from online shopping websites e.g. Tshirt, Shirt, Knitwear and additional smaller sets with clean labels for validation(14K) and testing(10K). Since the labels are generated by using surrounding texts of the images provided by the sellers, they thus contain many error labels. From Table 2, it can be observed that the proposed method achieves the best performance, which indicates our meta soft label corrector could be applied to real complicated dataset.
3.2 Analysis of the proposed MSLC



Dataset  CIFAR10  CIFAR100  

0  0.2  0.4  0.6  0.8  Ours  0  0.2  0.4  0.6  0.8  Ours  
Accuracy  Best  89.84  90.49  91.04  90.34  89.46  91.27  67.42  68.52  68.25  67.13  67.08  68.84 
Last  89.46  90.19  90.91  89.64  89.20  91.11  66.93  68.06  67.83  66.61  66.24  68.35  
Corrected Label Accuracy  92.23  93.36  94.24  93.44  91.94  94.52  81.47  83.29  83.04  81.28  81.24  83.98 
Fig.4 shows the confusion matrices of our method under symmetric and asymmetric noise on CIFAR10. The left column of Fig.4 (a) and (b) is the noise transition matrix, which is the guideline for generating the synthesized noisy datasets. And the right column is the matrix after corrected by our proposed method, which xaxis denotes the hard form corrected labels. By comparing the left and right column of Fig.4 (a) and (b), we could see that the probability of most diagonal terms exceeds after correction. That could indicate the high correction accuracy of our proposed MSLC.
Fig.5 demonstrates the output weights of and the corrected labels accuracy on clean and noisy samples, respectively. From Fig.5 (a), we could see that the weights of clean and noisy samples are significantly different, that means our meta soft label corrector inclines to choose the original clean labels and prones to use other target information when the original labels are noisy. Fig.5 (b) explains that our method could greatly correct the noise samples while retaining the original clean samples. It is worth noting that Ucorrection retains more than 99% of clean samples, however, we through experiments have found the reason is that it inclines to treat most of the samples as clean ones in the training process, which limits its ability to correct noise samples, as show in right column of Fig.5(b). As for JointOptimization, we could see that its training process corrupted the original clean labels from the left column of (b), since it used prediction targets replaced all original labels without considering if they are clean or not.
For further analysis the effectiveness of the network , we compared it learned hyperparameters () with a set of different manually set values on CIFAR10 and CIFAR100. It can be observed from Table 3 that the performance is worst when the is set to 0, which means directly choose the predictions of current model could not accurately correct the original labels. On the other hand, we can find that the best manually set changes when the dataset is different. Specifically, for CIFAR10, the best test accuracy is 91.04 corresponding to case, while for CIFAR100, the best is 68.52 corresponding to
. Compared with the way of setting the hyperparameter manually, our algorithm could learn it more flexibly and achieves the best performance in both test accuracy and the corrected label accuracy.
4 Related Work
Sample Selection: The main idea of this approach is to filter out clean samples from data and train the learner only on these selected ones. Some methods along this line designed their specific selective strategies. For example, Decouple [27] utilized two networks to select samples with different label predictions and then used them to update. Similarly, Coteaching [12] also used two networks, but chose smallloss samples as clean ones for each network. Other methods tend to select clean samples by assigning weights to losses of all training samples, and iteratively update these weights based on the loss values during the training process. A typical method is SPL (Selfpaced learning), which set smaller weights to samples with larger loss since they are more possible to be noisy samples [20, 15, 49]. Very recently, inspired by the idea of metalearning, some advanced sample reweighting methods have been raised. Typically, MentorNet [16] pretrained an additional teacher network with clean samples to guide the training process. Ren et al. [32] used a small set of validation data to training procedure and reweight the backward losses of the minibatch samples such that the updated gradient minimized the losses of those validation data. These methods usually have a more complex weighting scheme, which makes them able to deal with more general data bias and select the clean samples more accurately. In these methods, however, most noisy data useful for learning visual representations [29, 10] are discard from training, making them leaving large room for further performance improvement.
Label Correction: The traditional label correction approach aims to correct noisy labels to true ones through an additional inference step, such as conditional random fields [40]
[24] or directed graphical models [44]. Recently, transition matrix approach assumes that there exists a probabilities matrix that most probably flip the true labels into “noise" ones. There exist mainly two approaches to estimate the noise transition matrix. One is to train the classifier by preestimating noise transition matrix with the anchor point prior assumption. The other approach is to jointly estimate the noise transition matrix and the classifier parameters in a unified framework without employing anchor points [37, 17, 11, 43]. Besides this, some other methods exploit the predictions of network to rectify labels. For example, Joint Optimization [38] optimizes the parameters and updates the labels at the same time by using average prediction results of the network. SELFIE [36] used the coteaching strategy to select clean samples and progressively refurbish noisy samples by using the most frequently predicted labels of previous learned model. Arazo et al. [2] proposed a twocomponent Beta Mixture Model to define whether the data is corrupted or not, and then correct them by introducing the bootstrapping loss.5 Conclusion
Combining with metalearning, we proposed a novel label correction method that could adaptively ameliorating corrupted labels for robust deep learning when the training data is corrupted. Compared with current label correction methods that use a prefixed generation mechanism and require manually set hyperparameters, our method is able to do this task in a flexible automatic and adaptive datadriven manner. Experimental results show consistent superiority of our method in datasets with different types and levels of noise. In the future study, we will try to construct a new structure of meta soft label corrector, which input is not only the loss information, so that its welltrained model could transfer to other datasets under different noise level.
References
 [1] (2015) Training deep neural networks on noisy labels with bootstrapping. In Accepted as a workshop contribution at ICLR, pp. 1–11.
 [2] (2019) Unsupervised label noise modeling and loss correction. arXiv preprint arXiv:1904.11238. Cited by: Appendix A, Appendix B, §1, §1, §2.1, §2.1, §2.2, Table 1, Table 2, §3, §4.
 [3] (2017) A closer look at memorization in deep networks. In ICML,

[4]
(2019)
Mixmatch: a holistic approach to semisupervised learning
. In NeurIPS, 
[5]
(2017)
Active bias: training more accurate neural networks by emphasizing high variance samples
. In Advances in Neural Information Processing Systems, pp. 1002–1012. Cited by: §1.  [6] (2019) Understanding and utilizing deep neural networks trained with noisy labels. In ICML,
 [7] (2017) Fidelityweighted learning. arXiv preprint arXiv:1711.02799.
 [8] (2017) Modelagnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1126–1135. Cited by: §2.3.
 [9] (2018) Bilevel programming for hyperparameter optimization and metalearning. arXiv preprint arXiv:1806.04910. Cited by: §2.3.
 [10] (2018) Unsupervised representation learning by predicting image rotations. In ICLR 2018, Cited by: §4.
 [11] (2016) Training deep neuralnetworks using a noise adaptation layer. Cited by: §4.
 [12] (2018) Coteaching: robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pp. 8527–8537. Cited by: §4.
 [13] (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §3.
 [14] (2018) Using trusted data to train deep networks on labels corrupted by severe noise. In Advances in neural information processing systems, pp. 10456–10465. Cited by: §1, Table 1, §3, §3.
 [15] (2014) Easy samples first: selfpaced reranking for zeroexample multimedia search. In Proceedings of the 22nd ACM international conference on Multimedia, pp. 547–556. Cited by: §4.
 [16] (2018) MentorNet: learning datadriven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2304–2313. Cited by: §1, §1, §4.
 [17] (2016) Learning deep networks from noisy labels with dropout regularization. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 967–972. Cited by: §4.
 [18] (2009) Learning multiple layers of features from tiny images. Cited by: §3.
 [19] (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, Cited by: §1.
 [20] (2010) Selfpaced learning for latent variable models. In NeurIPS, Cited by: §1, §4.
 [21] (2016) Temporal ensembling for semisupervised learning. arXiv preprint arXiv:1610.02242.
 [22] (2013) Pseudolabel: the simple and efficient semisupervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3, pp. 2.
 [23] (2020) Dividemix: learning with noisy labels as semisupervised learning. In ICLR,

[24]
(2017)
Learning from noisy labels with distillation.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 1910–1918. Cited by: §4.  [25] (2011) Noise resistant graph ranking for improved web image search. In CVPR 2011, pp. 849–856. Cited by: §1.
 [26] (2018) Dimensionalitydriven learning with noisy labels. arXiv preprint arXiv:1806.02612.
 [27] (2017) Decoupling" when to update" from" how to update". In Advances in Neural Information Processing Systems, pp. 960–970. Cited by: §4.
 [28] (2017) A theoretical understanding of selfpaced learning. Information Sciences 414, pp. 319–328. Cited by: §1.

[29]
(2017)
Learning features by watching objects move.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2701–2710. Cited by: §4.  [30] (2017) Making deep neural networks robust to label noise: a loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1944–1952. Cited by: Appendix A, §1.
 [31] (2015) Training deep neural networks on noisy labels with bootstrapping. In ICLR, Cited by: §1, §2.1, §2.1, §2.2, Table 1, Table 2, §3, footnote 1.
 [32] (2018) Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pp. 4334–4343. Cited by: §1, §4.
 [33] (2019) Learning with bad training data via iterative trimmed loss minimization. In International Conference on Machine Learning, pp. 5739–5748.
 [34] (2019) Metaweightnet: learning an explicit mapping for sample weighting. In Advances in Neural Information Processing Systems, pp. 1917–1928. Cited by: Appendix A, §1, §1, §2.3, §2.3, Table 1, Table 2, §3.
 [35] (2020) Meta transition adaptation for robust deep learning with noisy labels. arXiv preprint arXiv:2006.05697. Cited by: §1.
 [36] (2019) SELFIE: refurbishing unclean samples for robust deep learning. In International Conference on Machine Learning, pp. 5907–5915. Cited by: §1, §2.1, §4.
 [37] (2014) Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080. Cited by: §4.
 [38] (2018) Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5552–5560. Cited by: Appendix A, Appendix A, Appendix B, §1, §2.1, §2.1, §2.2, Table 1, Table 2, §3, §3, §4, footnote 1.
 [39] (2017) Mean teachers are better role models: weightaveraged consistency targets improve semisupervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204.
 [40] (2017) Toward robustness against label noise in training deep discriminative neural networks. In Advances in Neural Information Processing Systems, pp. 5596–5605. Cited by: §4.
 [41] (2018) Iterative learning with openset noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8688–8696.
 [42] (2010) The multidimensional wisdom of crowds. In Advances in neural information processing systems, pp. 2424–2432. Cited by: §1.
 [43] (2019) Are anchor points really indispensable in labelnoise learning?. In Advances in Neural Information Processing Systems, pp. 6835–6846. Cited by: §4.
 [44] (2015) Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2691–2699. Cited by: §3, §4.

[45]
(2019)
Safeguarded dynamic label regression for noisy supervision.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 9103–9110. Cited by: §3.  [46] (2019) Probabilistic endtoend noise correction for learning with noisy labels. In CVPR,
 [47] (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1, §3.
 [48] (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, pp. 8778–8788. Cited by: Table 1, §3.
 [49] (2015) Selfpaced learning for matrix factorization. In Twentyninth AAAI conference on artificial intelligence, Cited by: §4.
Appendix A More Setting Details on Our Method
Network Structure. For the classifier network, we choose ResNet34. For meta learners, inspired by [34]
, we adopt a single multilayer perceptron (MLP) with one hidden layer containing 100 nodes in both networks
and to output the weight.Synthetic Datasets. We conducted these experiments across both synthetic datasets (i.e. CIFAR10 and CIFAR100 with different types and levels of noise) sharing the same configuration and lead to consistent improvements over the stateofthearts. Our proposed meta label corrector was trained with two steps, firstly through warmup to learn the structured data with only crossentropy loss, and secondly by introducing two meta learners to correct labels under the guidance of a small set of meta data with clean labels. We used SGD with a momentum of 0.9, a weight decay of , and the batchsize of 100. The learning rate is set as 0.1 which is divided by 10 after 80 and 100 epochs for a total of 120 epochs. After we trained the first step with 80 epochs, we used and Adam to train the two meta learners.
Clothing1M Data. In training on the Clothing1M dataset, we used ResNet50 pretrained on ImageNet to align experimental condition with previous study [30, 38]. We resized the images as , performed mean subtraction and cropped the middle for preprocessing. For classifier network, we used SGD with a momentum of 0.9, a weight decay of , and batch size of 32. The initial learning rate is set as and divided by 10 after 5 epochs. We trained the network for 10 epochs and began updating labels from the 2nd epoch (i.e. we only warm up for the 1st epoch). For meta learners , we set the initial learning rate and used Adam to optimize the training process.
Accuracy of Corrected Label. In section 3.1, we plot the accuracy of corrected labels to show the effectiveness of our proposed method. Since both JointOptimization [38] and Ucorrection [2] have warmup operations, for more comprehensive comparison, we let the three methods (i.e. JointOptimization, Ucorrection, Ours) begin with correct labels from the 80th epoch. For MWNet, we just follow its original settings (i.e. starts samplereweighting from the 1st epoch without warmup). We normalized its sample weights, and consider those samples which weight greater than 0.5 as its preserved clean samples (the rest samples are corrupted ones). From the perspective of label correction, its accuracy of the corrected label is the proportion of clean samples it retains on the original clean ones.
Appendix B More Experimental Results
To further analyze the corrected label accuracy of different methods as demonstrated in Section 3.1, we plot Fig.6 to show the accuracy in clean/noisy labels concretely. Fig.6(b)(d) reflect how many original clean samples are rectified mistakenly, and Fig.6(a)(c) represent how many original noisy samples are corrected accurately.
It can be seen that the accuracy in noisy labels decreases in the training process of Ucorrection [2] from Fig.6(b). That is because it used unsupervised clustering method to split the clean and noisy samples, which is easy to treat most samples as clean ones when it processes imbalanced data that the clean samples are the majority. Joint Optimization [38] performs well on the correction of noisy samples, as shown in Fig.6(b)(d). By simply using predicted labels to replace the original ones, this strategy, however, also causes the critical issue that many original clean samples are corrupted, as shown in Fig.6(a)(c).