1 Introduction
Label noise is ubiquitous in real world data. It may be caused by unintentional mistakes of manual or automatic annotators (Yan et al., 2014; Andreas et al., 2017). It may also be introduced by malicious attackers (Jacob et al., 2017). Noisy labels impair the performance of a model (Smyth et al., 1994; Brodley and Friedl, 1999), especially a deep neural network, which tends to have strong memorization power (Frénay and Verleysen, 2014; Zhang et al., 2017). Improving the robustness of a model trained with noisy labels is a crucial yet challenging task in many applications (Volodymyr and Geoffrey E., 2012; Wu et al., 2018).
Many methods have been proposed to train a robust model on data with label noise. One may recalibrate the model by explicitly estimating a noise
transition matrix, namely, the probability of one label being corrupted into another (Goldberger and BenReuven, 2017; Patrini et al., 2017). One may also introduce hidden layers (Reed et al., 2014), prior on data distribution (Lee et al., 2019)or modified loss function
(Van Rooyen et al., 2015; Shen and Sanghavi, 2019; Zhang and Sabuncu, 2018) to improve the robustness of the model. However, these methods either assume strong global priors on the data or lack sufficient supervision for the neural network to achieve satisfying performance. Furthermore, global modelcorrection mechanisms tend to rely on a few parameters; estimating these parameters can be challenging and the error will lead to failing of the training.To adapt to heterogeneous noise pattern and to fully exploit the power of deep neural networks, datarecalibrating methods have been proposed to focus on individual data instead of an overall model adjustment (Malach and ShalevShwartz, 2017; Jiang et al., 2018; Han et al., 2018; Tanaka et al., 2018; Wang et al., 2018; Ren et al., 2018; Cheng et al., 2020). These methods learn to recalibrate the model on each individual datum depending on its own context. They gradually collect clean data whose labels are trustworthy. As more clean data are collected, the quality of the trained models improves. These methods slowly accumulate useful/trustworthy information and eventually attain stateoftheart quality models.
Despite the success of datarecalibrating methods, their underlying mechanism remains elusive. It is unclear why the neural nets trained on noisy labels can help select clean data. A theoretical underpinning will not only explain the phenomenon, but also advance the methodology. One major challenge for these methods is to control the data recalibration quality. It is hard to monitor the model’s recalibrating decision on individual data. An aggressive selection of clean data can unknowingly accumulate irreversible errors. On the other hand, an overlyconservative strategy can be very slow in training, or stops with insufficient clean data and mediocre models. A theoretical guarantee will help develop models with selfassurance that the decision on each datum is reasonably close to the truth.
In this paper, we provide the first theoretical explanation for datarecalibrating methods. Our main theorem states that a noisy classifier (i.e., one trained on noisy labels) can identify whether a label has been corrupted. In particular, we prove that when the noisy classifier has low confidence on the label of a datum, such label is likely corrupted. In fact, we can quantify the threshold of confidence, below which the label is likely to be corrupted, and above which is it likely to be not. We also empirically show that the bound in our theorem is tight.
Our theoretical result not only explains existing datarecalibrating methods, but also suggests a new solution for the problem. As a second contribution of this paper, we propose a novel method for noisylabeled data. Based on our theorem and statistical principles, we verify the purity of a label through a likelihood ratio test w.r.t. the prediction of a noisy classifier, and the threshold value of confidence. The label is corrected or left intact depending on the test result. We prove that this simple labelcorrection algorithm has a guaranteed success rate and will recover the true labels with high probability. We incorporate the labelcorrection algorithm into the training of deep neural networks. We validate our method on different datasets with various noise patterns and levels. Our theoreticallyfounded method outperforms stateofthearts due to its simplicity and due to its principled design.
Our paper shows that a theorem that is wellgrounded in applications will inspire elegant and powerful algorithms even in deep learning settings. Our contribution is twofold:

[topsep=0pt, parsep=2pt, itemsep = 0pt,partopsep=0pt]

We provide a theorem quantifying how a noisy classifier’s prediction correlates to the purity of a datum’s label. This provides theoretical explanation for datarecalibrating methods for noisy labels.

Inspired by the theorem, we propose a new labelcorrection algorithm with guaranteed success rate. We train neural networks using the new algorithm and achieve superior performance.
The code of this paper can be found in https://github.com/pingqingsheng/LRT.git.
1.1 Related Work
One representative strategy for handling label noise is to model and employ noise transition matrix to correct the loss. For example, Patrini et al. (2017)
propose to correct the loss function with estimated noise pattern. The resulting loss is an unbiased estimator of the ground truth loss, and enables the trained model to achieve better performance. However, such an estimator relies on strong assumptions and could be inaccurate in certain scenarios.
Reed et al. (2014)consider modeling the noise pattern with a hidden layer. The learning of this hidden layer is regularized with a feature reconstruction loss, yet without a guarantee that the true label distribution is learned. Another method mentioned in their work is to minimize the entropy of neural network output; however, this method tends to predict a single class. To address this weakness,
Dan et al. (2019) propose to utilize a small number of trusted, clean data to pretrain a network and estimate the noise pattern. However, such clean data may not always be available in practice.Alternatively, another direction proposes to design models that are intrinsically robust to noisy data. Crammer et al. (2009) introduce a regularized confidence weighting learning algorithm (AROW), which attempts to preserve the weight distribution as much as possible while requiring the model to maintain discrimination ability. The followup work (Crammer and Lee 2010
) improves this algorithm by herding the updating direction via specific velocity field (NHERD), and achieves better performance. Both of these works impose constraints on parameters, which, however, could prevent classifiers from adapting to complex datasets. Another similar strategy proposes to assume Gaussian distribution for features, and models the data with a robust generative classifier
(Lee et al., 2019). However, such an assumption may not generalize to other complex scenarios.Devansh et al. (2017) show that deep neural networks tend to learn meaningful patterns before they overfit to noisy ones. Based on this observation, they propose to add Gaussian or adversarial noise to input when training with noisy labels, and empirically show that such data perturbation is able to make the resulting model more robust. Other commonly adopted techniques, such as weight decay and dropout, are also shown to be effective in increasing the robustness of trained classifier (Devansh et al. 2017; Zhang et al. 2017). However, the intrinsic reasons for this phenomenon still remain unclear and overfitting to noisy label is extremely likely. Datarecalibrating methods select clean data while eliminating noisy ones during training. For example, Malach and ShalevShwartz (2017) and Han et al. (2018) train two networks simultaneously, and update the networks only with samples that are considered clean by both networks. Similarly, Jiang et al. (2018) also use two networks: the first one is pretrained to learn a curriculum, and then utilized to select clean samples for training the second network. These methods deliver promising results but lack control of the quality of the collected clean data.
Finally, beyond deep learning framework, there are several theoretic works that demonstrate the robustness of a variety of losses to label noise (Long and Servedio 2010; Nagarajan et al. 2013; Ghosh et al. 2015; Van Rooyen et al. 2015). Following the work of (Wang and Chaudhuri 2018), Gao et al. (2016)
propose an algorithm that can converge to the Bayesian optimal classifier under different noise settings. Moreover, they provide indepth discussion regarding the performance of knearest neighbor (KNN) classifiers. However, the problem with KNN is that it is computationally intensive and difficult to be incorporated into a learning context. Within the framework of deep learning, there are more efforts that need to be made to bridge theory and practice.
2 The Main Theorem: Probing Label Purity Using the Noisy Classifier
Our main theorem answers the following question: without knowing the ground truth, how to decide whether a label is corrupted or not. During training, the only information one can rely on is a noisy classifier, i.e., one that is trained on the corrupted labels. Datarecalibrating methods use the noisy classifier to decide whether a datum is cleanlabeled. However, these methods lack a theoretical justification.
We establish the relationship between a noisy classifier and the purity of a label. We prove that if the classifier has low confidence on a datum with regard to its current label, then this label is likely corrupted. This result provides the first theoretical explanation of why noisy classifiers can be used to determine the purity of labels in previous methods.
This section is organized as follows. We start by providing basic notations and assumptions. Next, we state the main theorem for binary classification and then extend it to the multiclass setting. We also use experiments on synthetic data and CIFAR10 to validate the tightness of our bound.
2.1 Preliminaries and Assumptions
We first focus on binary classification. Later the result will be extended to multiclass setting. Let be the feature space,
be the label space. The joint probability distribution,
, can be factored as . We denote by the true conditional probability. The risk of a binary classifier is . A Bayes optimal classifier is the minimizer of the risk over all possible hypotheses, i.e., . It can be calculated using the true conditional probability, ,We assume satisfies the Tsybakov condition (Tsybakov, 2004). This condition, also called margin assumption, stipulates that the uncertainty of is bounded. In other words, the margin region close to the decision boundary, , has a bounded volume.
Assumption 1 (Tsybakov Condition).
There exist constants , and , such that for all ,
This assumption is adopted in previous works such as (Chaudhuri and Dasgupta, 2014; Belkin et al., 2018; Qiao et al., 2019). However, we have not seen any empirical verification of the condition in real datasets. In this paper, we conduct experiments to verify this condition and provide empirical estimation of the constants and . Our experiments indicate that this condition holds with moderate values of the constants and .
The noisy label setting. Instead of samples from , we are given a sample set with noisy labels , where is the possibly corrupted label based on the true label . We assume a transition probability , i.e., the chance a true label is flipped from class to class . For simplicity, we denote . The transition probabilities and
are independent of the true joint distribution
and the feature . We denote the conditional probability of the noisy labels as . We call the noisy conditional probability. It is easy to verify that is linear to the true conditional probability, :We intend to learn a classifier whose prediction is consistent with the Bayes optimal classifier . Therefore, we call the prediction of the correct label.
Definition 1 (Correct Label).
Given , its correct label is the Bayes optimal classifier prediction .
The correct label, , is subtly different from the true label, . In particular, is uniquely decided by , whereas is a sample from . Since is our final goal, we focus on recovering the correct label, , instead of .
2.2 The Main Theorem
Our main theorem connects a noisy classifier with the chance of a noisy label being correct. We assume is trained on the noisy labels and is trained well enough, i.e., close to the noisy conditional probability, . For convenience, we denote by the classifier prediction of label being , formally, if , and otherwise. Define the estimation error .
Theorem 1.
Assume satisfies the Tsybakov condition with constants , and . Assume For , we have:
Implication of the theorem. Intuitively, the theorem states that a noisy label has bounded probability to be correct if it has a low voteofconfidence by . The upper bound of the probability is controlled by , the approximation error of . In other words, the better approximates , the tighter the bound is. This justifies the usage of a goodquality to determine if is trustworthy. Later we will show is reasonably small in deep learning setting and the bound is tight in practice.
We remark that the constant and the constant hidden inside the bigO in the theorem depend on ’s, which are unknown in practice. Based on this theorem, we will propose a new labelcorrection algorithm that determines robustly in practice without knowing ’s.
2.2.1 Proof of Theorem 1
Preliminary Lemmata. To prove this theorem, we need to first prove two lemmata. Lemma 1 will show that if a classifier
is a linear transformation of
, when the value is below a certain threshold, is unlikely to be consistent with the true Bayesian optimal decision, . Next, Lemma 2 states that since is a linear transformation of , Lemma 1 will apply to and can be set accordingly. Finally, based on the conclusion of Lemma 2 and the Tsybakov condition, we can upperbound if is close to .Lemma 1.
If a classifier depends linearly on , i.e., with . Set . We have
(1) 
Proof.
To calculate , we enumerate two cases:
Case 1: . Observe iff ; iff . We have:
(2) 
We next show that this probability is 0 for the chosen . If , the probability is zero as . Otherwise, . We know that . Therefore, . In this case,
Thus we have .
Case 2: . Observe that iff ; iff , we have:
Similar to Case 1, by checking when and when , we can verify that .
This proves Equation (1) and completes the proof. ∎
Lemma 2.
Let . Let and .
(3) 
Proof.
Recall , in which and are transition probabilities. We can directly prove this lemma using Lemma 1 by setting with and . ∎
(a)  (b)  (c) 
Synthetic experiment using CIFAR10 at noise level 20%. (a): Check of Tsybakov condition using linear regression. Where yaxis is the proportion of data points at distance
from decision boundary. (b): Proportion of labels that are not correct (not consistent with Bayes optimal decision rule) and the proposed upper bound. (c). Same as (b) but labels are corrupted with asymmetric noise.Proof of Theorem 1 using the Lemmata.
Proof.
Remark 1.
Indeed, we can also prove a bound for the opposite case: when is highly confident, is correct with high probability. In this paper, we only focus on the bound in theorem 1 as we only want to identify incorrect labels and fix them.
2.3 Multiclass Setting
Theorem 1 can be generalized to a multiclass setting. Let be the observed (possibly) corrupted label, and . Recall is the classifier’s prediction on label . Define to be the number of total classes and .
First we extend the Tsybakov condition to multiclass scenario (Chen and Sun, 2006). Denote by the Bayes optimal classifier prediction, or say the class predicted by , formally . Denote by the second best prediction, . The difference between their corresponding true conditional probability is a nonnegative function, whose zero level set is the decision boundary of . We assume the Tsybakov condition around the margin of this decision boundary: and , such that for all ,
(4) 
For any pair of labels , we have the linear relationship . Define . Define the estimation error .
Theorem 2.
Assume fulfills multiclass Tsybakov condition for constants and . Assume that . For :
The proof of Theorem 2 will be provided in supplementary material.
2.4 Empirical Validation of the Bound
To better understand the Tsybakov condition assumption and the bound in our theorem, we conduct the following experiment. On the CIFAR10 dataset, we train deep neural networks to approximate relevant functions. We use these functions to estimate the constants and in the Tsybakov condition. Using these constants, we calculate the bound in Theorem 2 as a function of and check if it is tight.
(a) Noisy labels and .  (b) Corrected labels and . 
(c) for .  (d) for . 
To estimate and , we approximate the true conditional probability using a deep neural network trained on the original cleanlabeled CIFAR10 data. We densely sample between 0 and 0.9. For each , we empirically evaluate the left hand side (LHS) probability of Equation (4) and then use these values to estimate and via regression. In particular, for each we calculate LHS of Equation (4) using the frequency , in which is the number of data. If the RHS bound is tight, we can use to approximate . . As shown in Figure 1(a), we plot all pairs as blue dots and estimate and
via linear regression (red line). We observe that the samples are quite close to linear. Indeed, we could get ordinary least square (OLS) estimator of constant
and with high confidence (determinant coefficient , pvalue ). The estimated and are and respectively.Next, we verify our bound in Theorem 2. Using the estimated and , we can calculate the bound (RHS of Equation (4)) as a function of (the constant in the bigO is provided in the supplemental material). In Figure 1(b), we plot the bound function in green curve. We compare this bound with the LHS of Equation (4) which we can empirically evaluate. In particular, we train a noisy classifier by training a neural network on noisy labels (symmetric noise level 20%, see Section 4 for details). Using , we can count the number of data points which has and meanwhile is equal to (calculated using : the cleanlabeltrained neural network). This gives us the LHS of Equation (4), which is the probability of a label being correct when has low confidence (blue line in Figure 1(b)). Similarly, we can calculate the probability of a label being correct when has high confidence (orange line in Fig. 1(b)). We also carry out the same experiment on a different noise setting (asymmetric noise level 20%, see Sec. 4 for details).
Discussion. On CIFAR10 dataset, we estimated the constants of Tsybakov condition to be and with high confidence. This means our bound (Equation (4) is almost linear. As observed in Figure 1(b) and (c), the bound is rather small (only up to 0.2 when the approximation error of the classifier, , is below 0.4). Furthermore, the empirically evaluated chance of being correct when has low confidence (blue lines Figure 1) is almost zero, well below the curve of the bound. In Figure 1(b) The fact that the blue and green line intersects at implies that can be as small as 0.06. Similarly, Figure 1(c) implies can be as small as 0.12. Finally, we note that the orange lines are well above the blue ones. This means when has high confidence on , there is a high chance is correct. In other words, by comparing with a properly chosen constant , we can identify most data with corrupted labels.
We also conduct experiments on synthetic data (generated using multivariate normal distribution). In such case, we can calculate
and exactly. The estimated and are and respectively. More details about the synthetic experiments can be found in the supplemental material.In conclusion, experiments on synthetic and on CIFAR10 datasets show that the constants in Tsybakov condition are rather small and the bound in our theorem is almost linear to . We also note the bound is generally small/tight even in deep learning setting. Thresholding ’s confidence does detect corrupted labels accurately.
3 The Algorithm: Likelihood Ratio Test for Label Correction
Our theoretical insight inspires a new algorithm for label correction. We propose to directly test the confidence level of the noisy classifier to determine whether a label is correct. One additional requirement is that if we decide that a label is incorrect, we also need to decide what is the correct label. Therefore, instead of checking the confidence level, we check the likelihood ratio between ’s confidence on and its confidence on its own label prediction, i.e., . Specifically, we check the likelihood ratio
We compare this likelihood ratio with a predetermined threshold . The value of
is given in the next theorem. This is essentially a hypothesis testing on the null hypothesis
. If , we reject the null hypothesis and flip the label . Otherwise, the label remains unchanged, . If then the likelihood ratio is 1, . Detailed algorithm is provided in Procedure 1. See Figure 2 for an illustration of the algorithm in a binary classification case.We will show in the following theorem that the LRT correction algorithm is guaranteed to make proper correction and clean most of the corrupted labels. In particular, we show that in practice if we have a reasonable approximation to the theoretically optimal , the algorithm flips to the correct label (the Bayes optimal prediction, ) with a good chance. Recall the approximation error of the classifier is .
We consider two cases: (1) the label being flipped ; and (2) the label remaining the same . Each case has its own ideal . We bound the probability of obtaining a correct label with and . Here is the difference between the chosen and the ideal . We also introduce an additional term, , denoting the probability that the true label is neither nor , formally, .
Theorem 3.
, assume fulfills multiclass Tsybakov condition for constants , , .
Case 1 (Label flipped by LRTCorr(,,)): let and . Assume and . Then: is at least .
Case 2 (Label preserved by LRTCorr(,,)):
let and . Assume and .
Then: is at least .
3.1 Training Deep Nets with LRTCorrection
We incorporate the proposed labelcorrection into the training of deep neural networks. Similar to other datarecalibrating methods, our training algorithm continuously trains a deep neural network while correcting the noisy labels. Procedure 2 is the pseudocode of the training method, called AdaCorr. It trains a neural network model iteratively. Each iteration includes both label correction and model training steps. In label correction step, the prediction of the current neural network, , is used to run LRT test on all training data, and to correct their labels according to the test result. Since is used to approximate the conditional probability
, we use the softmax layer output of the neural network as
. After the labels of all training data are updated, we use them to train the neural network incrementally. We continue this iterative procedure until the training converges.We also have a burnin stage in which we train the network using the original noisy labels for epochs. During the burnin stage, we use the original crossentropy loss, . Afterwards, we add an additional retroactive loss, with the intention of stabilizing the network and avoiding overfitting.
After the burnin stage, we want to avoid overfitting of the neural network, so that its output better approximates . To achieve this goal, we introduce a retroactive loss term . The idea is to enforce the consistency between and the prediction of the model at a previous epoch, . It has been observed that a neural network at earlier training stage tends to learn the true pattern rather than to overfit the noise (Devansh et al., 2017). Formally, the loss can be written as , in which is the number of possible label classes. The training loss is the sum of the retroactive loss and the crossentropy loss:
In the experiment we evaluate our method on 4 public datasets: CIFAR10, CIFAR100, MNIST and ModelNet40 (see Section
4 for more details). Based on previous observations (Devansh et al., 2017), on CIFAR10 and CIFAR100 datasets, a neural network takes about 30 epochs to fit the true pattern before overfitting the noise. We use this number as the burnin stage length . For easier datasets like MNIST and ModelNet40, we set to be slightly smaller (25). As for , setting to be slightly smaller than 1 seems sufficient. Our Theorem 3 guarantees that the bound is affected almost linearly (as per Section 2.4) to the error of the manually picked from the optimal one.4 Experiments
In this section we empirically evaluate our proposed method with several datasets, where noisy labels are injected according to specified noise transition matrices.
Datasets. We use the following datasets: MNIST (LeCun et al. 1998), CIFAR10 (Krizhevsky et al. 2009), CIFAR100 (Krizhevsky et al. 2009) and ModelNet40 (Wu et al. 2015). MNIST consists of grayscale images with 10 categories. It contains 60,000 images, and we use 45,000 for training, 5,000 for validation and 10,000 for testing. CIFAR10 and CIFAR100 consist of the same 60,000 images whose size is . CIFAR10 has 10 classes while CIFAR100 has 100 finegrained classes. Similar to MNIST, we split 90% and 10% data from the official training set for the training and validation, respectively, and use the official test set for testing. ModelNet40 contains 12,311 CAD models from 40 categories, where 8,859 are used for training, 984 for validation and the remaining 2,468 for testing. We follow the protocol of (Qi et al., 2017)
to convert the CAD models into point clouds by uniformly sampling 1,024 points from the triangular mesh and normalizing them within a unit ball. In all experiments, we use early stopping on validation set to tune hyperparameters and report the performance on test set.
Baselines. We compare the proposed method with the following methods: (1) Standard, which trains the network in a standard manner, without any label resistance technique; (2) Forward Correction (Patrini et al. 2017), which explicitly estimates the noise transition matrix to correct the training loss; (3) Decoupling (Malach and ShalevShwartz 2017), which trains two networks simultaneously and updates the parameters on selected data whose labels are possibly clean; (4) Coteaching (Han et al. 2018), which also trains two networks but exchanges their error information for network updating; (5) MentorNet (Jiang et al. 2018), which learns a curriculum to filter out noisy data; (6) Forgetting (Devansh et al., 2017), which uses dropout to help deep models resist label noise. (7) Abstention (Thulasidasan et al. 2019), which regularizes the network with abstention loss to ensure model robustness under label noise.
Experimental setup. For the classification of MNIST, CIFAR10 and CIFAR100, we use preactive ResNet34 (He et al. 2016) as the backbone for all the methods. On ModelNet40, we use PointNet. We train the models for 180 epochs to ensure that all the methods have converged. We utilize RAdam (Liu et al. 2019) for the network optimization, and adopt batch size 128 for all the datasets. We use an initial learning rate of 0.001, which is decayed by 0.5 very 60 epochs. We also update to once at epoch to reflect better predictive power of network after several epochs. The experimental results are listed in Table 2. As is shown, our method overall achieves the best performance across the datasets under different noise settings.
Clothing 1M. We also evaluate our method on a large scale Clothing 1M dataset (Xiao et al., 2015), which consists of 1M images with realworld noisy labels. We use pretrained ResNet50 and train the model using SGD for 20 epochs. Our method achieves accuracy 71.47%. It outperforms Standard (68.94%), Forward Correction (69.84%) and Backward Correction (Patrini et al., 2017) (69.13%), where we take the number from the original paper directly. Note that other baselines (Forgetting, Decoupling, MentorNet, Coteaching and Abstention) did not report results on this dataset.
Method  Accuracy() 

Standard  68.94 
Forward  69.84 
Backward  69.13 
AdaCorr  71.74 0.12 
Discussion. Our method outperform stateofthearts over a broad spectrum of noise patterns and levels. This is due to the relatively simple procedure our theoretically guaranteed algorithm. Looking closely, in Figure 3, we draw convergence curves on CIFAR10 with 0.4 uniform noise. On the left, we show the curves of our proposed AdaCorr method. The model continues to flip labels to correct ones. Meanwhile, it fits with the corrected labels and the test accuracy on clean labels does not drop. This shows that the model and the label correction are improving in a harmonic fashion and do not collapse. On the right, we show the curves of the Standard method. Without label correction, the model overfits with noisy labels and the performance on test data degrades catastrophically.
Data Set  Method  Noise Level of Uniform Flipping  Noise Level of Pair Flipping  
0.2  0.4  0.6  0.8  0.2  0.3  0.4  
MNIST  Standard  99.0 0.2  98.7 0.4  98.1 0.3  91.3 0.9  99.3 0.1  99.2 0.1  98.8 0.1 
Forgetting  99.0 0.1  98.8 0.1  97.7 0.2  62.6 8.9  99.3 0.1  96.5 2.0  89.7 1.9  
Forward  99.1 0.1  98.7 0.2  98.0 0.4  89.6 4.8  99.4 0.0  99.2 0.2  96.5 4.4  
Decouple  99.3 0.1  99.0 0.1  98.5 0.2  94.6 0.2  99.4 0.0  99.3 0.1  99.1 0.2  
MentorNet  99.2 0.2  98.7 0.1  98.1 0.1  87.5 5.2  98.6 0.4  99.1 0.1  98.9 0.1  
Coteach  99.1 0.2  98.7 0.3  98.2 0.3  95.7 0.7  99.1 0.1  99.0 0.2  98.9 0.2  
Abstention  94.0 0.3  76.8 0.3  49.6 0.1  21.2 0.5  94.3 0.3  88.5 0.3  81.4 0.2  
AdaCorr  99.5 0.0  99.4 0.0  99.1 0.0  97.7 0.2  99.5 0.0  99.6 0.0  99.4 0.0  
CIFAR10  Standard  87.5 0.2  83.1 0.4  76.4 0.4  47.6 2.0  88.8 0.2  88.4 0.3  84.5 0.3 
Forgetting  87.1 0.2  83.4 0.2  76.5 0.7  33.0 1.6  89.6 0.1  83.7 0.1  86.4 0.5  
Forward  87.4 0.8  83.1 0.8  74.7 1.7  38.3 3.0  89.0 0.5  87.4 1.1  84.7 0.5  
Decouple  87.6 0.4  84.2 0.5  77.6 0.1  48.5 0.9  90.6 0.3  89.1 0.3  86.3 0.5  
MentorNet  90.3 0.3  83.2 0.5  75.5 0.7  34.1 2.5  90.4 0.2  88.9 0.1  83.3 1.0  
Coteach  90.1 0.4  87.3 0.5  80.9 0.5  25.0 3.6  91.8 0.1  89.9 0.2  80.1 0.7  
Abstention  85.3 0.4  82.0 0.7  68.8 0.4  33.8 7.7  88.5 0.0  83.1 0.5  77.4 0.4  
AdaCorr  91.0 0.3  88.7 0.5  81.2 0.4  49.2 2.4  92.2 0.1  91.3 0.3  89.2 0.4  
CIFAR100  Standard  58.9 0.8  52.1 1.0  42.1 0.7  20.8 1.0  59.5 0.4  52.9 0.6  44.7 1.3 
Forgetting  59.3 0.8  53.0 0.2  40.9 0.5  7.7 1.1  61.4 0.9  54.6 0.6  37.7 4.6  
Forward  58.4 0.5  52.2 0.3  41.1 0.5  20.6 0.6  58.3 0.7  53.2 0.6  44.4 2.8  
Decouple  59.0 0.7  52.2 0.7  40.2 0.4  18.5 0.8  60.8 0.7  56.1 0.7  48.4 1.0  
MentorNet  63.6 0.5  51.4 1.4  38.7 0.8  17.4 0.9  64.7 0.2  57.4 0.8  47.4 1.7  
Coteach  66.1 0.5  60.0 0.6  48.3 0.1  16.1 1.1  63.4 0.9  57.6 0.3  49.2 0.3  
Abstention  75.1 5.4  60.0 0.8  51.1 0.8  10.3 0.5  65.4 0.5  56.8 0.5  47.3 0.3  
AdaCorr  67.8 0.1  60.2 0.8  46.5 1.2  24.6 1.1  68.3 0.2  61.1 0.5  49.8 0.7  
ModelNet40  Standard  79.1 2.6  75.3 3.3  70.0 3.0  57.9 2.3  84.4 1.2  82.3 1.3  78.9 0.7 
Forgetting  80.1 1.8  73.9 0.6  69.0 0.7  26.2 4.8  83.3 1.1  62.0 3.0  59.5 2.9  
Forward  52.3 5.1  49.4 6.8  43.5 5.2  28.2 5.5  48.1 6.8  48.0 3.7  49.1 4.4  
Decouple  82.5 2.2  80.7 0.7  72.9 1.0  55.4 2.7  85.7 1.4  84.3 1.0  80.5 2.4  
MentorNet  86.5 0.5  75.4 1.8  70.9 1.9  52.7 3.1  83.7 1.8  81.0 1.5  79.3 2.1  
Coteach  85.6 0.9  84.2 0.8  81.8 1.1  68.9 2.8  85.7 0.8  79.1 3.0  69.1 2.4  
Abstention  78.1 0.6  65.6 0.5  45.6 1.5  23.5 0.5  82.3 0.5  80.4 0.6  65.6 0.5  
AdaCorr  86.9 0.3  85.1 0.6  78.6 1.4  72.1 1.1  87.6 0.4  84.6 0.5  83.7 0.5 
5 Conclusion
We prove theoretical guarantees for datarecalibrating methods for noisy labels. Based on the result, we propose a label correction algorithm to combat label noise. Our method can produce models robust to different noise patterns. Experiments on various datasets show that our method outperforms many recently proposed methods.
Acknowledgements
Mayank Goswami is supported by National Science Foundation grants CRII1755791 and CCF1910873. The research of Songzhu Zheng and Chao Chen is partially supported by NSF IIS1855759, CCF1855760 and IIS1909038. The research of Pengxiang Wu and Dimitris Metaxas is partially supported by NSF CCF1733843. We thank anonymous referees for constructive comments and suggestions.
References
 Learning from noisy largescale datasets with minimal supervision. In CVPR, pp. 6575–6583. Cited by: §1.

Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate
. In NeurIPS, pp. 2300–2311. Cited by: §2.1. 
Identifying mislabeled training data.
Journal of Artificial Intelligence Research
11, pp. 131–167. Cited by: §1.  Rates of convergence for nearest neighbor classification. In NeurIPS, pp. 3437–3445. Cited by: §2.1.

Consistency of multiclass empirical risk minimization methods based on convex loss.
Journal of Machine Learning Research
7 (11), pp. 2435–2447. Cited by: §2.3.  Learning with bounded instanceand labeldependent label noise. In ICML, Cited by: §1.

Adaptive regularization of weight vectors
. In NeurIPS, pp. 414–422. Cited by: §1.1.  Learning via gaussian herding. In NeurIPS, pp. 451–459. Cited by: §1.1.
 Using pretraining can improve model robustness and uncertainty. In ICML, pp. 2712–2721. Cited by: §1.1.
 A closer look at memorization in deep networks. In ICML, pp. 233–242. Cited by: §1.1, §3.1, §3.1, §4.
 Classification in the presence of label noise: a survey. Neural Networks and Learning Systems, IEEE Transactions on 25 (5), pp. 845–869. Cited by: §1.
 On the resistance of nearest neighbor to random noisy labels. arXiv, pp. arXiv–1607. Cited by: §1.1.
 Making risk minimization tolerant to label noise. Neurocomput 160, pp. 93–107. Cited by: §1.1.
 Training deep neuralnetworks using a noise adaptation layer. In ICLR, Cited by: §1.
 Coteaching: robust training of deep neural networks with extremely noisy labels. In NeurIPS, pp. 8536–8546. Cited by: §1.1, §1, §4.
 Identity mappings in deep residual networks. In ECCV, pp. 630–645. Cited by: §4.
 Certified defenses for data poisoning attacks. In NeurIPS, pp. 3520–3532. Cited by: §1.
 MentorNet: learning datadriven curriculum for very deep neural networks on corrupted labels. In ICML, pp. 2304–2313. Cited by: §1.1, §1, §4.
 Learning multiple layers of features from tiny images. External Links: Link Cited by: §4.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Link Cited by: §4.
 Robust inference via generative classifiers for handling noisy labels. In ICML, Cited by: §1.1, §1.

On the variance of the adaptive learning rate and beyond
. In ICLR, Cited by: §4.  Random classification noise defeats all convex potential boosters. Machine learning 78 (3), pp. 287–304. Cited by: §1.1.
 Decoupling ”when to update” from ”how to update”. In NeurIPS, pp. 960–970. Cited by: §1.1, §1, §4.
 Learning with noisy labels. In NeurIPS, Cited by: §1.1.
 Making deep neural networks robust to label noise: a loss correction approach. In CVPR, pp. 2233–2241. Cited by: §1.1, §1, §4, §4.
 Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, pp. 652–660. Cited by: §4.
 Rates of convergence for largescale nearest neighbor classification. In NeurIPS, pp. 10768–10779. Cited by: §2.1.
 Training deep neural networks on noisy labels with bootstrapping. arXiv, pp. arXiv–1412. Cited by: §1.1, §1.
 Learning to reweight examples for robust deep learning. In ICML, pp. 4334–4343. Cited by: §1.
 Learning with bad training data via iterative trimmed loss minimization. In ICML, pp. 5739–5748. Cited by: §1.
 Inferring ground truth from subjective labelling of venus images. In NeurIPS, pp. 1085–1092. Cited by: §1.
 Joint optimization framework for learning with noisy labels. In CVPR, pp. 5552–5560. Cited by: §1.
 Combating label noise in deep learning using abstention. In ICML, pp. 6234–6243. Cited by: §4.
 Optimal aggregation of classifiers in statistical learning. The Annals of Statistics 32 (1), pp. 135–166. Cited by: §2.1.
 Learning with symmetric label noise: the importance of being unhinged. In NeurIPS, pp. 10–18. Cited by: §1.1, §1.
 Learning to label aerial images from noisy data. In ICML, pp. 567–574. Cited by: §1.
 Analyzing the robustness of nearest neighbors to adversarial examples. In ICML, pp. 5133–5142. Cited by: §1.1.
 Iterative learning with openset noisy labels. In CVPR, pp. 8688–8696. Cited by: §1.
 A light CNN for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security 13, pp. 2884–2896. Cited by: §1.
 3D shapenets: a deep representation for volumetric shape modeling. In CVPR, pp. 1912–1920. Cited by: §4.
 Learning from massive noisy labeled data for image classification. In CVPR, pp. 2691–2699. Cited by: §4.
 Learning from multiple annotators with varying expertise. Machine learning 95 (3), pp. 291–327. Cited by: §1.
 Understanding deep learning requires rethinking generalization. In ICLR, Cited by: §1.1, §1.
 Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, pp. 8778–8788. Cited by: §1.