1 Introduction
While deep neural networks (DNNs) have recently obtained remarkable success on various applications
(Krizhevsky et al., 2012; He et al., 2016), its performance largely relies on a precollected largescale dataset with high quality of human annotations. In realworld applications, however, it is notoriously expensive both in time and money to achieve such data. In real practice, instead, data labels are always collected by coarse annotation sources, like crowdsourcing systems (Bi et al., 2014) or search engines (Xiao et al., 2015; Liang et al., 2016), naturally resulting in the noisy (incorrect) label problem in training data. Learning with such biased training data easily encounters the overfitting issue, thus hampering the generalization performance of the utilized learning regimes (Zhang et al., 2017a).The commonly used approach against this robust learning issue is to select confident examples and remove suspect ones (Chang et al., 2017) or to correct noisy labels to their more possibly true labels (Arazo et al., 2019). These methods, however, implicitly assume a sample belongs only to one class, but neglect the intrinsic labeling noise insight in realworld that there are essential ambiguities among various sample categories. While such “noisy label” are useful to deliver intrinsic knowledge of interclass transition principle naturally existed in data annotation, just coarsely removing noisy samples or transferring a noisy label to another ignores this label noise generation clue, and thus makes them still have room for further performance improvement.
Such label ambiguity issue can be easily understood by seeing Fig.1, where the samples are from Clothing1M (Xiao et al., 2015), a largescale clothing dataset by crawling images from several online shopping websites. It represents a typical realworld label corruption scenario: there exists an unknown noise transition matrix to flip the more possibly true label to other less possible ones with probability, and thus to produce noisy labels. Directly training a DNN classifier by taking given sample labels as deterministic, the top1 predictions tend to be consistent with the noisy labels, naturally conducting overfitting issue, as clearly shown in the third column of Fig.1. Achieving the underlying noise transition matrix is thus expected to be helpful for alleviating such robust issue by thoroughly extracting the real noisy label distribution and ameliorating the quality of trained classifier (as depicted in the fourth column of the figure).
Pervious methods for noise transition matrix estimation can be roughly summarized as two solutions. One is to estimate this matrix on preassumed anchor points, i.e., sample(s) certainly belonging to each class, in advance, and subsequently fix it to train the classifier. However, such prior knowledge (Scott et al., 2013; Patrini et al., 2017) are generally infeasible in practice. The other solution is to jointly estimate the noise transition matrix and the classifier parameters in a unified framework (Sukhbaatar et al., 2015; Goldberger and BenReuven, 2017). Although it avoids the anchor point assumption, it always obtains inaccurate estimation misguided by wrong annotation information especially in large noise cases, as clearly depicted in our experiments.
Against the above issues, this paper proposes a new metatransitionlearning strategy against the noisy labels. The main idea is to leverage a small set of metadata with clean labels to guide the estimation of noise transition. In summary, this study mainly made threefold contributions.

We propose a new learning strategy to estimate the noise transition matrix in a metalearning manner. Under the guidance of a small set of meta data with clean labels, the noise transition matrix and the classifier parameters can be mutually ameliorated to avoid being trapped by noisy training samples, and without need of any anchor point assumptions.

We show that our method can finely estimate the desired transition matrix under the guidance of the meta data with a statistical consistency guarantee. Comprehensive synthetic and real experiments validate that our method can more accurately extract the transition matrix underlying data, naturally following its more robust performance, than previous SOTA methods.

We discuss the essential relationship between our method and label distribution learning, which explains its fine performance even under nonoise scenarios. Experiments on outoftrainingdistribution behavior and adversarial attacks shows that our method can bring model better generalization and robustness.
The paper is organized as follows. Section 2 reviews the related works. Section 3 introduces the proposed meta learning method, as well as some of its fine statistical properties. Section 4 demonstrates experimental results. Section 5 discusses the relationship between our method and label distribution learning, and a conclusion is finally made.
2 Related Work
Learning with Noise Transition.
Transition matrix reflects the probabilities that most probable true labels flip into other “noise” ones, which has been previously employed to modify loss functions to help improve the training performance
(Natarajan et al., 2013; Scott, 2015). There exist mainly two approaches to estimate the noise transition matrix. One is to leverage a twostep solution to preestimate noise transition with the anchor point prior assumption and then use it to train the classifier. E.g., (Patrini et al., 2017)proposed a theoretically sound loss correction method for the task by using precalculated noise transition knowledge, which are obtained on heuristically collected anchor points from the unsupervised dataset. Afterwards, GLC
(Hendrycks et al., 2018) used a small set of preassumed cleanlabel samples to estimate the noise transition to further improve estimation stability. These methods, however, require to prespecify instances belonging to a special class with probability exactly or at least very approaching one, which is always an infeasible task in practice. The approximate used anchor points always lead to inaccurate estimation of the matrix, and thus hamper the subsequent training accuracy.The other approach is to jointly estimate the noise transition matrix and the classifier parameter in a unified framework without employing anchor points. Sukhbaatar et al. (2015) first learned a linear layer with a trace constrained, which pushes the linear layer to be interpreted as the transition matrix between the true and noisy labels. (Jindal et al., 2016) further ameliorated the result by additional dropout regularization. Subsequently, SModel (Goldberger and BenReuven, 2017)
modelled the noise transition with a Softmax layer beyond linear. Recently, TRevision
(Xia et al., 2019) introduced a slack variable to revise the preestimated matrix and validate the revision on noisy validation set. Albeit with concise calculation paradigm, the accuracy of these methods tend to be hampered misguided by noisy labels, especially in heavy noise rate cases, as clearly shown in our experiments.Other methods of learning with noisy labels. We also shortly introduce two typical strategies for handling noisy labels issue: label correction and reliable example selection approaches. The former aims to correct noisy labels to their true ones via an inference step, like directed graphical models (Xiao et al., 2015), conditional random fields (Vahdat, 2017)
(Li et al., 2017). (Tanaka et al., 2018) used the network outputs to predict hard or soft labels. Decouple (Malach and ShalevShwartz, 2017) selected the samples with different label predictions of two networks, while Coteaching (Han et al., 2018) selected its smallloss samples as clean samples for each network. INCV (Chen et al., 2019) randomly divided the noisy data and then utilized crossvalidation to identify clean samples by removing largeloss samples at each iteration. The other reliable example selection approach mainly adopts sample reweighting schemes by imposing weights on samples based on their reliability for training. Typical methods include SPL (Kumar et al., 2010) and its extensions (Jiang et al., 2014a, b; Meng et al., 2017), by reducing effects of examples with large losses, and pay more attention to easy samples with smaller losses. Some other methods along this line include iterative reweighting strategy (Zhang and Sabuncu, 2018), Bayesian latent variables inference (Wang et al., 2017) and so on.Recently, some works try to combine advantages of above two approaches. For example, SELFIE (Song et al., 2019) trained the network on selectively refurbished falselabeled samples that can be corrected with a high precision together with smallloss ones. (Arazo et al., 2019) used a twocomponent mixture model to character the loss distribution of clean and noisy samples in an unsupervised way, and used mixup data augmentation to achieve noisy label correction. (Shen and Sanghavi, 2019) proposed to iteratively minimize the trimmed loss to select samples with lowest current loss and retrain a model on only these samples, which is proved that recovers the ground truth in generalized linear models.
Meta learning methods. Inspired by metalearning developments (Schmidhuber, 1992; Thrun and Pratt, 1998; Finn et al., 2017; Shu et al., 2018, 2019), recently some methods were proposed to make DNNs robust to label noise. However, existing methods focus on learning an adaptive weighting scheme imposed on data to make the learning more automatic and reliable. Typical methods along this line include MentorNet (Jiang et al., 2018), L2RW (Ren et al., 2018) and MetaWeightNet (Shu et al., 2019). This paper can be seen as the first exploration of meta learning on fitting noise transition information.
3 Meta Transition Adaptation Method
3.1 Preliminaries
We consider the problem of class classification. Let be the feature space, be the label space, and denote the underlying data distributions with true and noisy labels. In practice, we assume that the labels of the collected training examples are independently corrupted from the true label distribution. Thus what we can obtain are the noisy training samples , corresponding to the latent true data samples . The two datasets are i.i.d. drawn from true and noisy data distributions and , respectively.
Assume our classifier model is a DNN architecture with layers comprising a transformation , where is the composition of a series of intermediate transformations layers . Each is defined as:
where denote the classifier parameters to be estimated ^{1}^{1}1
Here, we omit the bias vector in each layer.
, andis the activation function such as ReLU
(Glorot et al., 2011). We assume that the output layer is a Softmax layer, and then the output is , and the predicted label is thus given by . The Softmax output can be interpreted as adimensional vector approximating the classconditional probabilities
. We denote it by , also written as . The expected risk on clean data is defined as (Bartlett et al., 2006):(1) 
where is the loss function.
Since the distribution is usually unknown, we use the empirical risk over dataset to approximate ,
(2) 
In this study, we assume there are label transition probabilities between different classes, as commonly adopted in the previous works (Natarajan et al., 2013; Sukhbaatar et al., 2015; Patrini et al., 2017; Goldberger and BenReuven, 2017). The probability of each label in the training set flipping to is expressed as . We utilize a noise transition matrix (Van Rooyen and Williamson, 2017) to represent the probability , so that . The matrix is rowstochastic and not necessarily symmetric across the classes.
If we directly learn the classifier on the noisy data, we would obtain a class posterior predictor for noisy labels . Noise transition matrix bridges and the class posterior predictor for clean labels as follows:
(3) 
and the corresponding matrix form can be written as . It is easy to observe that once the noise transition matrix is obtained, we can recover the desired estimator of class posterior predictor by the softmax output through training the classifier , which is obtained by modifying the with . Thus the expected risks with respect to noisy data is
(4) 
and the empirical risk over noisy dataset is
(5) 
It has been exploited to build a classifierconsistent algorithm (Patrini et al., 2017; Xia et al., 2019), i.e., once the noise transition is obtained, by increasing the size of noisy examples, the learned classifier of Eq.(5) will converge to the optimal classifier learned by clean examples of Eq.(14).
3.2 Existing Estimation Methods
The success of classifierconsistent algorithms depends on the accurate estimation of the transition matrix. There exist two strategies to learn the matrix. One is a twostage regime to utilize anchor point assumption (Patrini et al., 2017) to preestimate the noise transition and then use it to train the classifier. By assuming instance is the anchor point for class if , and it holds that
(6) 
since . Thus if can be approximated by the softmax output (i.e., ),
can be obtained via estimating the noisy class posterior probabilities for anchor points. To preattain such anchor points,
Patrini et al. (2017) designed certain heuristic strategy on unsupervised samples, and Hendrycks et al. (2018) used a small set of clean samples to simulate anchor points. Once obtaining , it can recover by optimizing Eq.(5) according to classifierconsistent algorithms. However, the prior on anchor points is always hard to achieve in practice, increasing the difficulty of using them.The other is a onestage strategy to jointly estimate the noise transition matrix and the classifier parameters in a unified framework, and the noise transition can be modeled as a constrained linear layer (Sukhbaatar et al., 2015) or a Softmax layer (Goldberger and BenReuven, 2017). For example, SModel (Goldberger and BenReuven, 2017) modeled the matrix by adding another Softmax layer to the network, whose parameters can be learned using standard techniques for neural network training. Thus, they trained the classifier and Softmax layer simultaneously directly on the noisy data. At test time, they removed the adding softmax layer and used the classifier to predict the true labels. Recently, Xia et al. (2019) proposed a TRevision method to approximate by gradually ameliorating a slack variable imposed on it, together with updating the classifier parameters. The limitation of these methods mainly lies on its easy misguidance by the noisy annotations, especially in large noise cases, since they are directly trained on them.
3.3 Meta Transition Adaptation Method
To alleviate the aforementioned issues of the current methods, we propose a new learning strategy, which utilizes a small set of meta data with clean labels to guide the estimation of the noise transition matrix. Specifically, we leverage a small set of meta data set with clean labels, representing the metaknowledge of underlying label distribution of clean samples, where is the number of metasamples, and . Note that the data can always be attainable in practice as compared with infeasible anchor point priors and large collection of clean samples required in traditional DL methods. Then we formulate the following bilevel minimization problem to jointly estimate the noise transition matrix and learn the classifier parameters:
(7)  
(8) 
where and denote the hypothesis space of and the loss function imposed on meta data, respectively. represents the optimal classifier that minimizes Eq.(8) on the noisy dataset while depends on ( is the functional operator with parameter ). We use crossentropy (CE) loss as training and meta loss in all our experiments. Note that we treat as training hyperparameter, and the estimation of it should minimize the loss on meta data in a metalearning manner (Finn et al., 2017; Shu et al., 2019).
We have further proved that our method can recover the groundtruth noise transition matrix with meta loss in probability under some mild conditions, and our method is thus with statistical consistency property. All theoretical results and proof details are listed in supplementary material.
3.4 Generalization Error
We then show an upper bound for the estimation error supposed that we obtain the groundtruth noise transition matrix by using Rademacher complexity (Mohri et al., 2018).
Theorem 1
Let be the class of realvalued networks of depth over the domain , where each parameter matrix is with Frobenius norm at most , and the activation function is 1Lipschitz, positivehomogeneous and applied elementwise (such as the ReLU). Suppose the loss function be the CE loss, and then for any , with the probability at least , it holds that:
The proof is presented in the supplementary file. As we can see, although we append an extra noise transition adapting element compared with traditional CE loss, the derived generalization error bound is not larger than those derived from the algorithms employing the CE loss, implying that learning with transition matrix does not need extra larger training samples to achieve a good generalization result.
3.5 Algorithm for Estimating
Estimation of the optimal and requires two nested loops of optimization (Eq.(7)(8)), which is expensive to obtain the exact solution (Franceschi et al., 2018). We thus employ SGD technique, as conventional DNN implementations, to approximately solve our problem in a minibatch updating manner (Finn et al., 2017; Shu et al., 2019) to jointly ameliorating noise transition and classifier parameter in the DNN classifier .
Estimating . At iteration step , we firstly adjust the noise transition matrix according to the classifier parameters and noise transition matrix obtained in the last step by minimizing the meta loss defined in Eq.(7). SGD is employed to optimize the meta loss on a minibatch containing meta samples, i.e.,
(9) 
where the following equation is used to formulate on a minibatch data containing training samples,
(10) 
The above learning process is inspired by MAML (Finn et al., 2017), and represent the step sizes.
Updating . When obtained the noise transition matrix , the classifier parameters can then be updated by:
(11) 
The Meta Transition Adaptation learning algorithm can then be summarized in Algorithm 1
. All computations of gradients can be efficiently implemented by automatic differentiation techniques and easily generalized to any deep learning architectures. The algorithm can be easily implemented using popular deep learning frameworks like PyTorch
(Paszke et al., 2019). It can be seen that both the classifier and the noise transition matrix can be gradually ameliorated during the learning process based on their values calculated in the last step, and the noise transition matrix can thus be updated in a stable manner.Datasets  Methods  Symmetric Noise  Asymmetric Noise  

Noise Rate  Noise Rate  
0  0.2  0.4  0.6  0.8  0.2  0.4  0.6  0.8  
CIFAR10  CE  94.160.25  86.380.99  77.520.41  73.630.85  50.312.14  83.600.24  77.850.98  69.690.72  55.200.28 
Finetuning  94.400.14  87.470.80  82.230.44  78.100.59  51.443.86  92.090.14  89.960.24  75.612.91  60.291.46  
GCE  91.730.14  89.990.16  87.310.53  82.150.47  57.362.08  89.751.53  87.750.36  67.213.64  57.460.31  
Forward  94.330.31  88.260.22  83.23 0.56  78.191.12  61.66 3.54  91.340.28  89.870.61  87.240.96  81.071.92  
GLC  94.430.27  90.060.30  86.780.45  82.520.76  62.400.14  92.870.16  91.800.24  90.950.06  90.020.60  
SModel  94.390.46  90.210.14  87.922.01  81.990.21  57.080.23  90.860.15  84.870.27  67.890.46  56.171.24  
TRevision  93.860.11  90.660.12  87.880.23  83.450.68  57.941.56  92.480.28  91.760.12  89.200.69  84.041.13  
MWNet  93.900.15  90.900.66  87.020.86  82.980.30  65.431.51  92.690.24  90.170.11  68.550.76  58.291.33  
Ours  94.650.03  92.540.17  89.730.41  85.970.10  72.410.32  93.650.05  93.170.13  92.570.18  91.570.28  
CIFAR100  CE  76.100.24  60.380.75  46.920.51  31.821.16  8.293.24  61.050.11  50.301.11  37.341.80  12.460.43 
Finetuning  76.740.26  64.450.43  52.691.35  38.521.05  18.950.44  65.350.80  53.110.64  41.400.43  19.630.30  
GCE  71.970.45  68.021.05  64.180.30  54.460.31  15.610.97  66.150.44  56.850.72  40.580.47  15.820.63  
Forward  76.450.03  63.710.49  49.340.60  37.900.76  9.571.01  64.970.47  52.370.71  44.580.60  15.840.62  
GLC  76.550.07  66.300.62  59.250.69  50.860.57  15.070.78  70.830.25  66.470.58  54.820.99  28.181.88  
SModel  73.690.18  64.610.95  60.360.45  35.884.47  7.610.82  66.640.44  52.260.17  42.960.18  14.950.60  
TRevision  76.120.26  68.520.52  61.560.37  42.480.13  7.660.25  69.570.12  61.800.41  44.541.62  17.100.22  
MWNet  74.930.42  69.950.40  65.450.45  55.421.36  21.370.56  66.730.34  59.530.40  52.240.95  17.410.52  
Ours  76.750.09  72.580.13  68.770.17  57.850.51  21.780.42  74.740.08  71.580.15  61.160.43  33.310.78 
4 Experimental Results
To evaluate the capability of the proposed algorithm, we implement simulated experiments on CIFAR10, CIFAR100, TinyImageNet, as well as a largescale realworld noisy dataset Clothing1M.
4.1 Experimental Setup
Datasets. We first verify the effectiveness of our method on two benchmark datasets: CIFAR10 and CIFAR100 (Krizhevsky, 2009), consisting of
color images arranged in 10 and 100 classes, respectively. Both datasets contain 50,000 training and 10,000 test images. We randomly select 1,000 clean images in the validation set as meta data. We also verify our method on a larger and harder dataset called TinyImageNet (TImageNet briefly), containing 200 classes with 100K training, 10K validation, 10K test images of
. We randomly sample 10 clean images per class as meta data. These datasets are popularly used for evaluating learning with noisy labels in previous literatures (Patrini et al., 2017; Goldberger and BenReuven, 2017).Noise setting. We test two types of label noises: symmetric and asymmetric (classdependent) noise. Symmetric label noises are generated by flipping the labels of a given proportion of training samples to one of the other class labels uniformly (Zhang et al., 2017a). Under asymmetric noises, for CIFAR10, we use the setting in (Yao et al., 2019). Concretely, we set a probability to disturb the label to its similar class, i.e., truck automobile, bird airplane, deer horse, cat dog. For CIFAR100, a similar is set but the label flip only happens in each superclass as described in (Hendrycks et al., 2018). For TImagNet, we adopt the noise setting in (Yu et al., 2019), where labelers also make mistakes only within very similar classes. The graph illustration of asymmetric noise about CIFAR10 and TImageNet can be found in supplementary file.
Methods  Symmetric Noise  Asymmetric Noise  

Noise Rate  Noise Rate  
0  0.2  0.4  0.6  0.2  0.4  0.6  
CE  54.10  43.94  35.14  16.45  45.83  34.95  16.24 
Finetuning  54.52  45.69  38.06  16.60  48.57  37.17  18.79 
GCE  50.20  46.77  41.27  19.38  47.05  34.24  14.85 
Forward  54.17  46.40  37.11  24.98  49.08  37.71  19.90 
GLC  54.28  48.71  42.46  25.50  49.66  40.57  31.19 
SModel  54.32  46.88  37.12  22.81  47.01  32.94  16.70 
TRevision  51.79  41.70  37.04  26.44  49.63  35.02  18.87 
MWNet  53.58  48.31  43.33  32.23  50.14  35.68  18.97 
Ours  54.54  49.85  43.35  29.22  51.12  43.51  36.32 
Baselines. The compared methods include: 1) CE, which uses CE loss to train the DNNs on noisy datasets. 2) Finetuning, which finetunes the result of CE on the metadata to further enhance its performance; 3) GCE (Zhang and Sabuncu, 2018), which employs a robust loss combining the benefits of both CE loss and mean absolute error loss against label noise. 4) Forward (Patrini et al., 2017), which estimates the noise transition matrix in an unsupervised manner. 5) GLC (Hendrycks et al., 2018), which estimates the noise transition matrix by using a small set clean label dataset. 6) SModel (Goldberger and BenReuven, 2017), which uses a Softmax layer to model the noise transition matrix. 7) TRevision (Xia et al., 2019), which learns the noise transition matrix by adding a slack variable to adjust the initialized matrix. 8)MWNet (Shu et al., 2019), which uses a MLP net to learn the weighting function in a datadriven fashion. The metadata in these methods are used as validation set except for Finetuning and MWNet. Note that above 4&5, 6&7, 8 methods represent the SOTA onestage and twostage noise transition estimation methods, and the SOTA metalearning method for solving robust DL issue on noisy samples.
Network structure. We use ResNet34 (He et al., 2016) as our classifier network for CIFAR10 and CIFAR100 dataset followed by (Patrini et al., 2017; Xia et al., 2019), and a 18layer Preact ResNet (He et al., 2016) for TImageNet.
Experimental setup. We train the models with SGD, at an initial learning rate and a momentum 0.9, a weight decay
with minibatch size 128. The learning rate decays 0.1 at 80 and 100 epochs for a total of 120 epochs. We initialize the softmax parameters of our algorithm with the estimation results of GLC.
4.2 Evaluation on Robustness Performance
Results on CIFAR10 and CIFAR100. The classification accuracies of CIFAR10 and CIFAR100 under symmetric and asymmetric noise are reported in Table 1 with 5 random runs. As can be seen, our proposed algorithm achieves the best performance in all cases except for CIFAR100 80% symmetric noise. Specifically, even with large noise ratio, our algorithm still shows the competitive classification accuracy. For example, when on CIFAR10 symmetric noise and on CIFAR100 asymmetric noise, our algorithm reaches 72.41% and 61.16%, outperforming the best results of baselines by about 10% and 6%, respectively. This demonstrates the robustness of our method on different types and portions of noise.
From Table 1 it can be found that: 1) Our algorithm evidently improves the performance of Forward and GLC especially in large noise cases, possibly conducted by the inaccurate preassumed anchor points, which should be infeasible in real cases. Comparatively, our algorithm can dynamically adjust the transition matrix to make its estimation gradually ameliorated guided by meta data, though our method has a initialization result of GLC. 2) Smodel behaves well when noise ratio is small, while degrades quickly when noise ratio becomes large, as well as TRevision does. This can be explained by the fact that large noise makes it easy to fall into a wrong estimation, as illustrated in Section LABEL:under and Table.3. Though sharing the same initializations with them, our method can avoid to fall into a wrong estimation and still perform well through being guided by meta data to avoid being trapped by noisy samples. Especially, when on CIFAR100 symmetric noise, both of them underperform the CE methods, while our method achieves a pretty improvement. 3) MWNet produces a competitive result under the symmetric noise compared with our algorithm. However, it degrades the performance quickly under the asymmetric noise, since for this method, all classes share one weighting function, which is unreasonable when noise is asymmetric. Instead, our method can adaptively fit different noise types and noise rates and gradually ameliorate the estimation. 4) It is interesting to see that our method performs better than CE and finetuning even under nonoise scenarios. We will discuss this phenomenon in the next section.
Methods  CIFAR10  CIFAR100  

Noise Rate  Noise Rate  
0.2  0.4  0.6  0.8  0.2  0.4  0.6  0.8  
Forward  0.163  0.197  0.209  0.342  0.446  0.701  0.727  1.691 
GLC  0.051  0.093  0.163  0.206  0.251  0.515  0.563  0.676 
SModel  0.233  0.278  0.297  0.363  1.071  1.355  1.539  1.806 
TRevision  0.081  0.120  0.195  0.265  0.346  0.795  1.257  1.699 
Ours  0.046  0.058  0.068  0.097  0.188  0.273  0.297  0.323 
Results on TImageNet. To verify our method on more complex scenario, we summarize in Table 2 the test accuracy on TImageNet with different noise settings. As we can see, similar to the CIFAR experiments, for both noise settings with different noise rates, our algorithm outperforms all other baselines except for 60% symmetric noise, where MWNet beats our algorithm, where all methods have actually lost efficacy. But when the MWNet is used in more complicated asymmetric noise case with the same noise extent, the method is largely degenerated, where our method can still perform consistently well. The robustness of our method can thus be further substantiated.
4.3 How noise transition matrix adapt
To understand how our algorithm automatically adjust noise transition matrix guided by the meta data, Table.3 summarizes the estimation error for the transition matrix of the compared methods and ours. It can be observed that our method is more efficient in estimating the transition matrix. Specifically, the matrices learned by Forward and GLC are worse than ours, since the anchor points they find are likely to be inexact, and our method can improve the inexact estimation of GLC towards the groudtruth solution guided by the meta data. On the other hand, although shared the same initialized values with ours, matrices learned by SModel are easier to fall into a bad estimation when noise ratio increases, leading to poor performance compared with ours. Trevision is also towards bad direction, while the deterioration is slowed down with the control of the revision. Besides, TRevision deteriorates faster on CIFAR100 than on CIFAR10. Therefore, the estimating matrices by our method are more accurate, naturally following its more robust performance than compared methods.
Methods  CE  GCE  Forward  GLC  SModel  TRevision  MWNet  Ours 
Accuracy  68.94  69.75  70.83  74.26  70.36  74.18  73.72  75.59 
4.4 Experiments on Realworld Noisy Dataset
We then verify the applicability of our algorithm on a realworld largescale noisy dataset: Clothing1M (Xiao et al., 2015), which contains 1 million images of clothing from online shopping websites with 14 classes, e.g., Tshirt, Shirt, Knitwear. The labels are generated by the surrounding text of images and are thus extremely noisy. The dataset also provides 50k, 14k, 10k manually refined clean data for training, validation and testing, respectively, but we did not use the 50k clean data and use the validation dataset as the meta dataset. Following the previous works (Patrini et al., 2017; Tanaka et al., 2018), we used ResNet50 pretrained on ImageNet. For preprocessing, we resize the image to , crop the center as input, and perform normalization. We train the model using SGD with a momentum 0.9, a weight decay , an initial learning rate 0.0001, and batch size 100. The learning rate is divided by 10 after 5 epochs (for a total 10 epochs).
The results are summarized in Table 4 in terms of top1 accuracy. Our method outperfoms all baselines. Fig. 1 shows some examples of top5 predictions produced by CE and our method. It can be seen that the top1 prediction of CE method overfits to the noisy annotations (red labels), while the second top prediction implies the latent clean labels (green labels), reflecting the ambiguity of the sample labels of this dataset. Comparatively, our method can finely recover the true labels through taking the merit of the learned noise transition matrix. For example, the label of the first row image in Fig.1 should be “Tshirt”, while the annotated label is “underwear”. The CE method gives 94.2% confidence to underwear, which is completely trapped by noisy sample. yet our method generates the label “Tshirt” with high confidence suppressing the noisy label “underwear” benefited from learned noise transition matrix.
5 Relation to Label Distribution Learning
It can be observed that our method outperforms CE and Finetuning in Table.1 and 2 even in the nonoise cases, which might be attributed to its intrinsic label distribution learning (LDL) capability (Geng, 2016; Peterson et al., 2019). LDL is firstly proposed by (Geng et al., 2013), which extends the singlelabel and multilabel annotation to a distribution. Hinton et al. (2015)
used knowledge distillation to provide the smoothed softmax probabilities to enhance the performance of the student network. To employ soft labels replacing onehot encoding hard labels, label smoothing
(Szegedy et al., 2016) and mixup (Zhang et al., 2017b) techniques have also been proposed. Recently, Peterson et al. (2019) presented a full distribution of human labels dataset, CIFAR10H, and utilized it to help improve the accuracy and robustness of a model compared with hard labels.When there are no noisy labels, our method can be explained to be able to approximate the groundtruth label distribution. Specifically, the hard labels correspond to the most probable label while lose the full label distribution, i.e., including human allocation of probabilities. Therefore, Eq.(8) can be interpreted as that the observed data distribution with hard labels is obtained by transforming the underlying data distribution with full label distribution (soft labels) through the transition matrix . The underlying conditional data distribution should behave robust facing unseen data, i.e., to minimize the CE loss over unobserved data (meta data) to bring better generalization and robustness, as validated in (Peterson et al., 2019). Therefore, minimizing Eq.(7) can be considered to search for helping the classifier recover the underlying conditional data distribution. Therefore, it is rational that our method outperforms CE and Finetuning even with less training samples.
Furthermore, to verify that our method can deliver the knowledge of the latent label distribution, we follow the generalization and robustness experiments in (Peterson et al., 2019) to compare with Soft and Hard trained with human uncertainty soft labels and onehot hard labels. The results are demonstrated in Fig. 2 and Table 5. For generalization experiment (Section 5 in (Peterson et al., 2019)), we train ResNet110 on 9,900 test images and treat left 100 images randomly chosen 10 images per class as meta data, and evaluate on CIFAR10 50,000 training set, CIFAR10.1v6,v4 dataset (Recht et al., 2018) and CINIC10 dataset (Darlow et al., 2018). The accuracy of our method is very near to the Soft labels, as seen in Fig. 2(a), and the CE metric^{2}^{2}2The metric is used to evaluate how confident the top prediction of a model is, and whether its distribution over alternative categories is sensible is evidently better than Hard labels, as seen in Fig. 2(b). These results show our method can improve the generalization of the calculated classifier when test datasets are increasingly outofdistribution compared with Hard labels.
For robustness experiment, we pretrain ResNet110 on 49,900 CIFAR10 training images with treat left 100 images randomly chosen 10 images per class as meta data and then finetune pretrained model using 10,000 CIFAR10 test images. The FGSM attack results (Kurakin et al., 2016) are reported in Table 5, averaged over all 10,000 images in CIFAR10 test set. Note that our method obtains higher accuracy and lower CE loss than Hard labels. Fig.2(c) plots the increase in CE loss for each training scheme conditions on PGD attacks (Madry et al., 2018). The accuracy was driven to 0% for Hard labels and ours, and 1% for Soft labels. However, loss for Hard labels is driven up more rapidly than ours. These results show that our method can also improve the robustness of model compared with Hard labels.
Accuracy  Crossentropy  

Hard  Soft  Ours  Hard  Soft  Ours 
26.97  31.43  32.84  4.03  2.68  3.72 
6 Conclusion
We have proposed a novel metalearning method for adaptively extracting transition matrix to guarantee robust deep learning in the presence of noisy labels. Compared with previous methods that require strong anchor point prior assumption or inaccurate estimation misguided by wrong annotation information, the new method is able to yield a more robust and efficient one guided by a small set of meta data. The statistical consistency guarantee of correctly estimating transition matrix can also be proved. Our empirical results show that the proposed method can behave more robust than the SOTA methods. Besides, we discuss the essential relationship with label distribution learning, and our learning strategy is hopeful to improve the generalization and robustness of the model compared with the standard training on hard labels even under nonoise real scenarios due to the interclass ambiguity generally existed in real data. In future work, we will try to incorporate priors of the noise structure into transition matrix to further enhance the estimation stability, e.g., assuming sparse transition where corruption only happens in superclasses.
References
 Unsupervised label noise modeling and loss correction. In ICML, Cited by: §1, §2.
 Stronger generalization bounds for deep nets via a compression approach. In ICML, Cited by: Appendix B.
 Spectrallynormalized margin bounds for neural networks. In NeurIPS, Cited by: Appendix B.
 Convexity, classification, and risk bounds. Journal of the American Statistical Association 101 (473), pp. 138–156. Cited by: Appendix A, §3.1.
 Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3 (Nov), pp. 463–482. Cited by: Appendix B.
 Learning to predict from crowdsourced data.. In UAI, Cited by: §1.

Active bias: training more accurate neural networks by emphasizing high variance samples
. In NeurIPS, Cited by: §1.  Understanding and utilizing deep neural networks trained with noisy labels. In ICML, Cited by: §2.
 CINIC10 is not imagenet or cifar10. arXiv preprint arXiv:1810.03505. Cited by: §5.
 Modelagnostic metalearning for fast adaptation of deep networks. In ICML, Cited by: §2, §3.3, §3.5, §3.5.

Bilevel programming for hyperparameter optimization and metalearning
. In ICML, Cited by: Appendix A, §3.5.  Facial age estimation by learning from label distributions. IEEE transactions on pattern analysis and machine intelligence 35 (10), pp. 2401–2412. Cited by: §5.
 Label distribution learning. IEEE Transactions on Knowledge and Data Engineering 28 (7), pp. 1734–1748. Cited by: §5.
 Deep sparse rectifier neural networks. In AISTATS, Cited by: §3.1.
 Training deep neuralnetworks using a noise adaptation layer. In ICLR, Cited by: Appendix C, §1, §2, §3.1, §3.2, §4.1, §4.1.
 Sizeindependent sample complexity of neural networks. In COLT, Cited by: Appendix B.
 Coteaching: robust training of deep neural networks with extremely noisy labels. In NeurIPS, Cited by: §2.
 Deep residual learning for image recognition. In CVPR, Cited by: §1, §4.1.
 Using trusted data to train deep networks on labels corrupted by severe noise. In NeurIPS, Cited by: §2, §3.2, §4.1, §4.1.
 Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §5.
 Easy samples first: selfpaced reranking for zeroexample multimedia search. In ACM MM, Cited by: §2.
 Selfpaced learning with diversity. In NeurIPS, Cited by: §2.
 MentorNet: learning datadriven curriculum for very deep neural networks on corrupted labels. In ICML, Cited by: §2.
 Learning deep networks from noisy labels with dropout regularization. In ICDM, Cited by: §2.

Imagenet classification with deep convolutional neural networks
. In NeurIPS, Cited by: §1.  Learning multiple layers of features from tiny images. Technical report Cited by: §4.1.
 Selfpaced learning for latent variable models. In NeurIPS, Cited by: §2.
 Adversarial examples in the physical world. In ICLR, Cited by: §5.
 Probability in banach spaces: isoperimetry and processes. Vol. 23, Springer Science & Business Media. Cited by: Appendix B.
 Learning from noisy labels with distillation. In ICCV, Cited by: §2.
 Learning to detect concepts from weblylabeled video data.. In IJCAI, Cited by: §1.
 Towards deep learning models resistant to adversarial attacks. In ICLR, Cited by: §5.
 Decoupling" when to update" from" how to update". In NeurIPS, Cited by: §2.
 A theoretical understanding of selfpaced learning. Information Sciences 414, pp. 319–328. Cited by: §2.
 Foundations of machine learning. MIT Press. Cited by: Appendix B, Appendix B, §3.4.
 Learning with noisy labels. In NeurIPS, Cited by: §2, §3.1.
 A pacbayesian approach to spectrallynormalized margin bounds for neural networks. In ICLR, Cited by: Appendix B.
 PyTorch: an imperative style, highperformance deep learning library. In NeurIPS, Cited by: §3.5.
 Making deep neural networks robust to label noise: a loss correction approach. In CVPR, Cited by: Appendix C, §1, §2, §3.1, §3.1, §3.2, §4.1, §4.1, §4.1, §4.4.
 Human uncertainty makes classification more robust. In ICCV, Cited by: §5, §5, §5.
 Do cifar10 classifiers generalize to cifar10?. arXiv preprint arXiv:1806.00451. Cited by: §5.
 Learning to reweight examples for robust deep learning. In ICML, Cited by: §2.
 Learning to control fastweight memories: an alternative to dynamic recurrent networks. Neural Computation 4 (1), pp. 131–139. Cited by: §2.
 Classification with asymmetric label noise: consistency and maximal denoising. In Conference On Learning Theory, pp. 489–511. Cited by: §1.
 A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In AISTATS, Cited by: §2.
 Learning with bad training data via iterative trimmed loss minimization. In ICML, Cited by: §2.
 Metaweightnet: learning an explicit mapping for sample weighting. In NeurIPS, Cited by: Appendix A, §2, §3.3, §3.5, §4.1.
 Small sample learning in big data era. arXiv preprint arXiv:1808.04572. Cited by: §2.
 SELFIE: refurbishing unclean samples for robust deep learning. In ICML, Cited by: §2.
 Training convolutional networks with noisy labels. In ICLR workshop, Cited by: §1, §2, §3.1, §3.2.

Rethinking the inception architecture for computer vision
. In CVPR, Cited by: §5.  Joint optimization framework for learning with noisy labels. In CVPR, Cited by: §2, §4.4.
 Learning to learn. Springer. Cited by: §2.
 Toward robustness against label noise in training deep discriminative neural networks. In NeurIPS, Cited by: §2.
 A theory of learning with corrupted labels.. Journal of Machine Learning Research 18, pp. 228–1. Cited by: §3.1.
 Robust probabilistic modeling with bayesian data reweighting. In ICML, Cited by: §2.
 Are anchor points really indispensable in labelnoise learning?. In NeurIPS, Cited by: Appendix C, §2, §3.1, §3.2, §4.1, §4.1.
 Learning from massive noisy labeled data for image classification. In CVPR, Cited by: §1, §1, §2, §4.4.
 Safeguarded dynamic label regression for noisy supervision. In AAAI, Cited by: §4.1.
 Rademacher complexity for adversarially robust generalization. In ICML, Cited by: Appendix B.
 How does disagreement help generalization against label corruption?. In ICML, Cited by: §4.1.
 Understanding deep learning requires rethinking generalization. In ICLR, Cited by: §1, §4.1.
 Mixup: beyond empirical risk minimization. In ICLR, Cited by: §5.
 Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, Cited by: §2, §4.1.
Appendix A Solution of Estimating Noise Transition
In our paper, we jointly learn the noise transition matrix and classifier by miniming the following bilevel optimization problems (Franceschi et al., 2018; Shu et al., 2019)
(12)  
(13) 
The empirical version of above can be written as follows used in our main paper:
(14)  
(15) 
We try to illustrate that the theoretical solution of above optimization problems recover the solution we require.
Lemma 1
Suppose is the crossentropy loss, and , i.e., . Then by minimizing the expected risk , the optimal mapping satisfies .
Proof Minimizing the expected risk can be written as
(16) 
By using Lagrange Multiplier method, we have
(17) 
Take the erivative of witth respect to , we have . Thus, we have
(18) 
Since and , we can easily obtain . Therefore, we have
(19) 
Theorem 2
Proof The expected risk on clean data is defined as (Bartlett et al., 2006):
(20) 
and the empirical risk over meta dataset is defined as:
(21) 
Since meta dataset can be seen as i.i.d. sampling from clean data, we can deduce that by Hoeffding’s inequality, , the following holds for all with probability at least
(22) 
We denote and as the learned transition matrix by minimizing Eq.(12)(13) and the underlying transition matrix, respectively. We calculate to character the difference between and , since is unavailable. Since
we have the following holds for all with probability at least
(23) 
thus can control the approximation degree of . Subsequently, we will show that minimizing Eq.(12)(13) can make as small as possible.
We provide the proof by contradiction. Suppose that the optimal solution of Eq.(12) can not recover the groundtruth noise transition matrix, we can show that obtained by optimizing Eq.(13) still overfits to the label noise. Otherwise, when recovers the clean classifier, we have . However, by Lemma 1, the minimization of Eq.(12)(13) pushes that holds. This means that can not recover the classifier on the clean data . Thus can not get the best performance. Then minimizing Eq.(12) pushes as small as possible until approaches to , i.e., pushes as small as possible.
Appendix B Generalization Error
The results in this paper focus on Rademacher complexity (Bartlett and Mendelson, 2002; Mohri et al., 2018; Yin et al., 2019), which is a standard tool to control the uniform convergence (and hence the sample complexity) of given classes of predictors. Here, we present its formal definition. For any function class , given a sample of size , the empirical Rademacher complexity is defined as
(24) 
wher
are i.i.d. Rademacher random variables with
. In our learning problem, denote the training sample by clean dataset and noisy dataset . The expected and empirical risks are and ; We then have the following theorem which connects the expected and empirical risks via Rademacher complexity.Theorem 3 (Rademacher Complexity)
Suppose that the range of the loss function is . Then for any , with the probability at least , the following holds for all :
(25) 
where is the Rademacher complexity;
are Rademacher variables uniformly distributed from
.In our paper, our goal is to minimize the following expected risk and the empirical risk with respected to noisy data to recover the unbias classifier,
(26)  
(27) 
Therefore, the Rademacher complexity for our problems can be expressed as follows:
Corollary 1
For any , with the probability at least , the following holds for all :
where .
Here, the argument in the Rademacher complexity indicates that is chosen from the function space , which is generally determined by the function space of due to the fact that . Thus, we have the following conclusion.
Proposition 1
, where denotes the hypothesis complexity of the classifier.
Proof Firstly, we provide the following two lemmas related to our proof.
Lemma 2
The loss function is 1Lipschitz with respect to , where is crossentropy loss.
Proof Since , we have
Take the derivative of with respect to , we have
Thus, we have
and
Therefore, we can demonstrate that the loss function is 1Lipschitz with respect to .
Since depends on the loss function, while not the hypothesis space. Talagrand’s Contraction Lemma (Ledoux and Talagrand, 1991; Mohri et al., 2018) tries connect both of them.
Lemma 3 (Talagrand’s Contraction Lemma )
Let be an Lipschitz function. Then, for any hypothesis set of realvalued functions, we have
(28) 
Now, we can proof the conclusion.