While deep neural networks (DNNs) have recently obtained remarkable success on various applications(Krizhevsky et al., 2012; He et al., 2016), its performance largely relies on a pre-collected large-scale dataset with high quality of human annotations. In real-world applications, however, it is notoriously expensive both in time and money to achieve such data. In real practice, instead, data labels are always collected by coarse annotation sources, like crowdsourcing systems (Bi et al., 2014) or search engines (Xiao et al., 2015; Liang et al., 2016), naturally resulting in the noisy (incorrect) label problem in training data. Learning with such biased training data easily encounters the overfitting issue, thus hampering the generalization performance of the utilized learning regimes (Zhang et al., 2017a).
The commonly used approach against this robust learning issue is to select confident examples and remove suspect ones (Chang et al., 2017) or to correct noisy labels to their more possibly true labels (Arazo et al., 2019). These methods, however, implicitly assume a sample belongs only to one class, but neglect the intrinsic labeling noise insight in real-world that there are essential ambiguities among various sample categories. While such “noisy label” are useful to deliver intrinsic knowledge of inter-class transition principle naturally existed in data annotation, just coarsely removing noisy samples or transferring a noisy label to another ignores this label noise generation clue, and thus makes them still have room for further performance improvement.
Such label ambiguity issue can be easily understood by seeing Fig.1, where the samples are from Clothing1M (Xiao et al., 2015), a large-scale clothing dataset by crawling images from several online shopping websites. It represents a typical real-world label corruption scenario: there exists an unknown noise transition matrix to flip the more possibly true label to other less possible ones with probability, and thus to produce noisy labels. Directly training a DNN classifier by taking given sample labels as deterministic, the top-1 predictions tend to be consistent with the noisy labels, naturally conducting overfitting issue, as clearly shown in the third column of Fig.1. Achieving the underlying noise transition matrix is thus expected to be helpful for alleviating such robust issue by thoroughly extracting the real noisy label distribution and ameliorating the quality of trained classifier (as depicted in the fourth column of the figure).
Pervious methods for noise transition matrix estimation can be roughly summarized as two solutions. One is to estimate this matrix on pre-assumed anchor points, i.e., sample(s) certainly belonging to each class, in advance, and subsequently fix it to train the classifier. However, such prior knowledge (Scott et al., 2013; Patrini et al., 2017) are generally infeasible in practice. The other solution is to jointly estimate the noise transition matrix and the classifier parameters in a unified framework (Sukhbaatar et al., 2015; Goldberger and Ben-Reuven, 2017). Although it avoids the anchor point assumption, it always obtains inaccurate estimation misguided by wrong annotation information especially in large noise cases, as clearly depicted in our experiments.
Against the above issues, this paper proposes a new meta-transition-learning strategy against the noisy labels. The main idea is to leverage a small set of meta-data with clean labels to guide the estimation of noise transition. In summary, this study mainly made three-fold contributions.
We propose a new learning strategy to estimate the noise transition matrix in a meta-learning manner. Under the guidance of a small set of meta data with clean labels, the noise transition matrix and the classifier parameters can be mutually ameliorated to avoid being trapped by noisy training samples, and without need of any anchor point assumptions.
We show that our method can finely estimate the desired transition matrix under the guidance of the meta data with a statistical consistency guarantee. Comprehensive synthetic and real experiments validate that our method can more accurately extract the transition matrix underlying data, naturally following its more robust performance, than previous SOTA methods.
We discuss the essential relationship between our method and label distribution learning, which explains its fine performance even under no-noise scenarios. Experiments on out-of-training-distribution behavior and adversarial attacks shows that our method can bring model better generalization and robustness.
The paper is organized as follows. Section 2 reviews the related works. Section 3 introduces the proposed meta learning method, as well as some of its fine statistical properties. Section 4 demonstrates experimental results. Section 5 discusses the relationship between our method and label distribution learning, and a conclusion is finally made.
2 Related Work
Learning with Noise Transition.
Transition matrix reflects the probabilities that most probable true labels flip into other “noise” ones, which has been previously employed to modify loss functions to help improve the training performance(Natarajan et al., 2013; Scott, 2015). There exist mainly two approaches to estimate the noise transition matrix. One is to leverage a two-step solution to pre-estimate noise transition with the anchor point prior assumption and then use it to train the classifier. E.g., (Patrini et al., 2017)
proposed a theoretically sound loss correction method for the task by using pre-calculated noise transition knowledge, which are obtained on heuristically collected anchor points from the unsupervised dataset. Afterwards, GLC(Hendrycks et al., 2018) used a small set of pre-assumed clean-label samples to estimate the noise transition to further improve estimation stability. These methods, however, require to pre-specify instances belonging to a special class with probability exactly or at least very approaching one, which is always an infeasible task in practice. The approximate used anchor points always lead to inaccurate estimation of the matrix, and thus hamper the subsequent training accuracy.
The other approach is to jointly estimate the noise transition matrix and the classifier parameter in a unified framework without employing anchor points. Sukhbaatar et al. (2015) first learned a linear layer with a trace constrained, which pushes the linear layer to be interpreted as the transition matrix between the true and noisy labels. (Jindal et al., 2016) further ameliorated the result by additional dropout regularization. Subsequently, S-Model (Goldberger and Ben-Reuven, 2017)
modelled the noise transition with a Softmax layer beyond linear. Recently, T-Revision(Xia et al., 2019) introduced a slack variable to revise the pre-estimated matrix and validate the revision on noisy validation set. Albeit with concise calculation paradigm, the accuracy of these methods tend to be hampered misguided by noisy labels, especially in heavy noise rate cases, as clearly shown in our experiments.
Other methods of learning with noisy labels. We also shortly introduce two typical strategies for handling noisy labels issue: label correction and reliable example selection approaches. The former aims to correct noisy labels to their true ones via an inference step, like directed graphical models (Xiao et al., 2015), conditional random fields (Vahdat, 2017)et al., 2017). (Tanaka et al., 2018) used the network outputs to predict hard or soft labels. Decouple (Malach and Shalev-Shwartz, 2017) selected the samples with different label predictions of two networks, while Co-teaching (Han et al., 2018) selected its small-loss samples as clean samples for each network. INCV (Chen et al., 2019) randomly divided the noisy data and then utilized cross-validation to identify clean samples by removing large-loss samples at each iteration. The other reliable example selection approach mainly adopts sample re-weighting schemes by imposing weights on samples based on their reliability for training. Typical methods include SPL (Kumar et al., 2010) and its extensions (Jiang et al., 2014a, b; Meng et al., 2017), by reducing effects of examples with large losses, and pay more attention to easy samples with smaller losses. Some other methods along this line include iterative reweighting strategy (Zhang and Sabuncu, 2018), Bayesian latent variables inference (Wang et al., 2017) and so on.
Recently, some works try to combine advantages of above two approaches. For example, SELFIE (Song et al., 2019) trained the network on selectively refurbished false-labeled samples that can be corrected with a high precision together with small-loss ones. (Arazo et al., 2019) used a two-component mixture model to character the loss distribution of clean and noisy samples in an unsupervised way, and used mixup data augmentation to achieve noisy label correction. (Shen and Sanghavi, 2019) proposed to iteratively minimize the trimmed loss to select samples with lowest current loss and retrain a model on only these samples, which is proved that recovers the ground truth in generalized linear models.
Meta learning methods. Inspired by meta-learning developments (Schmidhuber, 1992; Thrun and Pratt, 1998; Finn et al., 2017; Shu et al., 2018, 2019), recently some methods were proposed to make DNNs robust to label noise. However, existing methods focus on learning an adaptive weighting scheme imposed on data to make the learning more automatic and reliable. Typical methods along this line include MentorNet (Jiang et al., 2018), L2RW (Ren et al., 2018) and Meta-Weight-Net (Shu et al., 2019). This paper can be seen as the first exploration of meta learning on fitting noise transition information.
3 Meta Transition Adaptation Method
We consider the problem of -class classification. Let be the feature space, be the label space, and denote the underlying data distributions with true and noisy labels. In practice, we assume that the labels of the collected training examples are independently corrupted from the true label distribution. Thus what we can obtain are the noisy training samples , corresponding to the latent true data samples . The two datasets are i.i.d. drawn from true and noisy data distributions and , respectively.
Assume our classifier model is a DNN architecture with layers comprising a transformation , where is the composition of a series of intermediate transformations layers . Each is defined as:
where denote the classifier parameters to be estimated 111 Here, we omit the bias vector in each layer.
Here, we omit the bias vector in each layer., and
-dimensional vector approximating the class-conditional probabilities. We denote it by , also written as . The expected risk on clean data is defined as (Bartlett et al., 2006):
where is the loss function.
Since the distribution is usually unknown, we use the empirical risk over dataset to approximate ,
In this study, we assume there are label transition probabilities between different classes, as commonly adopted in the previous works (Natarajan et al., 2013; Sukhbaatar et al., 2015; Patrini et al., 2017; Goldberger and Ben-Reuven, 2017). The probability of each label in the training set flipping to is expressed as . We utilize a noise transition matrix (Van Rooyen and Williamson, 2017) to represent the probability , so that . The matrix is row-stochastic and not necessarily symmetric across the classes.
If we directly learn the classifier on the noisy data, we would obtain a class posterior predictor for noisy labels . Noise transition matrix bridges and the class posterior predictor for clean labels as follows:
and the corresponding matrix form can be written as . It is easy to observe that once the noise transition matrix is obtained, we can recover the desired estimator of class posterior predictor by the softmax output through training the classifier , which is obtained by modifying the with . Thus the expected risks with respect to noisy data is
and the empirical risk over noisy dataset is
It has been exploited to build a classifier-consistent algorithm (Patrini et al., 2017; Xia et al., 2019), i.e., once the noise transition is obtained, by increasing the size of noisy examples, the learned classifier of Eq.(5) will converge to the optimal classifier learned by clean examples of Eq.(14).
3.2 Existing Estimation Methods
The success of classifier-consistent algorithms depends on the accurate estimation of the transition matrix. There exist two strategies to learn the matrix. One is a two-stage regime to utilize anchor point assumption (Patrini et al., 2017) to pre-estimate the noise transition and then use it to train the classifier. By assuming instance is the anchor point for class if , and it holds that
since . Thus if can be approximated by the softmax output (i.e., ),
can be obtained via estimating the noisy class posterior probabilities for anchor points. To pre-attain such anchor points,Patrini et al. (2017) designed certain heuristic strategy on unsupervised samples, and Hendrycks et al. (2018) used a small set of clean samples to simulate anchor points. Once obtaining , it can recover by optimizing Eq.(5) according to classifier-consistent algorithms. However, the prior on anchor points is always hard to achieve in practice, increasing the difficulty of using them.
The other is a one-stage strategy to jointly estimate the noise transition matrix and the classifier parameters in a unified framework, and the noise transition can be modeled as a constrained linear layer (Sukhbaatar et al., 2015) or a Softmax layer (Goldberger and Ben-Reuven, 2017). For example, S-Model (Goldberger and Ben-Reuven, 2017) modeled the matrix by adding another Softmax layer to the network, whose parameters can be learned using standard techniques for neural network training. Thus, they trained the classifier and Softmax layer simultaneously directly on the noisy data. At test time, they removed the adding softmax layer and used the classifier to predict the true labels. Recently, Xia et al. (2019) proposed a T-Revision method to approximate by gradually ameliorating a slack variable imposed on it, together with updating the classifier parameters. The limitation of these methods mainly lies on its easy misguidance by the noisy annotations, especially in large noise cases, since they are directly trained on them.
3.3 Meta Transition Adaptation Method
To alleviate the aforementioned issues of the current methods, we propose a new learning strategy, which utilizes a small set of meta data with clean labels to guide the estimation of the noise transition matrix. Specifically, we leverage a small set of meta data set with clean labels, representing the meta-knowledge of underlying label distribution of clean samples, where is the number of meta-samples, and . Note that the data can always be attainable in practice as compared with infeasible anchor point priors and large collection of clean samples required in traditional DL methods. Then we formulate the following bi-level minimization problem to jointly estimate the noise transition matrix and learn the classifier parameters:
where and denote the hypothesis space of and the loss function imposed on meta data, respectively. represents the optimal classifier that minimizes Eq.(8) on the noisy dataset while depends on ( is the functional operator with parameter ). We use cross-entropy (CE) loss as training and meta loss in all our experiments. Note that we treat as training hyper-parameter, and the estimation of it should minimize the loss on meta data in a meta-learning manner (Finn et al., 2017; Shu et al., 2019).
We have further proved that our method can recover the ground-truth noise transition matrix with meta loss in probability under some mild conditions, and our method is thus with statistical consistency property. All theoretical results and proof details are listed in supplementary material.
3.4 Generalization Error
We then show an upper bound for the estimation error supposed that we obtain the ground-truth noise transition matrix by using Rademacher complexity (Mohri et al., 2018).
Let be the class of real-valued networks of depth over the domain , where each parameter matrix is with Frobenius norm at most , and the activation function is 1-Lipschitz, positive-homogeneous and applied element-wise (such as the ReLU). Suppose the loss function be the CE loss, and then for any , with the probability at least , it holds that:
The proof is presented in the supplementary file. As we can see, although we append an extra noise transition adapting element compared with traditional CE loss, the derived generalization error bound is not larger than those derived from the algorithms employing the CE loss, implying that learning with transition matrix does not need extra larger training samples to achieve a good generalization result.
3.5 Algorithm for Estimating
Estimation of the optimal and requires two nested loops of optimization (Eq.(7)(8)), which is expensive to obtain the exact solution (Franceschi et al., 2018). We thus employ SGD technique, as conventional DNN implementations, to approximately solve our problem in a mini-batch updating manner (Finn et al., 2017; Shu et al., 2019) to jointly ameliorating noise transition and classifier parameter in the DNN classifier .
Estimating . At iteration step , we firstly adjust the noise transition matrix according to the classifier parameters and noise transition matrix obtained in the last step by minimizing the meta loss defined in Eq.(7). SGD is employed to optimize the meta loss on a mini-batch containing meta samples, i.e.,
where the following equation is used to formulate on a mini-batch data containing training samples,
The above learning process is inspired by MAML (Finn et al., 2017), and represent the step sizes.
Updating . When obtained the noise transition matrix , the classifier parameters can then be updated by:
The Meta Transition Adaptation learning algorithm can then be summarized in Algorithm 1
. All computations of gradients can be efficiently implemented by automatic differentiation techniques and easily generalized to any deep learning architectures. The algorithm can be easily implemented using popular deep learning frameworks like PyTorch(Paszke et al., 2019). It can be seen that both the classifier and the noise transition matrix can be gradually ameliorated during the learning process based on their values calculated in the last step, and the noise transition matrix can thus be updated in a stable manner.
|Datasets||Methods||Symmetric Noise||Asymmetric Noise|
|Noise Rate||Noise Rate|
|Forward||94.330.31||88.260.22||83.23 0.56||78.191.12||61.66 3.54||91.340.28||89.870.61||87.240.96||81.071.92|
4 Experimental Results
To evaluate the capability of the proposed algorithm, we implement simulated experiments on CIFAR-10, CIFAR-100, TinyImageNet, as well as a large-scale real-world noisy dataset Clothing1M.
4.1 Experimental Setup
Datasets. We first verify the effectiveness of our method on two benchmark datasets: CIFAR-10 and CIFAR-100 (Krizhevsky, 2009), consisting of
color images arranged in 10 and 100 classes, respectively. Both datasets contain 50,000 training and 10,000 test images. We randomly select 1,000 clean images in the validation set as meta data. We also verify our method on a larger and harder dataset called Tiny-ImageNet (T-ImageNet briefly), containing 200 classes with 100K training, 10K validation, 10K test images of. We randomly sample 10 clean images per class as meta data. These datasets are popularly used for evaluating learning with noisy labels in previous literatures (Patrini et al., 2017; Goldberger and Ben-Reuven, 2017).
Noise setting. We test two types of label noises: symmetric and asymmetric (class-dependent) noise. Symmetric label noises are generated by flipping the labels of a given proportion of training samples to one of the other class labels uniformly (Zhang et al., 2017a). Under asymmetric noises, for CIFAR-10, we use the setting in (Yao et al., 2019). Concretely, we set a probability to disturb the label to its similar class, i.e., truck automobile, bird airplane, deer horse, cat dog. For CIFAR-100, a similar is set but the label flip only happens in each super-class as described in (Hendrycks et al., 2018). For T-ImagNet, we adopt the noise setting in (Yu et al., 2019), where labelers also make mistakes only within very similar classes. The graph illustration of asymmetric noise about CIFAR-10 and T-ImageNet can be found in supplementary file.
|Methods||Symmetric Noise||Asymmetric Noise|
|Noise Rate||Noise Rate|
Baselines. The compared methods include: 1) CE, which uses CE loss to train the DNNs on noisy datasets. 2) Fine-tuning, which finetunes the result of CE on the meta-data to further enhance its performance; 3) GCE (Zhang and Sabuncu, 2018), which employs a robust loss combining the benefits of both CE loss and mean absolute error loss against label noise. 4) Forward (Patrini et al., 2017), which estimates the noise transition matrix in an unsupervised manner. 5) GLC (Hendrycks et al., 2018), which estimates the noise transition matrix by using a small set clean label dataset. 6) S-Model (Goldberger and Ben-Reuven, 2017), which uses a Softmax layer to model the noise transition matrix. 7) T-Revision (Xia et al., 2019), which learns the noise transition matrix by adding a slack variable to adjust the initialized matrix. 8)MW-Net (Shu et al., 2019), which uses a MLP net to learn the weighting function in a data-driven fashion. The meta-data in these methods are used as validation set except for Fine-tuning and MW-Net. Note that above 4&5, 6&7, 8 methods represent the SOTA one-stage and two-stage noise transition estimation methods, and the SOTA meta-learning method for solving robust DL issue on noisy samples.
Network structure. We use ResNet-34 (He et al., 2016) as our classifier network for CIFAR-10 and CIFAR-100 dataset followed by (Patrini et al., 2017; Xia et al., 2019), and a 18-layer Preact ResNet (He et al., 2016) for T-ImageNet.
Experimental setup. We train the models with SGD, at an initial learning rate and a momentum 0.9, a weight decay
with mini-batch size 128. The learning rate decays 0.1 at 80 and 100 epochs for a total of 120 epochs. We initialize the softmax parameters of our algorithm with the estimation results of GLC.
4.2 Evaluation on Robustness Performance
Results on CIFAR-10 and CIFAR-100. The classification accuracies of CIFAR-10 and CIFAR-100 under symmetric and asymmetric noise are reported in Table 1 with 5 random runs. As can be seen, our proposed algorithm achieves the best performance in all cases except for CIFAR-100 80% symmetric noise. Specifically, even with large noise ratio, our algorithm still shows the competitive classification accuracy. For example, when on CIFAR-10 symmetric noise and on CIFAR-100 asymmetric noise, our algorithm reaches 72.41% and 61.16%, outperforming the best results of baselines by about 10% and 6%, respectively. This demonstrates the robustness of our method on different types and portions of noise.
From Table 1 it can be found that: 1) Our algorithm evidently improves the performance of Forward and GLC especially in large noise cases, possibly conducted by the inaccurate pre-assumed anchor points, which should be infeasible in real cases. Comparatively, our algorithm can dynamically adjust the transition matrix to make its estimation gradually ameliorated guided by meta data, though our method has a initialization result of GLC. 2) S-model behaves well when noise ratio is small, while degrades quickly when noise ratio becomes large, as well as T-Revision does. This can be explained by the fact that large noise makes it easy to fall into a wrong estimation, as illustrated in Section LABEL:under and Table.3. Though sharing the same initializations with them, our method can avoid to fall into a wrong estimation and still perform well through being guided by meta data to avoid being trapped by noisy samples. Especially, when on CIFAR-100 symmetric noise, both of them underperform the CE methods, while our method achieves a pretty improvement. 3) MW-Net produces a competitive result under the symmetric noise compared with our algorithm. However, it degrades the performance quickly under the asymmetric noise, since for this method, all classes share one weighting function, which is unreasonable when noise is asymmetric. Instead, our method can adaptively fit different noise types and noise rates and gradually ameliorate the estimation. 4) It is interesting to see that our method performs better than CE and fine-tuning even under no-noise scenarios. We will discuss this phenomenon in the next section.
|Noise Rate||Noise Rate|
Results on T-ImageNet. To verify our method on more complex scenario, we summarize in Table 2 the test accuracy on T-ImageNet with different noise settings. As we can see, similar to the CIFAR experiments, for both noise settings with different noise rates, our algorithm outperforms all other baselines except for 60% symmetric noise, where MW-Net beats our algorithm, where all methods have actually lost efficacy. But when the MW-Net is used in more complicated asymmetric noise case with the same noise extent, the method is largely degenerated, where our method can still perform consistently well. The robustness of our method can thus be further substantiated.
4.3 How noise transition matrix adapt
To understand how our algorithm automatically adjust noise transition matrix guided by the meta data, Table.3 summarizes the estimation error for the transition matrix of the compared methods and ours. It can be observed that our method is more efficient in estimating the transition matrix. Specifically, the matrices learned by Forward and GLC are worse than ours, since the anchor points they find are likely to be inexact, and our method can improve the inexact estimation of GLC towards the groud-truth solution guided by the meta data. On the other hand, although shared the same initialized values with ours, matrices learned by S-Model are easier to fall into a bad estimation when noise ratio increases, leading to poor performance compared with ours. T-revision is also towards bad direction, while the deterioration is slowed down with the control of the revision. Besides, T-Revision deteriorates faster on CIFAR-100 than on CIFAR-10. Therefore, the estimating matrices by our method are more accurate, naturally following its more robust performance than compared methods.
4.4 Experiments on Real-world Noisy Dataset
We then verify the applicability of our algorithm on a real-world large-scale noisy dataset: Clothing1M (Xiao et al., 2015), which contains 1 million images of clothing from online shopping websites with 14 classes, e.g., T-shirt, Shirt, Knitwear. The labels are generated by the surrounding text of images and are thus extremely noisy. The dataset also provides 50k, 14k, 10k manually refined clean data for training, validation and testing, respectively, but we did not use the 50k clean data and use the validation dataset as the meta dataset. Following the previous works (Patrini et al., 2017; Tanaka et al., 2018), we used ResNet-50 pre-trained on ImageNet. For preprocessing, we resize the image to , crop the center as input, and perform normalization. We train the model using SGD with a momentum 0.9, a weight decay , an initial learning rate 0.0001, and batch size 100. The learning rate is divided by 10 after 5 epochs (for a total 10 epochs).
The results are summarized in Table 4 in terms of top-1 accuracy. Our method outperfoms all baselines. Fig. 1 shows some examples of top-5 predictions produced by CE and our method. It can be seen that the top-1 prediction of CE method overfits to the noisy annotations (red labels), while the second top prediction implies the latent clean labels (green labels), reflecting the ambiguity of the sample labels of this dataset. Comparatively, our method can finely recover the true labels through taking the merit of the learned noise transition matrix. For example, the label of the first row image in Fig.1 should be “T-shirt”, while the annotated label is “underwear”. The CE method gives 94.2% confidence to underwear, which is completely trapped by noisy sample. yet our method generates the label “T-shirt” with high confidence suppressing the noisy label “underwear” benefited from learned noise transition matrix.
5 Relation to Label Distribution Learning
It can be observed that our method outperforms CE and Fine-tuning in Table.1 and 2 even in the no-noise cases, which might be attributed to its intrinsic label distribution learning (LDL) capability (Geng, 2016; Peterson et al., 2019). LDL is firstly proposed by (Geng et al., 2013), which extends the single-label and multi-label annotation to a distribution. Hinton et al. (2015)
used knowledge distillation to provide the smoothed softmax probabilities to enhance the performance of the student network. To employ soft labels replacing one-hot encoding hard labels, label smoothing(Szegedy et al., 2016) and mixup (Zhang et al., 2017b) techniques have also been proposed. Recently, Peterson et al. (2019) presented a full distribution of human labels dataset, CIFAR10H, and utilized it to help improve the accuracy and robustness of a model compared with hard labels.
When there are no noisy labels, our method can be explained to be able to approximate the ground-truth label distribution. Specifically, the hard labels correspond to the most probable label while lose the full label distribution, i.e., including human allocation of probabilities. Therefore, Eq.(8) can be interpreted as that the observed data distribution with hard labels is obtained by transforming the underlying data distribution with full label distribution (soft labels) through the transition matrix . The underlying conditional data distribution should behave robust facing unseen data, i.e., to minimize the CE loss over unobserved data (meta data) to bring better generalization and robustness, as validated in (Peterson et al., 2019). Therefore, minimizing Eq.(7) can be considered to search for helping the classifier recover the underlying conditional data distribution. Therefore, it is rational that our method outperforms CE and Fine-tuning even with less training samples.
Furthermore, to verify that our method can deliver the knowledge of the latent label distribution, we follow the generalization and robustness experiments in (Peterson et al., 2019) to compare with Soft and Hard trained with human uncertainty soft labels and one-hot hard labels. The results are demonstrated in Fig. 2 and Table 5. For generalization experiment (Section 5 in (Peterson et al., 2019)), we train ResNet-110 on 9,900 test images and treat left 100 images randomly chosen 10 images per class as meta data, and evaluate on CIFAR-10 50,000 training set, CIFAR10.1v6,v4 dataset (Recht et al., 2018) and CINIC10 dataset (Darlow et al., 2018). The accuracy of our method is very near to the Soft labels, as seen in Fig. 2(a), and the CE metric222The metric is used to evaluate how confident the top prediction of a model is, and whether its distribution over alternative categories is sensible is evidently better than Hard labels, as seen in Fig. 2(b). These results show our method can improve the generalization of the calculated classifier when test datasets are increasingly out-of-distribution compared with Hard labels.
For robustness experiment, we pretrain ResNet-110 on 49,900 CIFAR-10 training images with treat left 100 images randomly chosen 10 images per class as meta data and then fine-tune pretrained model using 10,000 CIFAR-10 test images. The FGSM attack results (Kurakin et al., 2016) are reported in Table 5, averaged over all 10,000 images in CIFAR10 test set. Note that our method obtains higher accuracy and lower CE loss than Hard labels. Fig.2(c) plots the increase in CE loss for each training scheme conditions on PGD attacks (Madry et al., 2018). The accuracy was driven to 0% for Hard labels and ours, and 1% for Soft labels. However, loss for Hard labels is driven up more rapidly than ours. These results show that our method can also improve the robustness of model compared with Hard labels.
We have proposed a novel meta-learning method for adaptively extracting transition matrix to guarantee robust deep learning in the presence of noisy labels. Compared with previous methods that require strong anchor point prior assumption or inaccurate estimation misguided by wrong annotation information, the new method is able to yield a more robust and efficient one guided by a small set of meta data. The statistical consistency guarantee of correctly estimating transition matrix can also be proved. Our empirical results show that the proposed method can behave more robust than the SOTA methods. Besides, we discuss the essential relationship with label distribution learning, and our learning strategy is hopeful to improve the generalization and robustness of the model compared with the standard training on hard labels even under no-noise real scenarios due to the inter-class ambiguity generally existed in real data. In future work, we will try to incorporate priors of the noise structure into transition matrix to further enhance the estimation stability, e.g., assuming sparse transition where corruption only happens in super-classes.
- Unsupervised label noise modeling and loss correction. In ICML, Cited by: §1, §2.
- Stronger generalization bounds for deep nets via a compression approach. In ICML, Cited by: Appendix B.
- Spectrally-normalized margin bounds for neural networks. In NeurIPS, Cited by: Appendix B.
- Convexity, classification, and risk bounds. Journal of the American Statistical Association 101 (473), pp. 138–156. Cited by: Appendix A, §3.1.
- Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3 (Nov), pp. 463–482. Cited by: Appendix B.
- Learning to predict from crowdsourced data.. In UAI, Cited by: §1.
Active bias: training more accurate neural networks by emphasizing high variance samples. In NeurIPS, Cited by: §1.
- Understanding and utilizing deep neural networks trained with noisy labels. In ICML, Cited by: §2.
- CINIC-10 is not imagenet or cifar-10. arXiv preprint arXiv:1810.03505. Cited by: §5.
- Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §2, §3.3, §3.5, §3.5.
Bilevel programming for hyperparameter optimization and meta-learning. In ICML, Cited by: Appendix A, §3.5.
- Facial age estimation by learning from label distributions. IEEE transactions on pattern analysis and machine intelligence 35 (10), pp. 2401–2412. Cited by: §5.
- Label distribution learning. IEEE Transactions on Knowledge and Data Engineering 28 (7), pp. 1734–1748. Cited by: §5.
- Deep sparse rectifier neural networks. In AISTATS, Cited by: §3.1.
- Training deep neural-networks using a noise adaptation layer. In ICLR, Cited by: Appendix C, §1, §2, §3.1, §3.2, §4.1, §4.1.
- Size-independent sample complexity of neural networks. In COLT, Cited by: Appendix B.
- Co-teaching: robust training of deep neural networks with extremely noisy labels. In NeurIPS, Cited by: §2.
- Deep residual learning for image recognition. In CVPR, Cited by: §1, §4.1.
- Using trusted data to train deep networks on labels corrupted by severe noise. In NeurIPS, Cited by: §2, §3.2, §4.1, §4.1.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §5.
- Easy samples first: self-paced reranking for zero-example multimedia search. In ACM MM, Cited by: §2.
- Self-paced learning with diversity. In NeurIPS, Cited by: §2.
- MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, Cited by: §2.
- Learning deep networks from noisy labels with dropout regularization. In ICDM, Cited by: §2.
Imagenet classification with deep convolutional neural networks. In NeurIPS, Cited by: §1.
- Learning multiple layers of features from tiny images. Technical report Cited by: §4.1.
- Self-paced learning for latent variable models. In NeurIPS, Cited by: §2.
- Adversarial examples in the physical world. In ICLR, Cited by: §5.
- Probability in banach spaces: isoperimetry and processes. Vol. 23, Springer Science & Business Media. Cited by: Appendix B.
- Learning from noisy labels with distillation. In ICCV, Cited by: §2.
- Learning to detect concepts from webly-labeled video data.. In IJCAI, Cited by: §1.
- Towards deep learning models resistant to adversarial attacks. In ICLR, Cited by: §5.
- Decoupling" when to update" from" how to update". In NeurIPS, Cited by: §2.
- A theoretical understanding of self-paced learning. Information Sciences 414, pp. 319–328. Cited by: §2.
- Foundations of machine learning. MIT Press. Cited by: Appendix B, Appendix B, §3.4.
- Learning with noisy labels. In NeurIPS, Cited by: §2, §3.1.
- A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. In ICLR, Cited by: Appendix B.
- PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, Cited by: §3.5.
- Making deep neural networks robust to label noise: a loss correction approach. In CVPR, Cited by: Appendix C, §1, §2, §3.1, §3.1, §3.2, §4.1, §4.1, §4.1, §4.4.
- Human uncertainty makes classification more robust. In ICCV, Cited by: §5, §5, §5.
- Do cifar-10 classifiers generalize to cifar-10?. arXiv preprint arXiv:1806.00451. Cited by: §5.
- Learning to reweight examples for robust deep learning. In ICML, Cited by: §2.
- Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Computation 4 (1), pp. 131–139. Cited by: §2.
- Classification with asymmetric label noise: consistency and maximal denoising. In Conference On Learning Theory, pp. 489–511. Cited by: §1.
- A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In AISTATS, Cited by: §2.
- Learning with bad training data via iterative trimmed loss minimization. In ICML, Cited by: §2.
- Meta-weight-net: learning an explicit mapping for sample weighting. In NeurIPS, Cited by: Appendix A, §2, §3.3, §3.5, §4.1.
- Small sample learning in big data era. arXiv preprint arXiv:1808.04572. Cited by: §2.
- SELFIE: refurbishing unclean samples for robust deep learning. In ICML, Cited by: §2.
- Training convolutional networks with noisy labels. In ICLR workshop, Cited by: §1, §2, §3.1, §3.2.
Rethinking the inception architecture for computer vision. In CVPR, Cited by: §5.
- Joint optimization framework for learning with noisy labels. In CVPR, Cited by: §2, §4.4.
- Learning to learn. Springer. Cited by: §2.
- Toward robustness against label noise in training deep discriminative neural networks. In NeurIPS, Cited by: §2.
- A theory of learning with corrupted labels.. Journal of Machine Learning Research 18, pp. 228–1. Cited by: §3.1.
- Robust probabilistic modeling with bayesian data reweighting. In ICML, Cited by: §2.
- Are anchor points really indispensable in label-noise learning?. In NeurIPS, Cited by: Appendix C, §2, §3.1, §3.2, §4.1, §4.1.
- Learning from massive noisy labeled data for image classification. In CVPR, Cited by: §1, §1, §2, §4.4.
- Safeguarded dynamic label regression for noisy supervision. In AAAI, Cited by: §4.1.
- Rademacher complexity for adversarially robust generalization. In ICML, Cited by: Appendix B.
- How does disagreement help generalization against label corruption?. In ICML, Cited by: §4.1.
- Understanding deep learning requires rethinking generalization. In ICLR, Cited by: §1, §4.1.
- Mixup: beyond empirical risk minimization. In ICLR, Cited by: §5.
- Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, Cited by: §2, §4.1.
Appendix A Solution of Estimating Noise Transition
The empirical version of above can be written as follows used in our main paper:
We try to illustrate that the theoretical solution of above optimization problems recover the solution we require.
Suppose is the cross-entropy loss, and , i.e., . Then by minimizing the expected risk , the optimal mapping satisfies .
Proof Minimizing the expected risk can be written as
By using Lagrange Multiplier method, we have
Take the erivative of witth respect to , we have . Thus, we have
Since and , we can easily obtain . Therefore, we have
Proof The expected risk on clean data is defined as (Bartlett et al., 2006):
and the empirical risk over meta dataset is defined as:
Since meta dataset can be seen as i.i.d. sampling from clean data, we can deduce that by Hoeffding’s inequality, , the following holds for all with probability at least
We denote and as the learned transition matrix by minimizing Eq.(12)(13) and the underlying transition matrix, respectively. We calculate to character the difference between and , since is unavailable. Since
we have the following holds for all with probability at least
We provide the proof by contradiction. Suppose that the optimal solution of Eq.(12) can not recover the ground-truth noise transition matrix, we can show that obtained by optimizing Eq.(13) still overfits to the label noise. Otherwise, when recovers the clean classifier, we have . However, by Lemma 1, the minimization of Eq.(12)(13) pushes that holds. This means that can not recover the classifier on the clean data . Thus can not get the best performance. Then minimizing Eq.(12) pushes as small as possible until approaches to , i.e., pushes as small as possible.
Based on Eq.(23), can be bounded by minimizing Eq.(12)(13). In other words, Eq.(23) holds for all with probability at least . Since can be very small, we can deduce that recover in a certain probability. The proof is completed.
Appendix B Generalization Error
The results in this paper focus on Rademacher complexity (Bartlett and Mendelson, 2002; Mohri et al., 2018; Yin et al., 2019), which is a standard tool to control the uniform convergence (and hence the sample complexity) of given classes of predictors. Here, we present its formal definition. For any function class , given a sample of size , the empirical Rademacher complexity is defined as
are i.i.d. Rademacher random variables with. In our learning problem, denote the training sample by clean dataset and noisy dataset . The expected and empirical risks are and ; We then have the following theorem which connects the expected and empirical risks via Rademacher complexity.
Theorem 3 (Rademacher Complexity)
Suppose that the range of the loss function is . Then for any , with the probability at least , the following holds for all :
where is the Rademacher complexity; are Rademacher variables uniformly distributed from
are Rademacher variables uniformly distributed from.
In our paper, our goal is to minimize the following expected risk and the empirical risk with respected to noisy data to recover the unbias classifier,
Therefore, the Rademacher complexity for our problems can be expressed as follows:
For any , with the probability at least , the following holds for all :
Here, the argument in the Rademacher complexity indicates that is chosen from the function space , which is generally determined by the function space of due to the fact that . Thus, we have the following conclusion.
, where denotes the hypothesis complexity of the classifier.
Proof Firstly, we provide the following two lemmas related to our proof.
The loss function is 1-Lipschitz with respect to , where is cross-entropy loss.
Proof Since , we have
Take the derivative of with respect to , we have
Thus, we have
Therefore, we can demonstrate that the loss function is 1-Lipschitz with respect to .
Lemma 3 (Talagrand’s Contraction Lemma )
Let be an -Lipschitz function. Then, for any hypothesis set of real-valued functions, we have
Now, we can proof the conclusion.