An Ensemble Noise-Robust K-fold Cross-Validation Selection Method for Noisy Labels

07/06/2021 ∙ by Yong Wen, et al. ∙ HUAWEI Technologies Co., Ltd. 0

We consider the problem of training robust and accurate deep neural networks (DNNs) when subject to various proportions of noisy labels. Large-scale datasets tend to contain mislabeled samples that can be memorized by DNNs, impeding the performance. With appropriate handling, this degradation can be alleviated. There are two problems to consider: how to distinguish clean samples and how to deal with noisy samples. In this paper, we present Ensemble Noise-robust K-fold Cross-Validation Selection (E-NKCVS) to effectively select clean samples from noisy data, solving the first problem. For the second problem, we create a new pseudo label for any sample determined to have an uncertain or likely corrupt label. E-NKCVS obtains multiple predicted labels for each sample and the entropy of these labels is used to tune the weight given to the pseudo label and the given label. Theoretical analysis and extensive verification of the algorithms in the noisy label setting are provided. We evaluate our approach on various image and text classification tasks where the labels have been manually corrupted with different noise ratios. Additionally, two large real-world noisy datasets are also used, Clothing-1M and WebVision. E-NKCVS is empirically shown to be highly tolerant to considerable proportions of label noise and has a consistent improvement over state-of-the-art methods. Especially on more difficult datasets with higher noise ratios, we can achieve a significant improvement over the second-best model. Moreover, our proposed approach can easily be integrated into existing DNN methods to improve their robustness against label noise.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Together with the resurgence and remarkable success of DNNs, large-scale datasets have become increasingly common. For supervised learning tasks, modern DNNs generally require the datasets to be annotated with accurate labels to achieve high performance. However, to correctly label large amounts of data is very costly and error-prone, even high-quality hand-labeled benchmark dataset such as ImageNet 

[deng2009imagenet] contains mislabeled samples [northcutt2019confident]. There exist alternative, low-cost methods, including large-scale annotation through crowd-sourcing [sheng2008get] and online web queries [divvala2014learning], but these inevitably yield a higher proportion of incorrect class labels.

DNNs are prone to overfitting to corrupted data samples, which increases the generalization error of the network [zhang2017understanding]. To address this issue, numerous algorithms have been proposed to train DNNs in a way robust to label noise [wang2019sl, Xu2019L_DMIAI]

. The capability of DNNs to fit noisy data has been further studied by Chen et al. chen2019understanding. They showed that, for symmetric noise, the test accuracy is a quadratic function of the noise ratio, and claim that generalization occurs in the sense of distribution. In this paper, we relax their assumptions and give a theoretical analysis of the impact that an imperfect classifier has. Our findings demonstrate that, while the noise level has a significant impact, the performance of the classifier is key.

Based on our analysis, we propose E-NKCVS, a novel ensemble method based on -fold cross-validation to increase the generalization performance. We empirically evaluate our solution and demonstrate that it outperforms the state-of-the-art, proving the effectiveness of our method. In summary, our contributions are as follows.

  • We propose a novel method (E-NKCVS) based on a combination of -fold cross-validation and ensemble learning. Samples are selected from the noisy data by keeping those where the predicted label matches the given (noisy) label. Any non-selected samples can then either be discarded or re-weighted to have a lower impact. Mixup [zhang2017mixup] is applied during training to augment the data.

  • We further propose a label re-weighting scheme for samples that are likely erroneous. For these uncertain samples, we consider both the given label and a generated pseudo label with the weight set using the entropy of the predicted labels given by E-NKCVS.

  • We empirically show that the proposed solution outperforms state-of-the-art noise-robust methods on image recognition and text classification tasks on multiple datasets. Moreover, our solution can easily be incorporated into existing network architectures to enhance their robustness to noisy labels.

2 Related Work

There have been numerous approaches proposed to deal with noisy labels. These can generally be categorized into three types. The most straightforward way is to improve the quality of a dataset by removing or correcting corrupted samples. There have been multiple strategies proposed to identify the most likely corrupted samples, including using conditional random fields [vahdat2017cdf]

, knowledge graphs distilling knowledge from noisy data 

[li2017distill], and a label cleaning network to achieve noise-robust classifications [veit2017learning]. However, auxiliary clean data are often required and not always obtainable.

Another approach is to reformulate the loss function. Theoretical studies by Ghosh et al ghosh2017robust prove that the mean absolute error (MAE) is robust to label noise under certain assumptions. Inspired by Ghosh’s work, other robust losses have been proposed. Ma et al. ma2018d2l correct the loss to avoid overfitting to noisy labels. Wang et al. wang2019sl propose symmetric cross-entropy learning by balancing cross-entropy and a noise-tolerant reverse cross-entropy while Zhang et al. zhang2018gce propose a set of noise-tolerant loss functions that generalize both the categorical cross-entropy and MAE. Xu et al. Xu2019L_DMIAI introduce Determinant-based Mutual Information (DMI) loss which is a generalized version of mutual information and provably insensitive to instance-independent label noise.

Refinement of the training process has also been explored to deal with noisy labels. MentorNet [jiang2017mentornet]

is proposed to supervise the training of a student network and make it focus on samples with a higher probability of being labeled correctly. Following the same idea, Co-teaching 

[han2018coteach] trains a network with the most confident samples as output by a second network. Meanwhile, DivideMix [li2020dividemix] fit a two-component mixture model to obtain the per-sample label confidence, then use this information to divide the training data into a labeled set and an unlabeled set. The semi-supervised technique MixMatch [berthelot2019mixmatch] is then applied for training. In a similar fashion, MentorMix [jiang2020beyond] also takes advantage of Mixup [zhang2017mixup] and merges it with MentorNet to minimize the empirical vicinal risk using curriculum learning.

Our proposed method refines the training process by adding -fold cross-validation and an ensemble to deal with the noisy labels and adjust per-sample label weights during training.

3 Preliminaries

We consider the -class classification problem. Given a dataset , where denotes the -th sample in the -dimensional space with its observed label as . The given label may be corrupt and we thus denote as the ground-truth label of sample . A sample is referred to as clean when it is labeled correctly, i.e., . In this work, we examine two types of artificial noise, symmetric noise and asymmetric noise. We introduce a noise transition matrix , where , to characterize the probability of samples in the -th class being flipped to the -th class label.

Definition 1.

(symmetric noise) Given noise ratio , we define the noise transition matrix as , and .

Definition 2.

(asymmetric noise) Given noise ratio , , and , for some , otherwise.

Both noise types make the common assumption that the noise is data-independent given the true class label, i.e., . Asymmetric (class-dependent) noise is designed to imitate real-world label noise which often arises due to annotators mistaking similar classes. This is simulated by flipping a fraction of a class’s labels to a similar class (e.g., truck automobile, cat dog).

For simplicity and consistency, we denote the neural network classifier parameterized by as , where is an element of a functional space which maps the feature space to the label space . We further denote and

as the class probability distribution and the predicted label, respectively. The loss is denoted as

, or

for short. Finally, the confusion matrix of classifier

is denoted as , where .

4 Ensemble Noise-Robust K-fold Cross-Validation Selection

In this section, we present the details of our sample selection strategy to obtain clean samples and our re-weighting scheme for samples with uncertain or likely corrupted labels. Our goal is to select clean samples from the noisy dataset

, and consecutively train a deep learning model with the selected samples and the re-weighted non-selected samples. The optimal scenario would be to select all samples

where from and re-weight all non-selected samples to use their correct label as the training label.

To effectively filter out noisy samples, we present a Noise-robust -fold Cross-Validation Selection (NKCVS) method in Algorithm 1. Following the standard -fold cross-validation scheme, the dataset is randomly partitioned into equal-sized subsets () (line 2). The data is split into training data consisting of subsets and a single subset (line 4). We augment the training data and train a DNN model with the standard cross-entropy loss (lines 5-6). The model is then used to predict the labels of all samples in (lines 7-9), and we select any samples where the predicted label matches the given label (lines 10-11). This process is repeated times until all samples have been tested once.

The training data augmentation is done following Mixup [zhang2017mixup]. Each sample

is interpolated with another randomly chosen sample

from the same mini-batch. For each such pair of samples, a mixed sample is computed by:


We further propose an extended ensemble version in Algorithm 2 (E-NKCVS). In ensemble learning, multiple predictions are combined to obtain better predictive performance. Following this idea, we iterate NKCVS times with each iteration yielding a separate set of samples . We finally select all samples that fulfill the condition,


where is the threshold to retain and is an indicator function returning 1 if , otherwise 0.

For each sample, we save all predicted labels in . These are used to create pseudo labels for all non-selected samples and adjust the weight between the pseudo and given labels as described in Section 4.2.

4.1 Evaluation strategy and theoretical analysis

0:  noisy dataset , number of splits .
1:  ,
2:  Split into
3:  for  to  do
6:     Train with to obtain
7:     for  do
8:        Predict labels with
9:        Append to
10:        if   then
12:        end if
13:     end for
14:  end for
14:  the selected sample set , predicted label set .
Algorithm 1 Noise-Robust K-fold Cross-Validation Selection (NKCVS)
0:  noisy dataset , number of splits , number of iterations , threshold .
1:  , .
2:  for  to  do
3:     Set
4:  end for
5:  for  do
6:      =
7:     if  then
9:     end if
10:  end for
10:  the selected sample set , predicted label set .
Algorithm 2 Ensemble NKCVS (E-NKCVS)

To evaluate the algorithms’ ability to select clean samples, we adapt the standard definitions of precision and recall to our scenario while keeping the original intent behind the metrics intact. We denote the selected sample set as

, and define the clean samples and clean selected samples as follows:


We measure the ability of identifying the clean samples using precision and recall, defined as:


where denotes the number of samples in a set. Thus, precision expresses the fraction of clean samples in , while recall represents the fraction of clean samples in over all clean samples in . The performance of Algorithm 2 is theoretically quantified in Theorem 1 with the full proof provided in Appendix A.

Theorem 1.

Denote . Assuming noise transition matrix and confusion matrix of a classifier, the expectations of precision and recall of the selected samples by Algorithm 2 are then:

Corollary 1.

For the special case where both the noise matrix and the confusion matrix are symmetric, with , , , the precision and recall can be simplified as follows:


The performance is thus dependent on both the classifier accuracy and the noise transition matrix. Furthermore, from Equation 9 we can see that the only way to improve both precision and recall is to improve . Namely, to improve the accuracy of the classifier in Algorithm 1.

The number of splits in Algorithm 1 can be tuned for this purpose. In general, a higher value for will give better results since the training of classifier is augmented with more training samples, giving more accurate predictions and thus a higher . For E-NKCVS in Algorithm 2, we can further tune the number of iterations and the threshold . This will not directly or consistently increase , but will act as a regularizer to enhance the precision at the cost of the recall, or vice versa. An empirical study of the impact of , and is found in Section 5.2. Although Corollary 1 is a special case for symmetric noise, we experimentally show in Section 5 that for asymmetric and real-world noise, our algorithm achieves competitive results with state-of-the-art methods.

4.2 Label re-weighting based on predicted labels

To make use of all available information, we do not simply discard the samples not selected by E-NKCVS. Instead, we decrease the weight given to these samples during training of the final network. In Algorithm 2 (E-NKCVS), for each sample , we obtain predicted labels. We denote these as . We denote the label with the most occurrences as and use it as a pseudo label. The distribution of the predicted labels is denoted as , and the entropy as , where .

We only re-weight samples . The loss using the original label and the pseudo label are computed, and the weight between the two are determined by as follows:


The weight is based on the label entropy and set to be:


where is the maximum possible value of and hence used to normalize to [0,1]. Thus, the weight given to will decrease with increased uncertainty in .

By including the selected samples and a tuning parameter , we obtain the complete loss function:


where is the standard cross-entropy loss.

5 Experiments

In this section, we demonstrate the validity and robustness of the proposed method when training on data with label noise. The section is divided into four parts. We begin by introducing the experimental setup. This is followed by a validation of the effectiveness of E-NKCVS in identifying clean samples and a parameter analysis of how different parameter settings can affect the results. Finally, we show that our method is robust and can outperform state-of-the-art methods, both on datasets with artificially added noise and datasets with real-world noisy labels.

5.1 Experimental setup

We extensively validate our method on multiple benchmark datasets, namely MNIST 


, CIFAR-10 and CIFAR-100

[krizhevsky2009learning], TREC [li2002learning], Clothing-1M [xiao2015learning], and WebVision [li2017webvision]. We use symmetric and asymmetric noise as defined in Section 3 to manually corrupt the labels in the training data with different noise ratios. The labels in the testing data are kept clean. For Clothing-1M and WebVision, we do not introduce any artificial noise since the datasets are naturally noisy. To obtain the final test accuracy, the DNN is retrained using the selected samples and the re-weighted non-selected samples. The test evaluation is done with .

For the real-world datasets, we follow previous works [li2020dividemix, chen2019understanding]

and use ResNet-50 with weights pre-trained on ImageNet for Clothing-1M and inception-resnet v2 

[szegedy2016inception] for WebVision. We discard aby labeled training images provided in the datasets. For WebVision, we use the first 50 classes of the Google image subset and evaluate the results on the provided validation dataset. A summary of the datasets and the full details on the experimental setup are provided in Appendix B.

The default parameters of E-NKCVS are set as follows, , and . For each iteration over , is split into random folds and the network is randomly initialized.

5.2 Method effectiveness and parameter sensitivity

0.0 50000 46659 46659 100.0% 93.32%
0.2 40000 37111 36958 99.59% 92.40%
0.4 30000 27505 27130 98.64% 91.68%
0.6 20000 18127 17304 95.46% 86.52%
0.8 10000 9070 6602 72.79% 66.02%
Table 1: The performance of E-NKCVS on CIFAR-10 with different noise ratios . The parameters and are set to , and , respectively.
Figure 1: The sensitivity of precision and recall when adjusting and using from CIFAR-10. The accuracy is measured on using the model trained on the selected samples .

We conduct experiments on how effective E-NKCVS is in identifying the clean samples. In particular, we verify our claim that increasing the number of splits will have a significant positive impact on both precision and recall, as defined in Equation 7. Moreover, the dynamics between and are investigated in detail. Here, we do not re-weight the non-selected samples in order to examine the pure effect of the parameters on the selected samples. For this purpose, we use the training data set from CIFAR-10 with varying degrees of label noise and empirically show how the different parameters of E-NKCVS affect the result.

The parameter that has the largest influence on the results is without a doubt the noise ratio . Table 1 shows the precision and recall as well as the sizes of the different sample sets in the presence of different noise ratios. Predictably, the accuracy of the sample selection drops with an increase in label noise, i.e., the baseline with achieves the best result. There is a significant drop in both precision and recall when increasing from to showing that obtaining a good model in the case of severe label noise is particularly challenging.

Impact of increasing :

In Figure 1(b) we see the results of E-NKCVS for varying values of while the plain NKCVS (i.e., ) results are shown in Figure 1(a). As clearly illustrated in the figures, the generalization performance steadily increases with higher . Moreover, the increase is more pronounced at higher noise ratios. The increase in both precision and recall with higher indicates that more of the clean samples in are found while the purity of the selected samples is increased. This improvement of is reflected in the final accuracy which is also higher with larger . However, the expected performance gain of increasing will have diminishing returns, and any performance gain needs to be weighed against the increased computation cost.

Impact of increasing :

We investigate the impact of increasing while fixing (for we use ), the results are in Figure 1(c). will act as a regularizer, namely, a higher will increase in precision and decrease recall. This indicates that the selected sample set contains fewer samples but has a higher proportion of clean samples. The resulting accuracy on the model trained on generally increases slightly with higher , emphasizing the relative importance of a higher precision.

In Figure 1(a) and Figure 1(b) the the difference between plain NKCVS (i.e., setting ) and E-NKCVS with is illustrated. E-NKCVS outperforms NKCVS, especially for lower values of , demonstrating the significance of . The outcome of fixing while increasing can be observed in Figure 1(c), where changes from 1 to 2 while is kept at 1. This will have the opposite effect to increasing while keeping static, i.e., will have more but noisier samples. The behavior of and is thus very flexible and can be adjusted depending on the scenario. In addition, the label re-weighting scheme will also benefit from larger .

Datasets Methods Symmetric Noise Asymmetric Noise
0.0 0.2 0.4 0.6 0.8 0.2 0.4
MNIST CE 99.3 0.1 98.6 0.1 98.1 0.2 97.0 0.2 81.5 0.5 93.1 0.1 81.1 0.5
Co-teaching 99.2 0.1 99.2 0.1 99.1 0.1 98.4 0.1 88.2 0.5 97.1 0.2 88.8 0.5
SL 99.3 0.1 99.2 0.1 99.0 0.1 98.3 0.1 91.4 0.1 99.1 0.1 98.0 0.1
E-NKCVS 99.3 0.1 99.3 0.1 99.1 0.1 98.5 0.1 91.9 0.1 99.1 0.2 98.4 0.2
CIFAR-10 CE 89.7 0.1 83.5 0.1 78.8 0.2 69.9 0.6 41.5 0.5 85.9 0.2 78.5 0.6
Co-teaching 89.4 0.2 86.6 0.3 84.1 0.8 81.1 0.6 22.5 3.6 86.8 0.4 75.5 0.5
SL 89.5 0.1 87.6 0.1 85.3 0.1 80.1 0.1 59.5 0.5 88.2 0.1 80.6 0.4
E-NKCVS 89.7 0.1 89.0 0.1 86.3 0.2 83.1 0.2 63.5 0.4 88.9 0.2 85.1 0.3
CIFAR-100 CE 69.1 0.6 61.1 0.4 51.4 0.6 27.6 1.2 7.7 1.5 63.0 0.3 61.8 0.4
Co-teaching 66.2 0.5 61.3 0.5 52.3 0.8 41.1 1.6 5.5 2.6 63.2 0.3 62.2 0.5
SL 68.2 0.1 62.1 0.1 55.3 0.1 43.4 0.1 15.5 0.1 66.1 0.2 63.1 0.4
E-NKCVS 69.1 0.5 64.8 0.2 59.6 0.4 47.9 0.3 26.3 0.3 67.9 0.3 64.5 0.5

CE 96.5 0.5 93.5 0.8 90.1 1.5 77.6 4.2 26.3 7.5 93.0 1.0 79.3 4.5
Co-teaching 95.2 0.5 92.3 0.8 90.3 0.8 81.3 1.6 25.5 4.6 92.8 0.6 77.3 3.8
SL 96.4 0.4 93.7 0.5 92.2 0.6 83.5 2.2 30.2 4.9 94.0 0.5 83.3 4.2
E-NKCVS 96.4 0.2 95.0 0.5 93.8 0.5 89.9 0.6 35.5 2.5 95.0 0.6 88.0 1.2
Table 2: Test accuracy (%, average over 5 runs) with different label noise ratios . We train on the manually corrupted and test on the clean data. The best results are highlighted in bold.

5.3 Comparison to the state-of-the-art

We compare our algorithm with the standard cross-entropy loss and multiple state-of-the-art methods made for dealing with noisy labels.

  • CE: Basic cross-entropy loss.

  • Co-teaching [han2018coteach]: Simultaneously training two networks and let them teach each other. Samples selected by one network in a mini-batch are used for back-propagation in the other.

  • SL [wang2019sl]: A symmetric cross-entropy learning approach that tries to balance between sufficient learning and robustness to noisy labels.

  • DMI [Xu2019L_DMIAI]: Uses a determinant-based mutual information robust loss to train the DNNs.

  • MentorNet [jiang2017mentornet]: Trains a teacher network to teach a student network by providing a sample weighting scheme.

  • MentorMix [jiang2020beyond]: Proposes a new robust loss by mixing curriculum learning from MentorNet [jiang2017mentornet] and vicinal risk minimization.

  • DivideMix [li2020dividemix]: Dynamically divides the data based on label confidence and trains two networks in a semi-supervised manner based on MixMatch.

The baselines originally evaluate on different datasets and in the following evaluation we keep these distinctions. Accuracy on real-world datasets is reported as in the original papers while those baselines that utilize synthetic data are rerun for a fair comparison.

The experimental results on the datasets with artificial noise are summarized in Table 2 with E-NKCVS using the default parameters as presented in Section 5.1. Our solution outperforms the baseline methods and achieves the best test accuracy at all levels and types of label noise. The advantage of our method is more significant when the noise ratio is more severe and the difficulty increases. This is, in particular, discernible on CIFAR-100 which is a more challenging dataset.

In the scenario with no corrupted labels (i.e., ), CE is the best baseline since it focuses on fitting the data instead of dealing with noisy labels. For the other methods, the adjustments made to the loss or the network architecture to account for label noise are at best wasted, and in many cases detrimental. In contrast, our E-NKCVS algorithm will have close to identical test accuracy as compared to CE, implying that our method has a minimal negative impact when there is no or little label noise present.

5.4 Experiments on real-world noisy datasets

We further assess the capabilities of E-NKCVS and its practical usage on two datasets with real-world noisy labels, Clothing-1M and (mini) WebVision. The results can be seen in Table 3.

As shown, E-NKCVS achieves an accuracy of 75.0% on Clothing-1M, improving the prior state-of-the-art without the use of the auxiliary training labels. Running without any additional considerations for the label noise, i.e., using a basic CE loss, a test accuracy of 69.0% is obtained. Thus, to be conscious of and take steps to rectify mislabeled samples can give substantial model improvements on real-world datasets.

For WebVision, the test accuracy of E-NKCVS is competitive to recent published works and improves slightly upon the prior state-of-the-art. The results imply that our relatively simple E-NKCVS method is reliable and effective on datasets containing real-world noisy labels. Here, the CE baseline is surprisingly strong with a test accuracy of 74.0%, only outperformed by our method and the two best baseline methods. WebVision has a relatively lower noise level of around 20% [li2017webvision] compared to Clothing-1M with close to 40% label noise [xiao2015learning] which could explain part of this discrepancy.

Method Clothing-1M WebVision
CE 69.0 74.0
MentorNet - 63.0
SL 71.0 -
DMI 72.5 -
Co-teaching - 63.6
MentorMix 74.3 76.0
DivideMix 74.8 77.3
E-NKCVS 75.0 77.6
Table 3: Comparison with state-of-the-art methods in test accuracy (%) on Clothing1M and (mini) WebVision. Results for baselines are copied from the original papers or, if missing, from Li et al. li2020dividemix.

6 Conclusion

In this paper, we propose Ensemble Noise-Robust

-fold Cross-Validation Selection (E-NKCVS) to deal with the noisy label problem by selecting likely clean samples to use for model training. For non-selected samples, we further propose to use an entropy-based label re-weighting scheme based on the given label and the predicted labels. The effectiveness of our solution is verified on multiple datasets that are manually corrupted with different levels of symmetric and asymmetric label noise. We show that E-NKCVS consistently outperforms existing methods at all levels of label noise. Particularly on the more complex and challenging dataset, CIFAR-100, we achieve a significant improvement over the second-best approach at high noise ratios. Experiments on two large real-world datasets with natural label noise, Clothing-1M and WebVision, further emphasize the usefulness of our method, and we show that our method achieves state-of-the-art test accuracy on both datasets. An extensive empirical hyperparameter analysis is provided that demonstrates the versatility of our proposed method. Moreover, due to the method’s relative simplicity, it can easily be incorporated into existing DNN algorithm architectures to enhance their robustness against label noise.


Appendix A Proof of Theorem 1


Inserting the above into the definitions of precision and recall in Equation 7, we obtain the desired results. ∎

Appendix B Dataset Summary and Experimental Setup

b.1 Dataset Summary

The datasets used in the experimental part of the paper are introduced one by one below. A brief overview can be found in Table 4.

MINST [lecun1998gradient] is a popular but small dataset of handwritten digits. This is a balanced dataset with 10 classes, each of which has 6,000 training and 1,000 testing images.

CIFAR-10 [krizhevsky2009learning] contains images with human-annotated labels. There are 10 classes, airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The dataset is balanced, with 5,000 images in the training set and 1,000 in the test set for each class.

CIFAR-100 [krizhevsky2009learning] is similar to CIFAR-10 but contains 100 classes instead of 10. This dataset is also balanced, with 500 training images and 100 testing images per class. This dataset is more complex as compared to CIFAR-10 with both more classes as well as fewer images per class.

TREC [li2002learning] (Text REtrieval Conference Question Classification) is a text dataset for question classification. The idea behind the dataset is that if a question can be classified into a semantic class correctly, this will put constraints on potential answers. The dataset has 6 labels which can be further separated into 50 second-level labels. In this work, we only use the 6 main labels. These are abbreviation, entity, description, human, location, and numeric. The average question sentence length is 10 and the vocabulary size is 8,700.

Clothing-1M [xiao2015learning] is a real-world dataset containing 1 million images of clothing articles. There are 14 different classes: t-shirt, shirt, knitwear, chiffon, sweater, hoodie, windbreaker, jacket, down coat, suit, shawl, dress, vest, and underwear. The images have all been obtained from online shopping websites and the labels are created from the attached text. The accuracy of the labels is about 61.54%. Some classes are more often confused with each other (e.g., sweater and knitwear), indicating that the dataset may contain both symmetric and asymmetric noise. There are 50,000 manually cleaned images provided for training. The validation and test sets contain 14,000 and 10,000 clean images, respectively. The cleaned training and validation images are not used and only the 10,000 testing samples are used for evaluation.

WebVision [li2017webvision] is a large-scale dataset with real-world noisy labels containing images. The whole dataset contains 2.4 million images collected from the web using the same classes as ImageNet [deng2009imagenet]

. The noise level in WebVision is estimated to be around 20% 

[li2017webvision]. We follow previous works [chen2019understanding, li2020dividemix] and use the first 50 classes of the Google image subset. We evaluate the test accuracy on the provided validation dataset.

Dataset Q Image size
MNIST 60,000 10,000 10
CIFAR-10 50,000 10,000 10
CIFAR-100 50,000 10,000 100
TREC 5,500 500 6 -
Clothing-1M 1,000,000 10,526 14
WebVision (mini) 69,544 2,500 50
Table 4: Dataset summary.

b.2 Experimental Setup

During training, in each iteration of in Algorithm 1, we use of the data in to be a validation set . Then, the optimal model parameter is obtained by


where is trained on , and is a metric function. Here, we set to be the accuracy. Note that will be different in each iteration since will change.

Asymmetric noise:

For the experiments using asymmetric noise, it requires pairs of labels to be flipped for some fraction of samples. These flips are done on pairs of similar classes in each dataset. Following the setting of asymmetric noise in [wang2019sl], in the MINST dataset, we flip , , , and . For CIFAR-10, we flip TRUCK AUTOMOBILE, BIRD AIRPLANE, DEER HORSE, and CAT DOG. In CIFAR-100, there are 20 super-classes each of which has 5 sub-classes. To generate asymmetric noise, we randomly flip the labels of two sub-classes within each super-class. For the TREC dataset, we flip abbreviation entity, description human, and location human.

We set the loss tuning parameter in Equation 4.2 to . For the mixup data augmentation, in Equation 1 is set to . The network architectures and all further dataset-specific parameters are given as follows.


A simple 4-layer CNN network (two convolutional and two fully connected layers) is used. We train the network with stochastic gradient descent (SGD) with a momentum

. The learning rate is initially set to with a weight decay of

. The training is run for 50 epochs and the learning rate is divided by 10 after 10 and 30 epochs.

CIFAR-10: We use an 8-layer CNN with six convolutional and two fully connected layers. We train the network with SGD with a momentum . Similar to MINST, we set the initial learning rate to with a weight decay of . We divide the learning rate by 10 after 40 and 80 epochs and run for a total of 120 epochs.

CIFAR-100: Due to the relatively larger and more complex dataset, we use a larger network, ResNet-44 [he2016deep]. We train the network with SGD with a momentum . The initial learning rate is set to and the weight decay to . The training is run for 150 epochs and we divide the learning rate by 10 after 80 and 120 epochs.

TREC: Since TREC is a text-based dataset, we use the pre-trained BERT-Base as our network. We use the Adam optimization algorithm with a learning rate of and run for 5 epochs with a batch size of 200. Note that no mixup data augmentation is used for this dataset.

Clothing-1M: Following [xiao2015learning, wang2019sl], we use ResNet-50 with ImageNet pre-trained weights. The manually cleaned and labeled training images provided are not used and discarded. Evaluation is done on the provided clean testing set of images. For preprocessing, we resize the images to and subtract the mean for each pixel. The images are then cropped at the center to a size of . We train the network with SGD for a single epoch with a learning rate of and a batch size of 200. We set , , and .

WebVision: We follow the setup in [li2020dividemix, chen2019understanding, jiang2020beyond] and use inception-resnet v2 [szegedy2016inception]. The methods are evaluated on the provided validation set of the first 50 classes of the Google image subset. We resize the images to and then random crop them to . The network is trained with SDG with a momentum of and a weight decay of for a total of epochs. The learning rate starts at and is decreased to at the 40th epoch and to at the 80th epoch. We set , , and .

Furthermore, for the CIFAR-10 and CIFAR-100 datasets, we apply data augmentation techniques to the images in width and height shifts and random horizontal flips.