Training Deep Neural Networks (DNNs) often requires large data sets to perform well on challenging problems such as image classification (litjens2017survey). However, the larger the data set, the greater the likelihood for it to be contaminated with noisy labels due to reasons such as low-quality data, human failure, or challenging labelling tasks (frenay_survey). The main issue is that DNNs can easily fit noisy labels, particularly for large rates of label noise, reducing their accuracy, as shown by Zhang et al. (zhang2016understanding).
In the literature, several methods have been proposed to deal with noisy labels (kim2019nlnl; wang2019symmetric; ren2018learning; wang2019symmetric; nguyen2019self; li2020dividemix), where the most successful methods explore a 2-stage process formed by an unsupervised learning method to classify training samples as clean or noisy, followed by a semi-supervised learning (SSL) to minimise the empirical vicinal risk (EVR) with a labelled set formed by the samples classified as clean, and an unlabelled set with the samples classified as noisy. The unsupervised learning stage is generally based on the small-loss strategy (yu2019does)
, where at every epoch, samples with small loss are classified as clean, and large loss as noisy. This strategy can lead to a low classification precision of clean samples, particularly in high noise rate scenarios, where the loss values can be unstable at different training epochs. The SSL stage(arazo2019unsupervised; nguyen2019self; li2020dividemix) is usually based on MixMatch (berthelot2019mixmatch) that minimises the empirical vicinal risk (EVR) (zhang2017mixup)
, where a robust estimation of the vicinal distribution is critical for an effective optimisation that generalises well. Such robust estimation depends on a large training set to minimise the EVR(berthelot2019mixmatch; zhang2018generalization), but problems with high noise rate usually cause the unsupervised learning stage to build a small training set to be used by this optimisation, affecting the generalisation of the SSL stage.
In this paper, we hypothesise that the generalisation of 2-stage noisy-label learning methods depends on the precision of the unsupervised learning stage to classify clean and noisy samples and a large training set to minimise the EVR at the SSL stage. We empirically validate these two hypotheses and propose a new 2-stage noisy-label training algorithm, called LongReMix. LongReMix is based on a theoretically sound unsupervised learning method to maximise the precision of the clean sample classification by considering the small-loss strategy over a range of epochs instead of a single one. Then, we artificially increase the training set size to improve the generalisation of MixMatch for the minimisation of the EVR during the SSL stage (berthelot2019mixmatch). We evaluate our approach on the noisy-label learning benchmarks of CIFAR-10 (krizhevsky2009learning), CIFAR-100 (krizhevsky2009learning), WebVision (li2017WebVision), Clothing1M (xiao2015learning), and Food101-N (lee2018cleannet), where LongReMix shows the best performance in the field in almost all of those data sets, particularly in problems with extremely large noise rates. We also show that LongReMix finds a set of clean samples with higher precision than the competing methods, and is robust to over-fitting in problems with high label noise.
2 Prior Work
Several methods have been proposed for the noisy-label problem, and they explore different strategies, such as robust loss functions(wang2019imae; wang2019symmetric), label cleansing (jaehwan2019photometric; yuan2018iterative), sample weighting (ren2018learning), meta-learning (han2018pumpout), ensemble learning (miao2015rboost), and others (yu2018learning; kim2019nlnl; zhang2019metacleaner). Below, we focus on the prior work that is close to our approach and that show competitive results on the main benchmarks. It is important to mention that we do not consider methods that need a clean validation set, such as (zhang2020distilling), because we believe this forms a less general experimental setup.
Several approaches explore the sample noise characterisation. xue2019robust
present a probabilistic Local Outlier Factor algorithm (pLOF) to estimate the probability that a sample is an outlier, which is assumed to have label noise. The idea explored by pLOF is that the density around a noisy sample is significantly different from the density around its (clean) neighbors. However, in high noise rate problems, the effectiveness of pLOF is reduced because it cannot find significant differences between the densities of noisy and clean samples.wang2018iterative also use pLOF combined with a Siamese network to increase the dissimilarities between clean and noisy samples. Nevertheless, the incorrect classification of clean samples by pLOF can induce the learning of wrong feature representations. arazo2019unsupervised propose the use of a Beta Mixture Model (BMM) to separate the clean and noise samples during training, based on the loss value of each sample. Similarly, li2020dividemix
use Gaussian Mixture Model (GMM) for the same goal. Although the use of BMM and GMM applied on the loss values works well for low noise rate, for high noise regimes it becomes less precise. One of the issues affecting the precision of the classification of clean samples from the training set is that they usually rely on an estimation of clean and noisy sets using the loss from the latest training epoch and do not consider the stability of the classification of clean samples over several epochs.
Another technique being studied for noisy-label learning is the use of multiple models to improve the robustness of sample noise characterisation. han2018co propose Co-teaching, which trains two models simultaneously, where each model estimates the clean sample set to be used by the other model. However, with an increase in the number of epochs, both networks converge to a consensus and show little difference between their estimated clean sets. Co-teaching+ (yu2019does) relies on small loss samples that disagree on the predictions to select the data for the other model. Although this multiple model strategy shows better results for filtering clean samples, noisy samples are usually ignored during training, decreasing the effectiveness of the approach.
After distinguishing between clean and noisy samples, methods either disregard the noisy samples during training (thulasidasan2019combating; han2018co), or use both the clean and noisy samples in a semi-supervised learning (SSL) approach (li2020dividemix; arazo2019unsupervised; sachdeva2021evidentialmix), where SSL-based methods tend to show better results on benchmarks. One particularly successful technique that relies on SSL is DivideMix (li2020dividemix) that relies on MixMatch (berthelot2019mixmatch) to linearly combine training samples classified as clean or noisy for the EVR minimisation (zhang2017mixup). The generalisation of the EVR minimisation has been theoretically shown to depend on a large training set (zhang2018generalization). However, recent methods, such as DivideMix (li2020dividemix), constrain this training set to be of the same size as the clean set, which tends to be small in large noise rate scenarios. Our approach removes this constraint, allowing a better generalisation of the EVR minimisation.
3.1 Problem Definition
Consider the training set , where is the image and
is a one-hot vector representing the noisy label, withdenoting the set of labels, and . The label may differ from the unknown true label as a result of a noise process represented by , with , where the are the classes, the probability of flipping the class to , and . We assume that this noise process can be of three types, namely symmetric (kim2019nlnl), asymmetric (patrini2017making), and semantic (rog). The symmetric noise, also called uniform noise, refers to a noise type that the hidden label flips to a random class with a fixed probability , where the true label is included into the label flipping options, which means that in , and . The asymmetric noise is based on flipping labels between similar classes (patrini2017making), where depends only on the classes , but not on . For example, using CIFAR-10 data set (krizhevsky2009learning), the asymmetric noise maps truck automobile, bird plane, deer horse, as mapped by (zhang2018generalized). The semantic noise (rog) depends on both the classes and the image .
We only consider 2-stage noisy-label learning approaches (li2020dividemix; ding2018semi; kong2019recycling) that hold state-of-the-art (SOTA) results on all benchmarks – these approaches are based on: 1) an unsupervised learning classifier that characterises training samples as clean or noisy; and 2) a semi-supervised learning classifier that assumes that the training samples classified as clean are labelled, and the samples classified as noisy are unlabelled. The SOTA noise-robust classifier (li2020dividemix; nguyen2019self) is formed by an ensemble of two classifiers, each represented by , where the classifier structure is the same, but their parameters are denoted by . The training for influences and vice-versa, where this can be achieved by co-training (li2020dividemix) or student-teacher (nguyen2019self) approaches. Our training relies on co-training.
The unsupervised learning classifier predicts the clean and noisy samples based on their loss values (arazo2019unsupervised; li2020dividemix; rog; jiang2020beyond). Formally, assuming that the training is minimising the empirical risk , the set of clean and noisy samples are respectively defined by
where , represents a classification loss (e.g., cross entropy), and is a function that computes the probability that the training sample is clean based on its loss (jiang2020beyond; li2020dividemix; zhang2020distilling; nguyen2019self) and parameterised by (in this paper, this probability function computes the posterior of the smaller-mean component of a bi-modal GMM, where this smaller mean represents the clean GMM component (li2020dividemix)). To learn and , co-training uses the clean and noisy sets from model to train , and vice-versa.
The semi-supervised learning based on MixMatch (berthelot2019mixmatch) mixes the elements of and to minimise the empirical vicinal risk (EVR) (zhang2017mixup):
where weights the noisy set loss, and denote the losses in the clean and noisy sets, respectively defined as
where is a Dirac mass centered at , , and . In (li2020dividemix), the noisy set size and clean set size are constrained to be equal to , which means that .
3.3 Our Hypothesis
We hypothesise that the generalisation of 2-stage noisy-label learning methods depend on: 1) the precision of the classification of clean samples to be included in in (1), and 2) the size of the clean set denoted by . In particular, a large with a high proportion of positives will reduce the bound of the difference between the estimated and vicinal risks (zhang2018generalization), improving the semi-supervised classification accuracy.
Let us begin with our proposed method to increase the precision in the classification of clean samples in . Our idea is to classify as clean, the samples that consistently show for epochs. Assuming that denotes the probability of classifying a clean sample as clean, the probability of classifying a clean sample as noisy (i.e., ). Similarly, represents the probability of classifying a noisy sample as noisy, the probability of classifying a noisy sample as clean (i.e., ). Also, and denote the proportion of clean and noisy samples in the training set, with . The probability of a clean sample being in the clean set after epochs is , and the probability of a noisy sample being in the clean set after epochs is .
Assuming that (so ) and (so ), the classification precision of clean samples in tends to 1 and recall tends to 0, as increases.
The precision and recall are calculated with:
Given the assumption that and and that , Precision tends to 1, and similarly, given that , Recall tends to 0. ∎
In low noise rate problems, tends to be large and , small, so even for small values of , Precision will be close to one with a relatively high Recall in (5), allowing for a large . According to Theorem 8 in (zhang2018generalization), a large will decrease the bound for vicinal risk minimisation. On the other hand, tends to be small and large, in high noise rate scenarios, which means that needs to increase to push the Precision to be close to one, but that can reduce the Recall to very low values, resulting in a potentially small , which will increase the vicinal risk minimisation bound (zhang2018generalization). Therefore, is a hyper-parameter that needs to be estimated to achieve a good trade-off between Precision and Recall. Nevertheless, for high noise rate scenarios, even with a careful estimation of , can still be small. Hence, we propose that must be sampled with replacement when mixing up and in (3), such that .
Our proposed LongReMix algorithm is divided into two stages (Figure 1). The first stage, comprising the High Confidence Training (HCT), trains the model to find a high confidence set of clean samples with high precision. Next, in the second stage, we build a core set of clean samples using the largest high confidence set obtained from the first stage. With this core set, we retrain the model. Moreover, we propose a new way to build the data sets and in (3), called LongMix, which enables the number of MixUp operations to be proportional to instead of , as described in Sec. 3.3.
4.1 First Stage: High Confidence Training
The high confidence training (HCT) aims to increase the precision of the unsupervised classification of clean and noisy training samples. Following the idea presented in Sec. 3.3, we re-define how to form the sets of clean and noisy samples, originally defined in (1), as follows:
where represents the loss of sample at training epoch and denotes the confidence window comprising the current and the previous epochs – this is represented by the block ”filter” that produces the high confidence samples in Fig. 1. Hence, a sample to be in the clean set must be classified as clean for epochs in a row, resulting in a more consistent, but smaller, set of clean samples, containing fewer noisy samples than the set in (1).
4.2 Second Stage: Guided Training
The second stage of the training depends on the core set of clean samples estimated from the first training stage with
where is the total number of training epochs for the first stage of training. In the second stage of training, we define the labelled and unlabelled sets as in (1), but we use to update these sets as follows:
During the second stage of LongReMix, we retrain the model from scratch111We compared if we should fine-tune the model trained from the first stage or train from scratch, and the latter approach showed the best results. using the core set of clean samples from (7) included in the predicted clean and with the original labels from .
As explained in Sec. 3.3, we hypothesise that by sampling the clean set with replacement, we increase the number of MixUp operations in the EVR loss in (2), resulting in a smaller bound of the difference be-tween estimated and vicinal risks (zhang2018generalization). Therefore, we propose LongMix that increases the number of MixUp operations to be , instead of the number of predicted clean samples. A criticism faced by LongMix is that adding more MixUp iterations per epoch may be equivalent to a simple increase in the number of epochs, but we show in the experiments that this is not true.
4.3 Training Loss
The training loss for our proposed LongReMix is (li2020dividemix):
where denotes the empirical vicinal error defined in (2), with , , weights the regularisation loss, and
with denoting a vector of dimensions with values equal to , and
representing the Kullback Leibler divergence betweenand . The pseudo-code for the training of LongReMix is shown in Algorithm 1 in the supplementary material.
We compare LongReMix with related approaches on five noisy-label learning benchmarks. We also analyze the performance of LongReMix on a number of ablation studies. All comparisons are performed with the same network architecture and trained for the same number of epochs as the compared methods.
5.1 Data Sets
We conduct our experiments on the data sets CIFAR-10, CIFAR-100 (krizhevsky2009learning), Clothing1M (xiao2015learning), WebVision (li2017WebVision) and Food101-N (lee2018cleannet). CIFAR-10 and CIFAR-100 have 50000 training and 10000 testing images of size pixels, where CIFAR-10 has 10 classes and CIFAR-100 has 100 classes and all training and testing sets have a perfectly balanced number of images per classes. As CIFAR-10 and CIFAR-100 data sets originally do not contain label noise, a common approach is to add synthetic noise to evaluate the models. For CIFAR-10/CIFAR-100 we investigated three noise types: symmetric, asymmetric and semantic, as defined in Sec. 3.1. The symmetric noise is generated using , with defined in Sec. 3.1. The asymmetric noise is produced following the mapping used in (li2020dividemix; patrini2017making), with (note that we study because it is close to the theoretical limit of 50% for this type of noise). We also evaluate the semantic noise scenario, where we follow the setup from (rog) to generate semantically noisy labels based on a trained VGG (vgg), DenseNet (DN), and ResNet (RN) on CIFAR-10 and CIFAR-100.
Clothing1M consists of 1 million training images acquired from online shopping websites and it is composed of 14 classes.As the images from the data set vary in size, we resized the images to for training, as used in (li2020dividemix; han2019deep). The data set is heavily imbalanced and most of the noise is asymmetric (yi2019probabilistic), with noise rate estimated to be around 40% (xiao2015learning). The data set provide additional clean sets for training, validation, and test of 50k, 14k and 10k images, respectively. For our experiments we do not use any of the clean training or validation sets, but we use the test set for evaluation.
WebVision contains 2.4 million images collected from the internet, with the same 1000 classes from ILSVRC12 (deng2009imagenet) and images resized to pixels. It provides a clean test set of 50k images, with 50 images per class. We compare our model using the first 50 classes of the Google image subset, as used in (li2020dividemix; chen2019understanding).
Food101-N (lee2018cleannet) contains 310,009 training images of food recipes classified in 101 classes and 25,000 images for the testing set. The images from this data set were resized to . This data set is based on the Food101 data set (bossard2014food), but it has more images with noisy labels. The test set is the same provided by the original Food101 (bossard2014food), which is a clean test set of 25K images.
|Method/ noise ratio||20%||50%||80%||90%||40%||49%||20%||50%||80%||90%|
|Method/ noise ratio||20%||50%||80%||90%||40%||49%||20%||50%||80%||90%|
The model is represented by a 18-layer PreAct ResNet (PRN18) (he2016identity) for CIFAR-10 and CIFAR-100, InceptionV2 (szegedy2017inception) for WebVision (this is the model used by competing approaches), and ResNet-50 (he2016deep)
for Clothing1M and Food-101N. The models are trained with stochastic gradient descent with momentum of 0.8, weight decay of 0.0005 and batch size of 64. The learning rate is 0.02 which is reduced to 0.002 in the middle of the training. The WarmUp and total number of epochs is defined according to each data set, as defined in(li2020dividemix). For CIFAR-10 and CIFAR-100, PRN18 is based on a WarmUp stage of 30 epochs, with 300 epochs of total training. For WebVision, the InceptionV2 is trained for 100 epochs, with a WarmUp stage of 1 epoch. For Clothing1M, ResNet-50 is trained for 80 epochs with WarmUp stage of 1 epoch. For Food-101N, we also use ResNet-50 and rely on the same training protocol as in (han2019deep), consisting of training for 30 epochs, WarmUp stage of 1 epoch and reducing the learning rate by a factor of 10 every 10 epochs. The MixMatch parameter is in (4), and the regularisation weight for the loss in (9) is for symmetric noise and for asymmetric noise–these two parameters are as defined in (li2020dividemix). We used a confidence window of in (6), which was defined empirically for all the experiments. In Table 1 of supplementary material we show that, in general, Precision increases and Recall decreases with larger values. Also, classification accuracy reaches a peak for large noise rates (symmetric at and asymmetric ) at , and for lower noise rates, accuracy does not change much with different values of .
5.3 Precision and Recall of the Clean Set
We evaluate the precision and recall of the clean set from (6) in the last epoch of the first stage of training (HCT), compared to the clean set from (1) that relies on the small loss result from the last epoch (Baseline). We assess that by computing and of the sets from (6) and from (1), where refers to the samples correctly predicted as clean, denotes the noisy samples incorrectly predicted as clean, and denotes the clean samples incorrectly predicted as noisy. Figure 5-(a) shows the Precision vs Recall of predicted clean set for CIFAR-10 with 40% asymmetric noise, where results are obtained by varying the threshold applied to to form and . We highlight the value of , which is the default value (li2020dividemix) that we use to split the clean and noisy samples. Notice that in this highly asymmetric noise scenario, the curve from HCT shows a better trade-off than the Baseline. Figure 5-(b,c) shows that from HCT trades off a higher precision for a lower recall, compared with from the Baseline for several types of noise, As shown below, this has a large influence on the training efficacy of LongReMix.
5.4 LongMix Analysis
Figure 6 shows the test accuracy versus training steps (iterations) for LongMix compared to the baseline (li2020dividemix), for CIFAR-10 at 90% symmetric noise – this figure shows that adding more MixUp iterations per epoch, as in LongMix, is not equivalent to adding more epochs, as in baseline (li2020dividemix). This shows evidence for the claim in Sec. 4.2 that a simple increase in the number of epochs is not equivalent to adding more MixUp iterations, as we propose for LongMix. Table 1 shows further evidence for this claim by comparing LongMix and baseline (li2020dividemix) using the same number of training iterations for different noisy rates on CIFAR-10 and CIFAR-100 – results show that LongMix is more accurate for most cases.
5.5 Comparison with the State-of-the-Art
|Data set||CIFAR-10||CIFAR-100||Webv.||Cloth.||Food.||Mean Rank|
|Method/ n. ratio||20%||50%||80%||90%||40%||49%||20%||50%||80%||90%||-||-||-||-|
|Method/ noise ratio||DN (32%)||RN (38%)||VGG (34%)||DN (34%)||RN (37%)||VGG (37%)|
|CE + RoG||68.33||64.15||70.04||61.14||53.09||53.64|
|Bootstrap + RoG||68.38||64.03||70.11||54.71||53.30||53.76|
|Forward + RoG||68.20||64.24||70.09||53.91||53.36||53.63|
|Backward + RoG||68.66||63.45||70.18||54.01||53.03||53.50|
|D2L + RoG||68.57||60.25||59.94||31.67||39.92||45.42|
|Method||Top 1||Top 5|
|Method||from pre-trained||from scratch|
For CIFAR-10 and CIFAR-100, we evaluate our model using different levels of symmetric label noise ranging from 20% to 90%. We also consider asymmetric noisy, with noise rates of 40% and 49%. We report both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs of training, similar to (li2020dividemix). Table 2 shows that for CIFAR-10 and CIFAR-100 data sets, our method obtains better results for all evaluated noisy rates. LongReMix displays a higher improvement for large symmetric noise and asymmetric noise scenarios, which can be considered as the most challenging cases. We believe that the improvement over higher noise rates is due to the LongMix approach, which runs a large number of MixUp operations proportional to the size of the training set. The retraining with high confidence samples also improves the results for asymmetric noise. The results for semantic noise (rog) in Table 4 shows again the superiority of our approach compared to the related work.
Also, we evaluate our method on large-scale data sets. For WebVision, Table 5
shows the Top-1 and Top-5 accuracy, where LongReMix displays better results than competing methods. For the Clothing1M evaluation, the competing methods rely on a pre-trained ImageNet model for training on Clothing1M. In our experiments, we did not observe any improvement with pre-trained models, and therefore we trained from scratch with 128k images from Clothing1M. The results in Table6 show that our model, trained from scratch and with a reduced training set, obtained comparable results to the competing approaches. Lastly, Table 7 summarizes the results for Food-101N. For this problem, we evaluate our approach with a pre-trained model and trained from scratch, and LongReMix outperforms all other approaches in both scenarios.
5.6 Ablation Study
We analyze the effect of the different components of our proposal in an ablation study, shown in Table 3. Below, ”Retrain” denotes the high-confidence training explained in Sec. 4.1 which increases the accuracy of the classifier that distinguishes between clean and noisy samples; and ”LongMix” represents the guided training from Sec. 4.2 that increases the number of MixUp operations. We first evaluate our approach without LongMix – this approach is referred to as ”Retrain”. Then we evaluate training only with the LongMix, without the second stage of re-training, and the whole model is denoted as LongReMix. In general, we can observe that the LongReMix is competitive for all noise scenarios (being best or second best for all cases), but it is generally better for the large-scale data sets. Considering different data sets and noise rates, LongReMix shows the best average rank.
We presented LongReMix, a new 2-stage noisy-label learning algorithm based on an unsupervised learning stage to classify clean and noisy training samples, followed by an SSL stage to minimise the EVR using a labelled set formed by samples classified as clean, and an unlabelled set with samples classified as noisy. Our LongReMix improves the precision of the unsupervised learning stage and improves the generalisation of the EVR minimisation. We show that LongReMix reaches state-of-the-art performance on several benchmarks, and is robust to over-fitting in high label noise problems.