Over the last few years, deep learning and Convolutional Neural Networks (CNNs) in particular have become progressively popular as they have been used in a variety of applications, especially in computer vision, outperforming other modelskrizhevsky2012imagenet ; GOMEZRIOS2019315 ; olmos2018automatic . Generally speaking, the networks that have been used in these types of applications have become deeper and deeper over the years, due to their high performance.
A common problem when dealing with real-world data sets in the context of supervised classification is label noise. The term ’label noise’ refers to when some instances in the data set have erroneous labels, thus misleading the training of machine learning algorithms song2020learning . This type of noise can be present in the data set because it was labelled automatically using text labels from the Internet, or because not enough experts were available to label an entire data set. In either case, the rate of label noise can vary and can increase to large values xiao2015learning ; lee2017cleannet . As a result, label noise has been extensively studied when using classical machine learning algorithms 6685834 . The two most used and studied types of label noise are symmetric and asymmentric noise. In symmetric noise, also called uniform noise, the labels are corrupted randomly and equally in all classes, independently of their true class. In asymmetric noise, the corruptions are dependent on the true class of the instances. This implies that the corruptions can be made so that one specific class is labelled as another specific one. Subsequently, asymmetric noise is more realistic than symmetric noise. In 6685834 the authors made a characterisation of label noise in classification and symmetric noise is called Noisy Completely at Random (NCAR) and asymmetric noise is called Noisy at Random (NAR).
The use of increasingly deeper networks implies the need for larger data sets to adequately train them. This fact has caused researchers to investigate ways to overcome the lack of data. One possible solution is to create larger data sets by labelling them automatically instead of relying on experts, which is usually the only labelling solution for very large datasets xiao2015learning ; li2017webvision ; song2019selfie
. Another solution is to use techniques such as transfer learning and data augmentation. Transfer learning allows the network to start from a pre-trained state: instead of starting the training from scratch for every problem, we use the network already pre-trained on other data set, usually larger and related to the new one in some way. As a consequence, transfer learning speeds up the training process. On the other hand, data augmentation artificially increases the size of the training data by introducing transformations of the original images, such as rotations, changes in lightning, cropping, flipping, etc. However, if the original training data set presents label noise, the use of data augmentation can aggravate it, thus becoming another source of label noise.
In the specific case of deep learning, label noise has been proven to have a negative effect in generalization when training deep neural networks zhang2021understanding , and thus there has been an increasing amount of studies trying to improve the behaviour of deep neural networks as much as possible in presence of label noise patrini2017making ; song2019selfie ; song2020learning ; ma2018dimensionality ; jiang2018mentornet . Label noise appears mostly in real-world data sets, where the noise rate is not known. However, to test the performance of the models proposed to overcome this problem we need a controlled environment, where the noise is artificially introduced in some noise-free data sets using different noise rates. Two of the most used data sets for this are CIFAR10 and CIFAR100 wang2018iterative ; 10.5555/3326943.3327112 ; jindal2016learning ; patrini2017making
. Though it is necessary to test the proposals to overcome label noise in large data sets like CIFAR or MNIST, it is also important to analyse them in other scenarios, like with small data sets. Small data sets are also common in real-world problems where it is not possible to collect more data. The majority of the current proposals lack this scenario.
To make the training of deep neural networks robust to noise, both symmetric and asymmetric, the analysis of incorrect classified instances for filtering and relabelling is the usual way to proceed. In this paper, we consider this hypothesis to deal with this problem, to identify the instances considered as noisy and to handle them.
We propose an algorithm which, during the training process, relabels or filters the instances that it considers as noisy using the predictions made by the backbone network. The backbone network is the deep network chosen to classify the data set (for instance, ResNet50), which will be trained using backpropagation as usual. Thus, we called the algorithm the Relabelling And Filtering Noisy Instances (RAFNI) algorithm. As opposed to some of the previous proposals for this task, we do not suppose that the noise rate is known nor do we estimate it. We also do not use clean training or validation subsets. Instead, the RAFNI algorithm only uses the noisy training set and progressively cleans it during the training process by using the loss value of each instance and the probability of each instance belonging to each class. These values are given by the backbone network at each epoch of the training process. The algorithm has two filtering mechanisms to remove noisy instances from the training set, and one relabelling mechanism used to change the class to some instances to their original (clean) class.
We evaluated our proposal with a variety of data sets, including small and large data sets, and under different types of label noise. We also use CIFAR10 and CIFAR100 as benchmarks to compare our proposal with other state-of-the-art models, since these data sets are two of the most used in other studies.
The rest of the paper is organized as follows. In Section 2, we give a background on works that propose strategies to help neural networks learn with label noise, with special mention to the ones we compare our algorithm to. In Section 3, we provide a detailed description of the RAFNI algorithm. Section 4 details the experimental framework, including the data sets, the types and levels of noise and the network configurations we used. The complete results obtained for all data sets and the comparison with the state-of-the-art models are shown in Section 5 and Section 6, respectively. We give some final conclusions in Section 7
. Finally, the best values for all the hyperparameters of RAFNI along with all the values we have tested for all the hyperparameters are shown inA and B respectively.
In this section, we provide first some background in the context of label noise and the types of noise we use in this paper in Subsection 2.1. Then, in Subsection 2.2, we present an overview of the most popular approaches made in the context of deep learning to overcome the problem of label noise and provide a description of the proposals we have selected to compare with RAFNI.
2.1 Definition and types of label noise
In supervised classification, we have a set of training instances and their corresponding labels , where , and is the total number of classes. Label noise presents when some instances in have erroneous labels. That is, an instance with a true label actually appears in the training set with another label , , . The percentage of instances that present label noise is called the noise rate or noise level.
Depending on whether the label noise appears dependent or independent of the class of the instances, we can distinguish between symmetric and asymmetric noise.
Symmetric noise (also called uniform noise or NCAR). The noise is independent of the original true class of the instances and the attributes of the instances. Thus, the labels of a percentage of the instances of the training set are randomly changed to another class following a uniform distribution, where all the classes have the same probability of being the noisy label. This implies that the percentage of noisy instances is the same in all classes. Usually, the true class is not taken into account when choosing the noisy label.
Asymmetric noise (also called NAR). The noise is dependent on the original true class of the instances and independent of the attributes of the instances. Therefore, the probability of each class to be the noisy label is different and depends on the original true class, but all the instances in the same class have the same probability of being noisy. This implies that the percentage of noisy instances in each class can be different.
We also define a new type of noise, called pseudo-asymmetric noise, where the probability of an instance becoming noisy depends on the class it belongs but also on the subclass that it belongs inside a specific class. We define this type of noise because it is more realistic for a particular data set we used. The definition for this data set can be found in Subsection 4.2.
As it happens in classical machine learning, the types of noise that we used can be treated (either by filtering or relabelling) without a prior estimation of the probability distribution.
2.2 Label noise with deep learning
During the last few years, there has been an increment in the number of proposals to help deep neural networks, and CNNs in particular, to learn in the presence of label noise in supervised classification. Most works fall into one or more of the following approaches:
Proposals that modify the loss function in some way, either to make the function robust to label noiseNEURIPS2019_8cd7775f ; ghosh2017robust ; zhang2018generalized , or to correct its values, so the noisy labels do not negatively impact the learning patrini2017making ; yi2019probabilistic ; ma2018dimensionality ; song2019selfie .
Some proposals suppose that a subset of clean samples is available xiao2015learning , or similarly, do not introduce noise in the validation set jiang2018mentornet , and others assume that the noise rate is known patrini2017making ; song2019selfie , which is not usual when dealing with real-world noisy datasets, though patrini2017making proposes a mechanism to approximate the noise rate. A more in-depth survey of all the work that has been done to learn deep neural networks in presence of label noise can be found in song2020learning .
We have selected a subset of four of these proposals to compare with our algorithm: one that uses a robust loss function, two that propose loss correction approaches, and one that proposes a hybrid approach between loss correction and sample selection. All of them had official public implementations either on TensorFlow or Keras. In the following, we describe these four proposals.
Robust loss function approach that uses a generalization of the softmax layer and the categorical cross-entropy lossNEURIPS2019_8cd7775f . Here, the authors propose to make the loss function robust against label noise by modifying the loss function and the last softmax activation of the deep neural network with two temperatures, creating non-convex loss functions. These two temperatures can be tuned for each data set. This proposal has the advantage that using the code provided by the authors, it can easily be used with any combination of a deep network, data set and optimization technique, including transfer learning.
Loss correction approaches patrini2017making . The authors propose two approaches to correct the loss values of the noisy instances, for which it is necessary to know the noise matrix of the data set, called backward correction and forward correction. They provide a mechanism to estimate the noise matrix, and when used, the approaches are called estimated backward correction and estimated forward correction. The first one uses the noise matrix to correct the loss values, so they are more similar to the loss values of the clean instances. The second explicitly uses the noise matrix to correct the predictions of the model.
Loss correction approach using the dimensionality of the training set ma2018dimensionality
. The authors explain that, when dealing with noisy labels, the learning can be separated into two phases. In the first phase, which occurs in the first epochs of the training, the network learns the underlying distribution of the data. Then, in the second phase, the network learns to overfit the noisy instances. They use a measure called Local Intrinsic Dimensionality (LID) to detect the moment the training enters the second phase. They also use the LID to modify the loss function to reduce the effect of the noisy instances.
Loss and label correction approach song2019selfie . The authors propose a hybrid approach between sample selection and loss correction that tries to relabel noisy instances when possible and not use them when not. For the noisy instances, they use that if the network, in the first epochs of the training, returns the same label with high probability it is possible to correct that instance, and they change its label to the one the network predicts. In contrast, if the network changes the prediction of an instance inconsistently, they stop using that instance. They assume that the noise rate in the data set is known, and they do not provide a way to estimate it. This approach can be used iteratively so that the training set is iteratively cleaned in several training processes.
3 Proposal: Filtering and relabelling instances based on the probabilities prediction
In this section, we describe our proposal. First, in Subsection 3.1, we give an overall description of the algorithm and explain its basics. Then, in Subsection 3.2, we present a formal definition of the algorithm.
3.1 Base concepts
We propose the RAFNI algorithm, which filters and relabels instances based on the predictions and their probabilities made by the backbone neural network during the training process. In Figure 3
, we show the difference between training the backbone network with and without the RAFNI algorithm and the moment it is applied. The backbone network used is independent of the algorithm, and it can change or be modified, for example, including transfer learning. In our case, this network is a convolutional neural network pre-trained on ImageNet, where we have removed the last layer with a thousand neurons and we have added two fully connected layers. The first fully connected layer has 512 neurons and a ReLU activation. The last one has as many neurons as classes in the data set we are classifying and a softmax activation.
Generally speaking, we propose two mechanisms to filter an instance and one mechanism to relabel an instance, with some restrictions. These three mechanisms are the following:
First filtering mechanism. This mechanism only uses the loss value of the instances. The foundation is that the noisy instances tend to have higher loss values than the rest of them. As a result, this mechanism filters out instances that have a loss value above a certain threshold. This threshold is dynamic and will change during training.
Second filtering mechanism. This mechanism depends on how many times an instance has been relabelled. Here we suppose that if the algorithm relabels an instance too many times is because the backbone network is unsure about its class and it is better to remove that instance. Thus, this mechanism filters an instance if it has been relabelled more than a certain number of times.
Relabelling mechanism. This mechanism takes into account the probability predictions of the backbone network. We suppose that if the backbone network predicts another class with a high probability as the training progresses, it is probable that the instance is noisy and its class is indeed the one predicted by the backbone network. As a consequence, the relabelling mechanism changes the class of an instance if the backbone network predicts another class with a probability that is above a certain fixed threshold.
These mechanisms have restrictions related to the moment they are applied. Since we are using the backbone network to relabel and filter instances, we need to wait until it is trained enough for the predictions to be reliable. Therefore, the algorithm does not use any filter nor relabel mechanism before a certain number of epochs of the training. Moreover, we want to prioritise the relabelling mechanism over the filtering mechanism based on the loss values of the instances. Due to this, the algorithm does not use the relabelling mechanism until the th epoch, and it does not use the filtering mechanism based on the loss values of the instances until the th epoch. The filtering mechanism that uses the number of times an instance change class starts at the th epoch along with the relabelling mechanism. We also establish a period of a certain number of epochs after an instance has been relabelled during which the algorithm cannot filter nor relabel it again.
All these thresholds will depend on the specific data set that we want to learn, so they are hyperparameters of the algorithm in order to be tuned in each case.
3.2 Formal definition
Let be the set of training instances and their corresponding labels, where , , and is the total number of classes. Let be an epoch of the training process, , where is the total number of epochs, and the losses of the training instances in epoch , . Finally, let be the probabilities predicted by the neural network for each instance in epoch , and we name the prediction of the network for the instance in epoch , where , . Then, we define the following:
A threshold, named epoch_threshold, so that the algorithm does not make any change in the labels before epoch_threshold epochs and it does not filter any instance using its loss before epoch_threshold.
A threshold, named loss_threshold, so that if loss_threshold, then is filtered for the following epochs , where .
The length of the record of each instance, record_length, so that the algorithm saves the last record_length predictions made by the neural network in the last record_length epochs of the training. Then, if the predictions of an instance change record_length times in the last record_length epochs, the instance is filtered for the following epochs , where .
A threshold prob_threshold, so that if prob_threshold and , then in the following epochs , where . If this happens, the algorithm clears the record of the instance .
A number not_change_epochs, so that if the label of an instance has been changed, the algorithm cannot change it again nor filter that instance until not_change_epochs epochs have passed.
In Figure 4 we show the flowchart of the RAFNI algorithm, detailing how and when each mechanism is applied to each instance during a specific epoch of the training process.
The numbers epoch_threshold, record_length, prob_threshold and not_change_epochs are hyperparameters of the algorithm that can be setted by the user. On the other hand, the loss_threshold is a parameter that dynamically changes every epoch using the losses of the instances in the previous epoch, , for epoch_threshold. Specifically, the loss_threshold
is calculated for every epoch as the quantile of orderof the losses in the previous epoch , where is a hyperparameter of the algorithm and can be set by the user. That way, the loss_threshold usually descends as the epoch increases and the training instances are being filtered and their classes relabelled. Due to that, we need to stop updating the loss_threshold
parameter at some point to not filter too many instances. To do this, we use that the noisy instances tend to have higher loss values than the clean ones. There is more separation between the loss values of the noisy instances and the clean instances at the beginning of the training process. Then, as the training progresses and the instances with the highest loss values are being filtered and other instances are being relabelled, is more difficult to identify an instance as noisy due to its loss value, because the hardest clean instances to classify will also have high loss values. Thus, we suppose that we can approximate the loss values in a given epoch using a Gaussian mixture model, and use it to detect the moment when we need to stop updating theloss_threshold parameter.
A mixture model is a model that can represent different subpopulations inside a population. These subpopulations or components follow a distribution that in a Gaussian mixture is supposed to be a Gaussian distribution. That way, if we have a Gaussian mixture model with two components, we are approximating two subpopulations with a Gaussian distribution each one, so we obtain two means and two variances. In our case, we have two components, one for the clean instances, with mean, and one for the noisy instances, with mean and standard deviation , and we use the Gaussian mixture model to stop updating the loss_threshold parameter when the two components of the mixture model are sufficiently close. Once we stop updating the loss_threshold, it stays with its last value until the training is finished. In particular, we stop updating the loss_threshold parameter when any of the following happens:
The means of the two components of the Gaussian mixture model are sufficiently close: .
The current loss_threshold in epoch is outside the noisy component of the Gaussian mixture model: loss_threshold, epoch_threshold.
The loss_threshold drops too much from one epoch to the next one: loss_threshold - loss_threshold,
Besides these hyperparameters, the algorithm also allows to set the number of epochs of the training, epochs, the batch size, batch_size, and whether to use fine-tune or not: if fine_tune is set to false, the layers of the backbone neural network are not retrained and only the new added layers are trained; if fine_tune is set to true, all the layers are trained. This algorithm can be used with any CNN as the backbone network.
The code of the algorithm is available at https://github.com/ari-dasci/S-RAFNI.
4 Experimental framework
In this section, we describe the experimental framework we used to carry out the experiments. In Subsection 4.1, we describe the data sets we used. In Subsection 4.2, we detail the types of noise we used in each data set along with the noise levels we used in each of them. Finally, in Subsection 4.3, we provide the specific configuration, backbone neural networks and software we used for all the experiments.
4.1 Data sets
|Data set||# classes||# images||# images per class|
We describe the data sets we used to analyse RAFNI under different types and levels of label noise. We used six data sets, each one with a different number of classes, images per class and a total number of images: RSMAS, StructureRSMAS, EILAT, COVIDGR1.0-SN, CIFAR10 and CIFAR100. There is a summary of the statistics of these data sets in Table 1.
RSMAS, StructureRSMAS and EILAT are small coral data sets. RSMAS and EILAT coralDataset ; GOMEZRIOS2019315 are texture data sets, containing coral patches, meaning that they are close-up patches extracted from larger images, and StructureRSMAS GOMEZRIOS2019104891 is a structure data set, containing images of entire corals. The patches in EILAT have size 6464 and come from images taken under similar underwater conditions, and the ones in RSMAS have size 256256 and come from images taken under different conditions. StructureRSMAS is a data set collected from the Internet and therefore contains images of different sizes taken under different conditions.
COVIDGR1.0-SN is a modification of COVIDGR1.0 9254002 . COVIDGR1.0 contains chest x-rays of patients divided into two classes: positive for COVID-19, and negative for COVID-19, using the RT-PCR as ground truth. All the images in the data set were taken using the same protocol and similar x-ray machines. The authors made available the data set along with a list containing the degree of severity of the positive x-rays: Normal-PCR+, Mild, Moderate and Severe. The x-rays with Normal-PCR+ severity are x-rays from patients that tested positive in the RT-PCR test but where experts radiologists could not find signs of the disease in the x-ray. The modification that we use, COVIDGR1.0-SN, is the same data set as COVIDGR1.0 but we removed the 76 positive images with Normal-PCR+ severity. To maintain the two classes balanced, as it happens in the original data set, we also removed 76 randomly chosen negative images.
Finally, CIFAR10 and CIFAR100 krizhevsky2009learning are the 60k tiny images of size 3232 images proposed by Alex Krizhevsky. In relation to the others data sets used in this study, CIFAR10 and CIFAR100 are much larger in size. Both of them have a predefined test hold-out of 10.000 images, meaning they both have a training set of 50.000 images. Both datasets contain classes of common objects, such as ’Airplane’ and ’Ship’ in CIFAR10 or ’Bed’ and ’Lion’ in CIFAR100.
4.2 Types and levels of label noise
|Data set||Type of label noise||Levels of noise|
|COVIDGR1.0-SN||Pseudo-asymmetric noise||0%, 20%, 30%, 40% and 50%|
|CIFAR10||Asymmetric noise||0%, 20%, 30% and 40%|
|CIFAR10||Symmetric noise||0%, 20%, 40% and 60%|
We state the types of noise we used for each data set and which rates we used for each of them. In Table 2 we show a brief description. In summary, we used symmetric noise, asymmetric noise and pseudo-asymmetric noise. Symmetric noise is the most used type of noise and since it is not necessary to have external information to use it, we use this type of noise in all data sets except for COVIDGR1.0-SN. However, to also use more realistic and challenging types of noise, we used asymmetric and pseudo-asymmetric noise when possible. For COVIDGR1.0-SN, we have the additional information of the severity degree in the images of the positive class, so we used them to introduce pseudo-asymmetric noise. For CIFAR10, we used the asymmetric noise introduced by patrini2017making , which has been a standard when evaluating deep learning in the presence of asymmetric label noise. This noise is introduced between classes that are alike, simulating real label noise that could have occurred naturally. For the coral data sets (RSMAS, EILAT and StructureRSMAS) and CIFAR100, we did not have the necessary information to introduce this type of noise, so we only used symmetric noise.
For COVIDGR1.0-SN, we introduced pseudo-asymmetric noise, where we change the labels of a percentage of the instances of the data set subject to some condition over the instances. COVIDGR1.0-SN have two classes: P (COVID-19 positive) and N (COVID-19 negative), and the instances from P have associated a severity (Mild, Moderate and Severe). In this scenario, is more realistic that a positive image with mild severity has been misclassified as negative than a positive image with moderate or severe severity. Equivalently, is more realistic that a positive image with moderate severity has been misclassified as negative than a positive image with severe severity. As a consequence, we define the probability of the instances in the groups N (to change class to P), Mild (to change class to N), Moderate (to change class to N) and Severe (to change class to N) as it follows: 0.5 for N, 0.3 for Mild, 0.2 for Moderate and 0 for Severe. That way, we are changing the same amount of instances from P to N and vice versa, but when we change the class from P to N, it is more probable to change a mild positive image than a moderate positive image. In addition, we are making sure that no positive image with severe severity has changed from class P to N.
For CIFAR10, we introduced asymmetric noise between the following classes: TRUCK AUTOMOBILE, BIRD AIRPLANE, DEER HORSE, CAT DOG, as defined in patrini2017making . Note that since we are introducing an % of noise in five of the ten classes, we are introducing an % of noise in the total dataset.
4.3 Network and experimental configuration
We provide the specific configuration we used in the experiments we carried out.
|Data set||Optimizer||Batch size||
. Both of them were pre-trained using ImageNet, and in both cases, we removed the last layer of the networks and added two fully connected layers, the first one with 512 neurons and ReLU activation and the second one with as many neurons as classes the data set had and softmax activation. Once we removed the last layer of ResNet50 and EfficientNetB0, their outputs had 2048 and 1280 neurons, respectively. We chose 512 neurons for the first fully connected layer we added as an intermediate number between 2018 or 1280 and the number of classes in the data sets. The fixed hyperparameters we used in each data set can be seen inTable 3
. We used the Stochastic Gradient Descent (SGD) with a learning rate of, a decay of
and a Nesterov momentum of 0.9. We did not optimize these hyperparameters.
For the experimentation, we used TensorFlow 2.4 and an Nvidia Tesla V100. In all cases, we gave two or three values to each one of the following hyperparameters of the algorithm: epoch_threshold, prob_threshold, record_length, not_change_epochs and quantile. Then, we performed a grid search to find the best configuration of these hyperparameters for each data set and level of noise. Each experiment in the grid search is done using five-fold cross-validation for the data sets RSMAS, StructureRSMAS, EILAT and COVIDGR1.0-SN and a hold-out (with their predefined hold-outs) for CIFAR10 and CIFAR100. The values we used in the grid search to optimize each hyperparameter vary depending on the size of the data set, the level of noise and how “hard” is the data set to classify only with the backbone network, considering the accuracy in the test set. As a general guide, the values of prob_threshold and quantile get lower as the level of noise gets higher, since it is more difficult to train the backbone network. The values of prob_threshold, quantile and epoch_threshold get lower if the data set is hard to classify with the backbone network, considering test accuracy, and higher if it is easy to obtain a good result with that data set. Finally, the values of epoch_threshold, record_length and not_change_epochs get lower if the data set is large and higher if it is small since it takes more epochs for the backbone network to be trained if the data set is small. It is important to note that if for some data set it is better to not filter instances or to not relabel instances, the hyperparameters can be tuned so they do not use some or any of the mechanisms of filtering or relabelling.
To compare performances between proposals and between RAFNI and the baseline model, which is the backbone CNN alone, without filtering nor relabelling instances, we use the accuracy measure, widely used for supervised classification. The accuracy is defined as the number of instances well classified in the test set divided by the total number of instances in the test set.
For every data set and level of noise, we compare the performance of RAFNI with a given backbone network and the performance of the backbone network alone, without filtering or relabelling instances. Since the CIFAR data sets and the rest of the data sets we used had very different sizes, the experimental framework we used for them was different. For the smaller data sets (RSMAS, StructureRSMAS, EILAT and COVIDGR1.0-SN), we used five-fold cross-validation for the experiments in the grid. To ensure a more stable final result, we repeated the five-fold cross-validation with the best hyperparameter configuration five times (noted 5x5fcv) and we report the mean and standard deviation of the 5x5fcv. This scheme of using mean and standard deviation is one of the most used in the literature. Since we are comparing this result with the backbone network alone, we also repeated five times the same five-fold cross-validation with the backbone network without filtering or relabelling instances and we also report the mean and standard deviation of the 5x5fcv. For CIFAR10 and CIFAR100 we used the hold-out set provided with these data sets instead of cross-validation, similar to how these data sets are used in the others papers in the literature. We also repeated the hold-out five times in both cases (with the backbone network alone and with RAFNI using the best configuration of hyperparameters) and we report the mean and standard deviation.
5 RAFNI results and comparison with the baseline model
In this section, we present the results we obtained for each data set using our proposal, and we compare it with the backbone network alone as the baseline.
We present the results obtained with RSMAS and symmetric noise for each noise level.
|0%||97.81 1.06||98.07 1.13||95.56 1.26||95.82 1.37|
|20%||87.62 4.37||88.80 4.30||91.28 2.24||91.98 2.90|
|30%||86.14 5.39||85.17 4.60||87.85 3.16||89.27 2.92|
|40%||81.75 4.94||82.48 4.84||82.17 2.11||85.82 2.63|
|50%||78.93 3.93||78.59 5.78||74.30 4.99||79.16 5.37|
|60%||67.65 5.03||69.61 6.13||64.13 2.05||70.18 4.42|
|70%||53.78 7.16||54.78 7.07||49.27 4.07||58.87 4.03|
In Table 4, we can observe the results for the data set RSMAS using RAFNI with two different backbone networks, ResNet50 and EfficientNetB0, and the comparison with the two backbone networks alone as the baseline. Considering ResNet50, RAFNI and the baseline obtained similar results until we reach 60% of noise rate, where RAFNI begins to obtain better results. On the other hand, using EfficientNetB0 as the backbone network, RAFNI obtained better results at every noise level. At 30% of noise, RAFNI obtains 1.42% more than the baseline and the gain increases as the noise increases, obtaining a 9.6% gain at 70% noise.
It is also interesting that the results obtained by plain EfficientNetB0 are lower at 50%, 60% and 70% noise than the results obtained by plain ResNet50, while the results obtained by RAFNI using EfficientNetB0 as the backbone network are better at all levels of noise, except for 0% noise than the ones obtained by RAFNI with ResNet50 as the backbone network. In both cases, the use of the RAFNI algorithm involves some gain in the obtained accuracy, even when there is no introduced noise.
We present the results obtained with StructureRSMAS and symmetric noise for each noise level.
|0%||83.12 3.84||82.45 4.00||82.65 2.07||83.26 2.49|
|20%||68.50 5.42||68.22 4.68||78.71 4.41||78.73 2.83|
|30%||62.70 5.78||62.63 6.79||77.11 3.18||77.56 2.95|
|40%||59.26 5.78||57.71 4.54||69.63 3.07||70.71 5.75|
|50%||50.33 7.46||51.57 8.16||63.49 6.63||67.06 3.75|
|60%||43.43 6.49||46.47 7.24||57.92 7.53||60.06 7.05|
|70%||32.13 6.35||32.63 4.68||45.49 6.81||43.38 6.76|
The results using RAFNI with ResNet50 and EfficientNetB0 as the backbone network and the baselines can be seen in Table 5. In this case, we can observe that for both CNNs, the results using RAFNI and the baseline CNN are very similar. The algorithm only obtains a substantial gain at 60% noise using ResNet50 (3.02%) and at 50% (3.57%) and 60% (2.14%) when using EfficientNetB0. As happened with RSMAS, the results using EfficientNetB0 are better than the ones obtained using ResNet50 as the backbone network.
We present the results obtained with EILAT and symmetric noise for each level of noise.
|0%||97.53 1.49||97.53 1.64||96.89 1.14||96.78 1.22|
|20%||94.13 2.71||93.28 3.27||94.02 1.02||93.93 1.03|
|30%||92.99 3.56||92.33 3.83||93.26 1.72||93.45 1.23|
|40%||91.76 2.56||91.56 3.20||90.12 2.56||91.53 1.96|
|50%||88.91 3.46||88.48 3.74||85.03 2.49||87.82 1.69|
|60%||83.98 2.74||86.09 4.80||74.19 3.59||83.32 2.42|
|70%||67.11 5.23||76.73 5.93||61.00 4.33||79.34 3.72|
The results for EILAT using RAFNI with both CNNs as backbones and the baselines are shown in Table 6. Using ResNet50, we can observe a similar behaviour to what happened with RSMAS: ResNet50 can handle low and middle levels of noise, and up until 50% of noise, the results with and without RAFNI are very similar. At 60% and 70% of noise, however, we have a significant gain in accuracy using RAFNI, 2.98% at 60% and 9.62% at 70% of noise. Using EfficientNetB0, we obtain a similar behaviour in the sense that this network can handle low levels of noise for this dataset, but in this case, the gain in accuracy using RAFNI starts at 40% with 1.41%, and it keeps increasing: 2.79% at 50%, 9.13% at 60% and 18.34% at 70% of noise.
We present the results obtained with COVIDGR1.0-SN and pseudo-symmetric noise for each level of noise. Here we only used levels of noise up until 50% because this data set has only two classes.
The results are shown in Table 7. This data set has the advantage that it is a real-world data set, and it is more difficult to train (at 0% noise level) than the other data sets: EfficientNetB0 obtains a 78.86% accuracy at 0% of noise. In addition, the noise we introduced in this data set is more realistic, so we can see how the noise affects this scenario and how well the RAFNI algorithm behaves. The results obtained with ResNet50 and EfficientNetB0 are similar: at all noise levels, including 0%, the results are better using RAFNI with gains that generally increase as the noise level raises. Using ResNet50 the gain of using RAFNI reaches 10.5% at 40% of noise and 8.06% at 50% of noise and using EfficientNetB0 the gain reaches 9.4% at 40% of noise and 9.1% at 50% of noise.
|0%||76.51 2.88||77.63 3.04||78.86 3.01||79.14 2.78|
|20%||74.77 4.81||75.49 2.35||75.54 4.06||77.97 3.27|
|30%||71.83 4.40||75.00 2.82||72.26 3.90||77.74 1.71|
|40%||62.69 5.81||72.94 4.69||65.83 2.18||75.23 2.66|
|50%||55.97 6.31||64.03 5.86||57.29 4.74||66.31 7.16|
We show the results we have obtained using CIFAR10 with symmetric and asymmetric noise.
|0%||95.31 0.06||95.59 0.17||94.40 0.72||95.20 0.15|
|20%||84.72 0.61||93.75 0.26||93.00 1.20||93.29 0.45|
|40%||67.16 0.59||90.02 0.55||90.91 0.65||90.97 0.72|
|60%||46.08 0.73||84.33 0.57||84.98 0.96||87.83 0.29|
|0%||95.31 0.06||95.59 0.17||94.40 0.72||95.20 0.15|
|20%||89.10 0.35||94.61 0.15||93.03 0.56||93.39 0.67|
|30%||84.19 0.45||93.43 0.30||91.36 0.45||92.11 1.32|
|40%||77.86 0.26||89.66 0.81||86.21 0.92||90.69 0.62|
In Table 8 and Table 9 we can see the results for CIFAR10 using symmetric noise and asymmetric noise, respectively. RAFNI achieves better results independently of the type of noise, the noise level and the backbone network used, even when we do not introduce noise. The gain in accuracy, again, tends to increase as the noise level increases. For symmetric noise and using ResNet50 as the backbone network, RAFNI achieves a gain in accuracy of 9.03% at 20% noise, 22.86% at 40% noise and 38.25% at 60% noise. Using EfficientNetB0, it reaches a gain of 2.85% at 60% noise. The scenario for asymmetric noise is very similar, being the gains in accuracy using ResNet50 more evident: 5.51% at 20% noise, 9.24 at 30% and 11.80% at 40% noise, while using EfficientNetB0 the gains are 0.75% at 30% noise and 4.48% at 40% noise.
|0%||80.97 0.16||81.38 0.27||80.55 0.70||81.10 0.46|
|20%||70.07 0.27||77.33 0.32||78.69 0.56||78.82 0.29|
|40%||56.02 0.38||71.27 0.55||75.44 0.73||75.38 0.46|
|60%||37.18 0.29||64.03 0.65||67.93 1.89||71.94 0.64|
We present the results we have obtained for CIFAR100 using symmetric noise, which we can find in Table 10. The scenario here is similar to what happened with CIFAR10: EfficientNetB0 is less affected by the noise than ResNet50 and, in consequence, RAFNI has more gain in accuracy when using ResNet50, compared with the baseline in each case. Using ResNet50, RAFNI has a gain of 7.26% at 20% noise, 15.19% at 40% noise and 26.85% at 60%, while using EfficientNet50 the gain at 60% is 4.01%.
6 Comparison with state-of-the-art models
In this section, we compare our proposal, RAFNI, using CIFAR10 and CIFAR100, which are the data sets that used most of the papers in the literature, with some state-of-the-art models: the loss correction approaches from patrini2017making , the robust function from NEURIPS2019_8cd7775f , the proposal from ma2018dimensionality and the one from song2019selfie , which we described in Section 2. First, we compare in Subsection 6.1 all the proposals without using data augmentation. Then, in Subsection 6.2 we compare RAFNI using data augmentation with the proposals that originally used data augmentation.
6.1 Comparison without using data augmentation
In order to make a fair comparison, we made some changes. Since data augmentation is specific to the data set used, we did not use it in the baseline models nor with RAFNI. For the sake of fairness, we then removed its use in the two proposals that use it (patrini2017making and ma2018dimensionality ), so that none of the models compared make of use it.
Furthermore, we used transfer learning from ImageNet in our algorithm in all the data sets to speed up the training, so we can use 10 epochs for CIFAR10 and 15 epochs for CIFAR100. Most of the proposal’s codes did not allow to change the networks used and specifically, they did not allow to introduce transfer learning. Thus, we adopted the following strategies: for the proposal that allowed to use transfer learning (NEURIPS2019_8cd7775f ), we used transfer learning and the same number of epochs that we use, and for the ones that did not (patrini2017making , ma2018dimensionality and song2019selfie ) we used the number of epochs that they reported better for each dataset (between 100 and 150), with one exception: ma2018dimensionality uses 200 epochs for CIFAR100, but due to time restrictions, we changed that to 150.
It is important to note that patrini2017making and song2019selfie suppose the noise level known. The authors in patrini2017making include a mechanism to estimate it if it is not known, so we used both of their approaches using this estimation, as the rest of the algorithms do not suppose the noise level known. However, the author in song2019selfie do not provide such a mechanism, so we used their proposal as given, using the noise level of the data set, but we will have that into account when we make the comparison.
We left the rest of the hyperparameters of each of the proposals to their best values, as given in their respective papers. In the case of NEURIPS2019_8cd7775f , the authors only used CIFAR100 under symmetric noise, so we only had the best values for the two temperatures for this case. Since the temperatures are the same through the noise levels, for the other two scenarios (CIFAR10 with symmetric noise and with asymmetric noise), we evaluated values in the range the authors give for each hyperparameter and selected the best ones for each scenario.
As for the network used as the backbone, we use ResNet50 for RAFNI and NEURIPS2019_8cd7775f since most of the other proposals use some variation of ResNet (except for song2019selfie , which uses DenseNet). Finally, we used exactly the same data set as with our algorithm where it was possible (with NEURIPS2019_8cd7775f and ma2018dimensionality ), and the given ones in the rest. We argue that, since the noise level is the same, it is introduced randomly in all the cases, and the test sets are also the same as they are predefined, we can safely compare the algorithms. For the SELFIE algorithm song2019selfie , we had to implement the matrix of asymmetric noise as given in patrini2017making to use this type of noise as we used it with the rest of the algorithms.
|Est. Forward patrini2017making||82.82 0.88||79.22 2.74||77.64 0.98|
|Est. Backward patrini2017making||70.81 4.61||56.09 4.05||42.38 13.10|
|D2L ma2018dimensionality||80.08 0.37||74.97 0.23||68.61 0.45|
|SELFIE song2019selfie||88.25 0.46||85.05 0.38||71.22 0.96|
|Bi-Tempered (RN50) NEURIPS2019_8cd7775f||92.13 0.24||87.59 0.86||80.55 0.16|
|RAFNI (RN50)||94.61 0.15||93.43 0.30||89.66 0.81|
The results of the comparisons are shown in Table 11 and Table 12. We can see that both the Bi-Tempered proposal and RAFNI report better results than the rest of the algorithms, even SELFIE song2019selfie , which uses the noise level and it is trained during the largest number of epochs since it is an iterative algorithm with three rounds and each one of them trains the network during 100 epochs. Our algorithm reports better results in all cases except for CIFAR100 at 40% and 60% noise level, where the Bi-Tempered algorithm is better, but with less than 0.5% difference in both cases. The results of both algorithms are very similar in the symmetric noise scenario. However, for asymmetric noise, RAFNI outperforms the Bi-Tempered algorithm at all noise levels: it has a gain of 2.48% at 20%, 5.84% at 30% and 9.11% at 40% noise.
6.2 Comparison using data augmentation
We compare RAFNI with the proposals that originally used data augmentation with CIFAR10 and CIFAR100: the estimated forward and backward from patrini2017making and D2L from ma2018dimensionality . That way, we are comparing these proposals in their original context. The only difference with RAFNI is that we also added data augmentation in all cases. Instead of using all noise levels again, we used the standard and most used noise level for symmetric noise, 40%, and also the intermediate level for asymmetric noise, 30%.
|Symmetric noise, 40%||Asymmetric noise, 30%|
|Est. Forward patrini2017making||85.40 0.37||49.16 1.23||89.13 0.61|
|Est. Backward patrini2017making||78.79 0.43||15.59 6.09||83.36 0.83|
|D2L ma2018dimensionality||83.98 0.12||23.94 7.82||85.36 0.26|
|RAFNI (RN50 and DA)||92.96 0.26||73.42 0.10||94.23 0.18|
In Table 13 we show the results of the comparison results using data augmentation. In this case, RAFNI performs better than the other algorithms for all data sets. For CIFAR10 with symmetric noise, RAFNI has a gain of 7.56% with respect to the next best performing algorithm. For CIFAR100 our algorithm has 24.33% more accuracy than the next best performing algorithm. And finally, for CIFAR10 with asymmetric noise, RAFNI outperforms the next best one by 5.1%.
In this paper, we proposed an algorithm, called RAFNI, that can filter and relabel noisy instances during the training process of any convolutional neural network using the predictions and loss values the network gives the instances of the training set. This progressive cleaning of the training set allows the network to improve its generalisation at the end of the training process, improving the results the CNN has on its own. In addition, RAFNI has the advantage that it can be used with any CNN as the backbone network and that transfer learning and data augmentation can be easily applied. It also does not use prior information that is usually not known, like the noise matrix or the noise rate. In addition, it works well even when there is no introduced noise in the data set, so it is safe to use when we do not know the noise rate of a data set. We also made the code available so it is easier to use it.
Developing algorithms that can allow deep neural networks to perform better under label noise is an important task since label noise is a common problem in real-world scenarios and it negatively affects the performance of the networks. We believe that our proposal is a great solution for this problem: it can be easily fine-tuned to every data set, it allows to be used with any CNN, and it allows the use of transfer learning and data augmentation. We proved its potential using various data sets with different characteristics and using three different types of label noise. Finally, we also compared it with several state-of-the-art algorithms, improving their results.
This publication was supported by the project with reference SOMM17/6110/UGR, granted by the Andalusian Consejería de Conocimiento, Investigación y Universidades and European Regional Development Funds (ERDF). This work was also supported by project PID2020-119478GB-I00 granted by Ministerio de Ciencia, Innovación y Univesidades, and project P18-FR-4961 by Proyectos I+D+i Junta de Andalucia 2018. Anabel Gómez-Ríos was supported by the FPU Programme FPU16/04765.
- (1) A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems 25 (2012) 1097–1105.
- (2) A. Gómez-Ríos, S. Tabik, J. Luengo, A. Shihavuddin, B. Krawczyk, F. Herrera, Towards highly accurate coral texture images classification using deep convolutional neural networks and data augmentation, Expert Systems with Applications 118 (2019) 315–328. doi:https://doi.org/10.1016/j.eswa.2018.10.010.
- (3) R. Olmos, S. Tabik, F. Herrera, Automatic handgun detection alarm in videos using deep learning, Neurocomputing 275 (2018) 66–72.
- (4) H. Song, M. Kim, D. Park, Y. Shin, J.-G. Lee, Learning from noisy labels with deep neural networks: A survey, arXiv preprint arXiv:2007.08199.
T. Xiao, T. Xia, Y. Yang, C. Huang, X. Wang, Learning from massive noisy labeled data for image classification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2691–2699.
- (6) K.-H. Lee, X. He, L. Zhang, L. Yang, Cleannet: Transfer learning for scalable image classifier training with label noise, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- (7) B. Frenay, M. Verleysen, Classification in the presence of label noise: A survey, IEEE Transactions on Neural Networks and Learning Systems 25 (5) (2014) 845–869. doi:10.1109/TNNLS.2013.2292894.
- (8) W. Li, L. Wang, W. Li, E. Agustsson, L. Van Gool, Webvision database: Visual learning and understanding from web data, arXiv preprint arXiv:1708.02862.
- (9) H. Song, M. Kim, J.-G. Lee, Selfie: Refurbishing unclean samples for robust deep learning, in: International Conference on Machine Learning, PMLR, 2019, pp. 5907–5915.
- (10) C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning (still) requires rethinking generalization, Communications of the ACM 64 (3) (2021) 107–115.
- (11) G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, L. Qu, Making deep neural networks robust to label noise: A loss correction approach, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1944–1952.
- (12) X. Ma, Y. Wang, M. E. Houle, S. Zhou, S. Erfani, S. Xia, S. Wijewickrema, J. Bailey, Dimensionality-driven learning with noisy labels, in: International Conference on Machine Learning, PMLR, 2018, pp. 3355–3364.
- (13) L. Jiang, Z. Zhou, T. Leung, L.-J. Li, L. Fei-Fei, Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels, in: International Conference on Machine Learning, PMLR, 2018, pp. 2304–2313.
- (14) Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, S.-T. Xia, Iterative learning with open-set noisy labels, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8688–8696.
- (15) G. Song, W. Chai, Collaborative learning for deep neural networks, in: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Curran Associates Inc., 2018, p. 1837–1846.
- (16) I. Jindal, M. Nokleby, X. Chen, Learning deep networks from noisy labels with dropout regularization, in: 2016 IEEE 16th International Conference on Data Mining (ICDM), IEEE, 2016, pp. 967–972.
- (17) E. Amid, M. K. K. Warmuth, R. Anil, T. Koren, Robust bi-tempered logistic loss based on bregman divergences, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019.
- (18) A. Ghosh, H. Kumar, P. Sastry, Robust loss functions under label noise for deep neural networks, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31, 2017.
- (19) Z. Zhang, M. R. Sabuncu, Generalized cross entropy loss for training deep neural networks with noisy labels, in: 32nd Conference on Neural Information Processing Systems (NeurIPS), 2018.
- (20) K. Yi, J. Wu, Probabilistic end-to-end noise correction for learning with noisy labels, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7017–7025.
- (21) S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, R. Fergus, Training convolutional networks with noisy labels, 2015, 3rd International Conference on Learning Representations, ICLR 2015.
- (22) A. Shihavuddin, Coral reef dataset, mendeley data, v2, https://data.mendeley.com/datasets/86y667257h/2, accesed on: 06-04-2021. doi:10.17632/86y667257h.2.
- (23) A. Gómez-Ríos, S. Tabik, J. Luengo, A. Shihavuddin, F. Herrera, Coral species identification with texture or structure images using a two-level classifier based on convolutional neural networks, Knowledge-Based Systems 184 (2019) 104891. doi:https://doi.org/10.1016/j.knosys.2019.104891.
- (24) S. Tabik, A. Gómez-Ríos, J. L. Martín-Rodríguez, I. Sevillano-García, M. Rey-Area, D. Charte, E. Guirado, J. L. Suárez, J. Luengo, M. A. Valero-González, P. García-Villanova, E. Olmedo-Sánchez, F. Herrera, Covidgr dataset and covid-sdnet methodology for predicting covid-19 based on chest x-ray images, IEEE Journal of Biomedical and Health Informatics 24 (12) (2020) 3595–3605. doi:10.1109/JBHI.2020.3037127.
- (25) A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images.
- (26) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- (27) M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, in: International Conference on Machine Learning, PMLR, 2019, pp. 6105–6114.
Appendix A Best hyperparameter values in each case
In Table 14 and Table 15 we show the best hyperparameter values we found for the small datasets: RSMAS, StructureRSMAS, EILAT and COVIDGR1.0-SN with ResNet50 and EfficientNetB0, respectively. Similarly, in Table 16 and Table 17 we show the best hyperparameter values we found for the CIFAR datasets with ResNet50 and EfficientNetB0, respectively.
Appendix B Hyperparameter values tested in each case
In Table 18 and Table 19 we show all the values we tested for the hyperparameters for the small datasets: RSMAS, StructureRSMAS, EILAT and COVIDGR1.0-SN with ResNet50 and EfficientNetB0, respectively. Similarly, in Table 20 and Table 21 we show all the values we tested for the hyperparameters for the CIFAR datasets with ResNet50 and EfficientNetB0, respectively.