Deep neural networks (DNNs) trained on large-scale datasets have exhibited significant performance in image classification. Many large-scale datasets are collected from websites, however they tend to contain inaccurate labels that are termed as noisy labels. Training on such noisy labeled datasets causes performance degradation because DNNs easily overfit to noisy labels. To overcome this problem, we propose a joint optimization framework of learning DNN parameters and estimating true labels. Our framework can correct labels during training by alternating update of network parameters and labels. We conduct experiments on the noisy CIFAR-10 datasets and the Clothing1M dataset. The results indicate that our approach significantly outperforms other state-of-the-art methods.READ FULL TEXT VIEW PDF
Despite the success of deep neural networks (DNNs) in image classificati...
Despite the deep neural networks (DNN) has achieved excellent performanc...
Deep convolutional neural networks (CNNs) learned on large-scale labeled...
Large-scale datasets may contain significant proportions of noisy (incor...
Datasets with significant proportions of noisy (incorrect) class labels
Deep neural networks (DNNs) exhibit great success on many tasks with the...
Many advances of deep learning techniques originate from the efforts of
DNNs trained on large-scale datasets have achieved impressive results on many classification problems. Generally, accurate labels are necessary to effectively train DNNs. However, many datasets are constructed by crawling images and labels from websites and often contain incorrect noisy labels (e.g., YFCC100M , Clothing1M ). This study addresses the following question: how can we effectively train DNNs on noisy labeled datasets without manually cleaning the data?
The prominent issue in training DNNs on noisy labeled datasets is that DNNs can learn or memorize, any training dataset, and this implies that DNNs are subject to total overfitting on noisy data.
To address this problem, commonly used regularization techniques including dropout and early stopping are helpful. However, these methods do not guarantee optimization because they prevent the networks from reducing the training loss. Another method involves using prior knowledge, such as the confusion matrix between clean and noisy labels, which typically cannot be used in real settings.
Consequently, we need a new framework of optimization. In this study, we propose an optimization framework for learning on a noisy labeled dataset. We propose optimizing the labels themselves as opposed to treating the noisy labels as fixed. The joint optimization of network parameters and the noisy labels corrects inaccurate labels and simultaneously improves the performance of the classifier. Fig.1 shows the concept of our proposal. The main contributions are as follows.
We propose a joint optimization framework for learning on noisy labeled datasets. Our optimization problem has two optimization network parameters and class labels that are optimized by an alternating strategy.
We observe that a DNN trained on noisy labeled datasets does not memorize noisy labels and maintains high performance for clean data under a high learning rate. This reinforces the findings of Arpit et al.  that suggest that DNNs first learns simple patterns and subsequently memorize noisy data.
We evaluate the performance on synthetic and real noisy datasets. We demonstrate state-of-the-art performance on the noisy CIFAR-10 dataset and a comparable performance on the Clothing1M dataset .
Recently, generalization and memorization abilities of neural networks have attracted increasing attention. Specifically, we focus on the ability of learning labels. Zhang et al. showed that DNNs can learn any training dataset even if the training labels are completely random 
. This leads to two problems. First, the performance of a DNN decreases when it is trained on a noisy dataset and completely learns noisy labels. Second, it is difficult to learn which label is noisy given the perfect learning ability. To the best of our knowledge, most studies on deep learning with respect to noisy labels do not focus on the aforementioned problems that are caused by the memorization ability of DNNs. This study involves addressing these two problems to improve the classification accuracy by preventing completely learning for noisy labels.
We briefly review existing studies on learning on noisy labeled datasets.
Regularization: Regularization is an efficient method to deal with the issue of DNNs easily fitting noisy labels, as described in Section 2.1. Arpit et al. showed the performances of DNNs trained on noisy labeled datasets with several regularizations  including weight decay, dropout, and adversarial training . Zhang et al. used a mixup  involving the utilization of a linear combination of images and labels for training.
These techniques improve performance on clean labels. However, these methods do not explicitly deal with noisy labels, and therefore long-time training leads to performance degradation as follows: the performance of the last epoch is generally worse than that of the best epoch . Furthermore, it is not possible to use the training loss on the noisy labeled dataset as a measure of performance on clean labels. Therefore, training-loss based early stopping does not work well.
Noise transition matrix: Let and be the noisy and true labels. We define the noise transition matrix by . Then, we can use to modify the cross entropy loss as follows:
This formulation was used in many studies [16, 11, 14]. In deep learning, some studies presuppose the ground-truth noise transition matrix [14, 19] and achieve the state-of-the-art performance in the noisy CIFAR-10 dataset. Other studies estimate from noisy data. Specifically, is modeled by a fully connected layer and is trained in an end-to-end manner [16, 11]. However, these methods do not carefully focus on the memorization ability of DNNs. Patrini et al. proposed an estimation method for ; however, its performance is slightly worse than that obtained with the true .
Robust loss function:
A few studies achieve noise-robust classification by using a noise-tolerant loss functions, such as ramp loss and unhinged loss . For further details please refer to . In deep learning, Ghosh et al. used mean square error and mean absolute error  for noise-tolerant loss functions. It should be noted that they do not consider the problem that DNNs can learn arbitrary labels.
Other approaches using deep learning: Reed et al. used a bootstrapping scheme to handle noisy labels . Our method is similar to this study. Xiao et al. constructed a noise model with multiple noise types and trained two networks: an image classifier and a noise type classifier . It should be noted that this method requires a low amount of accurately labeled datasets.
is a type of self-training that is generally used in semi-supervised learning with few labeled data and many unlabeled data. In this technique, pseudo-labels are initially assigned to unlabeled data by predictions of a model trained on a clean dataset. Subsequently, the algorithm repeats retraining the model on both labeled and unlabeled data and updating pseudo-labels.
In semi-supervised learning, we know which data is labeled or not and only need to assign pseudo-labels to only unlabeled data. However, with respect to learning on noisy labeled data, it is necessary to treat all data equally because we do not know which data is noisy. Reed et al. proposed a self-training scheme  for training a DNN on noisy labeled data. Their approach is similar to that proposed in this study. However, they use original noisy labels for learning until the end of training, and thus the performance is degraded by the remaining effects of noisy labels for a high noise rate [15, 11]. Conversely, we completely replace all labels by pseudo-labels and use the same for training.
In this study, column vectors and matrices are denoted in bold (e.g. ) and capitals (e.g. ), respectively. Specifically, is a vector of all-ones. We define hard-label spaces and soft-label spaces .
In supervised -class classification problem setting, we have a set of training images with ground-truth labels , where is a one-hot vector representation of the true class label. The objective function is an empirical risk, such as the cross entropy, as follows:
where denotes the set of network parameters and denotes the output of the final layer, namely
-class softmax layer, of the network.
If a clean training dataset is present, then the network parameters are learned by optimizing Eq. (2) by using a gradient descent method. However, in this study, we consider the classification problem with noisy labels as follows: Let be the noisy label, and only the noisy training label set is given. The task involves training CNNs to predict true labels. In the next section, we describe the proposed method for training on noisy labels.
In this section, we present our proposed training method with noisy labels. Generally, with respect to supervised learning on clean labels, the optimization problem is formulated as follows:
As we will describe in Section 5.3, we experimentally found that a high learning rate suppresses the memorization ability of a DNN and prevents it from completely fitting to labels. Thus, we assume that a network trained with a high learning rate will have more difficulty fitting to noisy labels. In other words, the loss Eq. (3) is high for noisy labels and low for clean labels. Given this assumption, we obtain clean labels by updating labels in the direction to decrease Eq. (3). Therefore, we formulate the problem as the joint optimization of network parameters and labels as follows:
The concept of our proposal is shown in Fig. 1.
Our proposed loss function is constructed by three terms as follows:
where , , denote the classification loss and two regularization losses, respectively, and and denote hyper parameters. In this study, we use the Kullback-Leibler (KL)-divergence for as follows:
In the following subsections, we first describe an alternating optimization method to solve this problem, and we then describe the definition of , .
In our proposed learning framework, network parameters and class labels are alternatively updated as shown in Algorithm 1. We will describe the update rules of and .
Updating with fixed : All terms in the optimization problem Eq. (5) are sub-differentiable with respect to . Therefore, we update
by the stochastic gradient descent (SGD) on the loss function Eq. (5).
Updating with fixed : In contrast to other methods, we update and optimize the labels that we perform the training on. With respect to updating , it is only necessary to consider the classification loss from Eq. (5) with fixed . The optimization problem Eq. (6) on is separated for each .
As a method of optimizing labels, two methods can be considered: the hard-label method and the soft-label method. In the case of the hard-label method, is updated as follows:
In the case of the soft-label method, the KL-divergence from to is minimized when , and thus the update rule for is as follows:
As we will describe in Section 5.4, we experimentally determined that the performance of the soft-label method exceeded that of the hard-label method. Thus, we applied soft-labels to all experiments if not otherwise specified.
We describe definitions and roles of two regularization losses of and .
Regularization loss : The regularization loss is required to prevent the assignment of all labels to a single class: In the case of minimizing only Eq. (6), we obtain a trivial global optimal solution with a network that always predicts constant one-hot and each label for any image
. To overcome this problem, we introduce a prior probability distribution, which is a distribution of classes among all training data. If the prior distribution of classes is known, then the updated labels should follow the same. Therefore, we introduce the KL-divergence from to as a cost function as follows:
This approximation cannot treat a large number of classes and extreme imbalanced classes, however it works well in the experiments on the noisy CIFAR-10 dataset and the Clothing1M dataset.
Regularization loss : The term is required for the training loss when we use the soft-label. We consider the case of Eq. (5) with . In this case, when is updated by Eq. (9), both and are stuck in local optima and the learning process does not proceed. To overcome this problem, we introduce an entropy term to concentrate the probability distribution of each soft-label to a single class as follows:
Our method has two steps for training on noisy labels. In the first step, we obtain clean labels by updating labels as described in Section 4.1. In the second step, we initialize the network parameters and train the network by usual supervised learning with the labels obtained in the first step.
CIFAR-10: We use the CIFAR-10 dataset  and retain 10% of the training data for validation. Subsequently, we define three types of the training data, namely Symmetric Noise CIFAR-10 (SN-CIFAR), Asymmetric Noise CIFAR-10 (AN-CIFAR), and Pseudo Label CIFAR-10 (PL-CIFAR).
In SN-CIFAR, we inject the symmetric label noise. Symmetric label noise is as follows:
In AN-CIFAR, we inject the asymmetric label noise. The asymmetric label noise is discussed in . The rationale involves mimicking a part of the structure of real mistakes for similar classes: TRUCK AUTOMOBILE, BIRD AIRPLANE, DEER HORSE, CAT DOG. Transitions are parameterized by such that the probabilities of ground-truth and inaccurate class correspond to and , respectively.
In PL-CIFAR, pseudo labels are assigned to unlabeled training data. Pseudo labels are generated by applying k-means++ to features that are outputs of pool5 layer of ResNet-50 
Clothing1M: We use the Clothing1M dataset  to examine the performance of our method in a real setting. The Clothing1M dataset contains 1 million images of clothing obtained from several online shopping websites that are classified into the following 14 classes: T-shirt, Shirt, Knitwear, Chiffon, Sweater, Hoodie, Windbreaker, Jacket, Down Coat, Suit, Shawl, Dress, Vest, and Underwear. The labels are generated by using surrounding texts of the images that are provided by the sellers, and therefore contain many errors. In , it is reported that the overall accuracy of the noisy labels is 61.54%, and some pairs of classes are often confused with each other (e.g., Knitwear and Sweater). The Clothing1M dataset also contains , and of clean data for training, validation, and testing, respectively although we do not use the clean training data.
We implemented all the models with the deep learning framework Chainer v2.1.0 .
CIFAR-10: Training on SN-CIFAR, AN-CIFAR and PL-CIFAR, we used the network based on PreAct ResNet-32  as detailed in Appendix A. With respect to preprocessing, we performed mean subtraction and data augmentation by horizontal random flip and 32
32 random crops after padding with 4 pixels on each side. We used SGD with a momentum of 0.9, a weight decay of, and batch size of 128.
In the first step of our method, we trained the network for 200 epochs and began updating labels from the 70th epoch. We determined the values of a learning rate and the hyper parameters (, in Eq. (5)) for SN-CIFAR, AN-CIFAR, and PL-CIFAR respectively based on the validation accuracy. The details are described in each experimental section. As we will describe in Section 5.4, soft-labels performed better than hard-labels, and thus we applied soft-labels to all the experiments in Section 5.5, Section 5.6, and Section 5.7. In this case, the prior distribution
is uniform distribution because each class has the same number of images in the CIFAR-10 dataset. While updating the noisy labelby the probability , we used the average output probability of the network of the past 10 epochs as . We experimentally determined that this averaging technique is useful in preventing inaccurate updates since it has a similar effect to ensemble.
In the second step of our method, we trained the network for 120 epochs on the labels obtained in the first step. We began training with a learning rate of 0.2 and divided it by 10 after 40 and 80 epochs. We used only for the training loss in this step.
Clothing1M: Training on the Clothing1M dataset, we used ResNet-50 pre-trained on ImageNet to align experimental condition with . For preprocessing, we resized the images to , performed mean subtraction, and cropped the middle . We used SGD with a momentum of 0.9, a weight decay of , and batch size of 32.
In the first step of our method, we trained the network for 10 epochs and began updating labels from the 1st epoch. We used a learning rate of , and used 2.4 for and 0.8 for . While updating the noisy label by the probability , we used the average output probability of the network of all the past epochs as . We applied soft-labels to the experiment in Section 5.8.
In the second step of our method, we trained the network for 10 epochs on the labels obtained in the first step. We began training with a learning rate of and divided it by 10 after 5 epochs.
To examine the effect of the learning rate (lr) and the noise rate () on the training loss and the test accuracy, we trained the network on SN-CIFAR with only the cross entropy loss.
Fig. 2 shows the test accuracy curve with different learning rates. We trained the network for 120 epochs with a learning rates of 0.2 or 0.02. In the case of the low learning rate (lr=0.02), the test accuracy was high at the early phase of training and then gradually decreased because the network fitted the noisy labels. This is the same result reported in . Conversely, in the case of the high learning rate (lr=0.2), the network exhibited high test accuracies during training. This means that a high learning rate prevents the network from memorizing and fitting the noisy labels.
Fig. 3 shows how the manner in which training loss declines during training. We trained the network for 600 epochs. We commenced training with a learning rate of 0.2 and divided it by 10 after 200 and 400 epochs. At the end of training, our model fit the noisy labels even if the noise rate was high (for e.g., ). However, with respect to training with a high learning rate, the training loss clearly increases when the noise rate increases. This indicates that it is possible to optimize the labels towards lowering the training loss when the learning rate is high.
To prove the effectiveness of the soft-label, we trained the network on SN-CIFAR (noise rate ) for 1500 epochs by using the first step of our method. We compare the hard-label methods and the soft-label method. For the hard-labels methods, we update top 50, 500, 5000, or all labels whose current labels are most different from the predicted classes to the predicted hard-labels every epoch. For the soft-label method, we update all labels to the predicted soft-labels every epoch. In Fig. 4, we show the recovery accuracy, which is defined as the accuracy of the reassigned labels, in the first step of our method. The soft-label method achieves faster convergence and better recovery accuracy than any hard-label methods.
Subsequently, by using the second step of our method, we performed training on the labels obtained in the first step. In the hard-label methods, updating 500 labels every epoch is optimal and the test accuracy is 85.7%. Conversely, the test accuracy of the soft-label method is 86.0%. It shows that though the recovery accuracy of the soft-label method obtained in the first step is 86.0%, which is approximately equal to 85.9% of the hard-label method (updating 500 labels every epoch), the test accuracy is improved by 0.3%. The reason why the soft-label method performed better is considered as that soft-labels contain the probabilities of each class in themselves. Soft-labels reflect confidences of the trained network unlike hard-labels, which are assigned by ignoring confidences. Our results indicate that confidences are important in the case of training on noisy labels.
|#||method||Test Accuracy (%)||Recovery Accuracy (%)|
|noise rate (%)||0||10||30||50||70||90||0||10||30||50||70||90|
|1||Cross Entropy Loss||best||93.5||91.0||88.4||85.0||78.4||41.1||100.0||96.4||92.7||88.2||80.1||41.4|
To evaluate the performance of our method on synthesized noisy labels, we trained the network on SN-CIFAR (the noise rate ) by using our method. In the first step of our method, we used the optimal learning rate, and for each noise rate based on the validation accuracy as detailed in Appendix B. As a comparison, we also trained on initial noisy labels in the same manner as the second step of our method.
The results are reported in Table 1. As shown in Table 1, best denotes the scores of the epoch where the validation accuracy is optimal, and last denotes the scores at the end of training. The recovery accuracy for our method is defined as the accuracy of the reassigned labels. Conversely, other methods do not reassign the noisy labels, and thus the recovery accuracy is reported as the prediction accuracy on the ground-truth labels of the noisy training data.
Our method achieves overall better test accuracy and recovery accuracy on SN-CIFAR. When training was performed on initial noisy labels, the test accuracy decreases after approximately the 40th epoch (when we divided the learning rate by 10). This indicates that lowering the learning rate assists the network in fitting the noisy labels as described in Section 5.3. Conversely, when we trained on the labels optimized by our method, the test accuracy was high until the end of training. This is the important effects of our joint optimization.
|#||method||Test Accuracy (%)||Recovery Accuracy (%)|
|noise rate (%)||10||20||30||40||50||10||20||30||40||50|
|1||Cross Entropy Loss||best||91.8||90.8||90.0||87.1||77.3||97.2||95.8||94.3||91.0||80.5|
To evaluate the performance of our method in the settings in , we trained the network on AN-CIFAR (the noise rate ) by using our method. In the first step of our method, we used a learning rate of 0.03 and used 0.8 for and 0.4 for , respectively for all the noise rates. As a comparison, we also performed training on initial noisy labels in the same manner as the second step of our method with the cross entropy loss or the forward corrected loss .
The results of our experiments are shown in Table 2. The forward corrected loss  and the CNN-CRF model  require the ground-truth noise transition matrix. Conversely, we need only the prior distribution , and thus our condition is more general than that of [14, 19].
Our method achieves significantly better test accuracy and recovery accuracy on AN-CIFAR. However, only when the noise rate is 50%, there is no significant improvement in accuracy when compared with other noise rates. Since we generated label noise to exchange CAT and DOG classes, it is impossible to accurately determine the class for CAT and DOG when the noise rate is 50%.
In a manner similar to Section 5.5, when training is performed on initial noisy labels, the test accuracy decreases due to the network fitting noisy labels with a low learning rate. This trend is also observed if we use the forward corrected loss , while the test accuracy does not decrease and remains high in our method.
To evaluate the performance of our method in the settings of transfer learning, we trained the network on PL-CIFAR by using our method. In the first step of our method, we used a learning rate of 0.04 and used 1.2 for and 0.8 for . As a comparison, we also trained on initial pseudo-labels in the same manner as the second step of our method.
Fig. 5 shows the test accuracy curve with different labels, and Fig. 6 shows the decline in the training loss during training. In both figures, we show the results of training on SN-CIFAR (the noise rate ) because the noise rate of the pseudo labels is between 0.3 and 0.5. Additionally, we show the results of training on the ground-truth labels because the training loss curve of training on optimized labels is near the curve for the same.
Although the number of inaccurate labels in the pseudo labels exceeds that of the symmetric noise labels (), the value of the training loss of the pseudo labels is lower than that of the symmetric noise labels. This fact seems to conflict with extant knowledge that states that “the training loss increases when the noise rate increases”, as described in Section 5.3. However, we can explain the reason of this conflict as follows: the difference in the training loss depends on the noise rate as well as the type of the noise. The pseudo labels are generated from the outputs of ResNet-50 pre-trained on ImageNet, and thus they are already considered as “the optimized labels” by the network. Thus, the pseudo labels were not updated adequately. The test accuracy of training on the labels recovered from the noisy labels is worse than that of training on the ground-truth labels, and this indicates that the optimized labels do not necessarily denote optimal labels. This is a limitation of the proposed method.
Finally, we trained the network on the Clothing1M dataset  by using our method to evaluate the performance of our method in a real setting. As a comparison, we also trained on initial noisy labels in the same manner as the second step of our method. The results of our experiments are shown in Table 3. Additionally, we also show the scores (#1, #2) reported in .
In #2, Patrini et al. exploited the curated labels of clean data and their noisy versions in noisy data to obtain the ground-truth noise transition matrix, which is not often used in real-world settings. Conversely, we only used the distribution of the noisy labels, which can be always used, for the prior distribution , and therefore our condition is more general than #2. Nevertheless, our method achieves better test accuracy than #2 on the Clothing1M dataset.
In Fig. 7, Fig. 8, we show the examples of the images whose labels were reassigned to classes different from the original ones by our method. Additionally, we show the probability of the class that the label is reassigned to. When the probability is high, the label seems to be updated correctly. Conversely, when the probability is low, the label seems to be updated incorrectly. As opposed to the hard-labels, the soft-labels contain the probabilities of each class in themselves, and thus the network can consider the incorrectly updated labels as not important. Specifically, this effect contributes to improving the test accuracy.
|1||Cross Entropy Loss||68.94|
|3||Cross Entropy Loss||best||69.15|
We proposed a joint optimization framework for learning on noisy labeled datasets, which alternatively updates network parameters and class labels. The performance of the framework is guaranteed by our finding that training under a high learning rate prevents the network from memorizing noisy labels. We showed that our framework performed remarkably well on the noisy CIFAR-10 dataset and the Clothing1M dataset, outperforming the state-of-the-art methods [14, 19].
Acknowledgements. This research is partially supported by CREST (JPMJCR1686).
|input||3232 RGB imgae|
|conv||32 filters, 3
3, pad=1, stride=1
|unit1||(pre-activation Residual Unit 3232)5|
|unit2a||pre-activation Residual Unit 3264|
|unit2b||(pre-activation Residual Unit 6464)4|
|unit3a||pre-activation Residual Unit 64128|
|unit3b||(pre-activation Residual Unit 128128)4|
Batch Normarization, ReLU,
|Global average pool (8811 pixels)|
|dense||Fully connected 12810|
We show the hyper parameters used in the experiments on SN-CIFAR in Table 5. If the noise rate is high, the optimal learning rate also tends to be high.
|noise rate (%)||0||10||30||50||70||90|
The prediction accuracy is not so sensitive to the hyper parameters and our method demonstrated good performance with a different set of the hyper parameters as shown in Table 6, 7, 8, 9. In addition, Table 10, 11 show the validation accuracy with different and , where is the value at which to start label-updating, and is the value at which to stop label-updating. When we train the network with a high learning rate, the prediction accuracy retains high value, and thus we can start label-updating when the validation accuracy once reach high value. Label-updating should be stopped after the training loss converge.
|, learning rate|
|, learning rate|
|, learning rate|
|, learning rate|
|, learning rate|
|, learning rate|
|, learning rate|
|, learning rate|
We show the analysis of the effect of soft-labeling on the noisy CIFAR-10 dataset in Table 12, 13. The soft-labels with high probability are almost correct. Conversely, when the probability is low, the label seems to be updated incorrectly. As opposed to the hard-labels, the soft-labels contain the probabilities of each class in themselves, and thus the network can consider the incorrectly updated labels as not important.