Convolutional Neural Networks (CNNs) have led to great improvements in many supervised tasks. However, CNNs’ performance relies heavily on the quality of labels, and accurately labeling a huge amount of data is expensive and time-consuming. Furthermore, accurate labeling is done by hand, which can eventually lead to mismatched labeling. Therefore, the robust training of CNNs with noisy data is of great practical importance. There are many approaches regarding this issue. For example, there are methods that design noise-robust loss [ghosh2015making, ghosh2017robust, wang2019symmetric, ma2020normalized], use two neural networks to select clean labels [han2018co, yu2019does, wei2020combating], and utilize label correction [patrini2017making, xiao2015learning]. These existing approaches commonly use the given labels in a direct manner, i.e., “input image belongs to this label” (Positive Learning; PL). This behavior carries the risk of providing faulty information to the CNNs when noisy labels are involved.
Motivated by this reason, Negative Learning for Noisy Labels; NLNL [kim2019nlnl], which is an indirect learning method for training CNNs, has been proposed recently. Negative Learning (NL) uses randomly chosen complementary labels and trains the CNN that “input image does not belong to this complementary label,” reducing the risk of providing the wrong information because of the high chance of not selecting a true label as a complementary label. Additionally, NLNL proposed three-stage pipeline for filtering noisy data from training data (Figure 1 (a)). Each stage is composed of NL NL while discarding data of low confidence (Selective NL; SelNL) PL while only retaining data of high confidence (Selective PL; SelPL), enabling more convergence after NL. However, the fundamental problem that NL loss function causes underfitting to the overall training data still remains. This is the reason that NL requires an additional sequential step, SelNL. Furthermore, the three-stage pipeline for filtering noisy data is quite inefficient, extending the time for training CNNs.
In this study, we propose a novel version of NLNL: Joint Negative Learning and Positive Learning; JNPL which has a unified single-stage pipeline for filtering noisy data (Figure 1 (b)). JNPL is composed of two losses to train CNN, NL+ and PL+ losses, dedicated to filtering noisy data from training data. Each is developed from NL and PL loss functions, respectively. Firstly, our paper focuses on analyzing the NL loss function to understand the cause for underfitting. Then we develop a new loss function NL+ that resolves the issue, which produces a gradient appropriate for convergence on a noisy training dataset. Our study demonstrates the effectiveness of NL+, showing improved convergence across various label noise types and noise rates. Secondly, while we utilize PL to aid in training with noisy data, PL+ loss function is also newly designed to enable faster training with expected-to-be-clean data. Our paper shows the effectiveness of the PL+ loss function compared to the previous PL loss function. Finally, as both loss functions of our method (NL+ and PL+) jointly train the model through a single stage, it is simple and easier to use than NLNL. Our experiments show that JNPL successfully filters noisy data in a single stage, thereby providing significantly faster training of CNN as well as better filtering compared to NLNL.
After filtering noisy data from the training data we perform pseudo-labeling for noisy data classification. We achieve state-of-the-art accuracy across various settings in CIFAR10, CIFAR100 [cifar-10], and Clothing1M [xiao2015learning] datasets, proving the superior filtering ability of JNPL.
The main contributions of this paper are as follows:
[font=, leftmargin=4mm, topsep=1mm, noitemsep]
We propose an improved version of NLNL, named “Joint Negative and Positive Learning (JNPL),” featuring a single-stage pipeline for filtering noisy data, therefore enabling easier usage compared to NLNL.
Two novel loss functions are newly designed, each named NL+ loss and PL+ loss. NL+ solves the underfitting problem of the NL loss, and provides better convergence on various types and ratios of label noises in the training data. Moreover, PL+ enables faster training compared to the previous PL loss function.
Our method filters noisy data, more robust across different types and ratios of noise than NLNL. Our method also achieves state-of-the-art noisy data classification results when used along with pseudo-labeling.
Prior knowledge of the type or number of noisy data is not required for our method. It does not require any hyper-parameter tuning that depend on prior knowledge, allowing our method to be applicable in practice.
The remainder of this paper is organized as follows. Section 3 describes NLNL method in depth, which is targeted throughout the whole paper, and discusses the cause of the underfitting problem of the method. Section 4 describes our proposed method, JNPL, and explains in detail on NL+ loss and PL+ loss terms. Section 5 demonstrates the overall comparison between JNPL and NLNL, showing the distinct advantages of JNPL over NLNL. Section 6 discusses the evaluations of our method in comparison to baseline methods. Finally, we summarize and conclude in Section 7.
2 Related works
Several methods that aim to mitigate label noise have been proposed. Here, we summarize some of the recent approaches to noise-robust learning.
Designing noise-robust loss The commonly used cross-entropy (CE) loss is known to be prone to overfitting when there is noise in the labels. Therefore, a family of studies aims to design novel loss functions that are tolerant of label noise. Ghosh [ghosh2015making, ghosh2017robust] showed that the mean absolute error (MAE) loss is theoretically robust against label noise. Zhang [zhang2018generalized]
proposed Generalized Cross Entropy loss, which is a generalized function that can interpolate between the forms of CE and MAE, which enables it to adjust trade-offs between robust loss and non-robust loss.
However, in many cases, such noise-robust losses carry the problem of underfitting, which motivates the combination of a robust loss with a non-robust loss to improve convergence. Wang [wang2019symmetric] proposed Symmetric Cross Entropy loss, which combines CE loss with Reverse Cross Entropy loss. Recently, Ma [ma2020normalized] proposed a loss normalization technique that transforms a non-robust loss function into a robust loss function. They also showed that such normalized loss used in combination with another robust loss function improves convergence and coined the term Active Passive Loss (APL).
Weighting samples In some researches, each sample in the training set is weighted by the reliability of the label [jiang2017mentornet, ren2018learning, lee2017cleannet]. Moreover, other methods proposed meta-learning algorithms that predicts the weights for each sample [jiang2017mentornet, ren2018learning]. However, these methods require a clean validation set, which is often difficult to guarantee in practice.
Correction methods Some other researches used correction methods [patrini2016making, vahdat2017toward, hendrycks2018using, xiao2015learning, veit2017learning, li2017learning]. They assume that prior knowledge like noise rate or noisy transition matrix is known or that some clean data is accessible. However, in a practical case, prior knowledge and clean data is usually hard to obtain. Some other works used CNN with additional layer [sukhbaatar2014training, jindal2016learning, goldberger2016training], and noise transition matrix is approximated to correct loss. Many efforts gradually change the data label to the prediction value of the network [reed2014training, tanaka2018joint, ma2018dimensionality, yi2019probabilistic]. Arazo [arazo2019unsupervised]
, fits a mixture of beta distributions that models the loss of clean and noisy samples during training.
Selecting clean labels Some attempted to identify clean labels from a noisy dataset [han2018co, ding2018semi, northcutt2017learning]. Ding [ding2018semi] proposed a selection of clean examples based on predicted likelihoods. The labels of the remaining samples are discarded, and the network is trained by semi-supervision. Some of the successful approaches train two deep neural networks simultaneously and let them teach each other [han2018co, yu2019does, wei2020combating]. Each network selects possibly clean data and trains the other network with this data.
Use of complementary labels Kim [kim2019nlnl] proposed a noise-robust learning method where instead of maximizing the log-likelihood on the target position, it minimizes the log-likelihood on the complementary positions, termed Negative Learning (NL). They employ a three-stage pipeline based on NL that separates the clean data from the noisy data. Finally, the network is trained using standard CE loss with semi-supervision by treating the noisy set as unlabeled.
Other approaches Li [li2019learning] uses meta-learning to obtain weights that can be easily fine-tuned to a given noisy dataset. Zhang [zhang2019metacleaner] proposed to learn confidence scores of each samples from the relationship between noisy samples in the feature space, then use the confidence scores to generate cleaner representations. Harutyunyan [harutyunyan2020improving] proposed training algorithm based on mutual information between weights and labels to regularize the memorization of labels.
3 Negative Learning for Noisy Labels (NLNL)
Throughout this paper, we consider the problem of c-class classification. Let be an input, be its label and complementary label, respectively, and
be their one-hot vector. Suppose the CNNmaps the input space to the c-dimensional score space , where is the set of network parameters. If
passes through the softmax function, the output can be interpreted as a probability vector, where denotes the c-dimensional simplex.
NL [kim2019nlnl] is an indirect learning method for training CNNs with noisy data. Instead of using given labels, it chooses random complementary label and train CNNs as in “input image does not belong to this complementary label.” The loss function following this definition is as below, along with the classic PL loss function for comparison:
To improve convergence after NL, SelNL is performed as a subsequent step. SelNL trains the CNNs only with the data having confidence over (). Since data involved in training tend to be less noisy than before, CNNs converge better after SelNL. Furthermore, PL is considered a faster and more accurate method than NL, only if training data is assumed to be clean. After training with NL and SelNL, SelPL train CNNs only with data that has confidence above , assuming that such data are clean. After filtering noisy data with these three steps (NLSelNL
SelPL), semi-supervised learning (pseudo-labeling[lee2013pseudo]) is performed utilizing labeled expected-to-be-clean data and unlabeled noisy data.
As mentioned in Section 1, the fundamental problem of underfitting of NL still remains. To analyze the root of this phenomenon, we observe the gradient resulting from the NL loss function (Eq 2) as follows:
Eq 3 states that at classes except for receives gradient of (). Figure 2 (a) shows 2D gradient map of , and Figure 2 (b)-(d) shows the distribution of training data after NL in diverse noise ratio. Each training data is distributed in gradient map with respect to its (when ) and . As the training with NL progresses, clean data tend to have high and low (lower-right region in Figure 2 (a)), while noisy data tend to have low and high (upper-left region in Figure 2 (a)). However, considering noisy data, ground-truth labels may be chosen as . In this case, all classes, except for ground truth label, receive high because of high , resulting in underfitting of that data as the confidence of classes other than the ground-truth label increases. In Section 4.1, we describe the developed loss function of NL (NL+) that resolves this underfitting issue.
|(a) Plot of||(b)||(c)||(d)|
|(e) Plot of||(f)||(g)||(h)|
4 Joint Negative and Positive Learning (JNPL)
The loss function of the proposed method, JNPL, is composed of two loss functions:
Each of which is dedicated to filtering noisy data from training data. is the advanced version of NL, which resolves underfitting issue. is other newly designed loss for PL that trains on expected-to-be-clean data, empowering training on data of higher confidence. is added to scale the overall magnitude of PL+ so that it does not overwhelm the magnitude of NL+. We set throughout the whole paper. These two losses enable successful filtering of noisy data. Finally, noisy data classification is done in semi-supervised manner, utilizing these filtered noisy data confidence as pseudo-label. In the following sections, we further introduce each of the loss functions and describe the concept and implementation respectively.
As discussed in Section 3, we argue that the cause of the underfitting problem with NL is due to the nature of its gradient (Figure 2 (a)). This is more pronounced as the noise rate increases, as shown in Figure 2 (b)-(d). This problem occurs when noisy data receives high gradient to classes except for when the confidence of is high, being most likely to be ground truth label. To solve this issue, we propose a modification to the NL loss function, named NL+ loss, as follows:
It should be noted that acts as a constant weighting factor. Intuitively, this factor has the effect of decreasing the loss for noisy data when corresponding is high, being most likely to be ground truth label. That way, it reduces the risk of pressing down on the confidence of ground truth label for noisy data, reducing the risk of underfitting. This is further analyzed by observing the gradient of NL+ (), given by Eq 5:
The gradient map of is shown in Figure 2 (e). Compared to Figure 2 (a), it shows gradient at upper-left region is reduced. This implies that as the training progresses with NL+, noisy data is gathered at the upper-left region. With NL+, gradient received for noisy data of high is reduced, allowing noisy data to maintain high value, where is most likely to be ground truth label. Figure 2 (f)-(h) shows the distribution of training data mixed with diverse ratio of noise. It shows that compared to Figure 2 (b)-(d), NL+ results in more convergence. Especially in noise of high ratio (Figure 2 (d), (h)), NL+ successfully divides noisy data from training data, sending noisy data to upper-left region.
(a), (b): Cases for selecting data for PL+. Data is the candidate for PL+ if confidences at classes other than label of maximum probability is under uniform distribution (). (c): Gradient of PL+ depending on compared to PL (cross-entropy loss). (d): Accuracy comparison between PL+ with different . This shows that the flatter version of PL+ () generates better training results.
In this section, we introduce the second loss function in JNPL. As mentioned in Section 1, when training data is verified to have clean labels, PL is a faster and more accurate method than NL. Following this fact, we apply PL+ to our method for faster convergence. But compared to NLNL, this is not applied in a sequential step but rather as a unified step.
First of all, the criteria for selecting the training data for PL+ is required. Previously, NLNL applied PL to data over the threshold (). However, the criteria for selecting data for PL should be stricter. Even if a data satisfies , a probability of other class may reach as much as 0.5, resulting in the risk of selecting noisy data as clean data. Hence, PL+ considers the probabilities of classes other than the given label. When probabilities of other classes except for given label are under uniform distribution , this data is a candidate for PL+ (Figure 3 (a)). Additionally, among the candidates for PL+, it is selected through Bernoulli sampling with respect to . The higher the , the more frequently the data would be trained with PL+. Furthermore, PL+ selects data not only from expected-to-be-clean data but also from noisy data. Meaning that, when the probabilities of other classes except for the label of maximum probability is under the uniform distribution, the data is also a candidate for PL+ using the maximum probability class label () (Figure 3 (b)). In this way, PL+ selects data for training more strictly, but also, the candidate area is increased. The pseudocode for PL+ process is shown in Algorithm 1.
PL is usually done using cross-entropy (CE) loss (Eq 1). However, while it may be tolerable when training clean data, it may not be as tolerable as when training noisy data. The reason for PL in our method is to train faster on more confident data. However, when observing the gradient of CE in Figure 3 (c), it states that a smaller gradient is provided to more confident data, while a higher gradient is provided to less confident data. Since the goal is to train faster on more confident data, not just training more on less confident data, we propose PL+ loss function to resolve this issue as follows:
and the gradient of PL+ loss is as follows:
Similar to NL+, acts as a constant weighting factor. By applying this weight factor, the gradient of PL+ loss function is modified as shown in Eq 8 and visualized in Figure 3 (c). It can be seen that higher gradient is being provided to data of high as increases. Figure 3 (d) proves faster convergence as increases. We set throughout the whole paper.
Since our method is the advanced version of NLNL, which is targeted throughout our whole paper, this section further demonstrates the distinct advantage of our method JNPL over NLNL.
First of all, our method JNPL is a unified step pipeline for filtering noisy data, compared to 3-step pipeline of NLNL. JNPL is trained with two loss functions simultaneously, increasing the efficiency of training CNN. Figure 4 shows the performance comparison between NLNL (NLSelNLSelPL), NL+, and JNPL (NL+&PL+) when training with CIFAR10 mixed with 60% symm noise. Figure 4 clearly indicates that NL+ solely reaches the accuracy of NLSelNL, proving better convergence of NL+ compared to NL. Furthermore, when PL+ is done simultaneously along with NL+, it results in faster training without the need for additional subsequent step. It also shows overall accuracy of NL+ and JNPL overpasses the accuracy reached by NLNL while preventing overfitting to noisy data, proving the superiority of our method over NLNL.
Secondly, NL+ is more capable of handling more diverse noise types compared to NLSelNL owing to the nature of gradient followed by . Although NL applies SelNL to compensate for underfitting problem, we show that this is not an optimal solution for all types of noise. Consider when training data is CIFAR10 mixed with asymm noise, especially when class “dog” is mixed with “cat” in bidirectional manner (DOG CAT). Overall probability values across all classes are shared between class “dog” and “cat,” resulting in distribution of training data as shown in Figure 6 (a), (d). In this case, SelNL shows almost no effect as the noisy data is not under the uniform distribution (Figure 6 (b), (e)). Whereas for NL+, due to the fact that gradient for region ( & ) is reduced in a smooth manner compared to NL, it eventually enables both classes to be separated, showing distinct advantage of NL+ over SelNL (Figure 6 (c), (f)).
Finally, we show that our method JNPL successfully filters noisy data from training data than NLNL. Figure 5 shows overall filtering ability between NLNL and JNPL with average precision (AP). It is compared in diverse environment: CIFAR10/CIFAR100 mixed with different ratio of symm and asymm noise. It shows that our method outperforms NLNL in filtering noisy data on overall cases. Furthermore, it can be observed that gap of AP between NLNL and JNPL increases as the noise ratio increases. This implies that JNPL is more robust to the amount of noise mixed in training data. Also, JNPL being more robust to asymm noise than NLNL also proves the point made above. This phenomenon is more clearly shown in more difficult data CIFAR100. AP of NLNL drastically decreases as the noise rate gets higher. However, JNPL shows robustness in types and ratios of noise, similar to when training with CIFAR10. Figure 5 demonstrates our method JNPL is capable of being generalized to type and ratio of noise, and even number of classes in the dataset.
In this section, we describe the experiments performed to evaluate our method. Pseudo-labeling is done on a training dataset filtered by JNPL for noisy data classification and resulting accuracies are compared to those of other existing methods. We verify our method by comparing with other recent baseline methods, varying experimental settings in terms of dataset and type and ratio of noise in the training data.
6.1 Experiment settings
Baseline methods We compare our method against CE, along with recent state-of-the-art approaches including Co-teaching [han2018co], JoCoR [wei2020combating], APL [ma2020normalized], and NLNL [kim2019nlnl].
Dataset We conduct the experiments on CIFAR10, CIFAR100 [cifar-10] mixed with two types noises (symm, asymm), and Clothing1M [xiao2015learning] dataset. Clothing1M is a large-scale real-world dataset with noisy labels, containing 1 million images of clothing obtained from several online shopping websites. It is reported that the overall accuracy of noisy labels in this dataset is 61.54%, and some pairs of classes are often confused with each other (e.g., Knitwear and Sweater). For preprocessing, we performed mean subtraction, horizontal flip, and random crops for CIFAR10 and CIFAR100. For Clothing1M, we resize the image to 256256, crop 224224 at the center and perform mean subtraction and horizontal flip.
Label noise types We generated noisy CIFAR10 and CIFAR100 datasets according to the following procedures. In symmetric (symm) noise experiments, we flipped a portion of the labels by re-sampling each label uniformly from the remaining classes, excluding the ground-truth class. In asymmetric (asymm) noise experiments, we followed the same label transition rule used by Patrini [patrini2017making]. For CIFAR10, we mapped TRUCK AUTOMOBILE, BIRD PLANE, DEER HORSE, and CAT DOG. For CIFAR100, the noise flipped each class into the next, circularly within super-classes.
For each noise type, we compared the methods under the symmetric noise rates of and asymmetric noise rates of .
Models For CIFAR10 and CIFAR100 experiments, we used ResNet34. For Clothing1M, we used ResNet50 [he2016deep]
, pre-trained on ImageNet.
We used stochastic gradient descent (SGD) with momentum of 0.9, weight decay of
. For experiments with CIFAR10 and CIFAR100, batch size is set to 128. Moreover, JNPL trains CNN for 1000 epochs with initial learning rate of, and decay by a factor of 10 at 800 epochs. For pseudo labeling, initial learning rate is 0.1, decayed by a factor of 10 at 192, 288 epochs (480 epochs total). For experiments with Clothing1M, batch size is set to 64, and JNPL trains CNN for 40 epochs with initial learning rate of , and decay by a factor of 10 at 30 epochs. For pseudo labeling, initial learning rate is , decayed by a factor of 10 at 10 epochs (15 epochs total).
For CIFAR100, we adopt the technique NLNL proposed for generalization to the number of classes in training data: providing multi to each data. We provide 110 to each data in order to match the training speed to when training with CIFAR10 [kim2019nlnl].
Table 1 shows the results of our method and other baseline methods in various noise environment and two datasets. Our proposed method outperformed all other comparable baseline methods in overall noise types and ratios. The result shows other baseline methods achieve comparable results in the less-noisy environment, but the performance decreases drastically as the noise ratio increases, which is even more visible at CIFAR100, which is the harder case for noisy data classification. Our method shows a distinct improvement in this situation compared to all other methods. It was shown in Section 5 our method is robust to the amount of noise mixed in training data, regardless of the type of noises. Table 1 shows a similar result that our method achieves more distinct best accuracy as the noise rate gets higher. This phenomenon is more emphasized for CIFAR100. Our method outperforms as much as 6 to 7% at both symm and asymm noises in this dataset. It is noteworthy that our method achieved 7% higher state-of-the-art accuracy in the most difficult setting in Table 1, which is 100 class dataset mixed with 40% asymm noise. It is widely known training in general is challenging as the number of classes in the dataset increases. Furthermore, compared to symm noise, asymm noise is the replica of noise that we can actually make in real-life. Achieving such a high accuracy in this setting implies that our method is more capable of generalizing to training data and various types and ratios of noise mixed within compared to other baseline methods.
It is shown that Co-teaching and JoCoR method [han2018co, wei2020combating] exceeds the performance compared to our method for some cases. However, it should be noted that they assume prior knowledge on important statistics about the dataset such as the amount of noise. In reality, this assumption often leaves the method impractical because the ratio of noise mixed in training data is likely to be unknown. On the other hand, our method does not assume any such prior knowledge and therefore does not require extensive tuning of hyper-parameters.
To demonstrate the generalization of our method JNPL to real-world noisy data, we compose an experiment on Clothing1M dataset (Table 2). We brought recent baseline methods which conducted experiment on Clothing1M for comparison. It shows our method achieves comparable performance, outperforming other recent baseline methods. This result clearly proves that JNPL can generalize to training data mixed with various types and ratios of noise, showing the novelty of our method.
We propose Joint Negative and Positive Learning, the next version of NLNL which is the novel single-step pipeline for filtering noisy training data. Compared to 3-step pipeline of NLNL, our method trains CNN with two-loss functions () in one step. They are developed from previous NL and PL loss functions to enhance convergence and training speed, resulting in better filtering performance than NLNL. We demonstrated that JNPL is stable and robust in various types and ratios of noise mixed in training data. Our method achieves state-of-the-art performance in noisy data classification utilizing pseudo-labeling to our filtered training data, proving our method’s excellent filtering ability without referring to any prior knowledge.