1 Introduction
Learning from the industriallevel data is quite demanding, since labels of such data are heavily noisy, and their label generation processes are usually agnostic [1, 2]. Essentially, noisy labels of such data are corrupted from groundtruth labels without any prior assumptions (i.e., classconditional noise [3]), which degrades the robustness of learning models. It is noted that industriallevel data is frequently emerging in our daily life, such as socialnetwork data [4], Ecommerce data [1] and crowdsourcing data [5, 6].
Due to the large data volume, industriallevel data can be well handled by deep neural networks [1]. Thus, the key issue is how to train deep neural networks robustly on noisy labels of such data, since deep neural networks have the high capacity to fit noisy labels eventually [7]
. To handle noisy labels, one common direction focuses on estimating the noise transition matrix
[8, 9]. For instance, [10] first leveraged a twostep solution to estimate the noise transition matrix. Based on this estimated matrix, they conducted backward loss correction, which is used for training deep neural networks robustly. However, the noise transition matrix is not easy to be estimated accurately, especially when the noise ratio is high and the number of classes is large.Motivated by the memorization effects of deep neural networks [11], one emerging direction focuses on training only on selected instances [2, 12, 13]
, which does not require any prior assumptions on noisy labels. Specifically, deep learning models are known to learn easy instances first, then gradually adapt to hard instances when training epochs become large
[11]. Therefore, in the first few iterations, we may train deep neural networks on the whole dataset, and let them sufficiently learn clean instances in the noisy dataset. Then we may later conduct early stopping [14], which tries to stop the training on noisy instances; or we may employ the smallloss trick [2, 13], which tries to perform the training selectively on clean (smallloss) instances.However, when noisy labels indeed exist, no matter using early stopping or smallloss trick, deep learning models inevitably memorize some noisy labels [7], which will lead to the poor generalization performance. In this paper, we design a meta algorithm called Pumpout
, which allows us to overcome the issue of memorizing noisy labels. The main idea of Pumpout is to actively squeeze out the negative effects of noisy labels from the training model, instead of passively forgetting these effects by further training. Specifically, on clean labels, Pumpout conducts stochastic gradient descent typically; while on noisy labels, Pumpout conducts
scaled stochastic gradient ascent, instead of stopping gradient computation as usual. This aggressive policy can erase the negative effects of noisy labels actively and effectively.We leverage Pumpout to upgrade two representative but orthogonal approaches in the area of “deep learning with noisy labels”: MentorNet [2] and Backward Correction [10, 15]. We conducted experiments on simulated noisy MNIST and CIFAR10 datasets. Empirical results demonstrated that, under both extremely noisy labels (i.e., 45% and 50% of noisy labels) and lowlevel noisy labels (i.e., 20% of noisy labels), the robustness of two upgraded approaches (by Pumpout) is obviously superior than that of original approaches.
2 Pumpout Meets Noisy Supervision
Meta algorithm.
The original idea of Pumpout is to actively squeeze out the negative effects of noisy labels from the training model, instead of passively forgetting these effects. However, in the design of meta algorithm, we should consider how our meta algorithm can simultaneously benefit multiple orthogonal approaches, such as training on selected instances [2], estimating the noise transition matrix [10] and designing regularization [16].
For this purpose, we generalize noisy labels into “notfitting” labels, and generalize clean labels into “fitting” labels (details of the fitting condition will be discussed in Q1 below). In the high level, the meta algorithm Pumpout is to train deep neural networks by stochastic gradient descent on “fitting” labels, and train deep neural networks by scaled stochastic gradient ascent on “notfitting” labels.
In the low level, the proposed Algorithm 1 is named Pumpout. Specifically, we maintain deep neural network (with parameter ). When a single point is sequentially selected from noisy set (step 3), we first check whether is fitting the discriminative condition or not. If yes, we conduct stochastic gradient descent typically (step 4); otherwise, we conduct scaled () stochastic gradient ascent (step 5), which erases the negative effects of “notfitting” labels. The abstract algorithm arises three important questions naturally.
Three important questions.

What is the fitting condition?

Why do we need gradient ascent on nonfitting data, in addition to gradient descent on fitting data?

Why do we need to scale the stochastic gradient ascent on nonfitting data?
To answer the first question, we need to emphasize a view that orthogonal approaches require different fitting conditions. Intuitively, if a single point satisfies a discriminative fitting condition, it means that our training model will regard this data point as a useful knowledge, and fitting on this point will benefit training the robust model. Conversely, if a single point does not satisfy the discriminative fitting condition, it means that, our training model will regard this data point as useless knowledge, and want to erase the negative effects of this point actively. To instantiate the fitting condition, we provide two concrete cases in Algorithm 2 and Algorithm 3, respectively.
The above answer motivates our second question: why cannot we only conduct stochastic gradient descent on fitting data points (step 4). In other words, can we remove scaled stochastic gradient ascent (step 5) in Algorithm 1? In this case (removing step 5), our algorithm degenerates to training only on selected instances. However, once some of the selected instances are found to be false positives, our training model will fit on them, and thus the negative effects will inevitably occur (i.e., degrading the generalization). Instead of passively forgetting these negative effects, we hope to actively squeeze out the negative effects from the training model by using scaled stochastic gradient ascent (step 5).
Lastly, the third question closely connects with the second one. Namely, why do we need scaled instead of ordinary stochastic gradient ascent? The answer can be intuitively explained. Assume that we view stochastic gradient ascent as correction to “notfitting” labels, and view as a scale parameter. When , our Pumpout will squeeze out the negative effects with full fast rate; while when , our Pumpout will not squeeze out any negative effects. Both cases are not optimal. For the first case, the fast squeezing rate will negatively affect the convergence of our algorithm. For the second case, no squeezing rate will inevitably let deep neural networks memorize some “notfitting” labels, which degrades their generalization.
3 Pumpout Benefits StateoftheArt Algorithms
In this section, we apply the idea of Pumpout to MentorNet and Backward Correction as follows.
3.1 Upgraded MentorNet
Algorithm 2 represents the upgraded MentorNet using Pumpout approach (denoted as ), where MentorNet uses the smallloss trick. Specifically, we maintain deep neural network (with parameter ). When a minibatch is formed (step 3), we first let select a small proportion of instances in this minibatch that have small training losses (step 4). The number of instances is controlled by , and only samples percentage of instances out of the minibatch. More importantly, we let select a proportion of instances in this minibatch that have big training losses (step 5). The number of instances is controlled by , and only samples percentage of instances out of the minibatch. Then, we conduct stochastic gradient descent on smallloss instances (step 6); while we conduct scaled stochastic gradient ascent on bigloss instances (step 7), which actively erases the negative effects of bigloss instances. The update of (step 8) follows [13], in which extensive discussion has been conducted.
Relations to MentorNet.
To handle noisy labels, an emerging direction focuses on training only on selected instances [2, 12, 13], which is free of estimating the noise transition matrix, and also free of the classconditional noise assumption. These works try to select clean instances out of the noisy ones, and then use them to update the network. Among those works, a representative method is MentorNet [2], which employs the smallloss trick. Specifically, MentorNet pretrains an extra network, and then uses the extra network for selecting smallloss instances as clean instances to guide the training. However, the idea of MentorNet is similar to the selftraining approach [17], thus MentorNet inherits the same drawback of accumulated error caused by the sampleselection bias.
Note that, if we remove step 5 and step 7 in Algorithm 2, algorithm will be reduced to the core version of MentorNet. It means that algorithm is more aggressive than MentorNet in essence. Namely, conducts not only stochastic gradient descent on smallloss instances (like MentorNet), but also scaled stochastic gradient ascent on bigloss instances.
3.2 Upgraded Backward Correction
Algorithm 3 represents the upgraded Backward Correction using the Pumpout approach (denoted as ), where Backward Correction is defined in Theorem 1. If the model being trained is flexible (i.e., a deep neural network), Backward Correction will lead to negative risks [10], which subsequently yields an overfit issue. To mitigate this issue, we maintain deep neural network (with parameter ). When a single point is sequentially selected from the th minibatch (step 5), we first compute the temporary gradient at this point (step 6). If Backward Correction produces a positive risk at this point, namely (definitions of and are in Theorem 1), we accumulate gradient by the gradient descent (step 7); otherwise, we accumulate gradient by the scaled gradient ascent (step 8), and this step erases the negative effects of negativerisk instances. Lastly, we average the accumulated gradient (step 9) and update parameter by stochastic optimization (step 10).
Relations to Backward Correction and its nonnegative version.
To handle noisy labels, the other popular direction focuses on estimating the noise transition matrix [8, 9, 10]. Among those works, a representative method is Backward Correction. Specifically, [10]
leveraged a twostep solution to estimate the noise transition matrix heuristically. Then they employed the estimated matrix to correct the original loss, and robustly train a deep neural network based on the new loss function.
Theorem 1
(Backward Correction, Theorem 1 in [10]) Suppose that the noise transition matrix is nonsingular, where given that noisy label is flipped from clean label . Given loss and network parameter , Backward Correction is defined as
(1) 
Then, corrected loss is unbiased, namely,
(2) 
Remark 1
If the model being trained is flexible, such as a deep neural network, the backward loss correction will lead to negative risks, and the hazardous aspect is to yield an overfit issue. Motivated by [18], we should conduct a nonnegative correction again based on the backwardcorrected loss. The reason is that the risk should always be greater than or equal to.
Theorem 2
(Nonnegative Backward Correction) Suppose that the noise transition matrix is nonsingular, where given that noisy label is flipped from clean label . Given loss and network parameter , Nonnegative Backward Correction is defined as
(3) 
where . Then, the corrected loss is nonnegative, namely,
(4) 
Remark 2
is a nonnegative scalar. Our key claim is to overcome the overfit issue by nonnegative correction.
However, the above nonnegative correction is passive, since operator means stopping gradient computation on negativerisk instances. This correction may not achieve the optimal performance. Namely, when , we conduct stochastic gradient descent; otherwise, we do not perform the stochastic gradient. To propose an aggressive nonnegative correction, we reverse the gradient computation at negativerisk instances. Specifically, we use the Pumpout approach to improve Nonnegative Backward Correction. Namely, when , we conduct stochastic gradient descent; However, when , we conduct scaled stochastic gradient ascent. This brings our Algorithm 3.
Note that, if we remove line 8 in Algorithm 3, algorithm will be reduced to Nonnegative Backward Correction. It means algorithm is an aggressive version of Nonnegative Backward Correction. Namely, conducts not only stochastic gradient descent on nonnegativerisk instances, but also scaled stochastic gradient ascent on negativerisk instances.
4 Experiments
Datasets.
We verify the effectiveness of our Pumpout approach on two benchmark datasets preliminarily. MNIST and CIFAR10 are used here (Table 1), as these data sets are popularly used for evaluation of noisy labels in the literature [8, 10, 13, 19].
# of training  # of testing  # of class  image size  

MNIST  60,000  10,000  10  2828 
CIFAR10  50,000  10,000  10  3232 
Since all datasets are clean, following [19, 10], we need to corrupt these datasets manually by the noise transition matrix , where given that noisy is flipped from clean . Assume that the matrix has two representative structures (Figure 1): (1) Pair flipping [9]: a realworld application is the finegrained classification, where you may make mistake only within very similar classes in the adjunct positions; (2) Symmetry flipping [20]. Their precise definition is in Appendix 0.A.
This paper first verifies whether Pumpout can significantly improve the robustness of representative methods on extremely noisy supervision, the noise rate is chosen from . Intuitively, this means almost half of the instances have noisy labels. Note that, the noise rate for pair flipping means over half of the training data have wrong labels that cannot be learned without additional assumptions. In addition to extremely noisy settings, we also verify whether Pumpout can significantly improve the robustness of representative methods on lowlevel noisy supervision, where is set to . Note that pair case is much harder than symmetry case. In Figure 1(a), the true class only has more correct instances over wrong ones. However, the true has more correct instances in Figure 1(b). Meanwhile, similarly to [19, 8, 2], we did not make any implicit assumption behind Pumpout.
Baselines.
To verify the efficacy of Pumpout, we compare two orthogonal approaches in deep learning with noisy labels. The first set (SET1) comparison is to check whether Pumpout can improve the robustness of MentorNet. (i) MentorNet [2]. (ii) (Algorithm 2). The second set (SET2) comparison is to check whether Pumpout can improve the robustness of Backward Correction. (i) Backward Correction [10] (denoted as “BC”, Theorem 1). (ii). Nonnegative backward correction (denoted as “nnBC”, Theorem 2). (iii) (Algorithm 3). As a simple baseline, we also compare with the normal deep neural network that directly learns on the noisy training set (denoted as “Normal”).
For the fair comparison, we implement all methods with default parameters by PyTorch, and conduct all the experiments on a NIVIDIA K80 GPU. Standard CNN is used with LReLU active function, and the detailed architecture is in Appendix
0.B. Namely, we used the 9layer CNN [16, 21] for MNIST and ResNet32 [22] for CIFAR10, since the network structures we used here are standard test bed for weaklysupervised learning. For all datasets, Adam optimizer (momentum=0.9) with an initial learning rate of 0.001, the batch size is set to 128 and runs for 200 epoch. Besides, dropout and batchnormalization are also used.
Experimental setup.
For SET1, the most important parameter of our and MentorNet is . Here, we assume the noise level is known and set with and . If is not known in advanced, can be inferred using validation sets [23]. The choices of and follows [13]. Note that only depends on the memorization effect of deep networks but not any specific datasets. For SET2, the most important parameters of our and nnBC are and respectively. Specifically, the degree of tolerance is controlled by (), and the scale of gradient ascent is controlled by (). The choices of and follows [18].
This paper provides two upgraded approaches to train deep neural networks robustly under noisy labels. Thus, our goal is to classify the clean instances as accurately as possible, and the measurement for both
SET1 and SET2 is the test accuracy, i.e., test accuracy = (# of correct predictions) / (# of test dataset). Besides, for SET1, we also use the label precision in each minibatch, i.e., label precision = (# of clean labels) / (# of all selected labels). Specifically, we sampleof smallloss instances in each minibatch, and then calculate the ratio of clean labels in the smallloss instances. Intuitively, higher label precision means less noisy instances in the minibatch after sample selection; and the algorithm with higher label precision is also more robust to the label noise. All experiments are repeated five times. In each figure, the error bar for standard deviation is highlighted as a shade.
4.1 Results of and MentorNet
Mnist.
In Figure 2, we show test accuracy (top) and label precision (bottom) vs number of epochs on MINIST dataset. In all three plots, we can clearly see the memorization effects of networks, i.e., test accuracy of Normal first reaches a very high level and then gradually decreases. Thus, a good robust training method should stop or alleviate the decreasing process. On this point, our almost stops the decreasing process in the easier Symmetric50% and Symmetric20% cases. Meanwhile, compared to MentorNet, our alleviates the decreasing process in the hardest Pair45% case. Thus, consistently achieves the higher accuracy over MentorNet.
To explain such good performance, we plot label precision (bottom). Compared to Normal, we can clearly see that both and MentorNet can successfully pick clean instances out. However, our achieves the higher label precision on not only the easier Symmetric50% and Symmetric20% cases, but also the hardest Pair45% case. This shows our approach is better at finding clean instances due to the usage of scaled stochastic gradient ascent.
Cifar10.
Figure 3 shows test accuracy and label precision vs number of epochs on CIFAR10 dataset. Again, on test accuracy, we can see strongly stops the memorization effects of networks. More importantly, on the easier Symmetric50% and Symmetric20% cases, it works better and better along with the training epochs. On label precision, while Normal fails to find clean instances, both and MentorNet can do this. However, due to the usage of scaled stochastic gradient ascent, is stronger and find more clean instances.
4.2 Results of and nnBC
Mnist.
Figure 4 shows test accuracy vs number of epochs on MNIST dataset. In all three plots, we can see the memorization effects of networks, i.e., test accuracy of Normal first reaches a very high level and then gradually decreases. However, our fully stops the decreasing process in the hardest Pair45% case. Meanwhile, in the easier Symmetric50% and Symmetric20% cases, our works better and better along with the training epochs though it fluctuates. Moreover, our finally achieves the higher accuracy over both BC and nnBC.
5 Conclusions
This paper presents a meta algorithm called Pumpout, which significantly improves the robustness of stateoftheart methods under noisy labels. Our key idea is to squeeze out the negative effects of noisy labels actively from the training model, instead of passively forgetting these effects. The realization of Pumpout is to train deep neural networks by stochastic gradient descent on “fitting” labels; while train deep neural networks by scaled stochastic gradient ascent on “notfitting” labels. To demonstrate the efficacy of Pumpout, based on MentorNet and Backward Correction, we design two upgraded versions called and , respectively. The experimental results show that, both updated approaches can train deep models more robustly over previous ones. In future, we can extend our work in the following aspects. First, we can leverage Pumpout approach to train deep models under another weak supervision, e.g., complementary labels [24]. Second, we should investigate the theoretical guarantees for Pumpout approach.
Acknowledgments.
MS was supported by the International Research Center for Neurointelligence (WPIIRCN) at The University of Tokyo Institutes for Advanced Study and JST CREST JPMJCR1403. IWT was supported by ARC FT130100746, DP180100106 and LP150100671. BH would like to thank the financial support from Center for AIP, RIKEN.
References
 [1] Xiao, T. and Xia, T. and Yang, Y. and Huang, C. and Wang, X.: Learning from massive noisy labeled data for image classification. In: CVPR. (2015)
 [2] Jiang, L. and Zhou, Z. and Leung, T. and Li, L. and FeiFei, L.: MentorNet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In: ICML. (2018)
 [3] Natarajan, N. and Dhillon, I.S. and Ravikumar, P.K. and Tewari, A.: Learning with noisy labels. In: NIPS. (2013)
 [4] Cha, Y. and Cho, J.: Socialnetwork analysis using topic models. In: SIGIR. (2012)
 [5] Welinder, P. and Branson, S. and Perona, P. and Belongie, S.: The multidimensional wisdom of crowds. In: NIPS. (2010)
 [6] Han, B. and Pan, Y. and Tsang, I.: Robust Plackett–Luce model for kary crowdsourced preferences. MLJ. (2018)
 [7] Zhang, C. and Bengio, S. and Hardt, M. and Recht, B. and Vinyals, O.: Understanding deep learning requires rethinking generalization. In: ICLR. (2017)
 [8] Goldberger, J. and BenReuven, E.: Training deep neuralnetworks using a noise adaptation layer. In: ICLR. (2017)
 [9] Han, B. and Yao, J. and Niu, G. and Zhou, M. and Tsang, I. and Zhang, Y. and Sugiyama, M.: Masking: A new perspective of noisy supervision. In: NIPS. (2018)
 [10] Patrini, G. and Rozza, A. and Menon, A. and Nock, R. and Qu, L.: Making deep neural networks robust to label noise: a loss correction approach. In: CVPR. (2017)
 [11] Arpit,D. and Jastrzebski, S. and Ballas, N. and Krueger, D. and Bengio, E. and Kanwal, M.S. and Maharaj, T. and Fischer, A. and Courville, A. and Bengio, Y.: A closer look at memorization in deep networks. In: ICML. (2017)
 [12] Ren, M. and Zeng, W. and Yang, B. and Urtasun, R.: Learning to reweight examples for robust deep learning. In: ICML. (2018)
 [13] Han, B. and Yao, Q. and Yu, X. and Niu, G. and Xu, M. and Hu, W. and Tsang, I. and Sugiyama, M.: Coteaching: Robust training of deep neural networks with extremely noisy labels. In: NIPS. (2018)
 [14] Goodfellow, I. and Bengio, Y. and Courville, A.: Deep learning. MIT press Cambridge. (2016)
 [15] van Rooyen, B. and Williamson, B.: A theory of learning with corrupted labels. JMLR. (2018)
 [16] Miyato, T. and Dai, A. and Goodfellow, I.: Virtual adversarial training for semisupervised text classification. In: ICLR. (2016)

[17]
Chapelle, O. and Scholkopf, B. and Zien, A.: Semisupervised learning. IEEE TNN. (2009)
 [18] Kiryo, R. and Niu, G. and du Plessis, M. C and Sugiyama, M.: Positiveunlabeled learning with nonnegative risk estimator. In: NIPS. (2017)
 [19] Reed, S. and Lee, H. and Anguelov, D. and Szegedy, C. and Erhan, D. and Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. In: ICLR Workshop. (2015)
 [20] Van Rooyen, B. and Menon, A. and Williamson, R.C.: Learning with symmetric label noise: The importance of being unhinged. In: NIPS. (2015)
 [21] Laine, S. and Aila, T.: Temporal ensembling for semisupervised learning. In: ICLR. (2017)
 [22] He, K. and Zhang, X. and Ren, S. and Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016)
 [23] Liu, T. and Tao, D.: Classification with noisy labels by importance reweighting. IEEE TPAMI. (2016)
 [24] Ishida, T. and Niu, G. and Hu, W. and Sugiyama, M.: Learning from complementary labels. In: NIPS. (2017)
Appendix 0.A Definition of Noise
The definition of transition matrix is as follow, where is the noise rate and is the number of the classes.
Pair flipping:  
Symmetry flipping: 
Appendix 0.B Network Structures
For MNIST, 3232 gray image, the structure is as follows. We also summarize it into Table 2.
CNN on MNIST 

2828 Gray Image 
33 conv, 128 LReLU 
33 conv, 128 LReLU 
33 conv, 128 LReLU 
2 
dropout, 
33 conv, 256 LReLU 
33 conv, 256 LReLU 
33 conv, 256 LReLU 
22 maxpool, stride 2 
dropout, 
33 conv, 512 LReLU 
33 conv, 256 LReLU 
33 conv, 128 LReLU 
avgpool 
dense 12810 
(1*28*28)[C(3*3,128)]*2maxpool(2*2,2)dropout(0.25)[C(3*3,256)]*3maxpool(2*2,2)dropout(0.25)C(3*3,512)C(3*3,256)C(3*3,128)avgpool(1*1)12810, where the input is a 28*28 image, C(3*3,128) means 128 channels of 3*3 convolutions followed by LReLU (negative slop=0.01), maxpool(2*2,2) means max pooling (kernel size=2, stride=2), avgpool(2*2) means average pooling (kernel size=2), [.]*n means n such layers, etc. Batch normalization was applied before LReLU activations.
For CIFAR10, 3232 RGB image, the structure is ResNet32.
Comments
There are no comments yet.