Learning from the industrial-level data is quite demanding, since labels of such data are heavily noisy, and their label generation processes are usually agnostic [1, 2]. Essentially, noisy labels of such data are corrupted from ground-truth labels without any prior assumptions (i.e., class-conditional noise ), which degrades the robustness of learning models. It is noted that industrial-level data is frequently emerging in our daily life, such as social-network data , E-commerce data  and crowdsourcing data [5, 6].
Due to the large data volume, industrial-level data can be well handled by deep neural networks . Thus, the key issue is how to train deep neural networks robustly on noisy labels of such data, since deep neural networks have the high capacity to fit noisy labels eventually 
. To handle noisy labels, one common direction focuses on estimating the noise transition matrix[8, 9]. For instance,  first leveraged a two-step solution to estimate the noise transition matrix. Based on this estimated matrix, they conducted backward loss correction, which is used for training deep neural networks robustly. However, the noise transition matrix is not easy to be estimated accurately, especially when the noise ratio is high and the number of classes is large.
, which does not require any prior assumptions on noisy labels. Specifically, deep learning models are known to learn easy instances first, then gradually adapt to hard instances when training epochs become large. Therefore, in the first few iterations, we may train deep neural networks on the whole dataset, and let them sufficiently learn clean instances in the noisy dataset. Then we may later conduct early stopping , which tries to stop the training on noisy instances; or we may employ the small-loss trick [2, 13], which tries to perform the training selectively on clean (small-loss) instances.
However, when noisy labels indeed exist, no matter using early stopping or small-loss trick, deep learning models inevitably memorize some noisy labels , which will lead to the poor generalization performance. In this paper, we design a meta algorithm called Pumpout
, which allows us to overcome the issue of memorizing noisy labels. The main idea of Pumpout is to actively squeeze out the negative effects of noisy labels from the training model, instead of passively forgetting these effects by further training. Specifically, on clean labels, Pumpout conducts stochastic gradient descent typically; while on noisy labels, Pumpout conductsscaled stochastic gradient ascent, instead of stopping gradient computation as usual. This aggressive policy can erase the negative effects of noisy labels actively and effectively.
We leverage Pumpout to upgrade two representative but orthogonal approaches in the area of “deep learning with noisy labels”: MentorNet  and Backward Correction [10, 15]. We conducted experiments on simulated noisy MNIST and CIFAR10 datasets. Empirical results demonstrated that, under both extremely noisy labels (i.e., 45% and 50% of noisy labels) and low-level noisy labels (i.e., 20% of noisy labels), the robustness of two upgraded approaches (by Pumpout) is obviously superior than that of original approaches.
2 Pumpout Meets Noisy Supervision
The original idea of Pumpout is to actively squeeze out the negative effects of noisy labels from the training model, instead of passively forgetting these effects. However, in the design of meta algorithm, we should consider how our meta algorithm can simultaneously benefit multiple orthogonal approaches, such as training on selected instances , estimating the noise transition matrix  and designing regularization .
For this purpose, we generalize noisy labels into “not-fitting” labels, and generalize clean labels into “fitting” labels (details of the fitting condition will be discussed in Q1 below). In the high level, the meta algorithm Pumpout is to train deep neural networks by stochastic gradient descent on “fitting” labels, and train deep neural networks by scaled stochastic gradient ascent on “not-fitting” labels.
In the low level, the proposed Algorithm 1 is named Pumpout. Specifically, we maintain deep neural network (with parameter ). When a single point is sequentially selected from noisy set (step 3), we first check whether is fitting the discriminative condition or not. If yes, we conduct stochastic gradient descent typically (step 4); otherwise, we conduct scaled () stochastic gradient ascent (step 5), which erases the negative effects of “not-fitting” labels. The abstract algorithm arises three important questions naturally.
Three important questions.
What is the fitting condition?
Why do we need gradient ascent on non-fitting data, in addition to gradient descent on fitting data?
Why do we need to scale the stochastic gradient ascent on non-fitting data?
To answer the first question, we need to emphasize a view that orthogonal approaches require different fitting conditions. Intuitively, if a single point satisfies a discriminative fitting condition, it means that our training model will regard this data point as a useful knowledge, and fitting on this point will benefit training the robust model. Conversely, if a single point does not satisfy the discriminative fitting condition, it means that, our training model will regard this data point as useless knowledge, and want to erase the negative effects of this point actively. To instantiate the fitting condition, we provide two concrete cases in Algorithm 2 and Algorithm 3, respectively.
The above answer motivates our second question: why cannot we only conduct stochastic gradient descent on fitting data points (step 4). In other words, can we remove scaled stochastic gradient ascent (step 5) in Algorithm 1? In this case (removing step 5), our algorithm degenerates to training only on selected instances. However, once some of the selected instances are found to be false positives, our training model will fit on them, and thus the negative effects will inevitably occur (i.e., degrading the generalization). Instead of passively forgetting these negative effects, we hope to actively squeeze out the negative effects from the training model by using scaled stochastic gradient ascent (step 5).
Lastly, the third question closely connects with the second one. Namely, why do we need scaled instead of ordinary stochastic gradient ascent? The answer can be intuitively explained. Assume that we view stochastic gradient ascent as correction to “not-fitting” labels, and view as a scale parameter. When , our Pumpout will squeeze out the negative effects with full fast rate; while when , our Pumpout will not squeeze out any negative effects. Both cases are not optimal. For the first case, the fast squeezing rate will negatively affect the convergence of our algorithm. For the second case, no squeezing rate will inevitably let deep neural networks memorize some “not-fitting” labels, which degrades their generalization.
3 Pumpout Benefits State-of-the-Art Algorithms
In this section, we apply the idea of Pumpout to MentorNet and Backward Correction as follows.
3.1 Upgraded MentorNet
Algorithm 2 represents the upgraded MentorNet using Pumpout approach (denoted as ), where MentorNet uses the small-loss trick. Specifically, we maintain deep neural network (with parameter ). When a mini-batch is formed (step 3), we first let select a small proportion of instances in this mini-batch that have small training losses (step 4). The number of instances is controlled by , and only samples percentage of instances out of the mini-batch. More importantly, we let select a proportion of instances in this mini-batch that have big training losses (step 5). The number of instances is controlled by , and only samples percentage of instances out of the mini-batch. Then, we conduct stochastic gradient descent on small-loss instances (step 6); while we conduct scaled stochastic gradient ascent on big-loss instances (step 7), which actively erases the negative effects of big-loss instances. The update of (step 8) follows , in which extensive discussion has been conducted.
Relations to MentorNet.
To handle noisy labels, an emerging direction focuses on training only on selected instances [2, 12, 13], which is free of estimating the noise transition matrix, and also free of the class-conditional noise assumption. These works try to select clean instances out of the noisy ones, and then use them to update the network. Among those works, a representative method is MentorNet , which employs the small-loss trick. Specifically, MentorNet pre-trains an extra network, and then uses the extra network for selecting small-loss instances as clean instances to guide the training. However, the idea of MentorNet is similar to the self-training approach , thus MentorNet inherits the same drawback of accumulated error caused by the sample-selection bias.
Note that, if we remove step 5 and step 7 in Algorithm 2, algorithm will be reduced to the core version of MentorNet. It means that algorithm is more aggressive than MentorNet in essence. Namely, conducts not only stochastic gradient descent on small-loss instances (like MentorNet), but also scaled stochastic gradient ascent on big-loss instances.
3.2 Upgraded Backward Correction
Algorithm 3 represents the upgraded Backward Correction using the Pumpout approach (denoted as ), where Backward Correction is defined in Theorem 1. If the model being trained is flexible (i.e., a deep neural network), Backward Correction will lead to negative risks , which subsequently yields an over-fit issue. To mitigate this issue, we maintain deep neural network (with parameter ). When a single point is sequentially selected from the -th mini-batch (step 5), we first compute the temporary gradient at this point (step 6). If Backward Correction produces a positive risk at this point, namely (definitions of and are in Theorem 1), we accumulate gradient by the gradient descent (step 7); otherwise, we accumulate gradient by the scaled gradient ascent (step 8), and this step erases the negative effects of negative-risk instances. Lastly, we average the accumulated gradient (step 9) and update parameter by stochastic optimization (step 10).
Relations to Backward Correction and its non-negative version.
leveraged a two-step solution to estimate the noise transition matrix heuristically. Then they employed the estimated matrix to correct the original loss, and robustly train a deep neural network based on the new loss function.
(Backward Correction, Theorem 1 in ) Suppose that the noise transition matrix is non-singular, where given that noisy label is flipped from clean label . Given loss and network parameter , Backward Correction is defined as
Then, corrected loss is unbiased, namely,
If the model being trained is flexible, such as a deep neural network, the backward loss correction will lead to negative risks, and the hazardous aspect is to yield an over-fit issue. Motivated by , we should conduct a non-negative correction again based on the backward-corrected loss. The reason is that the risk should always be greater than or equal to.
(Non-negative Backward Correction) Suppose that the noise transition matrix is non-singular, where given that noisy label is flipped from clean label . Given loss and network parameter , Non-negative Backward Correction is defined as
where . Then, the corrected loss is non-negative, namely,
is a non-negative scalar. Our key claim is to overcome the over-fit issue by non-negative correction.
However, the above non-negative correction is passive, since operator means stopping gradient computation on negative-risk instances. This correction may not achieve the optimal performance. Namely, when , we conduct stochastic gradient descent; otherwise, we do not perform the stochastic gradient. To propose an aggressive non-negative correction, we reverse the gradient computation at negative-risk instances. Specifically, we use the Pumpout approach to improve Non-negative Backward Correction. Namely, when , we conduct stochastic gradient descent; However, when , we conduct scaled stochastic gradient ascent. This brings our Algorithm 3.
Note that, if we remove line 8 in Algorithm 3, algorithm will be reduced to Non-negative Backward Correction. It means algorithm is an aggressive version of Non-negative Backward Correction. Namely, conducts not only stochastic gradient descent on nonnegative-risk instances, but also scaled stochastic gradient ascent on negative-risk instances.
We verify the effectiveness of our Pumpout approach on two benchmark datasets preliminarily. MNIST and CIFAR10 are used here (Table 1), as these data sets are popularly used for evaluation of noisy labels in the literature [8, 10, 13, 19].
|# of training||# of testing||# of class||image size|
Since all datasets are clean, following [19, 10], we need to corrupt these datasets manually by the noise transition matrix , where given that noisy is flipped from clean . Assume that the matrix has two representative structures (Figure 1): (1) Pair flipping : a real-world application is the fine-grained classification, where you may make mistake only within very similar classes in the adjunct positions; (2) Symmetry flipping . Their precise definition is in Appendix 0.A.
This paper first verifies whether Pumpout can significantly improve the robustness of representative methods on extremely noisy supervision, the noise rate is chosen from . Intuitively, this means almost half of the instances have noisy labels. Note that, the noise rate for pair flipping means over half of the training data have wrong labels that cannot be learned without additional assumptions. In addition to extremely noisy settings, we also verify whether Pumpout can significantly improve the robustness of representative methods on low-level noisy supervision, where is set to . Note that pair case is much harder than symmetry case. In Figure 1(a), the true class only has more correct instances over wrong ones. However, the true has more correct instances in Figure 1(b). Meanwhile, similarly to [19, 8, 2], we did not make any implicit assumption behind Pumpout.
To verify the efficacy of Pumpout, we compare two orthogonal approaches in deep learning with noisy labels. The first set (SET1) comparison is to check whether Pumpout can improve the robustness of MentorNet. (i) MentorNet . (ii) (Algorithm 2). The second set (SET2) comparison is to check whether Pumpout can improve the robustness of Backward Correction. (i) Backward Correction  (denoted as “BC”, Theorem 1). (ii). Non-negative backward correction (denoted as “nnBC”, Theorem 2). (iii) (Algorithm 3). As a simple baseline, we also compare with the normal deep neural network that directly learns on the noisy training set (denoted as “Normal”).
For the fair comparison, we implement all methods with default parameters by PyTorch, and conduct all the experiments on a NIVIDIA K80 GPU. Standard CNN is used with LReLU active function, and the detailed architecture is in Appendix0.B. Namely, we used the 9-layer CNN [16, 21] for MNIST and ResNet-32  for CIFAR10
, since the network structures we used here are standard test bed for weakly-supervised learning. For all datasets, Adam optimizer (momentum=0.9) with an initial learning rate of 0.001, the batch size is set to 128 and runs for 200 epoch. Besides, dropout and batch-normalization are also used.
For SET1, the most important parameter of our and MentorNet is . Here, we assume the noise level is known and set with and . If is not known in advanced, can be inferred using validation sets . The choices of and follows . Note that only depends on the memorization effect of deep networks but not any specific datasets. For SET2, the most important parameters of our and nnBC are and respectively. Specifically, the degree of tolerance is controlled by (), and the scale of gradient ascent is controlled by (). The choices of and follows .
This paper provides two upgraded approaches to train deep neural networks robustly under noisy labels. Thus, our goal is to classify the clean instances as accurately as possible, and the measurement for bothSET1 and SET2 is the test accuracy, i.e., test accuracy = (# of correct predictions) / (# of test dataset). Besides, for SET1, we also use the label precision in each mini-batch, i.e., label precision = (# of clean labels) / (# of all selected labels). Specifically, we sample
of small-loss instances in each mini-batch, and then calculate the ratio of clean labels in the small-loss instances. Intuitively, higher label precision means less noisy instances in the mini-batch after sample selection; and the algorithm with higher label precision is also more robust to the label noise. All experiments are repeated five times. In each figure, the error bar for standard deviation is highlighted as a shade.
4.1 Results of and MentorNet
In Figure 2, we show test accuracy (top) and label precision (bottom) vs number of epochs on MINIST dataset. In all three plots, we can clearly see the memorization effects of networks, i.e., test accuracy of Normal first reaches a very high level and then gradually decreases. Thus, a good robust training method should stop or alleviate the decreasing process. On this point, our almost stops the decreasing process in the easier Symmetric-50% and Symmetric-20% cases. Meanwhile, compared to MentorNet, our alleviates the decreasing process in the hardest Pair-45% case. Thus, consistently achieves the higher accuracy over MentorNet.
To explain such good performance, we plot label precision (bottom). Compared to Normal, we can clearly see that both and MentorNet can successfully pick clean instances out. However, our achieves the higher label precision on not only the easier Symmetric-50% and Symmetric-20% cases, but also the hardest Pair-45% case. This shows our approach is better at finding clean instances due to the usage of scaled stochastic gradient ascent.
Figure 3 shows test accuracy and label precision vs number of epochs on CIFAR10 dataset. Again, on test accuracy, we can see strongly stops the memorization effects of networks. More importantly, on the easier Symmetric-50% and Symmetric-20% cases, it works better and better along with the training epochs. On label precision, while Normal fails to find clean instances, both and MentorNet can do this. However, due to the usage of scaled stochastic gradient ascent, is stronger and find more clean instances.
4.2 Results of and nnBC
Figure 4 shows test accuracy vs number of epochs on MNIST dataset. In all three plots, we can see the memorization effects of networks, i.e., test accuracy of Normal first reaches a very high level and then gradually decreases. However, our fully stops the decreasing process in the hardest Pair-45% case. Meanwhile, in the easier Symmetric-50% and Symmetric-20% cases, our works better and better along with the training epochs though it fluctuates. Moreover, our finally achieves the higher accuracy over both BC and nnBC.
This paper presents a meta algorithm called Pumpout, which significantly improves the robustness of state-of-the-art methods under noisy labels. Our key idea is to squeeze out the negative effects of noisy labels actively from the training model, instead of passively forgetting these effects. The realization of Pumpout is to train deep neural networks by stochastic gradient descent on “fitting” labels; while train deep neural networks by scaled stochastic gradient ascent on “not-fitting” labels. To demonstrate the efficacy of Pumpout, based on MentorNet and Backward Correction, we design two upgraded versions called and , respectively. The experimental results show that, both updated approaches can train deep models more robustly over previous ones. In future, we can extend our work in the following aspects. First, we can leverage Pumpout approach to train deep models under another weak supervision, e.g., complementary labels . Second, we should investigate the theoretical guarantees for Pumpout approach.
MS was supported by the International Research Center for Neurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Advanced Study and JST CREST JPMJCR1403. IWT was supported by ARC FT130100746, DP180100106 and LP150100671. BH would like to thank the financial support from Center for AIP, RIKEN.
-  Xiao, T. and Xia, T. and Yang, Y. and Huang, C. and Wang, X.: Learning from massive noisy labeled data for image classification. In: CVPR. (2015)
-  Jiang, L. and Zhou, Z. and Leung, T. and Li, L. and Fei-Fei, L.: MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML. (2018)
-  Natarajan, N. and Dhillon, I.S. and Ravikumar, P.K. and Tewari, A.: Learning with noisy labels. In: NIPS. (2013)
-  Cha, Y. and Cho, J.: Social-network analysis using topic models. In: SIGIR. (2012)
-  Welinder, P. and Branson, S. and Perona, P. and Belongie, S.: The multidimensional wisdom of crowds. In: NIPS. (2010)
-  Han, B. and Pan, Y. and Tsang, I.: Robust Plackett–Luce model for k-ary crowdsourced preferences. MLJ. (2018)
-  Zhang, C. and Bengio, S. and Hardt, M. and Recht, B. and Vinyals, O.: Understanding deep learning requires rethinking generalization. In: ICLR. (2017)
-  Goldberger, J. and Ben-Reuven, E.: Training deep neural-networks using a noise adaptation layer. In: ICLR. (2017)
-  Han, B. and Yao, J. and Niu, G. and Zhou, M. and Tsang, I. and Zhang, Y. and Sugiyama, M.: Masking: A new perspective of noisy supervision. In: NIPS. (2018)
-  Patrini, G. and Rozza, A. and Menon, A. and Nock, R. and Qu, L.: Making deep neural networks robust to label noise: a loss correction approach. In: CVPR. (2017)
-  Arpit,D. and Jastrzebski, S. and Ballas, N. and Krueger, D. and Bengio, E. and Kanwal, M.S. and Maharaj, T. and Fischer, A. and Courville, A. and Bengio, Y.: A closer look at memorization in deep networks. In: ICML. (2017)
-  Ren, M. and Zeng, W. and Yang, B. and Urtasun, R.: Learning to reweight examples for robust deep learning. In: ICML. (2018)
-  Han, B. and Yao, Q. and Yu, X. and Niu, G. and Xu, M. and Hu, W. and Tsang, I. and Sugiyama, M.: Co-teaching: Robust training of deep neural networks with extremely noisy labels. In: NIPS. (2018)
-  Goodfellow, I. and Bengio, Y. and Courville, A.: Deep learning. MIT press Cambridge. (2016)
-  van Rooyen, B. and Williamson, B.: A theory of learning with corrupted labels. JMLR. (2018)
-  Miyato, T. and Dai, A. and Goodfellow, I.: Virtual adversarial training for semi-supervised text classification. In: ICLR. (2016)
Chapelle, O. and Scholkopf, B. and Zien, A.: Semi-supervised learning. IEEE TNN. (2009)
-  Kiryo, R. and Niu, G. and du Plessis, M. C and Sugiyama, M.: Positive-unlabeled learning with non-negative risk estimator. In: NIPS. (2017)
-  Reed, S. and Lee, H. and Anguelov, D. and Szegedy, C. and Erhan, D. and Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. In: ICLR Workshop. (2015)
-  Van Rooyen, B. and Menon, A. and Williamson, R.C.: Learning with symmetric label noise: The importance of being unhinged. In: NIPS. (2015)
-  Laine, S. and Aila, T.: Temporal ensembling for semi-supervised learning. In: ICLR. (2017)
-  He, K. and Zhang, X. and Ren, S. and Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016)
-  Liu, T. and Tao, D.: Classification with noisy labels by importance reweighting. IEEE TPAMI. (2016)
-  Ishida, T. and Niu, G. and Hu, W. and Sugiyama, M.: Learning from complementary labels. In: NIPS. (2017)
Appendix 0.A Definition of Noise
The definition of transition matrix is as follow, where is the noise rate and is the number of the classes.
Appendix 0.B Network Structures
For MNIST, 3232 gray image, the structure is as follows. We also summarize it into Table 2.
|CNN on MNIST|
|2828 Gray Image|
|33 conv, 128 LReLU|
|33 conv, 128 LReLU|
|33 conv, 128 LReLU|
|33 conv, 256 LReLU|
|33 conv, 256 LReLU|
|33 conv, 256 LReLU|
|22 max-pool, stride 2|
|33 conv, 512 LReLU|
|33 conv, 256 LReLU|
|33 conv, 128 LReLU|
(1*28*28)-[C(3*3,128)]*2-maxpool(2*2,2)-dropout(0.25)-[C(3*3,256)]*3-maxpool(2*2,2)-dropout(0.25)-C(3*3,512)-C(3*3,256)-C(3*3,128)-avgpool(1*1)-128-10, where the input is a 28*28 image, C(3*3,128) means 128 channels of 3*3 convolutions followed by LReLU (negative slop=0.01), maxpool(2*2,2) means max pooling (kernel size=2, stride=2), avgpool(2*2) means average pooling (kernel size=2), [.]*n means n such layers, etc. Batch normalization was applied before LReLU activations.
For CIFAR10, 3232 RGB image, the structure is ResNet-32.