1 Introduction
Learning from noisy labels can date back to three decades ago angluin1988learning , and still keeps vibrant in recent years goldberger2016training ; patrini2017making . Essentially, noisy labels are corrupted from groundtruth labels, and thus they inevitably degenerate the robustness of learned models, especially for deep neural networks arpit2017closer ; zhang2016understanding . Unfortunately, noisy labels are ubiquitous in the real world. For instance, both online queries blum2003noise and crowdsourcing yan2014learning ; yu2018learning yield a large number of noisy labels across the world everyday.
As deep neural networks have the high capacity to fit noisy labels zhang2016understanding
, it is challenging to train deep networks robustly with noisy labels. Current methods focus on estimating the noise transition matrix. For example, on top of the softmax layer, Goldberger et al.
goldberger2016training added an additional softmax layer to model the noise transition matrix. Patrini et al. patrini2017makingleveraged a twostep solution to estimating the noise transition matrix heuristically. However, the noise transition matrix is not easy to be estimated accurately, especially when the number of classes is large.
To be free of estimating the noise transition matrix, a promising direction focuses on training on selected samples jiang2017mentornet ; malach2017decoupling ; ren2018learning . These works try to select clean instances out of the noisy ones, and then use them to update the network. Intuitively, as the training data becomes less noisy, better performance can be obtained. Among those works, the representative methods are MentorNet jiang2017mentornet and Decoupling malach2017decoupling . Specifically, MentorNet pretrains an extra network, and then uses the extra network for selecting clean instances to guide the training. When the clean validation data is not available, MentorNet has to use a predefined curriculum (e.g., selfpaced curriculum). Nevertheless, the idea of selfpaced MentorNet is similar to the selftraining approach chapelle2009semi , and it inherited the same inferiority of accumulated error caused by the sampleselection bias. Decoupling trains two networks simultaneously, and then updates models only using the instances that have different predictions from these two networks. Nonetheless, noisy labels are evenly spread across the whole space of examples. Thus, the disagreement area includes a number of noisy labels, where the Decoupling approach cannot handle noisy labels explicitly. Although MentorNet and Decoupling are representative approaches in this promising direction, there still exist the above discussed issues, which naturally motivates us to improve them in our research.
Meanwhile, an interesting observation for deep models is that they can memorize easy instances first, and gradually adapt to hard instances as training epochs become large
arpit2017closer . When noisy labels exist, deep learning models will eventually memorize these wrongly given labels zhang2016understanding , which leads to the poor generalization performance. Besides, this phenomenon does not change with the choice of training optimizations (e.g., Adagrad duchi2011adaptive and Adam kingma2014adam ) or network architectures (e.g., MLP goodfellow2016deep , Alexnet krizhevsky2012imagenet and Inception szegedy2016rethinking ) jiang2017mentornet ; zhang2016understanding .In this paper, we propose a simple but effective learning paradigm called “Coteaching”, which allows us to train deep networks robustly even with extremely noisy labels (e.g., 45% of noisy labels occur in the finegrained classification with multiple classes deng2013fine ). Our idea stems from the Cotraining approach blum1998combining . Similarly to Decoupling, our Coteaching also maintains two networks simultaneously. That being said, it is worth noting that, in each minibatch of data, each network views its smallloss instances (like selfpaced MentorNet) as the useful knowledge, and teaches such useful instances to its peer network for updating the parameters. The intuition why Coteaching can be more robust is briefly explained as follows. In Figure 1, assume that the error flow comes from the biased selection of training instances in the first minibatch of data. In MentorNet or Decoupling, the error from one network will be directly transferred back to itself in the second minibatch of data, and the error should be increasingly accumulated. However, in Coteaching, since two networks have different learning abilities, they can filter different types of error introduced by noisy labels. In this exchange procedure, the error flows can be reduced by peer networks mutually. Moreover, we train deep networks using stochastic optimization with momentum, and nonlinear deep networks can memorize clean data first to become robust arpit2017closer . When the error from noisy data flows into the peer network, it will attenuate this error due to its robustness.
We conduct experiments on noisy versions of MNIST, CIFAR10 and CIFAR100 datasets. Empirical results demonstrate that, under extremely noisy circumstances (i.e., 45% of noisy labels), the robustness of deep learning models trained by the Coteaching approach is much superior to stateoftheart baselines. Under lowlevel noisy circumstances (i.e., 20% of noisy labels), the robustness of deep learning models trained by the Coteaching approach is still superior to most baselines.
2 Related literature
Statistical learning methods.
Statistical learning contributed a lot to the problem of noisy labels, especially in theoretical aspects. The approach can be categorized into three strands: surrogate loss, noise rate estimation and probabilistic modeling. For example, in the surrogate losses category, Natarajan et al. natarajan2013learning
proposed an unbiased estimator to provide the noise corrected loss approach. MasnadiShirazi et al.
masnadi2009design presented a robust nonconvex loss, which is the special case in a family of robust losses. In the noise rate estimation category, both Menon et al. menon2015learning and Liu et al. liu2016classificationproposed a classprobability estimator using order statistics on the range of scores. Sanderson et al.
sanderson2014class presented the same estimator using the slope of the ROC curve. In the probabilistic modeling category, Raykar et al. raykar2010learning proposed a twocoin model to handle noisy labels from multiple annotators. Yan et al. yan2014learning extended this twocoin model by setting the dynamic flipping probability associated with instances.Other deep learning approaches.
In addition, there are some other deep learning solutions to deal with noisy labels ma2018dimensionality ; wang2018iterative . For example, Li et al. li2017learning
proposed a unified framework to distill the knowledge from clean labels and knowledge graph, which can be exploited to learn a better model from noisy labels. Veit et al.
veit2017learning trained a label cleaning network by a small set of clean labels, and used this network to reduce the noise in largescale noisy labels. Tanaka et al. tanaka2018joint presented a joint optimization framework to learn parameters and estimate true labels simultaneously. Ren et al. ren2018learning leveraged an additional validation set to adaptively assign weights to training examples in every iteration. Rodrigues et al. rodrigues2017deep added a crowd layer after the output layer for noisy labels from multiple annotators. However, all methods require either extra resources or more complex networks.Learning to teach methods.
Learningtoteach is also a hot topic. Inspired by hinton2015distilling , these methods are made up by teacher and student networks. The duty of teacher network is to select more informative instances for better training of student networks. Recently, such idea is applied to learn a proper curriculum for the training data fan2017learning and deal with multilabels gong2016teaching . However, these works do not consider noisy labels, and MentorNet jiang2017mentornet introduced this idea into such area.
3 Coteaching meets noisy supervision
Our idea is to train two deep networks simultaneously. As in Figure 1, in each minibatch data, each network selects its smallloss instances as the useful knowledge, and teaches such useful instances to its peer network for the further training. Therefore, the proposed algorithm is named Coteaching (Algorithm 1
). As all deep learning training methods are based on stochastic gradient descent, our Coteaching works in a minibatch manner. Specifically, we maintain two networks
(with parameter ) and (with parameter ). When a minibatch is formed (step 3), we first let (resp. ) select a small proportion of instances in this minibatch (resp. ) that have small training loss (steps 4 and 5). The number of instances is controlled by , and (resp. ) only selects percentage of smallloss instances out of the minibatch. Then, the selected instances are fed into its peer network as the useful knowledge for parameter updates (steps 6 and 7).There are two important questions for designing above Algorithm 1:

Why can sampling smallloss instances based on dynamic help us find clean instances?

Why do we need two networks and crossupdate the parameters?
To answer the first question
, we first need to clarify the connection between small losses and clean instances. Intuitively, when labels are correct, smallloss instances are more likely to be the ones which are correctly labeled. Thus, if we train our classifier only using smallloss instances in each minibach data, it should be resistant to noisy labels.
However, the above requires that the classifier is reliable enough so that the smallloss instances are indeed clean. The “memorization” effect of deep networks can exactly help us address this problem arpit2017closer . Namely, on noisy data sets, even with the existence of noisy labels, deep networks will learn clean and easy pattern in the initial epochs zhang2016understanding ; arpit2017closer . So, they have the ability to filter out noisy instances using their loss values at the beginning of training. Yet, the problem is that when the number of epochs goes large, they will eventually overfit on noisy labels. To rectify this problem, we want to keep more instances in the minibatch at the start, i.e., is large. Then, we gradually increase the drop rate, i.e., becomes smaller, so that we can keep clean instances and drop those noisy ones before our networks memorize them (details of will be discussed in Section 4.2).
Based on this idea, we can just use one network in Algorithm 1, and let the classifier evolve by itself. This process is similar to boosting freund1995desicion
and active learning
cohn1996active. However, it is commonly known that boosting and active learning are sensitive to outliers and noise, and a few wrongly selected instances can deteriorate the learning performance of the whole model
freund1999short ; balcan2009agnostic . This connects with our second question, where two classifiers can help.Intuitively, different classifiers can generate different decision boundaries and then have different abilities to learn. Thus, when training on noisy labels, we also expect that they can have different abilities to filter out the label noise. This motivates us to exchange the selected smallloss instances, i.e., update parameters in (resp. ) using minibatch instances selected from (resp. ). This process is similar to Cotraining blum1998combining , and these two networks will adaptively correct the training error by the peer network if the selected instances are not fully clean. Take “peerreview” as a supportive example. When students check their own exam papers, it is hard for them to find any error or bug because they have some personal bias for the answers. Luckily, they can ask peer classmates to review their papers. Then, it becomes much easier for them to find their potential faults. To sum up, as the error from one network will not be directly transferred back itself, we can expect that our Coteaching method can deal with heavier noise compared with the selfevolving one.
Relations to Cotraining.
Although Coteaching is motivated by Cotraining, the only similarity is that two classifiers are trained. There are fundamental differences between them. (i). Cotraining needs two views (two independent sets of features), while Coteaching needs a single view. (ii) Cotraining does not exploit the memorization of deep neural networks, while Coteaching does. (iii) Cotraining is designed for semisupervised learning (SSL), and Coteaching is for learning with noisy labels (LNL); as LNL is not a special case of SSL, we cannot simply translate Cotraining from one problem setting to another problem setting.
4 Experiments
Datasets.
We verify the effectiveness of our approach on three benchmark datasets. MNIST, CIFAR10 and CIFAR100 are used here (Table 1), because these data sets are popularly used for evaluation of noisy labels in the literature goldberger2016training ; patrini2017making ; reed2014training .
# of training  # of testing  # of class  image size  

MNIST  60,000  10,000  10  2828 
CIFAR10  50,000  10,000  10  3232 
CIFAR100  50,000  10,000  100  3232 
Since all datasets are clean, following patrini2017making ; reed2014training , we need to corrupt these datasets manually by the noise transition matrix , where given that noisy is flipped from clean . Assume that the matrix has two representative structures (Figure 2): (1) Symmetry flipping van2015learning ; (2) Pair flipping: a simulation of finegrained classification with noisy labels, where labelers may make mistakes only within very similar classes. Their precise definition is in Appendix A.
Since this paper mainly focuses on the robustness of our Coteaching on extremely noisy supervision, the noise rate is chosen from . Intuitively, this means almost half of the instances have noisy labels. Note that, the noise rate for pair flipping means over half of the training data have wrong labels that cannot be learned without additional assumptions. As a side product, we also verify the robustness of Coteaching on lowlevel noisy supervision, where is set to . Note that pair case is much harder than symmetry case. In Figure 2(a), the true class only has more correct instances over wrong ones. However, the true has more correct instances in Figure 2(b).
Baselines. We compare the Coteaching (Algorithm 1) with following stateofart approaches: (i). Bootstrap reed2014training , which uses a weighted combination of predicted and original labels as the correct labels, and then does back propagation. Hard labels are used as they yield better performance; (ii). Smodel goldberger2016training , which uses an additional softmax layer to model the noise transition matrix; (iii). Fcorrection patrini2017making , which corrects the prediction by the noise transition matrix. As suggested by the authors, we first train a standard network to estimate the transition matrix; (iv). Decoupling malach2017decoupling , which updates the parameters only using the samples which have different prediction from two classifiers; and (v). MentorNet jiang2017mentornet . An extra teacher network is pretrained and then used to filter out noisy instances for its student network to learn robustly under noisy labels. Then, student network is used for classification. We used selfpaced MentorNet in this paper. (vi). As a baseline, we compare Coteaching with the standard deep networks trained on noisy datasets (abbreviated as Standard). Above methods are systematically compared in Table 2. As can be seen, our Coteaching method does not rely on any specific network architectures, which can also deal with a large number of classes and is more robust to noise. Besides, it can be trained from scratch. These make our Coteaching more appealing for practical usage. Our implementation of Coteaching is available at https://github.com/bhanML/Coteaching.
Bootstrap  Smodel  Fcorrection  Decoupling  MentorNet  Coteaching  
large class  ✗  ✗  ✗  
heavy noise  ✗  ✗  ✗  ✗  
flexibility  ✗  ✗  
no pretrain  ✗  ✗  ✗ 
Network structure and optimizer.
For the fair comparison, we implement all methods with default parameters by PyTorch, and conduct all the experiments on a NIVIDIA K80 GPU. CNN is used with LeakyReLU (LReLU) active function
maas2013rectifier , and the detailed architecture is in Table 3. Namely, the 9layer CNN architecture in our paper follows “Temporal Ensembling” laine2016temporal and “Virtual Adversarial Training” miyato2016virtual, since the network structure we used here is standard test bed for weaklysupervised learning. For all experiments, Adam optimizer (momentum=0.9) is with an initial learning rate of 0.001, and the batch size is set to 128 and we run 200 epochs. Besides, dropout and batchnormalization are also used. As deep networks are highly nonconvex, even with the same network and optimization method, different initializations can lead to different local optimal. Thus, following
malach2017decoupling , we also take two networks with the same architecture but different initializations as two classifiers.CNN on MNIST  CNN on CIFAR10  CNN on CIFAR100 
2828 Gray Image  3232 RGB Image  3232 RGB Image 
33 conv, 128 LReLU  
33 conv, 128 LReLU  
33 conv, 128 LReLU  
2  
dropout,  
33 conv, 256 LReLU  
33 conv, 256 LReLU  
33 conv, 256 LReLU  
22 maxpool, stride 2  
dropout,  
33 conv, 512 LReLU  
33 conv, 256 LReLU  
33 conv, 128 LReLU  
avgpool  
dense 12810  dense 12810  dense 128100 
Experimental setup. Here, we assume the noise level is known and set with and . If is not known in advanced, can be inferred using validation sets liu2016classification ; yu2018efficient . The choices of and are analyzed in Section 4.2. Note that only depends on the memorization effect of deep networks but not any specific datasets.
As for performance measurements, first, we use the test accuracy, i.e., test Accuracy = (# of correct predictions) / (# of test dataset). Besides, we also use the label precision in each minibatch, i.e., label Precision = (# of clean labels) / (# of all selected labels). Specifically, we sample of smallloss instances in each minibatch, and then calculate the ratio of clean labels in the smallloss instances. Intuitively, higher label precision means less noisy instances in the minibatch after sample selection, and the algorithm with higher label precision is also more robust to the label noise. All experiments are repeated five times. The error bar for STD in each figure has been highlighted as a shade. Besides, the full Yaxis versions for all figures are in Appendix B.
4.1 Comparison with the StateoftheArts
Results on MNIST. Table 4 reports the accuracy on the testing set. As can be seen, on the symmetry case with noisy rate, which is also the easiest case, all methods work well. Even Standard can achieve test set accuracy. Then, when noisy rate raises to , Standard, Bootstrap, Smodel and Fcorrection fail, and their accuracy decrease lower than . Methods based on “selected instances”, i.e., Decoupling, MentorNet and Coteaching are better. Among them, Coteaching is the best. Finally, in the hardest case, i.e., pair case with 45% noisy rate, Standard, Bootstrap and SModel cannot learn anything. Their testing accuracy keep the same as the percentage of clean instances in the training dataset. Fcorrect fails totally, and it heavily relies on the correct estimation of the underneath transition matrix. Thus, when Standard works, it can work better than Standard; then, when Standard fails, it works much worse than Standard. In this case, our Coteaching is again the best, which is also much better than the second method, i.e. 87.53% for Coteaching vs. 80.88% for MentorNet.
FlippingRate  Standard  Bootstrap  Smodel  Fcorrection  Decoupling  MentorNet  Coteaching 

Pair45%  56.52%  57.23%  56.88%  0.24%  58.03%  80.88%  87.63% 
0.55%  0.73%  0.32%  0.03%  0.07%  4.45%  0.21%  
Symmetry50%  66.05%  67.55%  62.29%  79.61%  81.15%  90.05%  91.32% 
0.61%  0.53%  0.46%  1.96%  0.03%  0.30%  0.06%  
Symmetry20%  94.05%  94.40%  98.31%  98.80%  95.70%  96.70%  97.25% 
0.16%  0.26%  0.11%  0.12%  0.02%  0.22%  0.03% 
In Figure 3 , we show test accuracy vs. number of epochs. In all three plots, we can clearly see the memorization effects of networks, i.e., test accuracy of Standard first reaches a very high level and then gradually decreases. Thus, a good robust training method should stop or alleviate the decreasing processing. On this point, all methods except Bootstrap work well in the easiest Symmetry20% case. However, only MentorNet and our Coteaching can combat with the other two harder cases, i.e., Pair45% and Symmetry50%. Besides, our Coteaching consistently achieves higher accuracy than MentorNet, and is the best method in these two cases.
To explain such good performance, we plot label precision vs. number of epochs in Figure 4. Only MentorNet, Decoupling and Coteaching are considered here, as they are methods do instance selection during training. First, we can see Decoupling fails to pick up clean instances, and its label precision is the same as Standard which does not compact with noisy label at all. The reason is that Decoupling does not utilize the memorization effects during training. Then, we can see Coteaching and MentorNet can successfully pick clean instances out. These two methods tie on the easier Symmetry50% and Symmetry20%, when our Coteaching achieve higher precision on the hardest Pair45% case. This shows our approach is better at finding clean instances.
Finally, note that while in Figure 4(b) and (c), MentorNet and Coteaching tie together. Coteaching still gets higher testing accuracy (Table 4). Recall that MentorNet is a selfevolving method, which only uses one classifier, while Coteaching uses two. The better accuracy comes from the fact Coteaching further takes the advantage of different learning abilities of two classifiers.
Results on CIFAR10. Test accuracy is shown in Table 5. As we can see, the observations here are consistently the same as these for MNIST dataset. In the easiest Symmetry20% case, all methods work well. Fcorrection is the best, and our Coteaching is comparable with Fcorrection. Then, all methods, except MentorNet and Coteaching, fail on harder, i.e., Pair45% and Symmetry50% cases. Between these two, Coteaching is the best. In the extreme Pair45% case, Coteaching is at least 14% higher than MentorNet in test accuracy.
Flipping,Rate  Standard  Bootstrap  Smodel  Fcorrection  Decoupling  MentorNet  Coteaching 

Pair45%  49.50%  50.05%  48.21%  6.61%  48.80%  58.14%  72.62% 
0.42%  0.30%  0.55%  1.12%  0.04%  0.38%  0.15%  
Symmetry50%  48.87%  50.66%  46.15%  59.83%  51.49%  71.10%  74.02% 
0.52%  0.56%  0.76%  0.17%  0.08%  0.48%  0.04%  
Symmetry20%  76.25%  77.01%  76.84%  84.55%  80.44%  80.76%  82.32% 
0.28%  0.29%  0.66%  0.16%  0.05%  0.36%  0.07% 
Figure 5 shows test accuracy and label precision vs. number of epochs. Again, on test accuracy, we can see Coteaching strongly hinders neural networks from memorizing noisy labels. Thus, it works much better on the harder Pair45% and Symmetry50% cases. On label precision, while Decoupling fails to find clean instances, both MentorNet and Coteaching can do this. However, due to the usage of two classifiers, Coteaching is stronger.
Results on CIFAR100. Finally, we show our results on CIFAR100. The test accuracy is in Table 6. Test accuracy and label precision vs. number of epochs are in Figure 6. Note that there are only classes in MNIST and CIFAR10 datasets. Thus, overall the accuracy is much lower than previous ones in Tables 4 and 5. However, the observations are the same as previous datasets. We can clearly see our Coteaching is the best on harder and noisy cases.
Flipping,Rate  Standard  Bootstrap  Smodel  Fcorrection  Decoupling  MentorNet  Coteaching 

Pair45%  31.99%  32.07%  21.79%  1.60%  26.05%  31.60%  34.81% 
0.64%  0.30%  0.86%  0.04%  0.03%  0.51%  0.07%  
Symmetry50%  25.21%  21.98%  18.93%  41.04%  25.80%  39.00%  41.37% 
0.64%  6.36%  0.39%  0.07%  0.04%  1.00%  0.08%  
Symmetry20%  47.55%  47.00%  41.51%  61.87%  44.52%  52.13%  54.23% 
0.47%  0.54%  0.60%  0.21%  0.04%  0.40%  0.08% 
4.2 Choices of and
Deep networks initially fit clean (easy) instances, and then fit noisy (hard) instances progressively. Thus, intuitively should meet following requirements: (i). , where depends on the noise rate ; (ii). , which means we do not need to drop any instances at the beginning. At the initial learning epochs, we can safely update the parameters of deep neural networks using entire noisy data, because the networks will not memorize the noisy data at the early stage arpit2017closer ; (iii). should be a nonincreasing function on , which means that we need to drop more instances when the number of epochs gets large. This is because as the learning proceeds, the networks will eventually try to fit noisy data (which tends to have larger losses compared to clean data). Thus, we need to ignore them by not updating the networks parameters using large loss instances arpit2017closer . The MNIST dataset is used in the sequel.
Based on above principles, to show how the decay of affects Coteaching, first, we let with , where three choices of should be considered, i.e., . Then, three values of are considered, i.e., . Results are in Table 7. As can be seen, the test accuracy is stable on the choices of and here. The previous setup ( and ) works well but does not lead to the best performance. To show the impact of , we vary . Note that, cannot be zero. In this case, no gradient will be backpropagated and the optimization will stop. Test accuracy is in Table 8. We can see, with more dropped instances, the performance can be improved. However, if too many instances are dropped, networks may not get sufficient training data and the performance can deteriorate. We set in Section 4.1, and it works well but not necessarily leads to the best performance.
Pair45%  75.56%0.33%  87.59%0.26%  87.54%0.23%  
88.43%0.25%  87.56%0.12%  87.93%0.21%  
88.37%0.09%  87.29%0.15%  88.09%0.17%  
Symmetry50%  91.75%0.13%  91.75%0.12%  92.20%0.14%  
91.70%0.21%  91.55%0.08%  91.27%0.13%  
91.74%0.14%  91.20%0.11%  91.38%0.08%  
Symmetry20%  97.05%0.06%  97.10%0.06%  97.41%0.08%  
97.33%0.05%  96.97%0.07%  97.48%0.08%  
97.41%0.06%  97.25%0.09%  97.51%0.05% 
Flipping,Rate  0.5  0.75  1.25  1.5  
Pair45%  66.74%0.28%  77.86%0.47%  87.63%0.21%  97.89%0.06%  69.47%0.02% 
Symmetry50%  75.89%0.21%  82.00%0.28%  91.32%0.06%  98.62%0.05%  79.43%0.02% 
Symmetry20%  94.94%0.09%  96.25%0.06%  97.25%0.03%  98.90%0.03%  99.39%0.02% 
5 Conclusion
This paper presents a simple but effective learning paradigm called Coteaching, which trains deep neural networks robustly under noisy supervision. Our key idea is to maintain two networks simultaneously, and crosstrains on instances screened by the “small loss” criteria. We conduct simulated experiments to demonstrate that, our proposed Coteaching can train deep models robustly with the extremely noisy supervision. In future, we can extend our work in the following aspects. First, we can adapt Coteaching paradigm to train deep models under other weak supervisions, e.g., positive and unlabeled data kiryo2017positive . Second, we would investigate the theoretical guarantees for Coteaching. Previous theories for Cotraining are very hard to transfer into Coteaching, since our setting is fundamentally different. Besides, there is no analysis for generalization performance on deep learning with noisy labels. Thus, we leave the generalization analysis as a future work.
Acknowledgments.
MS was supported by JST CREST JPMJCR1403. IWT was supported by ARC FT130100746, DP180100106 and LP150100671. BH would like to thank the financial support from RIKENAIP. XRY was supported by NSFC Project No. 61671481. QY would give special thanks to Weiwei Tu and Yuqiang Chen from 4Paradigm Inc. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
References
 [1] D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.
 [2] D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. Kanwal, T. Maharaj, A. Fischer, A. Courville, and Y. Bengio. A closer look at memorization in deep networks. In ICML, 2017.
 [3] M. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78–89, 2009.
 [4] A. Blum, A. Kalai, and H. Wasserman. Noisetolerant learning, the parity problem, and the statistical query model. Journal of the ACM, 50(4):506–519, 2003.
 [5] A. Blum and T. Mitchell. Combining labeled and unlabeled data with cotraining. In COLT, 1998.
 [6] O. Chapelle, B. Scholkopf, and A. Zien. Semisupervised learning. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
 [7] D. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models. Journal of Artificial Intelligence Research, 4:129–145, 1996.
 [8] J. Deng, J. Krause, and L. FeiFei. Finegrained crowdsourcing for finegrained recognition. In CVPR, 2013.
 [9] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 [10] Y. Fan, F. Tian, T. Qin, J. Bian, and T. Liu. Learning to teach. In ICLR, 2018.
 [11] Y. Freund and R. Schapire. A desiciontheoretic generalization of online learning and an application to boosting. In European COLT, 1995.
 [12] Y. Freund, R. Schapire, and N. Abe. A short introduction to boosting. JournalJapanese Society For Artificial Intelligence, 14(771780):1612, 1999.
 [13] J. Goldberger and E. BenReuven. Training deep neuralnetworks using a noise adaptation layer. In ICLR, 2017.
 [14] C. Gong, D. Tao, J. Yang, and W. Liu. Teachingtolearn and learningtoteach for multilabel propagation. In AAAI, 2016.
 [15] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
 [16] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.
 [17] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. FeiFei. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In ICML, 2018.
 [18] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [19] R. Kiryo, G. Niu, M. Du Plessis, and M. Sugiyama. Positiveunlabeled learning with nonnegative risk estimator. In NIPS, 2017.
 [20] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [21] S. Laine and T. Aila. Temporal ensembling for semisupervised learning. In ICLR, 2017.
 [22] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and J. Li. Learning from noisy labels with distillation. In ICCV, 2017.
 [23] T. Liu and D. Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3):447–461, 2016.
 [24] X. Ma, Y. Wang, M. Houle, S. Zhou, S. Erfani, S. Xia, S. Wijewickrema, and J. Bailey. Dimensionalitydriven learning with noisy labels. In ICML, 2018.
 [25] A. Maas, A. Hannun, and A. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013.
 [26] E. Malach and S. ShalevShwartz. Decoupling" when to update" from" how to update". In NIPS, 2017.

[27]
H. MasnadiShirazi and N. Vasconcelos.
On the design of loss functions for classification: theory, robustness to outliers, and savageboost.
In NIPS, 2009.  [28] A. Menon, B. Van Rooyen, C. Ong, and B. Williamson. Learning from corrupted binary labels via classprobability estimation. In ICML, 2015.
 [29] T. Miyato, A. Dai, and I. Goodfellow. Virtual adversarial training for semisupervised text classification. In ICLR, 2016.
 [30] N. Natarajan, I. Dhillon, P. Ravikumar, and A. Tewari. Learning with noisy labels. In NIPS, 2013.
 [31] G. Patrini, A. Rozza, A. Menon, R. Nock, and L. Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017.
 [32] V. Raykar, S. Yu, L. Zhao, G. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322, 2010.
 [33] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich. Training deep neural networks on noisy labels with bootstrapping. In ICLR, 2015.
 [34] M. Ren, W. Zeng, B. Yang, and R. Urtasun. Learning to reweight examples for robust deep learning. In ICML, 2018.
 [35] F. Rodrigues and F. Pereira. Deep learning from crowds. In AAAI, 2018.
 [36] T. Sanderson and C. Scott. Class proportion estimation with application to multiclass anomaly rejection. In AISTATS, 2014.

[37]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
Rethinking the inception architecture for computer vision.
In CVPR, 2016.  [38] D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa. Joint optimization framework for learning with noisy labels. In CVPR, 2018.
 [39] B. Van Rooyen, A. Menon, and B. Williamson. Learning with symmetric label noise: The importance of being unhinged. In NIPS, 2015.
 [40] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie. Learning from noisy largescale datasets with minimal supervision. In CVPR, 2017.
 [41] Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S. Xia. Iterative learning with openset noisy labels. In CVPR, 2018.
 [42] Y. Yan, R. Rosales, G. Fung, R. Subramanian, and J. Dy. Learning from multiple annotators with varying expertise. Machine Learning, 95(3):291–327, 2014.
 [43] X. Yu, T. Liu, M. Gong, K. Batmanghelich, and D. Tao. An efficient and provable approach for mixture proportion estimation using linear independence assumption. In CVPR, 2018.
 [44] X. Yu, T. Liu, M. Gong, and D. Tao. Learning with biased complementary labels. In ECCV, 2018.
 [45] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
Appendix A Definition of noise
The definition of transition matrix is as follow. is number of the class.
Pair flipping:  
Symmetry flipping: 