The recent success of deep neural networks has increased the need for high-quality labeled data. However, such a labelling process can be time-consuming and costly. A compromise is to resort to weakly-supervised annotations, using crowdsourcing platforms or trained classifiers that annotate the data automatically. These weakly-supervised annotations tend to be low-quality and noisy, which negatively affects the accuracy of high-capacity models due to memorization effects(Zhang et al., 2017). Thus, learning with noisy labels has often drawn a lot of attention.
Early works on noisy labels studied random classification noise (RCN) for binary classification Angluin & Laird (1988); Kearns (1993). In the RCN model, each instance has its label flipped with a fixed noise rate . A natural extension of RCN is class-conditional noise (CCN) for multi-class classification (Stempfel & Ralaivola, 2009; Natarajan et al., 2013; Scott et al., 2013; Menon et al., 2015; van Rooyen & Williamson, 2015; Patrini et al., 2016) (Appendix A). In the CCN model, each instance from class
has a fixed probabilityof being assigned to class . Thus, it is possible to encode some similarity information between classes. For example, we can expect that the image of a “dog” is more likely to be erroneously labelled as “cat” than “boat”.
To handle the CCN model, a common method is the loss correction, which aims to correct the prediction or the loss of the classifier using an estimated noise transition matrix (Patrini et al., 2017; Sukhbaatar et al., 2015; Goldberger & Ben-Reuven, 2017; Ma et al., 2018). Another common approach is the label correction, which aims to improve the label quality during training. For example, Reed et al. (2015) introduced a bootstrapping scheme. Similarly, Tanaka et al. (2018) proposed to update the weights of a classifier iteratively using noisy labels, and use the updated classifier to yield more high-quality pseudo-labels for the training set. Although these methods have theoretical guarantees, they are unable to cope with real-world noise, e.g., instance-dependent noise (IDN).
The IDN model considers a more general noise (Manwani & Sastry, 2013; Ghosh et al., 2014; Menon et al., 2016; Cheng et al., 2017; Menon et al., 2018), where the probability that an instance is mislabeled depends on both its class and features. Intuitively, this noise is quite realistic, as poor-quality or ambiguous instances are more likely to be mislabeled in real-world datasets. However, it is much more complex to formulate the IDN model, since the probability of a mislabeled instance is a function of not only the label space but also the input space that can be very high dimensional.
|Du & Cai (2015)||✗||✗||✓|
|Menon et al. (2018)||✗||✓||✓|
|Bootkrajang & Chaijaruwanich (2018)||✗||✗||✓|
|Cheng et al. (2017)||✗||✓||✗|
As a result, several pioneer works have considered stronger assumptions on noise functions. However, stronger assumptions tend to restrict the utility of these works (Table 1). For instance, the boundary-consistent noise model considers stronger noise for samples closer to the decision boundary of the Bayesian optimal classifier (Du & Cai, 2015; Menon et al., 2018). However, such a model is restricted to binary and cannot estimate noise functions. Cheng et al. (2017) recently studied a particular case of the IDN model, where noise functions are upper-bounded. Nonetheless, their method is limited to binary classification and has only been tested on small datasets.
Instead of simplifying assumptions on noise functions, we propose to tackle the IDN model from the source, by considering confidence scores to be available for the label of each instance. We term this new setting confidence-scored instance-dependent noise (CSIDN, Figure 0(c)). The confidence scores denote how likely an instance is to be correctly labeled. Assuming that (i) confidence scores are available for each instance, (ii) transitions probabilities to other classes are independent of the instance conditionally on the assigned label being erroneous and (iii) a set of anchor points is available, we derive an instance-level forward correction algorithm which can fully estimate the transition probability for each instance, and subsequently train a robust classifier with a loss-correction method similarly to Patrini et al. (2017).
It is noted that confidence scores can be easily and cheaply derived during the construction of the dataset. Namely, the class-posterior probabilities of the labels assigned by the classifier can be approximately seen as confidence scores, when the loss we use is classification-calibrated(Zhang et al., 2004; Bartlett et al., 2006) and proper composite (Reid & Williamson, 2010; Nock & Nielsen, 2009). For example, when training deep neural networks for multi-class classification, we commonly leverage the cross-entropy loss, which is classification-calibrated and proper composite (Gneiting & Raftery, 2007). Thus, the final-layer outputs of deep neural networks can be approximately seen as confidence scores.
To sum up, we first formulate instance-dependent noise in Section 2.1, and expose its robustness challenge in Section 2.2. Then, we explain our motivation to use confidence scores, and propose the confidence-scored instance-dependent noise (CSIDN) model in Section 2.3. Lastly, to handle this new noise model, we present the first practical algorithm termed instance-level forward correction in Section 3, and validate the proposed algorithm through extensive experiments in Section 4.
2 Tackling instance-dependent noise from the source
In this section, we present the IDN model along with the limitations of existing approaches, and introduce the CSIDN model as a tractable instance-dependent noise model.
2.1 Noise models: from class-conditional to instance-dependent noise
We formulate the problem of learning with noisy labels in this section. Let
be the distribution of a pair of random variables, where , and is the number of classes. In the classification task with noisy labels, we hope to train a classifier while having only access to samples from a noisy distribution of random variables . Given a point sampled from , is derived from the random variable via a noise transition matrix :
Each noise function is defined as . In the class-conditional noise (CNN) model (Figure 0(a)), the transition matrix does not depend on the instance and the noise is entirely characterized by the constants . However, in the instance-dependent noise (IDN) model (Figure 0(b)), the transition matrix depends on the actual instance. This tremendously complicates the problem, as the noise is now characterized by functions over the latent space , which can be very high dimensional (e.g., - for an object recognition dataset).
2.2 Challenges from instance-dependent noise
Limitation of existing CCN methods.
Due to the complexity of the IDN model, most recent works in learning with noisy labels have focused on the CCN model (Figure 0(a)), and the CCN model can be seen as a simplified IDN model (Figure 0(b)) free of feature information.
In addition to loss correction and label correction mentioned before, another method for the CCN model is sample selection, which aims to find reliable samples during training, such as the small-loss approaches (Jiang et al., 2018; Han et al., 2018)
. Inspired by the memorization in deep learning(Arpit et al., 2017), those methods first run a standard classifier on a noisy dataset, then select the small-loss samples for reliable training.
However, all approaches cannot handle the IDN model directly. Specifically, loss correction considers the noise model to be characterized by a fixed transition matrix, which does not include any instance-level information. Meanwhile, label correction is vulnerable to the IDN model, since the classifier will be much weaker on noisy regions and labels corrected by the current prediction would likely be erroneous. Similarly, sample selection is easily affected by the IDN model.
For example, in the small-loss approaches, instance-dependent noise functions can leave partial regions of the input space clean and other regions very noisy (e.g., in an object recognition dataset, poor-quality pictures will tend to receive more noisy labels than high-quality ones). Since clean regions will tend to receive smaller losses than noisy regions, the small-loss approaches, which only trains on points with the smallest-losses, will focus on clean regions and neglect harder noisy regions. Then, since the distribution of clean regions will subsequently be different from the global distribution, this will introduce a covariate-shift (Shimodaira, 2000), which greatly degrades performances. Moreover, it is hard to use importance reweighting (Sugiyama et al., 2007) for alleviate the issue, since importance reweighting would require estimating the clean posterior probability that is intractable for the IDN model.
small-loss instances at each epochbased on the losses of the previous epoch, with decreasing in as described in Han et al. (2018). Figure 1(c) shows the density of the top small-loss instances selected after 10 epochs: since noisy regions are associated to higher losses, the network eventually tends to select instances from the clean region and neglect the noisy region. This leads to covariate-shift, which is associated with decreased performances (Shimodaira, 2000).
Limitation of pioneer IDN methods.
The main challenge of the IDN model is the wide range of possible noise functions included in its formulation. Since each is a function of the high-dimensional input space , it is challenging for a model to be flexible enough to fit any real-world noise function while being trainable on corrupted datasets, let alone derive theoretical results. Instead, various recent works have considered stronger assumptions on noise functions.
For instance, boundary-consistent noise (BCN), first introduced by (Du & Cai, 2015) and generalized in Menon et al. (2018), considers stronger noise for samples closer to the decision boundary of the Bayesian optimal classifier. This is a reasonable model for noise from human annotators, since “harder” instances (i.e., instances closer to the decision boundary) are more likely to be corrupted. Moreover, it is simple enough to derive some theoretical guarantees, as done in Menon et al. (2018). Additionally, an extension of the BCN model was studied in Bootkrajang & Chaijaruwanich (2018), where the noise function is a Gaussian mixture of the distance to the Bayesian optimal boundary. However, the BCN model and its extension are restricted to binary classification, and their geometry-based assumption becomes difficult to fathom for high-dimensional input spaces.
Furthermore, Cheng et al. (2017) recently studied a particular case of the IDN model, where the probabilities that the true labels of samples flip into corrupted ones have upper bounds. They proposed a method based on distilled samples, where noisy labels agree with the optimal Bayesian classifier on the clean distribution. However, their method is limited to binary classification and has only been tested on small UCI datasets. Table 1 summarizes the characteristics of those approaches.
2.3 Confidence-scored instance-dependent noise
Instead of simplifying assumptions on noise functions, we propose to tackle the IDN model from the source. Namely, we consider that, for each instance, we have access to a measure of confidence in the assigned label. As most of noisy datasets arise from crowdsourcing or automatic annotation, such confidence scores can be easily derived during the dataset construction, often with no extra cost. This allows for a good approximation of noise functions with weaker assumptions.
Before introducing our proposed noise model confidence-scored instance-dependent noise (CSIDN, Figure 0(c)), we first define what are the confidence scores, and explain why the confidence scores are available in real-world applications.
Definition of confidence scores.
For any data point
sampled from the joint distribution, we define the confidence score as follows.
Namely, the probability that the assigned label is correct.
Availability of confidence scores.
Our rationale is that in tasks involving instance-dependent noise, the confidence information can be easily derived with no extra cost. Specifically, the confidence information can be available in automatic annotation via a softmax output layer of deep neural networks. This layer outputs an estimation of the probability that each class is observed: When a model outputs a given class with probability 0.9, we expect the predicted class to be true 9 times out of 10 on average.
In theory, when a loss we use is classification-calibrated (Zhang et al., 2004; Bartlett et al., 2006) and proper composite (Reid & Williamson, 2010; Nock & Nielsen, 2009), the class-posterior probability of the assigned label can be approximately interpreted as a confidence measure that the label is correct. Therefore, for multi-class classification, when training deep neural networks via the cross-entropy loss, the final-layer outputs of deep neural networks can be approximately seen as confidence scores, since the cross-entropy loss is classification-calibrated and proper composite (Gneiting & Raftery, 2007).
CSIDN: a tractable instance-dependent noise model.
Recall the intrinsic difficulty of the IDN model: to fully characterize this noise, one would need to estimate functions over the input space . This is of course intractable with a finite noisy dataset. This is why pioneer solutions to the IDN model have been so far limited by very strong assumptions.
However, considering additional confidence scores, one can wonder whether such information would make the IDN model tractable with less restrictive assumptions. Hence, we introduce a new and tractable instance-dependent noise model: confidence-scored instance-dependent noise (CSIDN, Figure 0(c)). In this noise model, the training data takes the form , where and is the previously defined confidence scores in the assigned label of a given instance (Eq. (2)). The confidence information is decisive for robustness to instance-dependent noise, as it provides a proxy for the noise functions of the training data that are often intractable otherwise.
3 Benchmark solution for handling the CSIDN model
To tackle the CSIDN model, we propose a benchmark solution. Inspired by forward correction (Patrini et al., 2017) for the CCN model, we want to correct each prediction with the noise transition matrix . However, the transition matrix for the CSIDN model is instance-dependent, and has to be estimated for each instance . We term our solution instance-level forward correction.
3.1 Estimating instance-dependent transition matrix
Using the confidence scores, we will first estimate the diagonal terms of the transition matrix, and then estimate the non-diagonal ones.
The diagonal terms of the transition matrix correspond to the probabilities that assigned labels are equal to true labels. However, the confidence scores available are only relevant to the class corresponding to the observed label. Therefore, we need to proceed differently whether the confidence scores are available for the considered class or not.
First, note that for each sample , can be derived for the most part from the confidence scores alone:
In practice, we use an iterative procedure to estimate in turn and (see Section 3.2 for details). Then, for the rest of samples , does not give any direct information on . Hence, we simply set each function as its empirical mean estimated using samples from at the current epoch:
where denotes the cardinality of .
For non-diagonal terms, we have:
In Eq. (4), refers to the probability that an instance with true label has an observed label , once we know that the observed label is different from the true one. Then, a reasonable assumption is that : conditionally on the observed label being erroneous, the class transitions are not influenced by the instance . In other words, the dependence in of the noise function only impacts the “magnitude” of the noise and not the class transitions.
To illustrate this assumption, consider a crowdsourcing task of object recognition with adjacent classes which annotators can only differentiate with details that can be more or less visible depending on the instance. For example, objects from a given class may have distinctive traits, but those can be more or less visible in the pictures. When those traits are present, the annotators can confidently predict the right class. Otherwise, they will make errors towards adjacent classes. In this case, the probability that the assigned label is wrong highly depends the instance (with distinctive traits being visible or not). Nonetheless, conditionally on the instance being corrupted, i.e., because those traits were not visible enough on the image, the transition probabilities to the adjacent classes are not influenced by the instance itself.
With the previous assumption, we obtain with . This allows us to estimate the constants once, and derive the non-diagonal noise functions of directly from our estimates of the diagonal noise functions (Eq. (3.1)).
3.2 Overall algorithm: Instance-level forward correction
Estimating and .
To train a classifier with the instance-level forward correction method, we need to estimate both and from Eq. (3.1), for all . Firstly, the noisy posterior can be easily estimated by training a naive classifier on the noisy dataset. Secondly, the true posterior can be estimated using the output of the classifier at the previous epoch.
Therefore, we iteratively update and with the following steps: 1) , initialize and train a naive classifier on the noisy data to obtain . 2) , for each sample , compute and train classifier for one epoch. 3) , for each sample , update . Then, we repeat steps 2) and 3) through training. In this way, for every epoch, each function is estimated for the samples from . Lastly, for the rest of samples with noisy label , is estimated at each epoch using Eq. (4):
The computation of boils down to approximating non-diagonal terms of the transition matrix in the CCN model. As , we have:
A simple and reliable way is to use anchor points, i.e., points for which we can know the true class almost surely. These points may be directly available when some training data has been curated, or they can be identified either theoretically as in Liu & Tao (2015)
or heuristically as inPatrini et al. (2017). Having a set of class anchor points, we simply need compute:
Two noisy posteriors can be estimated using the same classifier trained on the noisy distribution aforementioned. Thus, can be estimated as follows:
Summary of the training procedure.
We compare our instance-level forward correction (ILFC) method with four representative baselines: forward correction (FC) (Patrini et al., 2017), mean absolute error (MAE) (Ghosh et al., 2017), -norm (LQ) (Zhang & Sabuncu, 2018) and co-teaching (CT) (Han et al., 2018). Details are shown in Appendix D. Note that the pioneer IDN methods cannot work for multi-class cases.
4.1 Synthetic dataset
shows the test accuracy of different methods on the synthetic dataset. Each experiment is repeated 5 times and we plot the confidence intervals of each curve. On low-level noise, all methods show good performances (Figure2(a)). On mild-level noise, both Co-teaching and ILFC show good performances and outperform other baselines (Figure 2(b)). On high-level noise, the performance of all the baselines collapse, whereas ILFC constantly maintains good performances (Figures 2(c) and 2(d)). More experiments are shown in Appendix B and F.
4.2 Real-world dataset
In order to corrupt labels from clean datasets such as SVHN and CIFAR10, we adopt the following procedure: (1) train a classifier on a small subset of the clean dataset; (2) using a small validation set, calibrate the classifier by selecting the temperature that maximizes the expected calibration error; (3) for each instance , set: . With this process, we attempt to emulate the construction of a real-world dataset (Appendix G).
Figures 3(a) and 3(b) show the test accuracy on SVHN with and instance-dependent noise, respectively. We can clearly observe that, on both low-level and high-level noise, ILFC shows good performances with a fast convergence rate, and outperforms other baselines. Figures 3(c) and 3(d) show the test accuracy on CIFAR10 with and instance-dependent noise, respectively. On low-level noise, all methods show good performances. However, on high-level noise, ILFC shows a fast convergence rate and outperforms other baselines.
In this paper, we give an overview of label-noise learning from class-conditional noise (easier) to instance-dependent noise (harder). We explain why existing approaches cannot handle instance-dependent noise well, and try to address this challenge via confidence scores. Thus, we formally propose the confidence-scored instance-dependent noise (CSIDN) model. To tackle the CSIDN model, we design a practical algorithm termed instance-level forward correction (ILFC). Our ILFC method robustly outperforms existing methods, especially in the case of high-level noise. In future works, we would like to extend label correction and sample selection approaches with the confidence scores from the CSIDN model.
- Angluin & Laird (1988) Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1988.
- Arpit et al. (2017) Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In ICML, 2017.
- Bartlett et al. (2006) Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
- Bootkrajang & Chaijaruwanich (2018) Jakramate Bootkrajang and Jeerayut Chaijaruwanich. Towards instance-dependent label noise-tolerant classification: a probabilistic approach. Pattern Analysis and Applications, pp. 1–17, 2018.
- Charoenphakdee et al. (2019) Nontawat Charoenphakdee, Jongyeong Lee, and Masashi Sugiyama. On Symmetric Losses for Learning from Corrupted Labels. ICML, 2019.
- Cheng et al. (2017) Jiacheng Cheng, Tongliang Liu, Kotagiri Ramamohanarao, and Dacheng Tao. Learning with bounded instance-and label-dependent label noise. stat, 1050:12, 2017.
- Du & Cai (2015) Jun Du and Zhihua Cai. Modelling class noise with symmetric and asymmetric distributions. In AAAI, 2015.
- Ghosh et al. (2014) Aritra Ghosh, Naresh Manwani, and P S. Sastry. Making risk minimization tolerant to label noise. Neurocomputing, 160, 2014.
Ghosh et al. (2017)
Aritra Ghosh, Himanshu Kumar, and PS Sastry.
Robust loss functions under label noise for deep neural networks.In AAAI, 2017.
- Gneiting & Raftery (2007) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
- Goldberger & Ben-Reuven (2017) Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adaptation layer. In ICLR, 2017.
- Han et al. (2018) Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, 2018.
- Ishida et al. (2018) Takashi Ishida, Gang Niu, and Masashi Sugiyama. Binary classification from positive-confidence data. In NeurIPS, 2018.
- Jiang et al. (2018) Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, 2018.
Efficient noise-tolerant learning from statistical queries.
Proceedings of the twenty-fifth annual ACM symposium on Theory of computing - STOC 93, 1993.
Laine & Aila (2017)
Samuli Laine and Timo Aila.
Temporal Ensembling for Semi-Supervised Learning.ICLR, 2017.
- Liu & Tao (2015) Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015.
- Ma et al. (2018) Xingjun Ma, Yisen Wang, Michael E. Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi Wijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. In ICML, 2018.
- Manwani & Sastry (2013) Naresh Manwani and P. S. Sastry. Noise tolerance under risk minimization. IEEE Transactions on Cybernetics, 43:1146–1151, 2013.
Masnadi-shirazi & Vasconcelos (2009)
Hamed Masnadi-shirazi and Nuno Vasconcelos.
On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost.In NeurIPS. 2009.
- Menon et al. (2015) Aditya Menon, Brendan Van Rooyen, Cheng Soon Ong, and Bob Williamson. Learning from corrupted binary labels via class-probability estimation. In ICML, pp. 125–134, 2015.
- Menon et al. (2016) Aditya Krishna Menon, Brendan Van Rooyen, and Nagarajan Natarajan. Learning from binary labels with instance-dependent corruption. arXiv preprint arXiv:1605.00751, 2016.
- Menon et al. (2018) Aditya Krishna Menon, Brendan van Rooyen, and Nagarajan Natarajan. Learning from binary labels with instance-dependent noise. Machine Learning, 107(8-10):1561–1595, September 2018.
- Miyato et al. (2018) Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
- Natarajan et al. (2013) Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with Noisy Labels. In NeurIPS. 2013.
- Nock & Nielsen (2009) Richard Nock and Frank Nielsen. On the efficient minimization of classification calibrated surrogates. In NeurIPS, pp. 1201–1208, 2009.
- Patrini et al. (2016) Giorgio Patrini, Frank Nielsen, Richard Nock, and Marcello Carioni. Loss factorization, weakly supervised learning and label noise robustness. In ICML, 2016.
- Patrini et al. (2017) Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017.
- Raykar et al. (2009) Vikas C Raykar, Shipeng Yu, Linda H Zhao, Anna Jerebko, Charles Florin, Gerardo Hermosillo Valadez, Luca Bogoni, and Linda Moy. Supervised learning from multiple experts: whom to trust when everyone lies a bit. In ICML, 2009.
- Reed et al. (2015) Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. ICLR, 2015.
- Reid & Williamson (2010) Mark D Reid and Robert C Williamson. Composite binary losses. Journal of Machine Learning Research, 11(Sep):2387–2422, 2010.
- Scott et al. (2013) Clayton Scott, Gilles Blanchard, and Gregory Handy. Classification with asymmetric label noise: Consistency and maximal denoising. In COLT, pp. 489–511, 2013.
- Shen & Sanghavi (2019) Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. In ICML, 2019.
- Shimodaira (2000) Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
- Snow et al. (2008) Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Ng. Cheap and fast – but is it good? evaluating non-expert annotations for natural language tasks. In EMNLP, 2008.
- Stempfel & Ralaivola (2009) Guillaume Stempfel and Liva Ralaivola. Learning SVMs from sloppily labeled data. In International Conference on Artificial Neural Networks, pp. 884–893, 2009.
- Sugiyama et al. (2007) Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert MÃžller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005, 2007.
- Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. ICLR workshop, 2015.
- Tanaka et al. (2018) Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In CVPR, 2018.
- Tarvainen & Valpola (2017) Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, 2017.
- van Rooyen & Williamson (2015) Brendan van Rooyen and Robert C. Williamson. Learning in the Presence of Corruption. arXiv e-prints, art. arXiv:1504.00091, Mar 2015.
- Yan et al. (2010) Yan Yan, Rómer Rosales, Glenn Fung, Mark Schmidt, Gerardo Hermosillo, Luca Bogoni, Linda Moy, and Jennifer Dy. Modeling annotator expertise: Learning when everybody knows a bit of something. In AISTATS, 2010.
- Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
- Zhang et al. (2018) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ICLR, 2018.
- Zhang et al. (2004) Tong Zhang et al. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):56–85, 2004.
- Zhang & Sabuncu (2018) Zhilu Zhang and Mert Sabuncu. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In NeurIPS. 2018.
Appendix A Related works
Besides the works aforementioned, we survey other approaches to learning with noisy labels.
Various approaches propose to use a provably robust loss function in the learning process. In the case of class-dependent label noise, Natarajan et al. (2013)
constructed an unbiased estimator of any loss function under the noisy distribution.Masnadi-shirazi & Vasconcelos (2009) introduced a robust non-convex loss. Recently, works on symmetric losses showed that such loss offer theoretical robustness results to various types of noise (Ghosh et al., 2017; Charoenphakdee et al., 2019). Motivated by the robustness to noise of the mean absolute error loss (MAE) shown in Ghosh et al. (2017), Zhang & Sabuncu (2018) introduced generalized cross entropy loss that allows for a trade-off between the efficient learning properties of the CCE loss and the noise-robustness of MAE. Shen & Sanghavi (2019) introduced a trimmed loss with an iterative minimization process that allows for theoretical guarantees in the simpler setting of generalized linear models.
Learning with multiple noisy labels.
A closely related setting is learning from multiple noisy labels, where the aim is to predict an unknown ground-truth label from , each referring to a noisy annotation. This setting can arise for example from crowdsourcing tasks; Snow et al. (2008) showed that using multiple non-expert annotators to train a classifier can be as effective as using gold standard annotations from experts. In Raykar et al. (2009), the authors derive a Bayesian approach to jointly learn the expertise of each annotator, the actual true label and the classifier. Yan et al. (2010) extends this Bayesian approach by considering that each annotator’s expertise varies across the input space. This setting differs from ours as it takes place before the aggregation of multiple annotations, which, for CSIDN, is only a way among others to obtain a confidence score for each noisy label.
Recently, several other regularization techniques have shown good robustness in weakly-supervised settings. Temporal Ensembling (TE) (Laine & Aila, 2017) method labels some additional unlabeled instances using a consensus of predictions from models from previous epochs and with different regularizations and input augmentation conditions. Mean-teacher (MT) (Tarvainen & Valpola, 2017) instead uses predictions from a model obtained by averaging the weights of a set of models similar to TE, as using the prediction from a unique model is more efficient when a large amount of unlabeled data is available. Virtual Adversarial Training (Miyato et al., 2018) regularizes the network using a measure of local smoothness of the conditional label distribution given the input, defined as the robustness of the prediction to local adversarial perturbations in the input space. Introduced in Zhang et al. (2018), mixup trains a neural network on convex combinations of instance pairs and their respective labels, and has been shown to reduce the memorization of corrupted labels.
Appendix B Sensitivity analysis
In practice, the confidence scores obtained may not be accurate. Therefore, we run a sensitivity analysis to assess the robustness of ILFC: similarly to Ishida et al. (2018)
, we add a zero-mean Gaussian noise with standard deviationto each confidence score and clip the values between 0 and 1. Figure 5 shows the resulting performances on the synthetic dataset. ILFC shows good robustness to inaccurate confidence scores even with high standard deviation on a highly noisy dataset.
Appendix C Algorithm
Appendix D Baselines
Introduced in Patrini et al. (2017), forward correction estimates a fixed transition matrix before training, and trains a classifier with the corrected loss .
Mean absolute error loss.
Due to its symmetric property, the Mean Absolute Error (MAE) has been theoretically justified to be robust to label noise under assumptions (Ghosh et al., 2017). However, this loss is more difficult to train, especially on complex datasets.
Introduced in Zhang & Sabuncu (2018), norm attempts to bring the best of both worlds between the CCE and the MAE loss: the CCE is easy to train, while the MAE is robust to label noise. The authors therefore define this loss using the negative box-cox transformation:
so that the tends to the CCE when and to the MAE when . In the following experiments, we set , suggested by authors.
Co-teaching (Han et al., 2018).
Co-teaching algorithm is a small-loss approach where two classifiers are trained in parallel. At each epoch, each classifier selects the instances with the smallest loss, and feed them to the other network as a training set for the next iteration. This recent work has proved to be a leading benchmark in the field of noisy labels.
Appendix E Synthetic dataset
Figure 6 shows three synthetic datasets, which cover clean, IDN and CSIDN models.
Appendix F Decision boundaries
Figure 7 shows the decision boundaries of our approach versus the ones of a benchmark model, for different levels of noise. With high levels of noise, a model that does not include any instance-level modelling will degenerate around the most noisy region of the input space. On the other hand, our model successfully accounts for the high noise in this region and is able to keep consistent predictions.
Appendix G Examples of real-world datasets
For example, the method would be similar to constructing a dataset with images scraped from the web, and automatically labelling them from neighbouring text fields using a classifier such as a recurrent neural network. Then, a small subset of curated images could be used at the beginning of the process to calibrate the classifier, in order to make the predictions of the softmax output faithful to the confidence in each label. This way, we could construct a very large dataset for a very low-cost that, while involving some instance-dependent noise, would be equipped with confidence information and therefore could be tackled with our proposed algorithm.