Consider a student with no knowledge of animals tasked with learning to classify whether a picture contains a dog. A teacher shows the student example pictures of lone four-legged animals, stating whether the image contains a dog or not. Unfortunately, the teacher may often make mistakes, asymmetrically, with a significantly large false positive rate, , and significantly large false negative rate,
. The teacher may also include “white noise” images with a uniformly random label. This information is unknown to the student, who only knows of the images and corrupted labels, but suspects that the teacher may make mistakes. Can the student (1) estimate the mistake rates,and , (2) learn to classify pictures with dogs accurately, and (3) do so efficiently (e.g. less than an hour for 50 images)? This allegory clarifies the challenges of learning for any classifier trained with corrupted labels, perhaps with intermixed noise examples. We elect the notation to emphasize that both the positive and negative sets may contain mislabeled examples, reserving and for uncorrupted sets.
This example illustrates a fundamental reliance of supervised learning on training labels(Michalski et al., 1986). Traditional learning performance degrades monotonically with label noise (Aha et al., 1991; Nettleton et al., 2010), necessitating semi-supervised approaches (Blanchard et al., 2010). Examples of noisy datasets are medical (Raviv & Intrator, 1996), human-labeled (Paolacci et al., 2010), and sensor (Lane et al., 2010) datasets. The problem of uncovering the same classifications as if the data was not mislabeled is our fundamental goal.
Towards this goal, we introduce Rank Pruning111 Rank Pruning is open-source and available at https://github.com/cgnorthcutt/rankpruning, an algorithm for learning composed of two sequential parts: (1) estimation of the asymmetric noise rates and and (2) removal of mislabeled examples prior to training. The fundamental mantra of Rank Pruning is learning with confident examples
, i.e. examples with a predicted probability of being positivenear when the label is positive or when the label is negative. If we imagine non-confident examples as a noise class, separate from the confident positive and negative classes, then their removal should unveil a subset of the uncorrupted data.
An ancillary mantra of Rank Pruning is removal by rank which elegantly exploits ranking without sorting. Instead of pruning non-confident examples by predicted probability, we estimate the number of mislabeled examples in each class. We then remove the -most or -least examples, ranked by predicted probability, via the BFPRT algorithm (Blum et al., 1973) in time, where is the number of training examples. Removal by rank mitigates sensitivity to probability estimation and exploits the reduced complexity of learning to rank over probability estimation (Menon et al., 2012). Together, learning with confident examples and removal by rank enable robustness, i.e. invariance to erroneous input deviation.
Beyond prediction, confident examples help estimate and . Typical approaches require averaging predicted probabilities on a holdout set (Liu & Tao, 2016; Elkan & Noto, 2008) tying noise estimation to the accuracy of the predicted probabilities, which in practice may be confounded by added noise or poor model selection. Instead, we estimate and as a fraction of the predicted counts of confident examples in each class, encouraging robustness for variation in probability estimation.
1.1 Related Work
Rank Pruning bridges framework, nomenclature, and application across and learning. In this section, we consider the contributions of Rank Pruning in both.
Positive-unlabeled () learning is a binary classification task in which a subset of positive training examples are labeled, and the rest are unlabeled. For example, co-training (Blum & Mitchell, 1998; Nigam & Ghani, 2000) with labeled and unlabeled examples can be framed as a learning problem by assigning all unlabeled examples the label ‘0’. learning methods often assume corrupted negative labels for the unlabeled examples such that learning is learning with no mislabeled examples in , hence their naming conventions.
Early approaches to2003) and biased SVM (Liu et al., 2003) to penalize more when positive examples are predicted incorrectly. Bagging SVM (Mordelet & Vert, 2014) and RESVM (Claesen et al., 2015) extended biased SVM to instead use an ensemble of classifiers trained by resampling (and for RESVM) to improve robustness (Breiman, 1996). RESVM claims state-of-the-art for learning, but is impractically inefficient for large datasets because it requires optimization of five parameters and suffers from the pitfalls of SVM model selection (Chapelle & Vapnik, 1999). Elkan & Noto (2008) introduce a formative time-efficient probabilistic approach (denoted Elk08) for learning that directly estimates by averaging predicted probabilities of a holdout set and dividing all predicted probabilities by . On the SwissProt database, Elk08 was 621 times faster than biased SVM, which only requires two parameter optimization. However, Elk08 noise rate estimation is sensitive to inexact probability estimation and both RESVM and Elk08 assume = and do not generalize to learning. Rank Pruning leverages Elk08 to initialize , but then re-estimates using confident examples for both robustness (RESVM) and efficiency (Elk08).
||Fraction of examples mislabeled as positive||Liu|
|Fraction of examples mislabeled as negative||, PU||Liu, Claesen|
|Fraction of mislabeled examples in||Scott|
|Fraction of mislabeled examples in||Scott|
|Fraction of correctly labeled if||PU||Elkan|
|Related Work||Noise||Any Prob.||Prob Estim.||Time||Theory||Added|
Elkan & Noto (2008)
|Claesen et al. (2015)||✓||✓|
|Scott et al. (2013)||✓||✓||✓||✓|
|Natarajan et al. (2013)||✓||✓||✓||✓||✓||✓|
|Liu & Tao (2016)||✓||✓||✓||✓||✓|
Theoretical approaches for learning often have two steps: (1) estimate the noise rates, , , and (2) use , for prediction. To our knowledge, Rank Pruning is the only time-efficient solution for the open problem (Liu & Tao, 2016; Yang et al., 2012) of noise estimation.
We first consider relevant work in noise rate estimation. Scott et al. (2013) established a lower bound method for estimating the inversed noise rates and (defined in Table 1). However, the method can be intractable due to unbounded convergence and assumes that the positive and negative distributions are mutually irreducible. Under additional assumptions, Scott (2015) proposed a time-efficient method for noise rate estimation, but reported poor performance Liu & Tao (2016). Liu & Tao (2016) used the minimum predicted probabilities as the noise rates, which often yields futile estimates of min = . Natarajan et al. (2013) provide no method for estimation and view the noise rates as parameters optimized with cross-validation, inducing a sacrificial accuracy, efficiency trade-off. In comparison, Rank Pruning noise rate estimation is time-efficient, consistent in ideal conditions, and robust to imperfect probability estimation.
Natarajan et al. (2013) developed two methods for prediction in the
setting which modify the loss function. The first method constructs an unbiased estimator of the loss function for the true distribution from the noisy distribution, but the estimator may be non-convex even if the original loss function is convex. If the classifier’s loss function cannot be modified directly, this method requires splitting each example in two with class-conditional weights and ensuring split examples are in the same batch during optimization. For these reasons, we instead compare Rank Pruning with their second method (Nat13), which constructs a label-dependent loss function such that for 0-1 loss, the minimizers of Nat13’s risk and the risk for the true distribution are equivalent.
Liu & Tao (2016) generalized Elk08 to the learning setting by modifying the loss function with per-example importance reweighting (Liu16), but reweighting terms are derived from predicted probabilities which may be sensitive to inexact estimation. To mitigate sensitivity, Liu & Tao (2016) examine the use of density ratio estimation (Sugiyama et al., 2012). Instead, Rank Pruning mitigates sensitivity by learning from confident examples selected by rank order, not predicted probability. For fairness of comparison across methods, we compare Rank Pruning with their probability-based approach.
Assuming perfect estimation of and , we, Natarajan et al. (2013), and Liu & Tao (2016) all prove that the expected risk for the modified loss function is equivalent to the expected risk for the perfectly labeled dataset. However, both Natarajan et al. (2013) and Liu & Tao (2016) effectively ”flip” example labels in the construction of their loss function, providing no benefit for added random noise. In comparison, Rank Pruning will also remove added random noise because noise drawn from a third distribution is unlikely to appear confidently positive or negative. Table 2 summarizes our comparison of and learning methods.
Procedural efforts have improved robustness to mislabeling in the context of machine vision (Xiao et al., 2015)2015)
, and face recognition(Angelova et al., 2005). Though promising, these methods are restricted in theoretical justification and generality, motivating the need for Rank Pruning.
In this paper, we describe the Rank Pruning algorithm for binary classification with imperfectly labeled training data. In particular, we:
Develop a robust, time-efficient, general solution for both learning, i.e. binary classification with noisy labels, and estimation of the fraction of mislabeling in both the positive and negative training sets.
Introduce the learning with confident examples mantra as a new way to think about robust classification and estimation with mislabeled training data.
Prove that under assumptions, Rank Pruning achieves perfect noise estimation and equivalent expected risk as learning with correct labels. We provide closed-form solutions when those assumptions are relaxed.
Demonstrate that Rank Pruning performance generalizes across the number of training examples, feature dimension, fraction of mislabeling, and fraction of added noise examples drawn from a third distribution.
Improve the state-of-the-art of learning across F1 score, AUC-PR, and Error. In many cases, Rank Pruning achieves nearly the same F1 score as learning with correct labels when 50% of positive examples are mislabeled and 50% of observed positive labels are mislabeled negative examples.
2 Framing the Learning Problem
In this section, we formalize the foundational definitions, assumptions, and goals of the learning problem illustrated by the student-teacher motivational example.
Given observed training examples with associated observed corrupted labels and unobserved true labels , we seek a binary classifier that estimates the mapping . Unfortunately, if we fit the classifier using observed pairs, we estimate the mapping and obtain .
We define the observed noisy positive and negative sets as and the unobserved true positive and negative sets as . Define the hidden training data as , drawn i.i.d. from some true distribution . We assume that a class-conditional Classification Noise Process (CNP) (Angluin & Laird, 1988) maps true labels to observed labels such that each label in is flipped independently with probability and each label in is flipped independently with probability (). The resulting observed, corrupted dataset is . Therefore, and . In recent work, CNP is referred to as the random noise classification (RCN) noise model (Liu & Tao, 2016; Natarajan et al., 2013).
The noise rate is the fraction of examples mislabeled as negative and the noise rate is the fraction of examples mislabeled as positive. Note that is a necessary condition, otherwise more examples would be mislabeled than labeled correctly. Thus, . We elect a subscript of “0” to refer to the negative set and a subscript of “1” to refer to the positive set. Additionally, let be the fraction of corrupted labels that are positive and be the fraction of true labels that are positive. It follows that the inversed noise rates are and . Combining these relations, given any pair in , the remaining two and are known.
We consider five levels of assumptions for , , and :
Perfect Condition: is a “perfect” probability estimator iff where . Equivalently, let . Then is “perfect” when and “imperfect” when . may be imperfect due to the method of estimation or due to added uniformly randomly labeled examples drawn from a third noise distribution.
Non-overlapping Condition: and have “non-overlapping support” if , where the indicator function is if the is true, else .
Ideal Condition222 Eq. (1) is first derived in (Elkan & Noto, 2008) .: is “ideal” when both perfect and non-overlapping conditions hold and such that
Range Separability Condition range separates and iff and , we have .
Unassuming Condition: is “unassuming” when perfect and/or non-overlapping conditions may not be true.
Their relationship is: Separability .
We can now state the two goals of Rank Pruning for learning. Goal 1 is to perfectly estimate and when is ideal. When is not ideal, to our knowledge perfect estimation of and is impossible and at best Goal 1 is to provide exact expressions for and w.r.t. and . Goal 2 is to use and to uncover the classifications of from . Both tasks must be accomplished given only observed () pairs. , and are hidden.
3 Rank Pruning
We develop the Rank Pruning algorithm to address our two goals. In Section 3.1, we propose a method for noise rate estimation and prove consistency when is ideal. An estimator is “consistent” if it achieves perfect estimation in the expectation of infinite examples. In Section 3.2, we derive exact expressions for and when is unassuming. In Section 3.3, we provide the entire algorithm, and in Section 3.5, prove that Rank Pruning has equivalent expected risk as learning with uncorrupted labels for both ideal and non-ideal with weaker assumptions. Throughout, we assume so that and are the hidden distributions, each with infinite examples. This is a necessary condition for Theorems. 2, 4 and Lemmas 1, 3.
3.1 Deriving Noise Rate Estimators and
We propose the confident counts estimators and to estimate and as a fraction of the predicted counts of confident examples in each class, encouraging robustness for variation in probability estimation. To estimate , we count the number of examples with label that we are “confident” have label and divide it by the total number of examples that we are “confident” have label . More formally,
where is fit to the corrupted training set to obtain . The threshold is the predicted probability in above which we guess that an example has hidden label , and similarly for upper bound . and partition and into four sets representing a best guess of a subset of examples having labels (1) , (2) , (3) , (4) . The threshold values are defined as
where is the predicted label from a classifier fit to the observed data. counts examples with label that are most likely to be correctly labeled () because . The three other terms in Eq. (3) follow similar reasoning. Importantly, the four terms do not sum to , i.e. , but and are valid estimates because mislabeling noise is assumed to be uniformly random. The choice of threshold values relies on the following two important equations:
Similarly, we have
To our knowledge, although simple, this is the first time that the relationship in Eq. (4) (5) has been published, linking the work of Elkan & Noto (2008), Liu & Tao (2016), Scott et al. (2013) and Natarajan et al. (2013). From Eq. (4) (5), we observe that and
are linear interpolations ofand and since , we have that and . When is ideal we have that , if and , if . Thus when is ideal, the thresholds and in Eq. (3) will perfectly separate and examples within each of and . Lemma 1 immediately follows.
When is ideal,
Thus, when is ideal, the thresholds in Eq. (3) partition the training set such that and contain the correctly labeled examples and and contain the mislabeled examples. Theorem 2 follows (for brevity, proofs of all theorems/lemmas are in Appendix A.1-A.5).
When is ideal,
Thus, when is ideal, the confident counts estimators and are consistent estimators for and and we set . These steps comprise Rank Pruning noise rate estimation (see Alg. 1). There are two practical observations. First, for any with fitting time, computing and is . Second, and should be estimated out-of-sample to avoid over-fitting, resulting in sample variations. In our experiments, we use 3-fold cross-validation, requiring at most .
3.2 Noise Estimation: Unassuming Case
Theorem 2 states that , when is ideal. Though theoretically constructive, in practice this is unlikely. Next, we derive expressions for the estimators when is unassuming, i.e. may not be perfect and and may have overlapping support.
Define as the fraction of overlapping examples in and remember that . Denote . We have
When is unassuming, we have
The second term on the R.H.S. of the expressions captures the deviation of from , . This term results from both imperfect and overlapping support. Because the term is non-negative, , in the limit of infinite examples. In other words, is an upper bound for the noise rates , . From Lemma 3, it also follows:
Given non-overlapping support condition,
If , then .
If ), then .
Theorem 4 shows that and are robust to imperfect probability estimation. As long as does not exceed the distance between the threshold in Eq. (3) and the perfect value, and are consistent estimators for and . Our numerical experiments in Section 4 suggest this is reasonable for . The average for the MNIST training dataset across different (, ) varies between 0.01 and 0.08 for a logistic regression classifier, 0.010.03 for a CNN classifier, and 0.050.10 for the CIFAR dataset with a CNN classifier. Thus, when and are above 0.1 for these datasets, from Theorem 4 we see that still accurately estimates .
3.3 The Rank Pruning Algorithm
Using and , we must uncover the classifications of from . In this section, we describe how Rank Pruning selects confident examples, removes the rest, and trains on the pruned set using a reweighted loss function.
First, we obtain the inverse noise rates , from , :
Next, we prune the examples in with smallest and the examples in with highest and denote the pruned sets and . To prune, we define as the smallest for and as the largest for . BFPRT () (Blum et al., 1973) is used to compute and and pruning is reduced to the following filter:
Lastly, we refit the classifier to by class-conditionally reweighting the loss function for examples in with weight and examples in with weight to recover the estimated balance of positive and negative examples. The entire Rank Pruning algorithm is presented in Alg. 1 and illustrated step-by-step on a synthetic dataset in Fig. 1.
We conclude this section with a formal discussion of the loss function and efficiency of Rank Pruning. Define as the predicted label of example for the classifier fit to and let be the original loss function for . Then the loss function for Rank Pruning is simply the original loss function exerted on the pruned , with class-conditional weighting:
Effectively this loss function uses a zero-weight for pruned examples. Other than potentially fewer examples, the only difference in the loss function for Rank Pruning and the original loss function is the class-conditional weights. These constant factors do not increase the complexity of the minimization of the original loss function. In other words, we can fairly report the running time of Rank Pruning in terms of the running time () of the choice of probabilistic estimator. Combining noise estimation (), pruning (), and the final fitting (), Rank Pruning has a running time of , which is for typical classifiers.
3.4 Rank Pruning: A simple summary
Recognizing that formalization can create obfuscation, in this section we describe the entire algorithm in a few sentences. Rank Pruning takes as input training examples , noisy labels , and a probabilistic classifier and finds a subset of that is likely to be correctly labeled, i.e. a subset of . To do this, we first find two thresholds, and , to confidently guess the correctly and incorrectly labeled examples in each of and , forming four sets, then use the set sizes to estimate the noise rates and . We then use the noise rates to estimate the number of examples with observed label and hidden label and remove that number of examples from by removing those with lowest predicted probability . We prune similarly. Finally, the classifier is fit to the pruned set, which is intended to represent a subset of the correctly labeled data.
3.5 Expected Risk Evaluation
In this section, we prove Rank Pruning exactly uncovers the classifier fit to hidden labels when range separates and and and are given.
Denote as a classifier’s prediction function belonging to some function space , where represents the classifier’s parameters. represents , but without necessarily fit to the training data. is the Rank Pruning estimate of .
Denote the empirical risk of w.r.t. the loss function and corrupted data as , and the expected risk of w.r.t. the corrupted distribution as . Similarly, denote as the expected risk of w.r.t. the hidden distribution and loss function . We show that using Rank Pruning, a classifier can be learned for the hidden data , given the corrupted data , by minimizing the empirical risk:
Under the range separability condition, we have
If range separates and and , , then for any classifier and any bounded loss function , we have
where is Rank Pruning’s loss function (Eq. 11).
The proof of Theorem 5 is in Appendix A.5. Intuitively, Theorem 5 tells us that if range separates and , then given exact noise rate estimates, Rank Pruning will exactly prune out the positive examples in and negative examples in , leading to the same expected risk as learning from uncorrupted labels. Thus, Rank Pruning can exactly uncover the classifications of (with infinite examples) because the expected risk is equivalent for any . Note Theorem 5 also holds when is ideal, since ideal range separability. In practice, range separability encompasses a wide range of imperfect scenarios, e.g. can have large fluctuation in both and or have systematic drift w.r.t. to due to underfitting.
4 Experimental Results
In Section 3, we developed a theoretical framework for Rank Pruning, proved exact noise estimation and equivalent expected risk when conditions are ideal, and derived closed-form solutions when conditions are non-ideal. Our theory suggests that, in practice, Rank Pruning should (1) accurately estimate and , (2) typically achieve as good or better F1, error and AUC-PR (Davis & Goadrich, 2006) as state-of-the-art methods, and (3) be robust to both mislabeling and added noise.
In this section, we support these claims with an evaluation of the comparative performance of Rank Pruning in non-ideal conditions across thousands of scenarios. These include less complex (MNIST) and more complex (CIFAR) datasets, simple (logistic regression) and complex (CNN) classifiers, the range of noise rates, added random noise, separability of and , input dimension, and number of training examples to ensure that Rank Pruning is a general, agnostic solution for learning.
In our experiments, we adjust instead of because binary noisy classification problems (e.g. detection and recognition tasks) often have that . This choice allows us to adjust both noise rates with respect to , i.e. the fraction of true positive examples that are mislabeled as negative () and the fraction of observed positive labels that are actually mislabeled negative examples (). The learning algorithms are trained with corrupted labels , and tested on an unseen test set by comparing predictions with the true test labels using F1 score, error, and AUC-PR metrics. We include all three to emphasize our apathy toward tuning results to any single metric. We provide F1 scores in this section with error and AUC-PR scores in Appendix C.
4.1 Synthetic Dataset
The synthetic dataset is comprised of a Guassian positive class and a Guassian negative classes such that negative examples () obey an
-dimensional Gaussian distribution
with unit variance, and positive examples obey , where is an
-dimensional vector, andmeasures the separability of the positive and negative set.
We test Rank Pruning by varying 4 different settings of the environment: separability , dimension, number of training examples , and percent (of
) added random noise drawn from a uniform distribution. In each scenario, we test 5 different pairs: , . From Fig. 2, we observe that across these settings, the F1 score for Rank Pruning is fairly agnostic to magnitude of mislabeling (noise rates). As a validation step, in Fig. 3 we measure how closely our empirical estimates match our theoretical solutions in Eq. (8) and find near equivalence except when the number of training examples approaches zero.
For significant mislabeling (, ), Rank Pruning often outperforms other methods (Fig. 4). In the scenario of different separability , it achieves nearly the same F1 score as the ground truth classifier. Remarkably, from Fig. 2 and Fig. 4, we observe that when added random noise comprises of total training examples, Rank Pruning still achieves F1 0.85, compared with F1 0.5 for all other methods. This emphasizes a unique feature of Rank Pruning, it will also remove added random noise because noise drawn from a third distribution is unlikely to appear confidently positive or negative.
4.2 MNIST and CIFAR Datasets
We consider the binary classification tasks of one-vs-rest for the MNIST (LeCun & Cortes, 2010) and CIFAR-10 (Krizhevsky et al. ) datasets, e.g. the “car vs rest” task in CIFAR is to predict if an image is a “car” or “not”. and are given to all learning methods for fair comparison, except for which is Rank Pruning including noise rate estimation. metrics measure our performance on the unadulterated learning problem.
As evidence that Rank Pruning is dataset and classifier agnostic, we demonstrate its superiority with both (1) a linear logistic regression model with unit L2 regularization and (2) an AlexNet CNN variant with max pooling and dropout, modified to have a two-class output. The CNN structure is adapted fromChollet (2016b) for MNIST and Chollet (2016a)
for CIFAR. CNN training ends when a 10% holdout set shows no loss decrease for 10 epochs (max 50 for MNIST and 150 for CIFAR).
We consider noise rates for both MNIST and CIFAR, with additional settings for MNIST in Table 3 to emphasize Rank Pruning performance is noise rate agnostic. The , case is omitted because when given