Semisupervised Classifier Evaluation and Recalibration

by   Peter Welinder, et al.

How many labeled examples are needed to estimate a classifier's performance on a new dataset? We study the case where data is plentiful, but labels are expensive. We show that by making a few reasonable assumptions on the structure of the data, it is possible to estimate performance curves, with confidence bounds, using a small number of ground truth labels. Our approach, which we call Semisupervised Performance Evaluation (SPE), is based on a generative model for the classifier's confidence scores. In addition to estimating the performance of classifiers on new datasets, SPE can be used to recalibrate a classifier by re-estimating the class-conditional confidence distributions.



There are no comments yet.


page 1

page 2

page 3

page 4


Using theoretical ROC curves for analysing machine learning binary classifiers

Most binary classifiers work by processing the input to produce a scalar...

Survey Equivalence: A Procedure for Measuring Classifier Accuracy Against Human Labels

In many classification tasks, the ground truth is either noisy or subjec...

A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels

In some problem spaces, the high cost of obtaining ground truth labels n...

Confident Learning: Estimating Uncertainty in Dataset Labels

Learning exists in the context of data, yet notions of confidence typica...

The Benefit Of Temporally-Strong Labels In Audio Event Classification

To reveal the importance of temporal precision in ground truth audio eve...

Classifier Risk Estimation under Limited Labeling Resources

In this paper we propose strategies for estimating performance of a clas...

Uncertainty Estimation For Community Standards Violation In Online Social Networks

Online Social Networks (OSNs) provide a platform for users to share thei...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a biologist who downloads software for classifying the behavior of fruit flies. The classifier was laboriously trained by a different research group who labeled thousands of training examples to achieve satisfactory performance on a validation set collected in some particular setting (see e.g. [6]). The biologist would be ill-advised if she trusted the published performance figures; maybe small lighting changes in her experimental setting have changed the statistics of the data and rendered the classifier useless. However, if the biologist has to review all the labels assigned by the classifier to her dataset, just to be sure the classifier is performing up to expectation, then what is the point of obtaining a trained classifier in the first place? Is it possible at all to obtain a reliable evaluation of a classifier when unlabeled data is plentiful, but when the user is willing to provide only a small number of labeled examples?

We propose a method for achieving minimally supervised evaluation of classifiers, requiring as few as 10 labels to accurately estimate classifier performance. Our method is based on a generative Bayesian model for the confidence scores produced by the classifier, borrowing from the literature on semisupervised learning [16, 20, 21]. We show how to use the model to re-calibrate classifiers to new datasets by choosing thresholds to satisfy performance constraints with high likelihood. An additional contribution is a fast approximate inference method for doing inference in our model.

Figure 1: Estimating detector performance with all but 10 labels unknown. A: Histogram of classifier scores obtained by running the “ChnFtrs” detector [8] on the INRIA dataset [5]

. The red and green curves show the Gamma-Normal mixture model fitting the histogrammed scores with highest likelihood. The scores are all unlabeled, apart from 10, selected at random, which have labels. The shaded bands indicate the 90% probability bands around the model. The red and green bars show the labels of the 10 randomly sampled labels (by chance, the scores for some of the samples are close to each other, thus only 6 bars are shown; the height of the bars has no meaning).


: Precision and recall curves computed from the mixture model in A.


: In black, precision-recall curve computed after all items have been labeled. In red, precision-recall curve estimated using SPE from only 10 labeled examples (with 90% confidence interval shown as the magenta band). See Section 

2 for a discussion.
Figure 2: Applying SPE to different datasets. A

: Estimation error, as measured by the area between the true and predicted precision-recall curves, versus the number of labels sampled, for the ChnFtrs detector on the CPD dataset. The red curve is SPE and the green curve shows the median error of the naive method (RND). The green band show the 90% quantiles of the naive method.

B: The performance curve estimated using SPE (red) with 90% confidence intervals (magenta) with 20 known labels. The ground truth performance with all label known is shown as a black curve (GT), and the performance curve computed on 20 labels using the naive method from 5 random samples is shown in green (RND). Notice that the curves (in green) obtained from different samples vary a lot (although most predict perfect performance). CD: same as A–B, but for the logres8 classifier on the DGT dataset (hand-picked as an example where SPE does not work well). E

: Comparison of estimation error (area between curves) of SPE and naive method for 20 known labels and different datasets. The appearance of the markers denote the dataset (each dataset has multiple classifiers), and the lines indicate the standard error averaged over 10 trials. SPE almost always perform significantly better than the naive method.

2 Modeling the classifier score

Let us start with a set of data items, , drawn from some unknown distribution and indexed by . Suppose that a classifier, , where is some scalar threshold, has been used to classify all data items into two classes, . While the “ground truth” labels are assumed to be unknown, initially, we do have access to all the “scores,”

, computed by the classifier. From this point onwards, we forget about the data vectors

and concentrate solely on the scores and labels, .

The key assumption in this paper is that the list of scores and the unknown labels can be modeled by a two-component mixture model , parameterized by , where the class-conditionals are standard parametric distributions. We show in Section 4.2 that this is a reasonable assumption for many datasets.

Suppose that we can ask an expert (the “oracle”) to provide the true label for any data item. This is an expensive operation and our goal is to ask the oracle for as few labels as possible. The set of items that have been labeled by the oracle at time is denoted by and its complement, the set of items for which the ground truth is unknown, is denoted . This setting is similar to semisupervised learning [20, 21]. By estimating , we will improve our estimate of the performance of when .

Consider first the fully supervised case, i.e. where all labels are known. Let the scores be i.i.d. according to the two mixture model. If the all labels are known, and we assume independent observations, the likelihood of the data is given by,


where , and is the mixture weight, i.e. . The component densities and

could be modeled parametrically by Normal distributions, Gamma distributions, or some other probability distributions appropriate for the given classifier (see Section 

4.2 for a discussion about which class conditional distributions to choose). This approach of applying a generative model to score distributions, when all labels are known, has been used in the past to obtain error estimates on classifier performance [13, 10, 12], and for classifier calibration [1]. However, previous approaches require that the all items used to estimate the performance have been labeled.

We suggest that it may be possible to estimate classifier performance even when only a fraction of the ground truth labels are known. In this case, the labels for the unlabeled items can be marginalized out,


where . This allows the model to make use of the scores of unlabeled items in addition to the labeled items, which enables accurate performance estimates with only a handful of labels. Once we have the likelihood, we can take a Bayesian approach to estimate the parameters . Starting from a prior on the parameters, , we can obtain a posterior by using Bayes’ rule,


Let us look at a real example. Figure 1a shows a histogram of the scores obtained from classifier on a public dataset (see Section 4 for more information about the datasets we use). At first glance, it is difficult to guess the performance of the classifier unless the oracle provides a lot of labels. However, if we assume that the scores follow a two-component mixture model as in (3), with a Gamma distribution for the and a Normal distribution for the component, then there is a only a narrow choice of that can explain the scores with high likelihood; the red and green curves in Figure 1a show such a high probability hypothesis. As we will see in the next section, the posterior on can be used to estimate the performance of the classifier.

Figure 3: Modeling class-conditional score densities by standard parametric distributions. AF: Standard parametric distributions (black solid curve) fitted to the class conditional scores for a few example datasets and classifiers. The score distributions are shown as histograms. In all cases, we normalized the scores to be in the interval , and made the truncation at for the truncated distributions. See Section 4.2 for more information. G: Comparison of standard parametric distributions best representing empirical class-conditional score distributions (for a subset of the 78 cases we tried). Each row shows the top-3 distributions, i.e. explaining the class-conditional scores with highest likelihood, for different combinations of datasets, classifiers and the class-labels (shown in brackets, or

). The distribution families we tried included (with abbreviations used in last three columns in parentheses) the truncated Normal (n), truncated Student’s t (t), Gamma (g), log-normal (ln), left- and right-skewed Gumbel (g-l and g-r), Gompertz (gz), and Frechet right (f-r) distribution. The last and second to last column show the relative log likelihood (r.l.l.) with respect to the best (

) distribution. Two densities, truncated Normal and Gamma, are either top or indistinguishable from the top in all the datasets we tried.

3 Estimating performance

Most performance measures can be computed directly from the model parameters . For example, two often used performance measures are the precision and recall at a particular score threshold . We can define these quantities in terms of the conditional distributions . Recall is defined as the fraction of the positive, i.e. , examples that have scores above a given threshold,


Precision is defined to be the fraction of all examples with scores above a given threshold that are positive,


We can also compute the precision at a given level of recall by inverting , i.e. for some recall . Other performance measures, such as the equal error rate, true positive rate, true negative rate, sensitivity, specificity, and the ROC can be computed from in a similar manner.

The posterior on can also be used to obtain confidence bounds on the performance of the classifier. For example, for some choice of parameters , the precision and recall can be computed for a range of score thresholds to obtain a curve (see solid curves in Figure 1b). Similarly, given the posterior on , the distribution of and can be computed for a fixed to obtain confidence intervals (shown as colored bands in Figure 1b). The same reasoning can be applied to the precision-recall curve: for some recall , the distribution of precisions, found using can be used to compute confidence intervals on the curve (see Figure 1c).

While the approach of estimating performance based purely on the estimate of works well in limit when the number of data items , it has some drawbacks when is small (on the order of ) and is unbalanced, in which case finite-sample effects come into play. This is especially the case when the number of positive examples is very small, say 10–100, in which case the performance curve will be very jagged. Since the previous approach views the scores (and the associated labels) as a finite sample from , there will always be uncertainty in the performance estimate. When all items have been labeled by the oracle, the remaining uncertainty in the performance represents the variability in sampling from . In practice, however, the question that is often asked is, “What is our best guess for the classifier performance on this particular test set?” In other words, we are interested in the sample performance rather than the population performance. Thus, when the oracle has labeled the whole test set, there should not be any uncertainty in the performance; it can simply be computed directly from .

To estimate the sample performance, we need to account for uncertainty in the unlabeled items, . This uncertainty is captured by the distribution of the unobserved labels , found by marginalizing out the model parameters,


Here is the space of all possible parameters. On the second line we rely on the assumption of a mixture model to factor the joint probability distribution on and .

One way to think of this approach is as follows: imagine that we sample from . We can then use all the labels and the scores to trace out a performance curve (e.g., a precision-recall curve). Now, as we repeat the sampling, each performance curve will look slightly different. Thus, the posterior distribution on

in effect gives us a distribution of performance curves. We can use this distribution to compute quantities such as the expected performance curve, the variance in the curves, and confidence intervals. The main difference between the sample and population performance estimates will be at the tails of the score distribution,

, where individual item labels can have a large impact on the performance curve.

3.1 Sampling from the posterior

In practice, we cannot compute in (7) analytically, so we must resort to approximate methods. For some choices of class conditional densities, , such as when they are Normal distributions, it is possible to carry out the marginalization over in (7) analytically. In that case one could use collapsed Gibbs sampling to sample from the posterior on , as is often done for models involving the Dirichlet process [14]. A more generally applicable method, which we will describe here, is to split the sampling into three steps: (a) sample from , (b) fix the mixture parameters to and sample the labels given their associated scores, and (c) compute the performance, such as precision and recall, for all score thresholds . By repeating these three steps, we can obtain a sample from the distribution over the performance curves.

The first step, sampling from the posterior , can be carried out using importance sampling (IS). We experimented with Metropolis-Hastings and Hamiltonian Monte Carlo [15], but we found that IS worked well for this problem, required less parameter tuning, and was much faster. In IS, we sample from a proposal distribution in order to estimate properties of the desired distribution . Suppose we draw samples of from to get . Then, we can approximate expectations of some function using the weighted function evaluations, i.e. . The weights correct for the bias introduced by sampling from and are defined as,


For the datasets in this paper, we found that the state-space around the MAP estimate444We used BFGS-B [4] to carry out the optimization. To avoid local maxima, we used multiple starting points. of ,


was well approximated by a multivariate Normal distribution. Hence, for the proposal distribution we used,


To simplify things further, we used a diagonal covariance matrix, . The elements along the diagonal of were found by fitting a univariate Normal locally to along each dimension of while the other elements were fixed at their MAP-estimates. The mean of the proposal distribution, , was set to the MAP estimate of .

We now have all steps needed to estimate the performance of the classifier, given the scores and some labels obtained from the oracle:

  1. Find the MAP estimate of using (9).

  2. Fit a proposal distribution to locally around .

  3. Sample instances of , , from and calculate the weights .

  4. For each , sample the labels for to get .

  5. Estimate performance measures using the scores , labels and weights .

4 Experiments

Figure 4: Recalibrating the classifier by estimating the probability that a condition is met. A: The conditions in panel B shown as colored “boxes,” e.g. the yellow curve shows the condition that the precision and recall . The blue curve and confidence band show SPE applied to the ChnFtrs detector on the CPD dataset with 100 observed labels (black curve is ground truth). B: Probability that the conditions shown in A are satisfied for different score thresholds. Based on a curve like this, a practitioner can “recalibrate” a pre-trained classifier by picking a threshold for new dataset such that some pre-defined criteria (e.g. in terms of precision and recall) are met.

4.1 Datasets

We surveyed the literature for published classifier scores with ground truth labels. One such dataset that we found was the Caltech Pedestrian Dataset555Downloaded from (CPD), for which both detector scores and ground truth labels are available for a wide variety of detectors [8]. Moreover, the CPD website also has scores and labels available, using the same detectors, for other pedestrian detection datasets, such as the INRIA (abbr. INR) dataset [5].

We made use of the detections in the CPD and INR datasets as if they were classifier outputs. To some extent, these detectors are in fact classifiers, in that they use the sliding-window technique for object detection. Here, windows are extracted at different locations and scales in the image, and each window is classified using a pedestrian classifier (with the caveat that there is often some extra post-processing steps carried out, such as non-maximum suppression to reduce the number of false positive detections). For our experiments, we show the results on detectors and datasets to highlight both the advantages and drawbacks with using SPE. To make experiments go faster, we sampled the datasets randomly to have between 800–2,000 items. See [8] for references to all detectors.

To complement the pedestrian datasets, we also used a basic linear SVM classifier and a logistic regression classifier on the “optdigits” (abbr. DGT) and “sat” (SAT) datasets from the UCI Machine Learning Repository

[11]. Since both datasets are multiclass, but our method only handles binary classification, we chose one category for and grouped the others into . Thus, each multi-class dataset was turned into multiple binary datasets. Planned future work includes extending our approach to multiclass classifiers. In the figures, the naming convention is as follows: “svm3” is used to mean that the SVM classifier was used with category 3 in the dataset being assigned to the class, and “logres9” denotes that the logistic regression classifier was used with category 9 being the class, and so on. The datasets had 1,800–2,000 items each.

4.2 Choosing class conditionals

So far we have not discussed in detail which distribution families to use for the class conditional distributions. To find out which parametric distributions are appropriate for modeling the score class-conditionals, we took the classifier scores and split them into two groups, one for and one for . We used MLE to fit different families of probability distributions (see Figure 3 for a list of distributions) on 80% of the data (sampled randomly) in each group. We then ranked the distributions by the log likelihood of the remaining 20% of the data (given the MLE-fitted parameters). In total, we carried out this procedure on 78 class conditionals from the different datasets and classifiers.

Figure 3G shows the top-3 distributions that explained the class-conditional scores with highest likelihood for a selection of the datasets and classifiers. We found that the truncated Normal distribution was in the top-3 list for 48/78 dataset class-conditionals, and that the Gamma distribution was in the top-3 list 53/78 times; at least one of the two distributions were always in the top-3 list. Figure 3A–F show some examples of the fitted distributions. In some cases, like Figure 3C, a mixture model would have provided a better fit than the simple distributions we tried. That said, we found that truncated Normal and Gamma distributions were good choices for most of the datasets.

Since we use a Bayesian approach in equation (4), we must also define a prior on

. The prior will vary depending on which distribution is chosen, and it should be chosen based on what we know about the data and the classifier. As an example, for the truncated Normal distribution, we use a Normal and a Gamma distribution as priors on the mean and standard deviation respectively (since we use sampling for inference, we are not limited to conjugate priors). As a prior on the mixture weight

, we use a Beta distribution.

In some situations when little is known about the classifier, it makes sense to try different kinds of class-conditional distributions. One heuristic, which we found worked well in our experiments, is to try different combinations of distributions for

and , and then choose the combination achieving the highest maximum likelihood on the labeled and unlabeled data.

4.3 Applying SPE

Figure 2 shows SPE applied to different datasets. The left-most plots show the estimation error, as measured by the area between the true and predicted precision-recall curves, versus the number of labels sampled. The datasets in Figure 2A–B and C–D were chosen to highlight the strengths and weaknesses of using SPE. Figure 2A shows SPE applied to the ChnFtrs detector in the CPD dataset. Already at 20 sampled labels, the estimate is very close (see Figure 2B). In a few cases, e.g. in Figure 2C–D (logres8 on the DGT dataset), SPE does not fare as well. While SPE performs as well as the naive method in terms of estimation error, the score distribution is not well explained by the assumptions of the model, so there is a bias in the prediction. That said, despite the fact that SPE is biased in Figure 2D, it is still far better than the naive method for 100 labels. Ultimately, the accuracy of SPE depends on how well the score data fit the assumptions in Section 2.

Figure 2E compares the estimation error of SPE to the naive method for different datasets, when only 20 labels are known. In almost all cases, SPE performs significantly better. Moreover, the variances in the SPE estimates are smaller than those of the naive method.

4.4 Classifier recalibration

Applying SPE to a test dataset allows us to “recalibrate” the classifier to that dataset. Unlike previous work on classifier calibration [1, 17], SPE does not require all items to be labeled. For each unlabeled data item, we can compute the probability that it belongs to the class by calculating the empirical expectation from the samples, i.e. .

Similarly, we can choose a threshold to use with the classifier based on some pre-determined criteria. For example, the requirement might be that the classifier performs with recall and precision . In that case, we define a condition . Then, for each , we find the probability that the condition is satisfied by calculating the expectation over the unlabeled items . Figure 4 shows the probability that is satisfied at different values of . Thus, this approach can be used to choose new thresholds for different datasets.

5 Related work

Previous approaches for estimating classifier performance with few labels falls into two categories: stratified sampling and active estimation using importance sampling. Bennett and Carvalho [2] suggested that the accuracy of classifiers can be estimated cost-effectively by dividing the data into disjoint strata based on the item scores, and proposed an online algorithm for sampling from the strata. This work has since been generalized to other classifier performance metrics, such as precision and recall [9]

. Sawade et al. proposed instead to use importance sampling to focus labeling effort on data items with high classifier uncertainty, and applied it to standard loss functions 

[19] and F-measures [18]. While both of these approaches assume that the classifier threshold is fixed (see Section 2) and that a single scalar performance measure is desired, SPE can be applied to the tradeoff between different performance measures in the form of performance curves.

Fitting mixture models to the class-conditional score distributions has been studied in previous work with the goal of obtaining smooth performance curves. Gu et al. [12] and Hellmich et al. [13]

showed how a two-component Gaussian mixture model can be used to obtain accurate ROC curves in different settings. Erkanli et al. 

[10] extended this work by fitting mixtures of Dirichlet process priors to the class-conditional distributions. This allowed them to provide smooth performance estimates even when the class-conditional distributions could not be explained by standard parametric distributions. Similarly, previous work on classifier calibration has involved fitting mixture models to score distributions [1, 17]. In contrast to previous work, which require all data items to be labeled, SPE also makes use of the unlabeled data. This semisupervised approach allows SPE to estimate classifier performance with very few labels, or when the proportions of positive and negative examples are very unbalanced.

6 Discussion

We explored the problem of estimating classifier performance from few labeled items. We propose using mixtures of two densities to model the scores of classifiers. This allows us to predict performance curves even when a very small number (none in the limit) of the samples are labeled. Using four public datasets, and multiple classifiers, we showed that classifier score distributions can often be well approximated by two-component mixture models with standard parametric component distributions, such as truncated Normal and Gamma distributions. We demonstrated how our model, Semisupervised Performance Evaluation (SPE), can be used to estimate classifier performance, with confidence intervals, using only a few labeled examples. We presented a sampling scheme based on importance sampling for efficient inference.

This line of research opens up many interesting avenues for future exploration. For example, is it possible to do unbiased active querying, so that the oracle is asked to label the most informative examples? One possibility in this direction would be to employ importance weighted active sampling techniques [3, 7], so similar in spirit to [19, 18] but for performance curves. Another future direction would be to extend SPE to multi-component mixture models and multiclass problems. That said, as shown by our experiments, SPE already works well for a broad range of classifiers and datasets, and can estimate classifier performance with as few as 10 labels (see Figure 1).


  • [1] Paul N. Bennett. Using asymmetric distributions to improve classifier probabilities: A comparison of new and standard parametric methods. Technical report, Carnegie Mellon University, 2002.
  • [2] Paul N. Bennett and Vitor R. Carvalho. Online stratified sampling: evalutating classifiers at web-scale. In CIKM, 2010.
  • [3] Alina Beygelzimer, Sanjoy Dasgupta, and John Langford.

    Importance weighted active learning.

    In ICML, 2009.
  • [4] Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing, 16(5):1190–1208, 1995.
  • [5] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In ICCV, 2005.
  • [6] Heiko Dankert, Liming Wang, Eric D Hoopfer, David J Anderson, and Pietro Perona. Automated monitoring and analysis of social behavior in drosophila. Nat Meth, 6(4):297–303, 04 2009.
  • [7] Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In ICML, 2008.
  • [8] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: An evaluation of the state of the art. PAMI, 99, 2011.
  • [9] Gregory Druck and Andrew McCallum. Toward interactive training and evaluation. In CIKM, 2011.
  • [10] Alaattin Erkanli, Minje Sung, E. Jane Costello, and Adrian Angold. Bayesian semi-parametric ROC analysis. Statist. Med., 25:3905–3928, 2006.
  • [11] A. Frank and Arthur Asuncion. UCI machine learning repository, 2010.
  • [12] Jiezhun Gu, Subhashis Ghosal, and Anindya Roy. Bayesian bootstrap estimation of ROC curve. Statist. Med., 27:5407–5420, 2008.
  • [13] Martin Hellmich, Keith R. Abrams, and Alex J. Sutton. Bayesian Approaches to Meta-analysis of ROC Curves. Med. Decis. Making, 19:252–264, 1999.
  • [14] Steven N. MacEachern. Estimating normal means with a conjugate style dirichlet process prior. Communications in Statistics B, 23(3):727–741, 1994.
  • [15] Radford M. Neal. MCMC using Hamiltonian dynamics. In S. Brooks, A. Gelman, G. L. Jones, , and X.-L. Meng, editors,

    Handbook of Markov Chain Monte Carlo

    , pages 113–162. Chapman & Hall / CRC Press, 2010.
  • [16] Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3):103–134, 2000.
  • [17] John C. Platt.

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.

    In A. J. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classiers, pages 61–74. MIT Press, 1999.
  • [18] Christoph Sawade, Niels Landwehr, Steffen Bickel, and Tobias Scheffer. Active risk estimation. In ICML, 2010.
  • [19] Christoph Sawade, Niels Landwehr, and Tobias Scheffer. Active Estimation of F-Measures. In NIPS, 2010.
  • [20] Matthias Seeger. Learning with labeled and unlabeled data. Technical report, University of Edinburgh, 2002.
  • [21] Xiaojin Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin–Madison, 2008.