Consider a biologist who downloads software for classifying the behavior of fruit flies. The classifier was laboriously trained by a different research group who labeled thousands of training examples to achieve satisfactory performance on a validation set collected in some particular setting (see e.g. ). The biologist would be ill-advised if she trusted the published performance figures; maybe small lighting changes in her experimental setting have changed the statistics of the data and rendered the classifier useless. However, if the biologist has to review all the labels assigned by the classifier to her dataset, just to be sure the classifier is performing up to expectation, then what is the point of obtaining a trained classifier in the first place? Is it possible at all to obtain a reliable evaluation of a classifier when unlabeled data is plentiful, but when the user is willing to provide only a small number of labeled examples?
We propose a method for achieving minimally supervised evaluation of classifiers, requiring as few as 10 labels to accurately estimate classifier performance. Our method is based on a generative Bayesian model for the confidence scores produced by the classifier, borrowing from the literature on semisupervised learning [16, 20, 21]. We show how to use the model to re-calibrate classifiers to new datasets by choosing thresholds to satisfy performance constraints with high likelihood. An additional contribution is a fast approximate inference method for doing inference in our model.
2 Modeling the classifier score
Let us start with a set of data items, , drawn from some unknown distribution and indexed by . Suppose that a classifier, , where is some scalar threshold, has been used to classify all data items into two classes, . While the “ground truth” labels are assumed to be unknown, initially, we do have access to all the “scores,”
, computed by the classifier. From this point onwards, we forget about the data vectorsand concentrate solely on the scores and labels, .
The key assumption in this paper is that the list of scores and the unknown labels can be modeled by a two-component mixture model , parameterized by , where the class-conditionals are standard parametric distributions. We show in Section 4.2 that this is a reasonable assumption for many datasets.
Suppose that we can ask an expert (the “oracle”) to provide the true label for any data item. This is an expensive operation and our goal is to ask the oracle for as few labels as possible. The set of items that have been labeled by the oracle at time is denoted by and its complement, the set of items for which the ground truth is unknown, is denoted . This setting is similar to semisupervised learning [20, 21]. By estimating , we will improve our estimate of the performance of when .
Consider first the fully supervised case, i.e. where all labels are known. Let the scores be i.i.d. according to the two mixture model. If the all labels are known, and we assume independent observations, the likelihood of the data is given by,
where , and is the mixture weight, i.e. . The component densities and4.2 for a discussion about which class conditional distributions to choose). This approach of applying a generative model to score distributions, when all labels are known, has been used in the past to obtain error estimates on classifier performance [13, 10, 12], and for classifier calibration . However, previous approaches require that the all items used to estimate the performance have been labeled.
We suggest that it may be possible to estimate classifier performance even when only a fraction of the ground truth labels are known. In this case, the labels for the unlabeled items can be marginalized out,
where . This allows the model to make use of the scores of unlabeled items in addition to the labeled items, which enables accurate performance estimates with only a handful of labels. Once we have the likelihood, we can take a Bayesian approach to estimate the parameters . Starting from a prior on the parameters, , we can obtain a posterior by using Bayes’ rule,
Let us look at a real example. Figure 1a shows a histogram of the scores obtained from classifier on a public dataset (see Section 4 for more information about the datasets we use). At first glance, it is difficult to guess the performance of the classifier unless the oracle provides a lot of labels. However, if we assume that the scores follow a two-component mixture model as in (3), with a Gamma distribution for the and a Normal distribution for the component, then there is a only a narrow choice of that can explain the scores with high likelihood; the red and green curves in Figure 1a show such a high probability hypothesis. As we will see in the next section, the posterior on can be used to estimate the performance of the classifier.
3 Estimating performance
Most performance measures can be computed directly from the model parameters . For example, two often used performance measures are the precision and recall at a particular score threshold . We can define these quantities in terms of the conditional distributions . Recall is defined as the fraction of the positive, i.e. , examples that have scores above a given threshold,
Precision is defined to be the fraction of all examples with scores above a given threshold that are positive,
We can also compute the precision at a given level of recall by inverting , i.e. for some recall . Other performance measures, such as the equal error rate, true positive rate, true negative rate, sensitivity, specificity, and the ROC can be computed from in a similar manner.
The posterior on can also be used to obtain confidence bounds on the performance of the classifier. For example, for some choice of parameters , the precision and recall can be computed for a range of score thresholds to obtain a curve (see solid curves in Figure 1b). Similarly, given the posterior on , the distribution of and can be computed for a fixed to obtain confidence intervals (shown as colored bands in Figure 1b). The same reasoning can be applied to the precision-recall curve: for some recall , the distribution of precisions, found using can be used to compute confidence intervals on the curve (see Figure 1c).
While the approach of estimating performance based purely on the estimate of works well in limit when the number of data items , it has some drawbacks when is small (on the order of ) and is unbalanced, in which case finite-sample effects come into play. This is especially the case when the number of positive examples is very small, say 10–100, in which case the performance curve will be very jagged. Since the previous approach views the scores (and the associated labels) as a finite sample from , there will always be uncertainty in the performance estimate. When all items have been labeled by the oracle, the remaining uncertainty in the performance represents the variability in sampling from . In practice, however, the question that is often asked is, “What is our best guess for the classifier performance on this particular test set?” In other words, we are interested in the sample performance rather than the population performance. Thus, when the oracle has labeled the whole test set, there should not be any uncertainty in the performance; it can simply be computed directly from .
To estimate the sample performance, we need to account for uncertainty in the unlabeled items, . This uncertainty is captured by the distribution of the unobserved labels , found by marginalizing out the model parameters,
Here is the space of all possible parameters. On the second line we rely on the assumption of a mixture model to factor the joint probability distribution on and .
One way to think of this approach is as follows: imagine that we sample from . We can then use all the labels and the scores to trace out a performance curve (e.g., a precision-recall curve). Now, as we repeat the sampling, each performance curve will look slightly different. Thus, the posterior distribution on
in effect gives us a distribution of performance curves. We can use this distribution to compute quantities such as the expected performance curve, the variance in the curves, and confidence intervals. The main difference between the sample and population performance estimates will be at the tails of the score distribution,, where individual item labels can have a large impact on the performance curve.
3.1 Sampling from the posterior
In practice, we cannot compute in (7) analytically, so we must resort to approximate methods. For some choices of class conditional densities, , such as when they are Normal distributions, it is possible to carry out the marginalization over in (7) analytically. In that case one could use collapsed Gibbs sampling to sample from the posterior on , as is often done for models involving the Dirichlet process . A more generally applicable method, which we will describe here, is to split the sampling into three steps: (a) sample from , (b) fix the mixture parameters to and sample the labels given their associated scores, and (c) compute the performance, such as precision and recall, for all score thresholds . By repeating these three steps, we can obtain a sample from the distribution over the performance curves.
The first step, sampling from the posterior , can be carried out using importance sampling (IS). We experimented with Metropolis-Hastings and Hamiltonian Monte Carlo , but we found that IS worked well for this problem, required less parameter tuning, and was much faster. In IS, we sample from a proposal distribution in order to estimate properties of the desired distribution . Suppose we draw samples of from to get . Then, we can approximate expectations of some function using the weighted function evaluations, i.e. . The weights correct for the bias introduced by sampling from and are defined as,
For the datasets in this paper, we found that the state-space around the MAP estimate444We used BFGS-B  to carry out the optimization. To avoid local maxima, we used multiple starting points. of ,
was well approximated by a multivariate Normal distribution. Hence, for the proposal distribution we used,
To simplify things further, we used a diagonal covariance matrix, . The elements along the diagonal of were found by fitting a univariate Normal locally to along each dimension of while the other elements were fixed at their MAP-estimates. The mean of the proposal distribution, , was set to the MAP estimate of .
We now have all steps needed to estimate the performance of the classifier, given the scores and some labels obtained from the oracle:
Find the MAP estimate of using (9).
Fit a proposal distribution to locally around .
Sample instances of , , from and calculate the weights .
For each , sample the labels for to get .
Estimate performance measures using the scores , labels and weights .
We surveyed the literature for published classifier scores with ground truth labels. One such dataset that we found was the Caltech Pedestrian Dataset555Downloaded from http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/. (CPD), for which both detector scores and ground truth labels are available for a wide variety of detectors . Moreover, the CPD website also has scores and labels available, using the same detectors, for other pedestrian detection datasets, such as the INRIA (abbr. INR) dataset .
We made use of the detections in the CPD and INR datasets as if they were classifier outputs. To some extent, these detectors are in fact classifiers, in that they use the sliding-window technique for object detection. Here, windows are extracted at different locations and scales in the image, and each window is classified using a pedestrian classifier (with the caveat that there is often some extra post-processing steps carried out, such as non-maximum suppression to reduce the number of false positive detections). For our experiments, we show the results on detectors and datasets to highlight both the advantages and drawbacks with using SPE. To make experiments go faster, we sampled the datasets randomly to have between 800–2,000 items. See  for references to all detectors.
To complement the pedestrian datasets, we also used a basic linear SVM classifier and a logistic regression classifier on the “optdigits” (abbr. DGT) and “sat” (SAT) datasets from the UCI Machine Learning Repository. Since both datasets are multiclass, but our method only handles binary classification, we chose one category for and grouped the others into . Thus, each multi-class dataset was turned into multiple binary datasets. Planned future work includes extending our approach to multiclass classifiers. In the figures, the naming convention is as follows: “svm3” is used to mean that the SVM classifier was used with category 3 in the dataset being assigned to the class, and “logres9” denotes that the logistic regression classifier was used with category 9 being the class, and so on. The datasets had 1,800–2,000 items each.
4.2 Choosing class conditionals
So far we have not discussed in detail which distribution families to use for the class conditional distributions. To find out which parametric distributions are appropriate for modeling the score class-conditionals, we took the classifier scores and split them into two groups, one for and one for . We used MLE to fit different families of probability distributions (see Figure 3 for a list of distributions) on 80% of the data (sampled randomly) in each group. We then ranked the distributions by the log likelihood of the remaining 20% of the data (given the MLE-fitted parameters). In total, we carried out this procedure on 78 class conditionals from the different datasets and classifiers.
Figure 3G shows the top-3 distributions that explained the class-conditional scores with highest likelihood for a selection of the datasets and classifiers. We found that the truncated Normal distribution was in the top-3 list for 48/78 dataset class-conditionals, and that the Gamma distribution was in the top-3 list 53/78 times; at least one of the two distributions were always in the top-3 list. Figure 3A–F show some examples of the fitted distributions. In some cases, like Figure 3C, a mixture model would have provided a better fit than the simple distributions we tried. That said, we found that truncated Normal and Gamma distributions were good choices for most of the datasets.
Since we use a Bayesian approach in equation (4), we must also define a prior on
. The prior will vary depending on which distribution is chosen, and it should be chosen based on what we know about the data and the classifier. As an example, for the truncated Normal distribution, we use a Normal and a Gamma distribution as priors on the mean and standard deviation respectively (since we use sampling for inference, we are not limited to conjugate priors). As a prior on the mixture weight
, we use a Beta distribution.
In some situations when little is known about the classifier, it makes sense to try different kinds of class-conditional distributions. One heuristic, which we found worked well in our experiments, is to try different combinations of distributions forand , and then choose the combination achieving the highest maximum likelihood on the labeled and unlabeled data.
4.3 Applying SPE
Figure 2 shows SPE applied to different datasets. The left-most plots show the estimation error, as measured by the area between the true and predicted precision-recall curves, versus the number of labels sampled. The datasets in Figure 2A–B and C–D were chosen to highlight the strengths and weaknesses of using SPE. Figure 2A shows SPE applied to the ChnFtrs detector in the CPD dataset. Already at 20 sampled labels, the estimate is very close (see Figure 2B). In a few cases, e.g. in Figure 2C–D (logres8 on the DGT dataset), SPE does not fare as well. While SPE performs as well as the naive method in terms of estimation error, the score distribution is not well explained by the assumptions of the model, so there is a bias in the prediction. That said, despite the fact that SPE is biased in Figure 2D, it is still far better than the naive method for 100 labels. Ultimately, the accuracy of SPE depends on how well the score data fit the assumptions in Section 2.
Figure 2E compares the estimation error of SPE to the naive method for different datasets, when only 20 labels are known. In almost all cases, SPE performs significantly better. Moreover, the variances in the SPE estimates are smaller than those of the naive method.
4.4 Classifier recalibration
Applying SPE to a test dataset allows us to “recalibrate” the classifier to that dataset. Unlike previous work on classifier calibration [1, 17], SPE does not require all items to be labeled. For each unlabeled data item, we can compute the probability that it belongs to the class by calculating the empirical expectation from the samples, i.e. .
Similarly, we can choose a threshold to use with the classifier based on some pre-determined criteria. For example, the requirement might be that the classifier performs with recall and precision . In that case, we define a condition . Then, for each , we find the probability that the condition is satisfied by calculating the expectation over the unlabeled items . Figure 4 shows the probability that is satisfied at different values of . Thus, this approach can be used to choose new thresholds for different datasets.
5 Related work
Previous approaches for estimating classifier performance with few labels falls into two categories: stratified sampling and active estimation using importance sampling. Bennett and Carvalho  suggested that the accuracy of classifiers can be estimated cost-effectively by dividing the data into disjoint strata based on the item scores, and proposed an online algorithm for sampling from the strata. This work has since been generalized to other classifier performance metrics, such as precision and recall 
. Sawade et al. proposed instead to use importance sampling to focus labeling effort on data items with high classifier uncertainty, and applied it to standard loss functions and F-measures . While both of these approaches assume that the classifier threshold is fixed (see Section 2) and that a single scalar performance measure is desired, SPE can be applied to the tradeoff between different performance measures in the form of performance curves.
showed how a two-component Gaussian mixture model can be used to obtain accurate ROC curves in different settings. Erkanli et al. extended this work by fitting mixtures of Dirichlet process priors to the class-conditional distributions. This allowed them to provide smooth performance estimates even when the class-conditional distributions could not be explained by standard parametric distributions. Similarly, previous work on classifier calibration has involved fitting mixture models to score distributions [1, 17]. In contrast to previous work, which require all data items to be labeled, SPE also makes use of the unlabeled data. This semisupervised approach allows SPE to estimate classifier performance with very few labels, or when the proportions of positive and negative examples are very unbalanced.
We explored the problem of estimating classifier performance from few labeled items. We propose using mixtures of two densities to model the scores of classifiers. This allows us to predict performance curves even when a very small number (none in the limit) of the samples are labeled. Using four public datasets, and multiple classifiers, we showed that classifier score distributions can often be well approximated by two-component mixture models with standard parametric component distributions, such as truncated Normal and Gamma distributions. We demonstrated how our model, Semisupervised Performance Evaluation (SPE), can be used to estimate classifier performance, with confidence intervals, using only a few labeled examples. We presented a sampling scheme based on importance sampling for efficient inference.
This line of research opens up many interesting avenues for future exploration. For example, is it possible to do unbiased active querying, so that the oracle is asked to label the most informative examples? One possibility in this direction would be to employ importance weighted active sampling techniques [3, 7], so similar in spirit to [19, 18] but for performance curves. Another future direction would be to extend SPE to multi-component mixture models and multiclass problems. That said, as shown by our experiments, SPE already works well for a broad range of classifiers and datasets, and can estimate classifier performance with as few as 10 labels (see Figure 1).
-  Paul N. Bennett. Using asymmetric distributions to improve classifier probabilities: A comparison of new and standard parametric methods. Technical report, Carnegie Mellon University, 2002.
-  Paul N. Bennett and Vitor R. Carvalho. Online stratified sampling: evalutating classifiers at web-scale. In CIKM, 2010.
Alina Beygelzimer, Sanjoy Dasgupta, and John Langford.
Importance weighted active learning.In ICML, 2009.
-  Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing, 16(5):1190–1208, 1995.
-  Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In ICCV, 2005.
-  Heiko Dankert, Liming Wang, Eric D Hoopfer, David J Anderson, and Pietro Perona. Automated monitoring and analysis of social behavior in drosophila. Nat Meth, 6(4):297–303, 04 2009.
-  Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. In ICML, 2008.
-  Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: An evaluation of the state of the art. PAMI, 99, 2011.
-  Gregory Druck and Andrew McCallum. Toward interactive training and evaluation. In CIKM, 2011.
-  Alaattin Erkanli, Minje Sung, E. Jane Costello, and Adrian Angold. Bayesian semi-parametric ROC analysis. Statist. Med., 25:3905–3928, 2006.
-  A. Frank and Arthur Asuncion. UCI machine learning repository, 2010.
-  Jiezhun Gu, Subhashis Ghosal, and Anindya Roy. Bayesian bootstrap estimation of ROC curve. Statist. Med., 27:5407–5420, 2008.
-  Martin Hellmich, Keith R. Abrams, and Alex J. Sutton. Bayesian Approaches to Meta-analysis of ROC Curves. Med. Decis. Making, 19:252–264, 1999.
-  Steven N. MacEachern. Estimating normal means with a conjugate style dirichlet process prior. Communications in Statistics B, 23(3):727–741, 1994.
Radford M. Neal.
MCMC using Hamiltonian dynamics.
In S. Brooks, A. Gelman, G. L. Jones, , and X.-L. Meng, editors,
Handbook of Markov Chain Monte Carlo, pages 113–162. Chapman & Hall / CRC Press, 2010.
-  Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom Mitchell. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3):103–134, 2000.
John C. Platt.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.In A. J. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classiers, pages 61–74. MIT Press, 1999.
-  Christoph Sawade, Niels Landwehr, Steffen Bickel, and Tobias Scheffer. Active risk estimation. In ICML, 2010.
-  Christoph Sawade, Niels Landwehr, and Tobias Scheffer. Active Estimation of F-Measures. In NIPS, 2010.
-  Matthias Seeger. Learning with labeled and unlabeled data. Technical report, University of Edinburgh, 2002.
-  Xiaojin Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin–Madison, 2008.