Abstract
A set of probabilistic predictions is well calibrated if the events that are predicted to occur with probability do in fact occur about fraction of the time. Well calibrated predictions are particularly important when machine learning models are used in decision analysis. This paper presents two new nonparametric methods for calibrating outputs of binary classification models: a method based on the Bayes optimal selection and a method based on the Bayesian model averaging. The advantage of these methods is that they are independent of the algorithm used to learn a predictive model, and they can be applied in a postprocessing step, after the model is learned. This makes them applicable to a wide variety of machine learning models and methods. These calibration methods, as well as other methods, are tested on a variety of datasets in terms of both discrimination and calibration performance. The results show the methods either outperform or are comparable in performance to the stateoftheart calibration methods.
1 Introduction
This paper focuses on the development of probabilistic calibration methods for probabilistic prediction tasks. Traditionally, machinelearning research has focused on the development of methods and models for improving discrimination, rather than on methods for improving calibration. However, both are very important. Wellcalibrated predictions are particularly important in decision making and decision analysis niculescu2005predicting ; zadrozny2001obtaining ; zadrozny2002transforming . Miscalibrated models, which overestimate or underestimate the probability of outcomes, may lead to making suboptimal decisions.
Since calibration is often not a priority, many prediction models learned by machine learning methods may be miscalibrated. The objective of this work is to develop general but effective methods that can address the miscalibration problem. Our aim is to have methods that can be used independent of the prediction model and that can be applied in the postprocessing step after the model is learned from the data. This approach frees the designer of the machine learning model from the need to add additional calibration measures and terms into the objective function used to learn the model. Moreover, all modeling methods make assumptions, and some of those assumptions may not hold in a given application, which could lead to miscalibration. In addition, limited training data can negatively affect a model’s calibration performance.
Existing calibration methods can be divided into parametric and nonparametric methods. An example of a parametric method is Platt s method that applies a sigmoidal transformation that maps the output of a model (e.g., a posterior probability)
platt1999probabilisticto a new probability that is intended to be better calibrated. The parameters of the sigmoidal transformation function are learned using the maximum likelihood estimation framework. A limitation of the sigmoidal function is that it is symmetric and does not work well for highly biased distributions
jiang2012calibrating . The most common nonparametric methods are based either on binning zadrozny2001obtaining or isotonic regression ayer1955empirical . Briefly, the binning approach divides the observed outcome predictions into k bins; each bin is associated with a new probability value that is derived from empirical estimates. The isotonic regression algorithm is a special adaptive binning approach that assures the isotonicity (monotonicity) of the probability estimates.In this paper we introduce two new Bayesian nonparametric calibration methods. The first one, the Selection over Bayesian Binnings , uses dynamic programming to efficiently search over all possible binnings of the posterior probabilities within a training set in order to select the Bayes optimal binning according to a scoring measure. The second method, Averaging over Bayesian Binnings , generalizes by performing model averaging over all possible binnings. The advantage of these Bayesian methods over existing calibration methods is that they have more stable, wellperforming behavior under a variety of conditions.
Our probabilistic calibration methods can be applied in two prediction settings. First, they can be used to convert the outputs of discriminative classification models, which have no apparent probabilistic interpretation, into posterior class probabilities. An example is an SVM that learns a discriminative model, which does not have a direct probabilistic interpretation. Second, the calibration methods can be applied to improve the calibration of predictions of a probabilistic model that is miscalibrated. For example, a Naïve Bayes (NB) model is a probabilistic model, but its class posteriors are often miscalibrated due to unrealistic independence assumptions niculescu2005predicting . The methods we describe are shown empirically to improve the calibration of NB models without reducing its discrimination. The methods can also work well on calibrating models that are less egregiously miscalibrated than are NB models.
The remainder of this paper is organized as follows. Section 2 describes the methods that we applied to perform postprocessing calibration. Section 3 describes the experimental setup that we used in evaluating the calibration methods. The results of the experiments are presented in Section 4. Section 5 discusses the results and describes the advantages and disadvantages of proposed methods in comparison to other calibration methods. Finally, Section 6 states conclusions, and describes several areas for future work.
2 Methods
In this section we present two new Bayesian nonparametric methods for binary classifier calibration that generalize the histogrambinning calibration method zadrozny2001obtaining by considering all possible binnings of the training data. The first proposed method, which is a hard binning classifier calibration method, is called Selection over Bayesian Binnings . We also introduce a new soft binning method that generalizes by model averaging over all possible binnings; it is called Averaging over Bayesian Binnings . There are two main challenges here. One is how to score a binning. We use a Bayesian score. The other is how to efficiently search over such a large space of binnings. We use dynamic programming to address this issue.
2.1 Bayesian Calibration Score
Let and define respectively an uncalibrated classifier prediction and the true class of the ’th instance . Also, let define the set of all training instances . In addition, let be the sorted set of all uncalibrated classifier predictions and be a list of the first elements of , starting at ’th index and ending at ’th index, and let denote a binning of . A binning model induced by the training set is defined as:
(1) 
where, is the number of bins over the set and is the set of all the calibration model parameters , which are defined as follows. For a bin , which is determined by , the distribution of the class variable
is modeled as a binomial distribution with parameter
. Thus, specifies all the binomial distributions for all the existing bins in . We note that our binning model is motivated by the model introduced in jonathan12application for variable discretization, which is here customized to perform classifier calibration. We score a binning model as follows:(2) 
The marginal likelihood in Equation 2 is derived using the marginalization of the joint probability of over all parameter space according to the following equation:
(3) 
Equation 3 has a closed form solution under the following assumptions: (1) All samples are i.i.d and the class distribution ,which is class distribution for instances locatd in bin number , is modeled using a binomial distribution with parameter , (2) the distribution of class variables over two different bins are independent of each other, and (3) the prior distribution over binning model parameters s are modeled using a distribution. We also assume that the parameters of the distribution and
are both equal to one, which means corresponds to a uniform distribution over each
. The closed form solution to the marginal likelihood given the above assumptions is as follows heckerman1995learning :(4) 
where is the total number of training instances located in bin . Also, and are respectively the number of class zero and class one instances among all training instances in bin .
The term in Equation 2
specifies the prior probability of a binning of calibration model
. It can be interpreted as a structure prior, which we define as follows. Let be the prior probability of there being a partitioning boundary between and in the binning given by model , and model it using a distribution with parameter .Consider the prior probability for the presence of bin , which contains the sequence of training instances according to model . Assuming independence of the appearance of partitioning boundaries, we can calculate the prior of the boundaries defining bin by using the function as follows:
(5) 
where the product is over all training instances from to , inclusive.
Combining Equations 5 and 4 into Equation 2, we obtain the following Bayesian score for calibration model :
(6) 
2.2 The and models
We can use the above Bayesian score to perform model selection or model averaging. Selection involves choosing the best partitioning model and calibrating a prediction as . As mentioned, we call this approach Selection over Bayesian Binnings . Model averaging involves calibrating predictions over all possible binnings. We call this approach Averaging over Bayesian Binnings model. A calibrated prediction in is derived as follows:
(7) 
where is the total number of predictions in (i.e., training instances).
Both and consider all possible binnings of the predictions in , which is exponential in . Thus, in general, a bruteforce approach is not computationally tractable. Therefore, we apply dynamic programming, as described in the next two sections.
2.3 Dynamic Programming Search of
This section summarizes the dynamic programming method used in . It is based on the dynamicprogrammingbased discretization method described in jonathan12application . Recall that is the sorted set of all uncalibrated classifier’s outputs in the training data set. Let define the prefix of set including the set of the first uncalibrated estimates . Consider finding the optimal binning models corresponding to the subsequence for of the set . Assume we have already found the highest score binning of these models , corresponding to each of the subsequences . Let denote the respective scores of the optimal binnings of these models. Let be the score of subsequence when it is considered as a single bin in the calibration model . For all from to , computes , which is the score for the highest scoring binning of set for which subsquence is considered as a single bin. Since this binning score is derived from two other scores , we call it a composite score of the binning model . The fact that this composite score is a product of two scores follows from the decomposition of Bayesian scoring measure we are using, as given by Equation 6. In particular, both the prior and marginal likelihood terms on the score are decomposable.
In finding the best binning model , chooses the maximum composite score over all , which corresponds to the optimal binning for the training data subset ; this score is stored in . By repeating this process from to , derives the optimal binning of set , which is the best binning over all possible binnings. As shown in jonathan12application , the computational time complexity of the above dynamic programming procedure is .
2.4 Dynamic Programming Search of
The dynamic programming approach used in is based on the above dynamic programming approach in . It focuses on calibrating a particular instance . Thus, it is an instancespecific method. The algorithm uses the property of the Bayesian binning score in Equation 6. Assume we have already found in one forward run (from lowest to highest prediction) of the method the highest score binning of the models , which correspond to each of the subsequences , respectively; let the values denote the respective scores of the optimal binning for these models, which we cache. We perform an analogous dynamic programming procedure in in a backward manner (from highest to lowest prediction) and compute the highest score binning of these models , which correspond to each of the subsequences , respectively; let the values denote the respective scores of the optimal binning for these models, which also cache. Using the decomposability property of the binning score given by 6, we can write the Bayesian model averaging estimate given by Equation 7 as follows:
(8) 
where is obtained using the frequency^{1}^{1}1we actually use smoothing of these counts, which is consistent with the Bayesian priors in the scoring function of the training instances in the bin containing the predictions . Remarkably, the dynamic programming implementation of is also . However, since it is instance specific, this time complexity holds for each prediction that is to be calibrated (e.g., each prediction in a test set). To address this problem, we can partition the interval [0, 1] into equally spaced bins and stored the ABB output for each of those bins. The training time is therefore . During testing, a given is mapped to one of the bins and the stored calibrated probability is retrieved, which can all be done in time.
3 Experimental Setup
This section describes the set of experiments that we performed to evaluate the calibration methods described above. To evaluate the calibration performance of each method, we ran experiments using both simulated data and real data. In our experiments on simulated data, we used logistic regression (LR) as the base classifier, whose predictions are to be calibrated. The choice of logistic regression was made to let us compare our results with the stateoftheart method
, which as published is tailored for LR. For the simulated data, we used one dataset in which the outcomes were linearly separable and two other datasets in which they were not. Also, in the simulation data we used randomly generated instances for training LR model, random instances for learning calibrationmodels, and random instances for testing the models ^{2}^{2}2Based on our experiments the separation between training set and calibration set is not necessary. However, zadrozny2001obtaining state that for the histogram model it is better to use another set of instances for calibrating the output of classifier; thus, we do so here. The scatter plots of the two linearly nonseparable simulated datasets are shown in Figures [2, 2 ].We also performed experiments on three different sets of real binary classification data. The first set is the UCI Adult dataset. The prediction task is a binary classification problem to predict whether a person makes over $50K a year using his or her demographic information. From the original Adult dataset, which includes total instances with real and categorical features, after removing the instances with missing values, we used instances for training classifiers, for calibrationmodel learning, and instances for testing.
We also used the UCI SPECT dataset, which is a small biomedical binary classification dataset. SPECT allows us to examine how well each calibration method performs when the calibration dataset is small in a real application. The dataset involves the diagnosis of cardiac Single Proton Emission Computed Tomography (SPECT) images. Each of the patients is classified into two categories: normal or abnormal. This dataset consists of training instances, with an equal number of positive and negative instances, and test instances with only positive instances. The SPECT dataset includes binary features. Due to the small number of instances, we used the original training data as both our training and calibration datasets, and we used the original test data as our test dataset.
For the experiments on the Adult and SPECT datasets, we used three different classifiers: LR, naïve Bayes, and SVM with polynomial kernels. The choice of the LR model allows us to include the ACP method in the comparison, because as mentioned it is tailored to LR. Naïve Bayes is a wellknown, simple, and practical classifier that often achieves good discrimination performance, although it is usually not well calibrated. We included SVM because it is a relatively modern classifier that is being frequently applied.
The other real dataset that we used for evaluation contains clinical findings (e.g., symptoms, signs, laboratory results) and outcomes for patients with community acquired pneumonia (CAP) fine1997prediction . The classification task we examined involves using patient findings to predict dire patient outcomes, such as mortality or serious medical complications. The CAP dataset includes a total of patient cases (instances) that we divided into instances for training of classifiers, instances for learning calibration models, and instances for testing the calibration models. The data includes discrete and continuous features. For our experiments on the naïve Bayes model, we just used the discrete features of data, and for the experiments on SVM we used all discrete and continuous features. Also, for applying LR model to this dataset, we first used PCA feature transformation because of the high dimensionality of data and the existing correlations among some features, which produced unstable results due to singularity issues.
4 Experimental Results
This section presents experimental results of the calibration methods when applied to the datasets described in the previous section. We show the performance of the methods in terms of both calibration and discrimination, since in general both are important. Due to a lack of space, we do not include here the results for the linearly separable data; however, we note that the results for each of the calibration methods and the base classifier was uniformly excellent across the different evaluation measures that are described below.
For the evaluation of the calibration methods, we used different measures. The first two measures are Accuracy (Acc) and the Area Under the ROC Curve (AUC), which measure discrimination. The three other measures are the Root Mean Square Error (RMSE), Expected Calibration Error (ECE), and Maximum Calibration Error (MCE). These measures evaluate calibration performance. The and are simple statistics that measure calibration relative to the ideal reliability diagram degroot1983comparison ; niculescu2005predicting . In computing these measures, the predictions are partitioned to ten bins . The predicted value of each test instance falls into one of the bins. The calculates Expected Calibration Error over the bins, and calculates the Maximum Calibration Error among the bins, using empirical estimates as follows:
where is the true fraction of positive instances in bin , is the mean of the postcalibrated probabilities for the instances in bin , and is the empirical probability (fraction) of all instances that fall into bin . The lower the values of and , the better is the calibration of a model.
The Tables show the comparisons of different methods with respect to evaluation measures on the simulated and real datasets. In these tables in each row we show in bold the two methods that achieved the best performance with respect to a specified measure.
As can be seen, there is no superior method that outperforms all the others in all data sets on all measures. However, SBB and ABB are superior to Platt and isotonic regression in all the simulation datasets. We discuss the reason why in Section 5. Also, SBB and ABB perform as well or better than isotonic regression and the Platt method on the real data sets.
In all of the experiments, both on simulated datasets and real data sets, both SBB and ABB generally retain or improve the discrimination performance of the base classifier, as measured by Acc and AUC. In addition, they often improve the calibration performance of the base classifier in terms of the , and measures.
5 Discussion
Having a wellcalibrated classifier can be important in practical machine learning problems. There are different calibration methods in the literature and each one has its own pros and cons. The Platt method uses a sigmoid as a mapping function. The main advantage of Platt scaling method is its fast recall time. However, the shape of sigmoid function can be restrictive, and it often cannot produce well calibrated probabilities when the instances are distributed in feature space in a biased fashion (e.g. at the extremes, or all near separating hyper plane) jiang2012calibrating .
Histogram binning is a nonparametric method which makes no special assumptions about the shape of mapping function. However, it has several limitations, including the need to define the number of bins and the fact that the bins remain fixed over all predictions zadrozny2002transforming .
Isotonic regressionbased calibration is another nonparametric calibration method, which requires that the mapping (from precalibrated predictions to postcalibrated ones) is chosen from the class of all isotonic (i.e., monotonicity increasing) functions niculescu2005predicting ; zadrozny2002transforming . Thus, it is less restrictive than the Platt calibration method. The pair adjacent violators (PAV) algorithm is one instance of an isotonic regression algorithm ayer1955empirical . The PAV algorithm can be considered as a binning algorithm in which the boundaries of the bins are chosen according to how well the classifier ranks the exampleszadrozny2002transforming . It has been shown that Isotonic regression performs very well in comparison to other calibration methods in real datasets niculescu2005predicting ; caruana2006empirical ; zadrozny2002transforming . Isotonic regression has some limitations, however. The most significant limitation of the isotonic regression is its isotonicity (monotonicity) assumption. As seen in Tables [2(a), 2(b)] in the simulation data, when the isotonicity assumption is violated through the choice of classifier and the nonlinearity of data, isotonic regression performs relatively poorly, in terms of improving the discrimination and calibration capability of a base classifier. The violation of this assumption can happen in real data secondary to the choice of learning models and algorithms.
A classifier calibration method called adaptive calibration of predictions (ACP) was recently introduced jiang2012calibrating . A given application of ACP is tied to a particular model , such as a logistic regression model, that predicts a binary outcome . ACP requires a confidence interval (CI) around a particular prediction of . ACP adjusts the CI and uses it to define a bin. It sets to be the fraction of positive outcomes among all the predictions that fall within the bin. On both real and synthetic datasets, ACP achieved better calibration performance than a variety of other calibration methods, including simple histogram binning, Platt scaling, and isotonic regression jiang2012calibrating . The ACP postcalibration probabilities also achieved among the best levels of discrimination, according to the AUC. ACP has several limitations, however. First, it requires not only probabilistic predictions, but also a statistical confidence interval () around each of those predictions, which makes it tailored to specific classifiers, such as logistic regression jiang2012calibrating . Second, based on a around a given prediction , it commits to a single binning of the data around that prediction; it does not consider alternative binnings that might yield a better calibrated . Third, the bin it selects is symmetric around by construction, which may not optimize calibration. Finally, it does not use all of the training data, but rather only uses those predictions within the confidence interval around . As one can see from the tables, ACP performed well when logistic regression is the base classifier, both in simulated and real datasets. SBB and ABB performed as well or better than ACP in both simulation and real data sets.
In general, the SBB and ABB algorithms appear promising, especially ABB, which overall outperformed SBB. Neither algorithm makes restrictive (and potentially unrealistic) assumptions, as does Platt scaling and isotonic regression. They also are not restricted in the type of classifier with which they can apply, unlike ACP.
The main disadvantage of SBB and ABB is their running time. If is the number of training instances, then SBB has a training time of , due to its dynamic programming algorithm that searches over every possible binning, whereas the time complexity of ACP and histogram binning is , and it is for isotonic regression. Also, the cached version of ABB has a training time of , where reflects the number of bins being used. Nonetheless, it remains practical to use these algorithms to perform calibration on a desktop computer when using training datasets that contain thousands of instances. In addition, the testing time is only for SBB where is the number of binnings found by the algorithm and for the cached version of ABB. Table 1 shows the time complexity of different methods in learning for N training instances and recall for only one instance.
Platt  Hist  IsoReg  ACP  SBB  ABB  

Time  
Complexity  
(Learning/Recall)  
Note that N and b are the of training sets and the number of bins found by the method respectively. T is the number of iteration required for convergence in Platt method and R reflects the number of bins being used by cached ABB. 
6 Conclusion
In this paper we introduced two new Bayesian, nonparametric methods for calibrating binary classifiers, which are called and . Experimental results on simulated and real data support that these methods perform as well or better than the other calibration methods that we evaluated. While the new methods have a greater time complexity than the other calibration methods evaluated here, they nonetheless are efficient enough to be applied to training datasets with thousands of instances. Thus, we believe these new methods are promising for use in machine learning, particularly when calibrated probabilities are important, such as in decision analyses.
In future work, we plan to explore how the two new methods perform when using Bayesian model averaging over the hyper parameter
. We also will extend them to perform multiclass calibration. Finally, we plan to investigate the use of calibration methods on posterior probabilities that are inferred from models that represent joint probability distributions, such as maximummargin Markovnetwork models
roller2004max ; zhu2008laplace ; zhu2009medlda .











References
 [1] M. Ayer, HD Brunk, G.M. Ewing, WT Reid, and E. Silverman. An empirical distribution function for sampling with incomplete information. The annals of mathematical statistics, pages 641–647, 1955.

[2]
R. Caruana and A. NiculescuMizil.
An empirical comparison of supervised learning algorithms.
In Proceedings of the 23rd international conference on Machine learning, pages 161–168, 2006.  [3] M.H. DeGroot and S.E. Fienberg. The comparison and evaluation of forecasters. The statistician, pages 12–22, 1983.
 [4] M.J. Fine, T.E. Auble, D.M. Yealy, B.H. Hanusa, L.A. Weissfeld, D.E. Singer, C.M. Coley, T.J. Marrie, and W.N. Kapoor. A prediction rule to identify lowrisk patients with communityacquired pneumonia. New England Journal of Medicine, 336(4):243–250, 1997.

[5]
D. Heckerman, D. Geiger, and D.M. Chickering.
Learning bayesian networks: The combination of knowledge and statistical data.
Machine learning, 20(3):197–243, 1995.  [6] X. Jiang, M. Osl, J. Kim, and L. OhnoMachado. Calibrating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association, 19(2):263–274, 2012.
 [7] J.L. Lustgarten, S. Visweswaran, V. Gopalakrishnan, and G.F. Cooper. Application of an efficient bayesian discretization method to biomedical data. BMC Bioinformatics, 12, 2011.
 [8] A. NiculescuMizil and R. Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632, 2005.

[9]
J. Platt et al.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.
Advances in large margin classifiers, 10(3):61–74, 1999.  [10] B. Taskar, C. Guestri, and D. Koller. Maxmargin markov networks. In Advances in Neural Information Processing Systems, volume 16, 2004.

[11]
B. Zadrozny and C. Elkan.
Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers.
In Machine LearningInternational Workshop then Conference, pages 609–616, 2001.  [12] B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694–699, 2002.
 [13] J. Zhu, A. Ahmed, and E.P. Xing. Medlda: maximum margin supervised topic models for regression and classification. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1257–1264, 2009.
 [14] J. Zhu, E.P. Xing, and B. Zhang. Laplace maximum margin markov networks. In Proceedings of the 25th international conference on Machine learning, pages 1256–1263, 2008.
Comments
There are no comments yet.