Better Classifier Calibration for Small Data Sets

02/24/2020 ∙ by Tuomo Alasalmi, et al. ∙ University of Oulu 0

Classifier calibration does not always go hand in hand with the classifier's ability to separate the classes. There are applications where good classifier calibration, i.e. the ability to produce accurate probability estimates, is more important than class separation. When the amount of data for training is limited, the traditional approach to improve calibration starts to crumble. In this article we show how generating more data for calibration is able to improve calibration algorithm performance in many cases where a classifier is not naturally producing well-calibrated outputs and the traditional approach fails. The proposed approach adds computational cost but considering that the main use case is with small data sets this extra computational cost stays insignificant and is comparable to other methods in prediction time. From the tested classifiers the largest improvement was detected with the random forest and naive Bayes classifiers. Therefore, the proposed approach can be recommended at least for those classifiers when the amount of data available for training is limited and good calibration is essential.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In many machine learning applications, e.g. in the medical domain

(Connolly et al., 2017), the models need to be explainable, or they will not be very useful. Obviously this means that the model needs to communicate to the user somehow what has led it to the given conclusion instead of just being a black-box (Guidotti et al., 2018). Another important factor in model explainability is the information how reliable the given prediction is. This property is called classifier calibration. A well calibrated classifier prediction is such that the predicted probability of an event is close to the proportion of the those events among a group of similar predictions (Dawid, 1982). However, the main design objective for classifiers tends to be good class separation and not accurate reliability estimation. Therefore, many classifiers are not well calibrated out of the box. To improve this probability estimate, accurate classifier calibration algorithms are needed. With accurate calibration, almost any model can output a good estimate of the probability that the decision it has made is indeed correct (Niculescu-Mizil and Caruana, 2005a). Accurate probability estimates are also important for cost sensitive decision making (Zadrozny and Elkan, 2001b).

For calibration algorithms to work well, a minimum of about 1000 to 2000 training samples are needed for the calibration data set depending on the learning algorithm to avoid overfitting. This is especially true for non-parametric calibration algorithms and calibration seems to improve further with even bigger calibration data sets (Niculescu-Mizil and Caruana, 2005a, b). To avoid biasing the calibration model, a separate calibration data set is needed. This means that the amount of training data in total needs to be large. E.g. if 10 % of the training data set is used for calibration and the rest for modelling, a training data set with at least 10 000 samples is needed. In addition, a separate data set needs to be held out for testing. Figure 1 illustrates the data set partitioning. In many real world modelling tasks, however, relatively small data sets are quite common. As we will demonstrate in this article, traditional calibration algorithms fail to deliver on small data sets. But with our proposed data generation approach, calibration can often be improved despite the data set being small.

A figure showing how a data set is divided into training and test data sets. Calibration data set, if used, is split from the training data set.

Figure 1.

Splitting of the available data into training, calibration, and test data sets. A part of the training data is reserved for calibration to avoid bias. Training data contains validation data for hyperparameter tuning.

The rest of the article is structured as follows. Literature in calibration with a view on small data sets is briefly reviewed in Section 2. In Section 3, a set of experiments is described. The results of the experiments are summarized in Section 4 and presented in more detail in the Appendix. To conclude the article, the results are discussed in Section 5.

2. Classifier calibration

There are three main categories of calibration techniques. These are the parametric calibration algorithms such as Platt scaling (Platt, 1999) and the non-parametric histogram binning (Zadrozny and Elkan, 2001a) and isotonic regression (Zadrozny and Elkan, 2002)

algorithms. In Platt scaling, a sigmoid function is fit to the prediction scores to transform prediction scores into probabilities. It was originally developed to improve calibration of support vector machines (SVM) and might not be the right transformation for many other classifiers. In binning, the prediction scores of a classifier are sorted and divided into bins of equal size. When we predict a test example, its prediction score can then be transformed into an estimated probability of belonging to a particular class by calculating the frequency of training samples belonging to that class in the corresponding bin. As drawbacks to binning, the number of bins needs to be specified and the probability estimates are discontinuous at bin boundaries. Also, depending on the classifier used, the prediction scores of classifier might not be uniformly distributed causing some bins to have significantly less, even zero, examples than others. Several methods have tried to overcome these problems, such as adaptive calibration of predictions (ACP)

(Jiang et al., 2012), selection over Bayesian binnings (SBB) and averaging over Bayesian binnings (ABB) (Naeini et al., 2015a)

, as well as Bayesian binning into quantiles (BBQ)

(Naeini et al., 2015b). In isotonic regression, a monotonically increasing function is used to map the prediction scores into probabilities. Isotonic regression is not continuous in general and can have undesirable jumps. To alleviate these problems, smoothing can be used (Jiang et al., 2011). In practice, however, the isotonicity assumption does not always hold (Naeini et al., 2015a). This makes isotonic regression sub optimal in these cases albeit quite effective (Niculescu-Mizil and Caruana, 2005a) regardless.

For the isotonicity constraint to hold true, the ranking imposed by the classifier would need to be perfect which is rarely true with real-world data sets. An ensemble of near-isotonic regression (ENIR) (Naeini and Cooper, 2018) allows violations of the ranking ordering and uses regularization to penalize the violations. In ENIR, a modified pool adjacent violators algorithm is used to find the solution path to a near isotonic regression problem (Tibshirani et al., 2011)

and Bayesian information criterion (BIC) scoring is used to combine the generated models. This ensemble is then used to post-process the classifier prediction scores to map them into calibrated probabilities. In their experiments, ENIR was on average the best performing calibration algorithm when compared to isotonic regression and BBQ with naive Bayes (NB), logistic regression, and SVM classifiers. Similarly, to what was accomplished with isotonic regression

(Zadrozny and Elkan, 2002), ENIR can be extended to multi-class problems whereas the Bayesian binning models cannot.

2.1. Calibrating small data sets

As already stated, to avoid biasing the calibration algorithm, a separate calibration data set is needed and it needs to be large enough to avoid overfitting. These constraints make the use of traditional calibration algorithms challenging with small data sets. For random forest (RF) classifiers, Out-of-Bag samples can be used so that the whole training data set can be utilized for both calibration and classifier training (Boström, 2008). An exact Bayesian model would not need calibration but as the true data distribution is not known in practice, we cannot construct such model. Instead, we can try to improve calibration by generating calibration data by Monte Carlo cross validation. The generation of calibration data can work, as we have previously shown, at least for isotonic regression calibration with the naive Bayes classifier (Alasalmi et al., 2018).

In our previous work (Alasalmi et al., 2018), two algorithms were suggested for calibration data generation. In the first stage, Monte Carlo cross validation is used to generate as many data points as desired. These value pairs consisting of the true class labels and the prediction scores can be used directly to tune the calibration algorithm. This is called the Data Generation (DG) model. The generated value pairs can be grouped and the average prediction scores along with the fraction of positive class labels in the group can be used for the calibration algorithm tuning. This model is called the Data Generation and Grouping (DGG) model. Detailed description of the process is not repeated here and the reader is instead referred to the original publication for details. In this work we will test the proposed data generation approach with the newer improved calibration algorithm ENIR and with more classifiers.

3. Experiments

To test the effectiveness of using data generation for calibrating classifiers with small data sets, a set of experiments was set up. ENIR was used as the calibration algorithm because as a non parametric algorithm it should work equally well with all classifiers. In addition, Platt scaling was used with SVM. Representatives from top performing classifier groups were selected for the experiments and their calibration performance with different calibration scenarios was compared with two Bayesian classifiers.

To serve as control, we used the uncalibrated prediction scores of each classifier. This calibration scenario is referred in the results as Raw. In this case, as there was no need for a separate calibration data set, all data points in the training data set were used for classifier training. To test if the raw prediction scores could be improved by calibration to more closely resemble posterior probabilities, the calibration algorithm ENIR was used in four different settings. First, ENIR was used in the recommended way, i.e. a separate calibration data set was held out from the training data set that was not used for classifier training but only for tuning the calibration model. Size of the calibration data was set to 10 % of the training data set and the remaining 90 % was used for training the classifier. This scenario is called ENIR in the results. Second, ENIR was used like the algorithm’s creators, i.e. the full training data set was used for both training the classifier and to tune the calibration model. This scenario is called ENIR full. The DG and DGG algorithms were also used with ENIR calibration. These are called DG + ENIR and DGG + ENIR, respectively. With the SVM classifier, Platt scaling was used with either a separate calibration data set as described above or with the full training data set. These are called Platt and Platt full in the results. Finally, the Out-of-Bag sample was used with ENIR calibration in the case of RF. This is called ENIR OOB in the results. R and Matlab code for carrying out the experiments is available on GitHub

111https://github.com/biovaan/Calibration.

There are literally hundreds of different classifiers available to use. Each of them has its place but not all of them perform equally well when compared over a diverse set of problems (Cernadas and Amorim, 2014)

. For our experiments we chose a representative from each of the top performing classifier groups, namely a random forest, an SVM, and a feed forward neural network (NN) with a single hidden layer. In addition, a naive Bayes classifier was tested as it is computationally simple, easy to interpret, and surprisingly accurate despite of the often unrealistic assumption of feature independence. Also, the prediction scores of naive Bayes are not well calibrated which makes it a good candidate for this experiment

(Domingos and Pazzani, 1997). In addition, two Bayesian classifiers were used that should produce well calibrated probabilities without separate calibration. These were Bayesian logistic regression (BLR) (Gelman et al., 2008) which is a parametric linear classifier and Gaussian process classifier (GPC) (Williams and Barber, 1998)

which is non-parametric and nonlinear when nonlinear covariance function such as squared exponential is used. We tested the GPC implemented with expectation propagation (EP) approximation. Markov chain Monte Carlo (MCMC) sampling approximation of GPC can be considered the gold standard of GPC approximations but it is computationally very complex whereas EP approximation has been proven to have very good agreement with MCMC for both predictive probabilities and marginal likelihood estimates for fraction of the computational cost

(Kuss and Rasmussen, 2005).

RF was implemented using the R package randomForest. The default number of trees (500), , was used and the hyperparameter was tuned by increments or decrements of two based on the Out-of-Bag error estimate. For SVM, the R package e1071 was used. A Gaussian kernel was used and the regularization parameter was tuned with values . Good values for kernel spread hyperparameter were estimated based on the training data using the kernlab R package and the median value of the estimates was used (Caputo et al., 2002)

. The NN was implemented with the R package nnet. Hidden layer size was tuned in range from 1 to 9 neurons in increments of two and the hyperparameter

was tuned with values

. As an activation function, a logistic function was used. For the Gaussian process classifier, GPML Matlab toolbox

222http://www.gaussianprocess.org/gpml/code/matlab/doc/ implementation was used. A logistic likelihood function and a zero mean function was chosen and the covariance function was set to isotropic squared exponential covariance function which is in line with SVM with Gaussian kernel and regularization parameter

. The hyperparameters for length-scale and signal magnitude were tuned by minimizing the negative log marginal likelihood (i.e., type II maximum likelihood approximation) on training data set. With the non-Bayesian methods, in every case except RF, which used Out-of-Bag error estimate, the tuning process was done using 10-fold cross validation on the training data excluding the calibration data. naive Bayes was implemented with the R package e1071. Bayesian logistic regression was implemented using the R package arm and default hyperparameter values (i.e., Cauchy prior with scale 2.5) were used and model was fitted by approximate expectation maximization algorithm on the training data set.

3.1. Evaluating calibration performance

Classifier calibration performance can be evaluated visually using a calibration plot or more objectively with some error metrics. With small data sets, the amount of data limits the usefulness of the calibration plot so they were not used for evaluating calibration performance in our experiments. Below we will introduce two error metrics that are commonly used to evaluate classifier calibration. These metrics are used to compare calibration performance of different calibration scenarios in our experiments.

Logarithmic loss (logloss) is an error metric that gives the biggest penalty for being both confident and wrong about a prediction. It is therefore a good metric to evaluate classifier calibration especially if cost sensitive decisions are made based on the classifier outcome. Logarithmic loss is defined in Equation (1). In the equation stands for the number of observations, stands for the number of class labels, is the natural logarithm, equals if observation belongs to class , otherwise it is , and stands for the predicted probability that observation belongs to class . A smaller value of logarithmic loss means better calibration.

Mean squared error (MSE) is another metric that is often used to evaluate classifier calibration. The smaller the MSE value of a classifier, the better the calibration. However, MSE puts less emphasis on single confident but wrong decisions made by the classifier. It is defined in Equation (2) where stands for the number of observations, is if observation belongs to the positive class, otherwise it is , and is the predicted probability that observation belongs to the positive class. As with logloss, a smaller value of MSE means better calibration.

(1)
(2)

To test the performance of each approach to calibration with each of the classifiers, the following test sequence was ran. Features were standardized to have zero mean and unit variance and near zero variance features were deleted. Depending on the calibration scenario, the data set was divided into two or three parts as in Figure

1. These were training and test data sets and in the ENIR and Platt scenarios, a separate calibration data set was split off from the training data set. In the Raw scenario, logloss and MSE were calculated on the raw prediction scores obtained with each classifier from the separate test data set. In the ENIR calibration scenario, the slightly smaller training data set was used to train each classifier and the prediction scores were calibrated using the ENIR algorithm that was tuned with the separate calibration data set. In ENIR full scenario, the whole training data set was used for both training the classifiers and tuning the ENIR algorithm. Finally the prediction scores from predicting the test data points were calibrated and the error metrics calculated. In DG + ENIR and DGG + ENIR scenarios, the corresponding algorithm was used to create a calibration data set that was then used to tune the ENIR algorithm. The whole training data set was used to train the classifiers and the test data set prediction scores were calibrated and error metrics calculated. Threshold used for classification was selected using the calibrated training data set so that the selected threshold maximized the classification rate. In addition to measuring the error metrics, each calibration scenario’s computation time was also measured.

To be able to test the differences between calibration scenarios, a stratified 10-fold cross validation was used to create the data samples. A 5

2CV t-test

(Dietterich, 1998) or a combined 5 2CV F-test (Alpaydm, 1999)

has been suggested to be used to detect differences in classifier performance because of a lower Type I error. The lower Type I error, however, does not come without a compromise, namely higher Type II error (i.e. lower power). The lower power seems to be highlighted in our own experiments with small data sets as the inherent variance between the results on different folds is quite high. Therefore, cross validation was selected as the sampling method in our experiments and a Student’s paired t-test with unequal variance assumption and the Welch modification to the degrees of freedom

(Welch, 1947) was used to determine if there was a difference between calibration scenarios.

3.2. Tests with synthetic data

A synthetic data set, where true posterior probabilities can be calculated, was used to verify that the proposed data generation algorithms can indeed help improve calibration on small data sets. MSE and logloss are proper measures of calibration performance (Kull and Flach, 2015) but in theory it is possible that with discrete labels even improvements in these calibration error metrics do not equate with more accurate probabilities. Instead, they could indicate that a higher probability was assigned to positive predictions and lower probability to negative predictions. However, this kind of change in the probabilities should increase logarithmic loss unless classification error approaches zero. With synthetic data, the predicted probabilities can be compared to true probabilities where any improvement in error metrics can only come from a real improvement in the predicted probabilities.

The data set was generated by sampling from normal distributions that represent the positive and negative classes, sampling 100 instances from each class. The true probabilities were calculated as the ratio of the probability density functions of the distributions at the sample coordinates. Derivative features were engineered from the original features and the original features were not given to the classifiers. This was done to make the problem harder to the models so that estimating the probabilities was not trivial. The R code that was used to create the synthetic data set is available in GitHub with the rest of the code.

3.3. Tests with real data

Table 1 presents the properties of the real data sets that were used in the experiments. If the problem was not already a binary classification, it was converted into one. With QSAR biodegradation data set (Mansouri et al., 2013) (Biodegradation) the task is to predict if the chemicals are readily biodegradable or not based on molecular descriptors. In Blood Transfusion Service Center data set (Yeh et al., 2009) (Blood donation), whether previous blood donors donated blood again in March 2007 or not is predicted. Contraceptive Method Choice data set (Contraceptive) is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The task here is to predict the choice of current contraceptive method. As a positive class a combination of classes short-term and long-term were used and the no-use class was used as the negative class. Letter Recognition data set (Letter) is a data set of predetermined image features for handwritten letter identification. A variation of the data set was created by reducing it down into a binary problem of two similar letters. The letter Q was selected as the positive class and the letter O as the negative class. In the Mammographic mass data set (Elter et al., 2007) the prediction task is to discriminate benign and malignant Mammographic masses based on BI-RADS attributes and the patient’s age. Malignant outcome served as the positive class and benign outcome as the negative class. The Titanic data set is from a Kaggle competition where the task is to predict whom of the passengers survived from the accident. Passenger name, ticket number, and cabin number were excluded from the features and only entries without missing values were used. All data sets used in the experiments are freely available from the UCI machine learning repository (Dua and Graff, 2019) except the Titanic data set which is available from Kaggle.

Data set Samples Features Positive class Calibration samples
Biodegradation 1055 41 32 % 94
Blood donation 748 4 24 % 67
Contraceptive 1473 9 57 % 132
Letter 1536 16 51 % 138
Mammographic mass 831 4 48 % 74
Titanic 714 7 41 % 64
Table 1. Data set properties.

4. Results

The synthetic data set was used to verify that the proposed approach does indeed improve probability estimates and not just calibration error metrics with discrete labels. Mean squared errors with each classifier and calibration scenario are presented in Table 2. With the synthetic data, MSE was calculated using the true probabilities, not discrete labels.

Classifier No Cal. ENIR E.full DG DGG OOB Platt P.full
NB 0.072 0.129 0.082 0.071 0.072
SVM 0.039 0.096 0.064 0.040 0.039 0.074 0.053
RF 0.052 0.088 0.092 0.041 0.039 0.041
NN 0.047 0.106 0.053 0.039 0.039
BLR 0.063
GPC 0.041
Average results of 10-fold cross validation. Lower value of mean squared error indicates better calibration performance. Significantly different from No Cal., . Significance of the difference determined with Student’s paired t-test on 10-fold cross validation results.
Table 2. Mean squared error of different classifiers and calibration scenarios on the synthetic data set.

Results of the experiments with real data sets are presented here summarized and the full results are attached as Appendix. The average logarithmic loss of each classifier and calibration scenario combination are depicted in Figure 2 and the average mean squared error in Figure 3. The training times of each classifier and calibration scenario were measured on a computational server (Intel Xeon E5-2650 v2 @ 2.60GHz, 196GB RAM) and the results are shown in Table 3.

A line with markers chart showing the average logarithmic loss for different classifiers and calibration scenarios.

Figure 2. Average logarithmic loss for different classifiers and calibration scenarios. Classifier specific (CS) means Out-of-Bag samples with ENIR calibration for RF and Platt scaling with the full training set for SVM. Lower value of logloss indicates better calibration performance.

A line with markers chart showing the average mean squared error for different classifiers and calibration scenarios.

Figure 3. Average mean squared error for different classifiers and calibration scenarios. Classifier specific (CS) means Out-of-Bag samples with ENIR calibration for RF and Platt scaling with the full training set for SVM. Lower value of MSE indicates better calibration performance.
Classifier No Cal. ENIR E.full DG DGG OOB Platt P.full
NB 0.01 0.06 0.02 4.06 3.87
SVM 8.13 6.40 8.14 10.38 10.21 6.34 8.14
RF 1.73 1.59 1.73 7.19 7.19 1.60
NN 172 154 172 174 174
BLR 0.17
GPC 281
Table 3. Average computation times of the different classifiers and calibration scenarios in seconds on the experiment data sets. The times include hyperparameter tuning, training the classifier, and steps needed for calibration. Prediction times are excluded as they can be considered negligible.

4.1. Interpretation of the results

With the synthetic data set, it can be seen that using ENIR calibration with either a separate calibration data set or with the full training data set lead to poorer probability estimates than were achieved without calibration on all tested classifiers. On random forest and neural network, improvements in predicted probabilities were achieved by using either the DG or the DGG algorithm to generate the calibration data for ENIR calibration. The calibration error was in these cases lowered to approximately the same level as is achieved with the Gaussian process classifier. SVM achieved a comparable error level without calibration and no further improvement was achieved with the proposed calibration approach but the calibration error did not increase either. Using Platt scaling did increase the calibration error. Calibration error of naive Bayes was higher than the best of the pack and stayed intact with the proposed approach. This suggests that naive Bayes, due to the model’s assumptions, was not flexible enough to catch the feature interactions in the data and therefore no improvement was achievable with calibration, even with DG or DGG data generation.

On the real data sets, with only one exception, the Biodegradation data set with naive Bayes classifier, using ENIR with a separate calibration data split off from the training data fails to improve calibration and actually makes the calibration worse although the differences are not statistically significant in every case. This obviously results from a very small calibration data set and these kind of results have also been noted in the literature before. This observation was the main motivation behind this work. When using the same data for training both the classifier and the ENIR calibration algorithm (ENIR full), we get mixed results. With the naive Bayes classifier, the calibration improves statistically significantly over uncalibrated control on four of the six data sets but the improvement does not reach statistical significance on the other two data sets. With the other three classifiers calibration tends to deteriorate compared to the uncalibrated control. The decrease in calibration performance is statistically significant on three data sets with SVM and RF, and four data sets with the NN classifier. What is interesting and supports our hypothesis is that with RF classifier ENIR full performs worse than ENIR with the tiny but separate calibration data which indicates overfitting. However, overfitting of the calibration model did not happen with the other classifiers.

Of the classifier specific calibration scenarios, Platt scaling performed equally well or insignificantly better with the small but separate calibration data set and the whole training data set as the calibration data on all but one data set on which using the full training data lowered logloss. Platt scaling did on average better than ENIR full, however, it could not improve calibration on any of the tested data sets over the uncalibrated control. The premise of using Out-of-Bag samples for calibration with RF was that the whole training data could be used for calibration without biasing the calibration model. Our results do not support that notion completely, at least on these small data sets. When Out-of-Bag samples were used to tune the ENIR calibration algorithm, calibration performance was worse than ENIR calibration with a separate calibration data set or using the full training data set on four and better on two of the tested data sets although one of the better performances was not statistically significant. What is more important, though, is that ENIR tuned with Out-of-Bag samples could improve calibration over the uncalibrated control on only one of the data sets. On those four misbehaving cases mentioned above the calibration significantly decreased instead.

Our DG algorithm coupled with ENIR was able to improve calibration over the uncalibrated control with the naive Bayes classifier on five of the tested data sets. SVM calibration improved slightly on two of the data sets with DG + ENIR and RF calibration performance decreased on three of the data sets while it improved on one. With NN, DG + ENIR calibration performance was not statistically significantly different from the uncalibrated control. It did, however, perform equally well or better than ENIR or ENIR full. DGG with ENIR calibration on the other hand improved calibration over uncalibrated control on all data sets with the naive Bayes classifier and on five out of six data sets with the RF classifier, although one of the improvements did not reach statistical significance. Calibration of SVM was improved on one of the data sets with DGG + ENIR and unaffected on the others. With NN, performance was improved with DGG + ENIR on one and decreased on one while being neutral on the other three data sets. DGG performed better than ENIR full on all data sets with all classifiers although the differences were not statistically significant in every case.

As a comparison, Bayesian logistic regression and Gaussian process classifiers were tested on the same data sets because these classifiers are supposed to be well calibrated without separate calibration. BLR calibration was better than the best non-Bayesian classifier with DGG + ENIR calibration on one of the data sets but worse on all other data sets although one of the differences was not statistically significant. Also, classification rate of BLR was slightly lower on average than on the other classifiers except NB, although the difference was not statistically significant. Logloss for GPC was lowest of all classifiers and calibration scenarios on five of the data sets by a clear margin but higher on one of the data sets. MSE, however, was higher on three and lower on one of the data sets than with the best of the calibrated non-Bayesian classifiers. This discrepancy indicates that a higher proportion of mistakes made by GPC were truly uncertain and high confidence predictions were more often correct with GPC than with the other classifiers. Thus, it could be said that GPC is not overconfident as classifiers calibrated with the ENIR algorithm. This is definitely an advantage in applications where good calibration is needed.

Using ENIR calibration with a separate calibration data set lead to a slightly lowered classification rate with three classifiers because the calibration data cannot be used for training the classifier making the training data set smaller. NN, SVM, and GPC had the highest classification rate on these data sets. A slightly lower classification rate was observed with RF and BLR classifiers. None of these small differences were, however, statistically significant. naive Bayes could not compete with the other classifiers in accuracy.

Training and calibration of naive Bayes, SVM, RF, and BLR took on average only seconds. NN and the EP approximation of GPC were clearly more computationally complex but still acceptable on these small data sets with training times of a few minutes.

4.2. Effect of class imbalance

To test how class imbalance problem affects the proposed data generation methods, another experiment was set up as follows. The Letter data set was used so that one of the classes on turn was downsampled to either 100, 50, or 25 samples resulting in six different data sets with the percentage of the positive class ranging from 3 % to 12 %. Classification rate was above the percentage of the larger class in every case so the classifiers can be considered to have worked reasonably well despite the class imbalance (Kuhn and Johnson, 2013). Same experiments were run on these data sets as before with the other data sets. The results of these experiments are shown in Tables 4 and 5. SVM and NN were well calibrated on these data sets without calibration so they are omitted from the tables. Using ENIR on a separate calibration data set or the full training data set did increase calibration error significantly as did Platt scaling on SVM when a separate calibration data set was used. Other methods had no significant effect on calibration performance on these two classifiers.

Class imbalance did not have a noticeable effect on the effectiveness of DG or DGG paired with ENIR calibration. With NB and RF classifiers the calibration of the raw scores were not optimal as can be seen from the difference compared to the Bayesian classifiers. Therefore DG and DGG with ENIR were able to improve their calibration. As was the case with more balanced data sets, DG with ENIR calibration lead to more overconfident probability estimates, i.e. low MSE but somewhat higher logloss, than DGG with ENIR calibration.

Classifier OQ QO OQ QO OQ QO
NB Raw 0.085 0.110 0.049 0.066 0.037 0.030
NB ENIR 0.080 0.061 0.043 0.045 0.035 0.028
NB ENIR full 0.068 0.050 0.040 0.033 0.025 0.024
NB DG + ENIR 0.069 0.051 0.040 0.035 0.025 0.023
NB DGG + ENIR 0.069 0.051 0.039 0.034 0.025 0.023
RF Raw 0.021 0.021 0.015 0.017 0.012 0.014
RF ENIR 0.022 0.021 0.013 0.026 0.017 0.022
RF ENIR full 0.016 0.017 0.012 0.016 0.010 0.014
RF OOB 0.017 0.015 0.014 0.015 0.010 0.012
RF DG + ENIR 0.016 0.015 0.012 0.014 0.010 0.013
RF DGG + ENIR 0.015 0.015 0.012 0.014 0.010 0.015
Bayesian logistic regression 0.019 0.015 0.014 0.010 0.009 0.009
Gaussian process 0.009 0.018 0.013 0.011 0.009 0.011
Average results of 10-fold cross validation. Lower value of MSE indicates better calibration performance. Significantly different from Raw, . Significance of the difference determined with Student’s paired t-test on 10-fold cross validation results.
Table 4. Effect of class imbalance on MSE on the subsampled Letter data sets.
Classifier OQ QO OQ QO OQ QO
NB Raw 0.954 0.924 0.629 0.452 0.585 0.244
NB ENIR 1.626 1.550 1.375 1.495 1.510 0.674
NB ENIR full 0.532 0.576 0.580 0.468 0.343 0.649
NB DG + ENIR 0.473 0.384 0.439 0.332 0.192 0.195
NB DGG + ENIR 0.474 0.383 0.357 0.327 0.193 0.193
RF Raw 0.171 0.160 0.111 0.126 0.095 0.105
RF ENIR 1.100 0.838 0.475 1.069 0.828 0.755
RF ENIR full 0.537 0.598 0.309 0.396 0.387 0.629
RF OOB 0.341 0.177 0.249 0.248 0.307 0.165
RF DG + ENIR 0.321 0.453 0.166 0.316 0.307 0.237
RF DGG + ENIR 0.183 0.115 0.083 0.171 0.080 0.117
Bayesian logistic regression 0.132 0.120 0.118 0.079 0.082 0.078
Gaussian process 0.038 0.079 0.054 0.044 0.040 0.048
Average results of 10-fold cross validation. Lower value of logloss indicates better calibration performance. Significantly different from Raw, . Significance of the difference determined with Student’s paired t-test on 10-fold cross validation results.
Table 5. Effect of class imbalance on logloss on the subsampled Letter data sets.

5. Discussion

The choice of a classifier depends on the problem at hand. Accuracy, computational complexity and memory requirements (e.g. wearable device vs. cloud server), and need for explainability are some properties that need to be taken into account when choosing a classifier. One aspect of explainability is classifier calibration, i.e. can the posterior probability estimates of the classifier be trusted. Bayesian methods such as Bayesian logistic regression and Gaussian process classifiers should be fairly well calibrated out of the box but may not be the most accurate on average when tested on a wide array of problems. The top performing classifier groups have been shown to be random forests, support vector machines, and neural network variations. Our results indicate that SVM and NN calibration on the tested small data sets is fairly good but sometimes it can be further improved by the DGG method coupled with ENIR calibration. RF on the other hand almost always benefits from the DGG method coupled with ENIR. Gaussian Process classifier held on to the premise of good calibration on most of the tested data sets but RF with DGG coupled with ENIR calibration produced probabilities whose average MSE over all the data sets was actually lower than with GPC. The same is not true for logloss which suggests that ENIR might produce overconfident probability estimates. This discrepancy between performance on MSE and logloss is more pronounced with the DG algorithm as it does not use label smoothing like implicitly DGG does. This leads to clearly overconfident probability estimates with the DG approach coupled with ENIR calibration. The proposed methods are not adversely affected by even severe class imbalance as demonstrated in the experiments. The improvements in calibration error metrics do indicate a real improvement in the quality of the predicted probabilities which was verified by the tests wit a synthetic data set where true probabilities are known.

A slight drawback in DGG is that the number of samples and the group size parameter need to be set. Also, the calibration data points generated with DG are not necessarily uniformly distributed meaning that with a fixed bin size the bin width in DGG can vary. This can potentially affect calibration resolution negatively with prediction scores that fall inside the widest bins. These cases are rare, however, otherwise the bins would be narrower. A possible drawback of Gaussian process classifier is that full GPCs unlike e.g. SVMs are not sparse out of the box but need additional approximation approaches. This needs to be considered when training classifiers on large-scale problems but might not pose a problem on small data sets.

On these small data sets ENIR on its own, either with a separate calibration data set or with the whole training data set, performs poorly on all classifiers except naive Bayes which is known for its poor calibration. Extra computation time from doing DGG is negligible in the case of small data sets where it is mostly needed and therefore its use is recommended when better calibration is essential. This is especially true with at least classifiers such as random forest and naive Bayes.

Acknowledgements.
The authors would like to thank Infotech Oulu, Jenny and Antti Wihuri Foundation, Tauno Tönning Foundation, and Walter Ahlström Foundation for financial support of this work.

References

  • T. Alasalmi, H. Koskimäki, J. Suutala, and J. Röning (2018) Getting more out of small data sets - improving the calibration performance of isotonic regression by generating more data. In

    Proceedings of the 10th International Conference on Agents and Artificial Intelligence (ICAART 2018)

    ,
    pp. 379–386. External Links: Document Cited by: §2.1, §2.1.
  • E. Alpaydm (1999) Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms. Neural Computation 11 (8), pp. 1885–1892. External Links: Document, ISSN 0899-7667 Cited by: §3.1.
  • H. Boström (2008) Calibrating random forests. Proceedings - 7th International Conference on Machine Learning and Applications, ICMLA 2008, pp. 121–126. External Links: Document, ISBN 9780769534954 Cited by: §2.1.
  • B. Caputo, K. Sim, F. Furesjo, and A. Smola (2002) Appearance-based object recognition using svms: which kernel should i use?. In

    Proceedings of NIPS workshop on Statistical methods for computational experiments in visual processing and computer vision, Whistler

    ,
    Cited by: §3.
  • E. Cernadas and D. Amorim (2014) Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?. Journal of Machine Learning Research 15, pp. 3133–3181. External Links: Link Cited by: §3.
  • B. Connolly, K. B. Cohen, D. Santel, U. Bayram, and J. Pestian (2017) A nonparametric Bayesian method of translating machine learning scores to probabilities in clinical decision support. BMC Bioinformatics 18 (1), pp. 361. External Links: Document, ISBN 14712105, ISSN 1471-2105 Cited by: §1.
  • A. P. Dawid (1982) The Well-Calibrated Bayesian. Journal of the American Statistical Association 77 (379), pp. 605–610. External Links: Document, ISSN 01621459 Cited by: §1.
  • T. G. Dietterich (1998) Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation 10 (7), pp. 1895–1923. External Links: Document, 1011.1669, ISBN 0899-7667, ISSN 08997667 Cited by: §3.1.
  • P. Domingos and M. Pazzani (1997) On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning 29, pp. 103–130. External Links: Document, ISBN 08856125 (ISSN), ISSN 08856125 Cited by: §3.
  • D. Dua and C. Graff (2019) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §3.3.
  • M. Elter, R. Schulz-Wendtland, and T. Wittenberg (2007) The prediction of breast cancer biopsy outcomes using two cad approaches that both emphasize an intelligible decision process. Medical physics 34 (11), pp. 4164–4172. External Links: Document Cited by: §3.3.
  • A. Gelman, A. Jakulin, M. G. Pittau, and Y. S. Su (2008) A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics 2 (4), pp. 1360–1383. External Links: Document, ISBN 19326157, ISSN 19326157 Cited by: §3.
  • R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, D. Pedreschi, and F. Giannotti (2018) A Survey Of Methods For Explaining Black Box Models. ACM Computing Surveys (CSUR) 51 (5). External Links: Document, ISSN 03600300 Cited by: §1.
  • X. Jiang, M. Osl, J. Kim, and L. Ohno-Machado (2011) Smooth Isotonic Regression: A New Method to Calibrate Predictive Models. In AMIA Summits Transl Sci Proc, pp. 16–20. External Links: Document, ISBN 2153-4063 (Electronic), ISSN 2153-4063 Cited by: §2.
  • X. Jiang, M. Osl, J. Kim, and L. Ohno-Machado (2012) Calibrating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association 19 (2), pp. 263–274. External Links: Document, ISBN 1527-974X, ISSN 10675027 Cited by: §2.
  • M. Kuhn and K. Johnson (2013) Applied predictive modeling. Vol. 26, Springer. Cited by: §4.2.
  • M. Kull and P. Flach (2015) Novel Decompositions of Proper Scoring Rules for Classification: Score Adjustment as Precursor to Calibration. In Machine Learning and Knowledge Discovery in Databases, A. Appice, P. P. Rodrigues, V. Santos Costa, C. Soares, J. Gama, and A. Jorge (Eds.), Lecture Notes in Computer Science, Vol. 9284, pp. 1–16. External Links: Document, 1412.7525, ISBN 978-3-319-23527-1 Cited by: §3.2.
  • M. Kuss and C. E. Rasmussen (2005) Assesing Approximate Inference for Binary Gaussian Process Classification. Journal of Machine Learning Research 6, pp. 1679–1704. External Links: ISBN 1532-4435, ISSN 1533-7928 Cited by: §3.
  • K. Mansouri, T. Ringsted, D. Ballabio, R. Todeschini, and V. Consonni (2013) Quantitative structure–activity relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling 53 (4), pp. 867–878. External Links: Document Cited by: §3.3.
  • M. P. Naeini, G. F. Cooper, and M. Hauskrecht (2015a) Binary Classifier Calibration: Bayesian Non-Parametric Approach. In Proceedings of SIAM International Conference on Data Mining, pp. 208–216. External Links: ISBN 9781611974010, Document Cited by: §2.
  • M. P. Naeini, G. F. Cooper, and M. Hauskrecht (2015b) Obtaining Well Calibrated Probabilities Using Bayesian Binning.. In AAAI Conference on Artificial Intelligence, pp. 2901–2907. External Links: Document, ISSN 2159-5399 Cited by: §2.
  • M. P. Naeini and G. F. Cooper (2018) Binary classifier calibration using an ensemble of near isotonic regression models. Knowledge and Information Systems 54, pp. 151–170. External Links: Document, ISBN 9781509054725, ISSN 15504786 Cited by: §2.
  • A. Niculescu-Mizil and R. Caruana (2005a) Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, pp. 625–632. External Links: ISBN 1-59593-180-5, Document Cited by: §1, §1, §2.
  • A. Niculescu-Mizil and R. A. Caruana (2005b) Obtaining Calibrated Probabilities from Boosting. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, pp. 413–420. External Links: ISBN 0-9749039-1-4 Cited by: §1.
  • J. C. Platt (1999) Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Advances in Large Margin Classifiers. Cited by: §2.
  • R. J. Tibshirani, H. Hoefling, and R. Tibshirani (2011) Nearly-isotonic regression. Technometrics 53 (1), pp. 54–61. External Links: Document, ISSN 00401706 Cited by: §2.
  • B. L. Welch (1947) The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika 34 (1-2), pp. 28–35. External Links: Document Cited by: §3.1.
  • C. K. I. Williams and D. Barber (1998) Bayesian Classification With Gaussian Processes. Ieee Transactions on Pattern Analysis and Machine Intelligence 20 (12), pp. 1342–1351. External Links: Document Cited by: §3.
  • I. Yeh, K. Yang, and T. Ting (2009) Knowledge discovery on RFM model using Bernoulli sequence. Expert Systems with Applications 36 (3, Part 2), pp. 5866 – 5871. External Links: ISSN 0957-4174, Document Cited by: §3.3.
  • B. Zadrozny and C. Elkan (2001a) Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’01, pp. 204–213. External Links: Document, ISBN 158113391X, ISSN 158113391X Cited by: §2.
  • B. Zadrozny and C. Elkan (2001b)

    Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers

    .
    In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, pp. 609–616. External Links: ISBN 1-55860-778-1, Link Cited by: §1.
  • B. Zadrozny and C. Elkan (2002) Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pp. 694–699. External Links: ISBN 1-58113-567-X, Document Cited by: §2, §2.

Appendix

Full results

Full results of our experiments are presented in Tables 7-12

. The results in the tables are averages and standard deviations of the results for each fold in 10-fold cross validation. The statistical tests to determine if the differences between calibration conditions are statistically significant were done with Student’s paired t-tests with unequal variance assumption. Bayesian logistic regression and Gaussian process classifiers were compared to the best performing of the other classifiers based on logarithmic loss of that classifier after calibrating the classifier with ENIR using DGG generated calibration data. Table

6 lists the abbreviations used in the result tables.

Abbreviation Description
CR Classification rate
DG Data Generation algorithm
DGG Data Generation and Grouping algorithm
ENIR Ensemble of near isotonic regressions
MSE Mean squared error
Logloss Logarithmic loss
NB Naive Bayes
NN Neural network
OOB Out-of-Bag
RF Random forest
SVM Support vector machine
Table 6. List of abbreviations used in the results.
Scenario CR (%) MSE Logloss
NB Raw 83.69 4.08 0.248 0.043 5.970 1.179
NB ENIR 83.31 4.53 0.135 0.023 2.261 1.492
NB ENIR full 82.65 3.68 0.127 0.023 1.099 0.444
NB DG + ENIR 83.78 3.96 0.128 0.025 1.046 0.464
NB DGG + ENIR 83.69 4.08 0.127 0.023 0.808 0.114
SVM Raw 87.20 3.57 0.101 0.019 0.678 0.112
SVM ENIR 85.31 2.87 0.113 0.022 1.843 1.161
SVM ENIR full 86.73 3.93 0.105 0.023 1.658 0.759
SVM Platt 86.26 3.83 0.106 0.023 0.707 0.119
SVM Platt full 87.20 3.57 0.105 0.025 0.761 0.192
SVM DG + ENIR 86.82 4.10 0.101 0.020 0.673 0.114
SVM DGG + ENIR 87.20 3.57 0.101 0.020 0.675 0.113
RF Raw 85.87 4.67 0.097 0.021 0.693 0.159
RF ENIR 85.02 3.75 0.114 0.026 2.833 1.502
RF ENIR full 85.87 4.67 0.125 0.039 6.667 2.349
RF ENIR OOB 85.87 4.96 0.097 0.022 0.732 0.152
RF DG + ENIR 85.59 4.66 0.100 0.024 1.046 0.494
RF DGG + ENIR 86.82 3.77 0.097 0.023 0.688 0.152
NN Raw 84.83 3.57 0.112 0.027 0.848 0.222
NN ENIR 85.12 3.00 0.118 0.020 1.875 1.057
NN ENIR full 84.55 3.25 0.120 0.028 2.165 1.542
NN DG + ENIR 84.93 4.20 0.109 0.022 0.767 0.172
NN DGG + ENIR 84.83 3.89 0.108 0.022 0.709 0.113
Bayesian logistic regression 85.78 3.99 0.106 0.024 0.699 0.132
Gaussian process 86.54 3.82 0.106 0.020 0.352 0.045
Average results of 10-fold cross validation standard deviation. Lower values of MSE and logloss indicate better calibration performance. Significantly different from Raw, . Significantly different from ENIR full, . Significantly different from classifier specific calibration, . Significantly different from RF DGG + ENIR, . Significance of the difference determined with Student’s paired t-test on 10-fold cross validation results.
Table 7. Classification rate, mean squared error, and logarithmic loss of different classifiers and calibration scenarios on the Biodegradation data set.
Scenario CR (%) MSE Logloss
NB Raw 76.07 1.82 0.186 0.028 1.440 0.438
NB ENIR 76.07 1.61 0.176 0.021 2.555 1.292
NB ENIR full 75.54 2.90 0.168 0.014 1.271 0.399
NB DG + ENIR 76.34 2.17 0.167 0.013 1.270 0.398
NB DGG + ENIR 76.60 2.69 0.166 0.012 1.011 0.066
SVM Raw 79.15 3.04 0.163 0.012 1.013 0.059
SVM ENIR 77.81 3.72 0.179 0.028 3.309 2.893
SVM ENIR full 78.07 2.82 0.162 0.018 1.935 0.711
SVM Platt 78.87 4.18 0.174 0.017 1.089 0.138
SVM Platt full 79.15 3.04 0.162 0.015 1.010 0.076
SVM DG + ENIR 79.01 3.03 0.160 0.016 0.994 0.077
SVM DGG + ENIR 79.01 2.98 0.161 0.015 0.997 0.072
RF Raw 76.60 5.26 0.169 0.023 2.794 1.539
RF ENIR 75.26 5.72 0.181 0.023 3.616 1.989
RF ENIR full 76.47 4.34 0.191 0.032 3.031 1.997
RF ENIR OOB 77.40 5.43 0.181 0.028 5.971 1.980
RF DG + ENIR 76.73 5.16 0.168 0.020 2.880 1.739
RF DGG + ENIR 77.40 5.43 0.161 0.016 0.997 0.083
NN Raw 80.21 2.96 0.148 0.016 0.934 0.081
NN ENIR 79.95 2.81 0.169 0.025 2.981 2.677
NN ENIR full 80.21 3.27 0.149 0.018 1.518 0.590
NN DG + ENIR 80.47 3.22 0.149 0.015 0.937 0.071
NN DGG + ENIR 79.81 2.87 0.148 0.015 0.931 0.074
Bayesian logistic regression 78.47 3.65 0.155 0.013 0.956 0.066
Gaussian process 79.14 2.43 0.152 0.015 0.473 0.036
Average results of 10-fold cross validation standard deviation. Lower values of MSE and logloss indicate better calibration performance. Significantly different from Raw, . Significantly different from ENIR full, . Significantly different from classifier specific calibration, . Significantly different from NN DGG + ENIR, . Significance of the difference determined with Student’s paired t-test on 10-fold cross validation results.
Table 8. Classification rate, mean squared error, and logarithmic loss of different classifiers and calibration scenarios on the Blood donation data set.
Scenario CR (%) MSE Logloss
NB Raw 63.00 4.22 0.258 0.028 1.802 0.290
NB ENIR 64.02 3.93 0.234 0.020 1.973 0.861
NB ENIR full 62.19 3.88 0.225 0.013 1.367 0.290
NB DG + ENIR 62.59 3.70 0.225 0.013 1.286 0.058
NB DGG + ENIR 63.00 4.28 0.226 0.013 1.287 0.058
SVM Raw 71.62 2.95 0.195 0.011 1.153 0.050
SVM ENIR 70.60 3.03 0.204 0.013 1.924 0.743
SVM ENIR full 70.81 3.09 0.197 0.014 1.469 0.380
SVM Platt 71.76 2.58 0.197 0.009 1.162 0.043
SVM Platt full 71.62 2.95 0.196 0.014 1.162 0.063
SVM DG + ENIR 71.56 3.46 0.194 0.012 1.194 0.164
SVM DGG + ENIR 71.35 3.32 0.194 0.012 1.147 0.053
RF Raw 70.06 4.02 0.196 0.014 1.228 0.153
RF ENIR 70.67 4.07 0.197 0.014 1.533 0.342
RF ENIR full 71.35 4.36 0.228 0.020 4.350 1.314
RF ENIR OOB 70.07 4.51 0.218 0.019 6.019 1.358
RF DG + ENIR 69.79 3.18 0.198 0.012 1.454 0.445
RF DGG + ENIR 69.93 3.98 0.191 0.011 1.123 0.056
NN Raw 71.22 2.70 0.189 0.015 1.120 0.070
NN ENIR 70.61 2.56 0.200 0.010 1.977 9.49
NN ENIR full 70.94 3.38 0.189 0.014 1.251 0.383
NN DG + ENIR 71.28 2.76 0.190 0.011 1.167 0.140
NN DGG + ENIR 71.49 2.69 0.191 0.012 1.129 0.053
Bayesian logistic regression 68.30 3.80 0.210 0.011 1.216 0.049
Gaussian process 71.49 3.61 0.192 0.011 0.570 0.028
Average results of 10-fold cross validation standard deviation. Lower values of MSE and logloss indicate better calibration performance. Significantly different from Raw, . Significantly different from ENIR full, . Significantly different from classifier specific calibration, . Significantly different from RF DGG + ENIR, . Significance of the difference determined with Student’s paired t-test on 10-fold cross validation results.
Table 9. Classification rate, mean squared error, and logarithmic loss of different classifiers and calibration scenarios on the Contraceptive use data set.
Scenario CR (%) MSE Logloss
NB Raw 84.24 2.98 0.135 0.023 1.060 0.190
NB ENIR 84.18 3.00 0.109 0.018 1.307 0.701
NB ENIR full 83.66 1.91 0.104 0.011 0.720 0.224
NB DG + ENIR 84.38 2.38 0.104 0.012 0.647 0.057
NB DGG + ENIR 84.05 2.41 0.104 0.012 0.648 0.056
SVM Raw 99.28 0.54 0.006 0.005 0.049 0.029
SVM ENIR 99.22 0.91 0.008 0.008 0.377 0.448
SVM ENIR full 99.28 0.54 0.007 0.006 0.171 0.394
SVM Platt 99.02 1.06 0.007 0.006 0.082 0.046
SVM Platt full 99.28 0.54 0.006 0.005 0.054 0.049
SVM DG + ENIR 99.15 0.71 0.006 0.005 0.043 0.033
SVM DGG + ENIR 99.28 0.54 0.006 0.005 0.044 0.032
RF Raw 97.53 1.30 0.024 0.007 0.210 0.045
RF ENIR 97.33 1.18 0.018 0.009 0.433 0.551
RF ENIR full 97.53 1.30 0.019 0.019 0.567 0.677
RF ENIR OOB 97.79 1.38 0.014 0.008 0.137 0.135
RF DG + ENIR 97.92 1.27 0.014 0.008 0.134 0.157
RF DGG + ENIR 97.98 1.07 0.013 0.007 0.094 0.047
NN Raw 98.96 0.83 0.008 0.006 0.057 0.039
NN ENIR 98.70 1.01 0.011 0.008 0.438 0.356
NN ENIR full 98.96 0.83 0.009 0.007 0.346 0.404
NN DG + ENIR 99.02 0.84 0.008 0.006 0.053 0.033
NN DGG + ENIR 98.96 0.83 0.008 0.006 0.054 0.034
Bayesian logistic regression 95.90 1.84 0.028 0.012 0.193 0.063
Gaussian process 98.11 1.14 0.023 0.006 0.107 0.015
Average results of 10-fold cross validation standard deviation. Lower values of MSE and logloss indicate better calibration performance. Significantly different from Raw, . Significantly different from ENIR full, . Significantly different from SVM DGG + ENIR, . Significance of the difference determined with Student’s paired t-test on 10-fold cross validation results.
Table 10. Classification rate, mean squared error, and logarithmic loss of different classifiers and calibration scenarios on the Letter recognition data set.
Scenario CR (%) MSE Logloss
NB Raw 77.62 4.71 0.169 0.036 1.292 0.288
NB ENIR 78.47 4.19 0.165 0.027 2.116 0.866
NB ENIR full 77.98 4.87 0.153 0.026 1.105 0.396
NB DG + ENIR 78.34 4.28 0.154 0.027 1.036 0.339
NB DGG + ENIR 78.09 4.83 0.153 0.026 0.958 0.130
SVM Raw 80.14 4.25 0.150 0.025 0.944 0.119
SVM ENIR 78.94 4.45 0.165 0.025 2.203 1.139
SVM ENIR full 79.66 4.22 0.150 0.027 1.394 0.713
SVM Platt 79.78 5.14 0.152 0.024 0.964 0.121
SVM Platt full 80.14 4.25 0.151 0.026 0.949 0.127
SVM DG + ENIR 80.14 3.82 0.152 0.026 0.947 0.128
SVM DGG + ENIR 80.14 4.25 0.152 0.026 0.951 0.127
RF Raw 80.50 4.41 0.160 0.030 1.556 0.575
RF ENIR 80.02 3.83 0.165 0.029 3.729 1.721
RF ENIR full 80.50 4.04 0.159 0.033 5.219 1.757
RF ENIR OOB 80.50 4.31 0.181 0.040 10.12 2.450
RF DG + ENIR 80.14 4.25 0.157 0.025 2.697 0.819
RF DGG + ENIR 80.63 4.31 0.148 0.025 0.935 0.115
NN Raw 80.02 4.68 0.148 0.027 0.919 0.139
NN ENIR 79.54 4.38 0.156 0.028 2.369 1.320
NN ENIR full 79.78 5.05 0.151 0.030 1.384 0.878
NN DG + ENIR 79.66 5.30 0.151 0.028 0.942 0.143
NN DGG + ENIR 80.02 4.68 0.151 0.028 0.938 0.141
Bayesian logistic regression 80.02 4.74 0.145 0.025 0.904 0.122
Gaussian process 81.10 4.65 0.145 0.025 0.452 0.062
Average results of 10-fold cross validation standard deviation. Lower values of MSE and logloss indicate better calibration performance. Significantly different from Raw, . Significantly different from ENIR full, . Significantly different from classifier specific calibration, . Significantly different from RF DGG + ENIR, . Significance of the difference determined with Student’s paired t-test on 10-fold cross validation results.
Table 11. Classification rate, mean squared error, and logarithmic loss of different classifiers and calibration scenarios on the Mammographic mass data set.
Scenario CR (%) MSE Logloss
NB Raw 78.14 4.51 0.170 0.029 1.390 0.430
NB ENIR 76.88 4.01 0.176 0.030 2.559 1.592
NB ENIR full 77.58 3.88 0.160 0.024 1.163 0.371
NB DG + ENIR 77.31 3.60 0.160 0.024 0.989 0.116
NB DGG + ENIR 77.58 4.33 0.160 0.024 0.993 0.114
SVM Raw 81.65 2.86 0.140 0.015 0.900 0.072
SVM ENIR 80.52 3.32 0.153 0.024 2.194 1.903
SVM ENIR full 80.11 3.76 0.140 0.019 1.337 0.599
SVM Platt 82.35 3.18 0.144 0.019 0.928 0.103
SVM Platt full 81.65 2.86 0.140 0.018 0.899 0.091
SVM DG + ENIR 81.93 2.75 0.142 0.013 0.907 0.065
SVM DGG + ENIR 82.07 2.80 0.141 0.014 0.905 0.067
RF Raw 80.11 4.30 0.139 0.023 1.130 0.454
RF ENIR 78.16 3.97 0.158 0.024 3.344 1.819
RF ENIR full 80.81 4.45 0.152 0.029 4.090 2.268
RF ENIR OOB 80.11 3.71 0.149 0.026 4.653 2.869
RF DG + ENIR 81.09 2.99 0.138 0.017 2.019 1.112
RF DGG + ENIR 80.24 4.25 0.135 0.017 0.863 0.100
NN Raw 80.24 4.64 0.141 0.022 0.903 0.132
NN ENIR 78.98 1.85 0.163 0.019 3.375 1.866
NN ENIR full 80.11 4.39 0.144 0.025 1.605 0.684
NN DG + ENIR 80.25 4.61 0.142 0.019 0.900 0.101
NN DGG + ENIR 80.10 4.46 0.141 0.019 0.899 0.100
Bayesian logistic regression 80.67 3.34 0.144 0.020 0.906 0.108
Gaussian process 82.10 3.40 0.135 0.020 0.436 0.053
Average results of 10-fold cross validation standard deviation. Lower values of MSE and logloss indicate better calibration performance. Significantly different from Raw, . Significantly different from ENIR full, . Significantly different from classifier specific calibration, . Significantly different from RF DGG + ENIR, . Significance of the difference determined with Student’s paired t-test on 10-fold cross validation results.
Table 12. Classification rate, mean squared error, and logarithmic loss of different classifiers and calibration scenarios on the Titanic data set.