## 1 Introduction

In classification, an object of interest is predicted to belong to one of discrete and predefined categories called classes. An example of a classification problem would be recognizing handwritten digits. In many applications it is also important to quantify the uncertainty of these predictions. In the handwritten digits example, how certain can we be that the digit is one and not seven or any other digit? If the results of a classifier are used as input for making decisions or if there are costs involved in the classification decision, then it is important, in addition to good classification accuracy, that the probabilities predicted by a classifier are accurate. A classifier is said to be well calibrated if the predicted probability of an event is close to the proportion of these events among a group of similar predictions [Dawid1982]. However, the main objective for classifier design is often good class separation and not accurate probability estimation. Therefore, many commonly used classifiers are not well calibrated. The process for improving a classifiers probability estimates by post-processing the probability estimates is called calibration. Most commonly used calibration algorithms only work on binary problems and need a fair amount of data, separate from training and testing data to avoid bias, which severely restricts their application in real-world problems.

To tackle these two limitations, we will demonstrate two ways to generalize a binary calibration method that has been previously shown to work on small data sets to work on multi-class problems. Using the proposed calibration approach lead to statistically significant improvement in calibration error metrics. The rest of this article is structured as follows. Section 2 will shortly review relevant literature on the topic, Section 3 explains the experiments that were used for testing the proposed approaches and results from those experiments are presented in Section 4. The results are discussed in Section 5 and Section 6 concludes the article.

## 2 Background

Calibration algorithms need training data and to avoid biasing this data needs to be separate from the data that is used for training the classifier. A minimum of about 1000 to 2000 training samples are needed for the calibration data set depending on the learning algorithm to avoid overfitting. Non-parametric calibration algorithms are particularly prone to overfitting on small data sets and their performance seems to improve with increasing calibration data set sizes even further [NiculescuMizil2005ICML]. This means that the amount of training data in total needs to be large so that enough data can be set aside for calibration. In addition, a separate data set needs to be held out for testing. However, relatively small data sets are quite common in many real-world modelling tasks.

It has been previously shown that calibrating binary classifiers with traditional calibration approach does not work very well when available data is limited. However, it is possible to solve the problem, at least partially, by generating more calibration data with a Monte Carlo cross validation approach [Alasalmi2018, Alasalmi2020] using isotonic regression (IR) [NiculescuMizil2005ICML] or ensemble of near isotonic regression models (ENIR) [Naeini2018] calibration algorithms. Many classification problems are not binary but instead the problem often is to classify the data into multiple classes () but most calibration algorithms work on binary () classification problems only. This is also true for the above mentioned solution that uses Data Generation and Grouping (DGG) algorithm [Alasalmi2020] which works with binary calibration algorithms only.

A solution to this problem is to break the multi-class problem into several binary problems, solve each binary classification and calibration problem independently, and combine the results to multi-class probability estimates [Zadrozny2002]. The premise is obviously that better calibrated binary probabilities result in better calibrated multi-class probabilities. The question then becomes how to divide the problem into binary problems and how to combine the results. Two intuitive ways to break the multi-class problem into binary problems are one-vs-rest and all pairs.

In the one-vs-rest approach the binary problems are such that one of the classes is treated as the positive class while the rest are treated as the negative class collectively and this is repeated for each class. The number of binary problems then becomes the same as the number of classes . Probability estimates from using the one-vs-rest approach can be combined by simply linearly normalizing the binary probabilities for each class so that they sum up to one. This results in comparable error rates with combining the probabilities using least squares or coupling algorithms [Zadrozny2002]. By using one class as the positive class and the rest of the data as the negative class leads to class imbalance which becomes more pronounced as the number classes grows. However, the number of binary problems in this approach remains reasonable.

In the all pairs approach all possible pairs of classes are enumerated and one class in each pair is selected as the positive class while the other class serves as the negative class. There are possible pairs of classes in this approach meaning that the number of binary problems is larger than with the one-vs-rest approach when as can be seen from Table 1. However, the binary problems are faster to learn in all pairs approach as only instances from the two classes are included in each. The binary problems are also more balanced in the all pairs approach. After learning and calibrating the binary classifiers, the probabilities for the multi-class problem can be combined with pairwise coupling which was originally developed by Hastie and Tibshirani [Hastie1998] and later improved by Wu et al. [Wu2004].

K | One-vs-rest | All Pairs |

3 | 3 | 3 |

4 | 4 | 6 |

5 | 5 | 10 |

6 | 6 | 15 |

… | … | … |

10 | 10 | 45 |

The two above mentioned intuitive ways for breaking up the multi-class problem are two special cases of a more general idea that uses so called error correcting output coding (ECOC) matrices [Allwein2000]. ECOC matrices can be either complete or sparse. However, the number of binary problems grows exponentially as the number of classes grows when using complete ECOC matrices and there are computational problems with sparse ECOC matrices making both infeasible in practice [Gebel2009].

## 3 Experiments

In this study, the feasibility of the DGG data generation algorithm for multi-class classification problem calibration was tested. One-vs-rest approach with normalization and all pairs approach with pairwise coupling were compared here when using the DGG algorithm along with ENIR calibration. The procedure in the context of binary calibration is described more thoroughly in [Alasalmi2020]. Calibration error was quantified with logarithmic loss (LL) and mean squared error (MSE). LL is defined in Equation 1 and MSE in Equation 2. In the equations stands for the number of observations, stands for the number of class labels, is the natural logarithm, equals if observation belongs to class , otherwise it is , and stands for the predicted probability that observation belongs to class . A smaller value of each metric indicates better calibration.

(1) |

(2) |

A stratified 10-fold cross validation was used to create data samples and Student’s paired t-test with unequal variance assumption and the Welch modification to the degrees of freedom

[welch1947] was used to determine if there was a statistically significant difference between calibration scenarios.Properties of the data sets that were used in the experiments are presented in Table 2. With the Abalone data set the task is to predict the age groups of abalones based on some physical measurements [nash1994population]. Many of the classes had only a handful, some just one sample so classes 1 to 5 were grouped together as were groups 14 and 15, and all classes over 16. Contraceptive Method Choice data set (Contraceptive) is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey. The task with the Contraceptive data is to predict the choice of current contraceptive method: long term, short term, or no use. The development index (development) data set describes the development status of countries based on demographic data and the task is to predict the development index of each country. The ecoli data set describes protein localization sites in Escherichia coli bacteria [nakai1992knowledge]. Due to small number of samples classes were grouped so that sub classes of inner membrane were grouped together as were sub classes of outer membrane. The forest type mapping (forest) data set describes forested areas in Japan based on multi-temporal remote sensing data [johnson2012using] and the task is to discriminate different forest types. Heart disease Cleveland data set (Heart) contains clinical and noninvasive test results of patients undergoing angiography at the Cleveland Clink [Detrano1989]. Six patients with missing values were discarded from the analysis. The goal with the Heart data is to predict the severity of heart disease based on the patient data. The optical recognition of handwritten digits (pendigits) data set [kaynak1995methods] contains preprocessed features that describe handwritten digits and the classification task is to recognize the digits. To facilitate the comparison of algorithm performance, the original division of training and test data sets was not used and instead the data sets were combined and cross validation was used like with the rest of the data sets. The seeds data set describes different varieties of wheat seeds based on a soft X-ray technique [charytanowicz2010complete] and the task is to classify the seeds into correct class. The steel plates faults (steel) data set^{1}^{1}1Dataset provided by Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. www.semeion.it describes faults in steel plates and the task is to classify the plate faults into correct categories based on measurement data. Waveform data set consists of artificially generated data on three classes of waves described by noisy attributes and the classification task is to separate the wave classes [Breiman1984]. The wholesale customers (wholesale) data set contains data of customers of a wholesale distributor in Portugal [abreu2011analise]. The task is to classify the customer belonging into a certain region. The yeast data set describes protein localization sites with results from analysis techniques [horton1996probabilistic]

and the task is to classify each protein to the correct location based on the analysis results. Development index data set is available from kaggle data sets. Rest of the data sets used in the experiments are freely available from the UCI machine learning repository

[Lichman:2013].Data set | Samples | Classes | Smallest | Largest |
---|---|---|---|---|

Abalone | 4177 | 11 | 189 | 689 |

Contraceptive | 1473 | 3 | 333 | 629 |

Development | 212 | 4 | 13 | 89 |

Ecoli | 336 | 4 | 25 | 143 |

Forest | 523 | 4 | 83 | 195 |

Heart | 297 | 5 | 13 | 160 |

Pendigits | 10992 | 10 | 1055 | 1144 |

Seeds | 210 | 3 | 70 | 70 |

Steel | 1941 | 7 | 55 | 673 |

Waveform | 5000 | 3 | 1647 | 1696 |

Wholesale | 440 | 3 | 47 | 316 |

Yeast | 1479 | 9 | 20 | 463 |

DGG data generation with ENIR calibration has been shown to work well especially with naive Bayes (NB) and random forest (RF) classifiers on binary problems and they are both capable of producing multi-class probability estimates without modification so they were selected as the base classifiers for our experiments. A total of five different calibration scenarios were compared in this study: multi-class uncalibrated probabilities (Multi-class Raw), one-vs-rest with either uncalibrated (One-vs-rest Raw) or calibrated (One-vs-rest DGG + ENIR) probabilities, and all pairs with either uncalibrated (All pairs Raw) or calibrated (All pairs DGG + ENIR) probabilities. In addition to calibration error metrics, computation times were recorded on a computational server (Intel Xeon E5-2650 v2 @ 2.60GHz, 196GB RAM) for each calibration scenario.

## 4 Results

Results of the experiments are summarized in Table 3 which shows how many of the data sets had statistically significant changes in calibration performance after our calibration treatment on the data sets grouped by the classifier used, the approach to form the binary problems, and by the number of classes. Full results, MSEs and LLs, for each of the tested data sets in each calibration scenario are reported in Tables 4 and 5 for naive Bayes and random forest, respectively.

Classifier | Number of classes | ||
---|---|---|---|

Low | High | ||

NB | |||

One-vs-rest | |||

All Pairs | |||

RF | |||

One-vs-rest | |||

All Pairs | |||

indicates improved calibration, indicates neutral effect, and indicates impaired calibration. |

Data set | Multi-class Raw | One-vs-rest Raw | One-vs-rest DGG+ENIR | All pairs Raw | All pairs DGG+ENIR | |||||

MSE | LL | MSE | LL | MSE | LL | MSE | LL | MSE | LL | |

Abalone | 0.089 | 4.676 | 0.079 | 3.634 | 0.074 | 2.833 | 0.169 | 7.791 | 0.079 | 3.087 |

Contraceptive | 0.233 | 2.354 | 0.221 | 2.160 | 0.200 | 1.752 | 0.233 | 2.354 | 0.199 | 1.750 |

Development | 0.081 | 3.039 | 0.082 | 2.773 | 0.065 | 1.038 | 0.090 | 3.104 | 0.060 | 0.876 |

Ecoli | 0.029 | 0.692 | 0.032 | 0.692 | 0.033 | 0.688 | 0.036 | 0.858 | 0.032 | 0.594 |

Forest | 0.065 | 3.384 | 0.076 | 2.066 | 0.059 | 0.996 | 0.128 | 3.572 | 0.117 | 1.529 |

Heart | 0.130 | 2.549 | 0.108 | 2.096 | 0.099 | 1.635 | 0.123 | 2.487 | 0.098 | 1.598 |

Pendigits | 0.027 | 2.077 | 0.035 | 1.597 | 0.024 | 0.918 | 0.059 | 2.303 | 0.056 | 1.909 |

Seeds | 0.054 | 0.848 | 0.047 | 0.610 | 0.046 | 0.450 | 0.054 | 0.848 | 0.050 | 0.456 |

Steel | 0.102 | 6.101 | 0.084 | 3.085 | 0.066 | 1.515 | - | - | 0.088 | 2.224 |

Waveform | 0.109 | 1.545 | 0.079 | 0.717 | 0.075 | 0.725 | 0.109 | 1.545 | 0.070 | 0.710 |

Wholesale | 0.207 | 2.612 | 0.199 | 2.294 | 0.148 | 1.405 | 0.207 | 2.612 | 0.148 | 1.405 |

Yeast | 0.064 | 2.118 | 0.064 | 2.039 | 0.063 | 1.935 | 0.105 | 4.510 | 0.116 | 4.248 |

Average results of 10-fold cross validation. Significantly different from Multi-class Raw is indicated with underlining. Best performing scenario with each classifier is indicated with boldface font. |

Mean squared error and logarithmic loss of naive Bayes classifier on different calibration scenarios.

Data set | Multi-class Raw | One-vs-rest Raw | One-vs-rest DGG+ENIR | All pairs Raw | All pairs DGG+ENIR | |||||

MSE | LL | MSE | LL | MSE | LL | MSE | LL | MSE | LL | |

Abalone | 0.073 | 2.859 | 0.073 | 2.981 | 0.072 | 2.919 | 0.080 | 3.534 | 0.078 | 3.158 |

Contraceptive | 0.186 | 1.682 | 0.190 | 1.752 | 0.184 | 1.661 | 0.185 | 1.652 | 0.181 | 1.613 |

Development | 0.004 | 0.087 | 0.009 | 0.203 | 0.003 | 0.057 | 0.022 | 0.386 | 0.011 | 0.730 |

Ecoli | 0.031 | 0.685 | 0.029 | 0.559 | 0.028 | 0.442 | 0.038 | 0.584 | 0.029 | 0.743 |

Forest | 0.044 | 0.640 | 0.044 | 0.840 | 0.043 | 0.808 | 0.106 | 1.356 | 0.108 | 1.371 |

Heart | 0.101 | 1.607 | 0.102 | 1.640 | 0.097 | 1.573 | 0.100 | 1.602 | 0.099 | 1.776 |

Pendigits | 0.003 | 0.157 | 0.003 | 0.174 | 0.002 | 0.105 | 0.055 | 1.957 | 0.053 | 1.899 |

Seeds | 0.033 | 0.359 | 0.035 | 0.358 | 0.037 | 0.371 | 0.035 | 0.360 | 0.037 | 0.678 |

Steel | 0.041 | 1.005 | 0.041 | 1.002 | 0.039 | 1.034 | - | - | 0.094 | 2.065 |

Waveform | 0.076 | 0.759 | 0.075 | 0.749 | 0.066 | 0.640 | 0.076 | 0.755 | 0.068 | 0.680 |

Wholesale | 0.157 | 1.602 | 0.157 | 1.528 | 0.148 | 1.402 | 0.157 | 1.528 | 0.148 | 1.402 |

Yeast | 0.059 | 1.951 | 0.059 | 1.938 | 0.059 | 1.973 | 0.091 | 2.802 | 0.112 | 4.726 |

Average results of 10-fold cross validation. Significantly different from Multi-class Raw is indicated with underlining. Best performing scenario with each classifier is indicated with boldface font. |

Breaking up the multi-class problem into one-vs-rest binary problems and combining the results by normalization was able to improve calibration of naive Bayes even without calibrating the binary classifier probabilities on almost all data sets. The same was not true for the all pairs approach that performs worse on some and achieves approximately the same level of performance as uncalibrated multi-class classification on some data sets. Calibrating the binary naive Bayes classifiers in the one-vs-rest approach was able to improve the error metrics on ten of the twelve data sets compared to both uncalibrated multi-class and uncalibrated one-vs-rest scenarios. One exception to this was on the Waveform data set where LL was not significantly different from the uncalibrated one-vs-rest scenario even though MSE was. Calibration did, however, improve both MSE and LL on that data set compared to uncalibrated multi-class classification.

Calibrating the binary naive Bayes classifiers in the all pairs approach improved calibration on seven of the twelve data sets compared to uncalibrated multi-class classification. On two data sets MSE increased while LL decreased and on one of the data sets the treatment increased both MSE and LL.

Overall the one-vs-rest approach with DGG + ENIR calibration coupled with normalization was the best performing calibration scenario for naive Bayes. One-vs-rest calibration performed better than all pairs on five data sets, there was no statistically significant difference on six data sets, and all pairs was better on one data set.

With the random forest classifier, breaking up the multi-class problem into binary problems increased calibration error metrics on four data sets with the one-vs-rest approach and on eight data sets with the all pairs approach. After calibrating the binary problems, calibration improved on six data sets with the one-vs-rest approach and on five data sets with the all pairs approach compared to the corresponding uncalibrated scenario. Compared to the uncalibrated multi-class scenario, calibration performance with the one-vs-rest approach improved on four data sets while being similar on the other eight data sets. The calibrated all pairs was able to improve calibration only on three data sets, was neutral on two data sets, and decreased calibration performance on seven data sets compared to the uncalibrated multi-class scenario.

As with naive Bayes, the one-vs-rest approach fared better than the all pairs approach overall. On seven data sets the one-vs-rest approach did better than the all pairs approach, on four data sets there was no difference, and on one data set the all pairs approach resulted in lower calibration error.

Average computation times for training and calibrating the classifiers were recorded and the results are shown in Table 6. For the one-vs-rest and the all pairs approaches the calibration times are presented as time consumed for each binary problem to make the numbers comparable when taking into account the number of binary problems on each data set. Naive Bayes was extremely fast to train and although breaking up the classification problem into several binary problems increased the computation times this increase was negligible in practice.

Scenario | Model | Calibration |
---|---|---|

NB Multi-class | 0.009s | - |

NB One-vs-rest | 0.052s | 4.36s |

NB All Pairs | 0.112s | 4.62s |

RF Multi-class | 4.15s | - |

RF One-vs-rest | 29.5s | 5.13s |

RF All Pairs | 18.2s | 4.67s |

For random forest, too, the multi-class classifier was clearly faster to train than either the all pairs or the one-vs-rest. The all pairs classifier was, however, clearly faster to train than the one-vs-rest classifier but with such small data sets this difference is still not very meaningful in practice.

DGG data generation and ENIR calibration took approximately the same time for each binary problem for both the one-vs-rest and the all pairs approaches as the number of generated calibration data points is the same in both approaches. What was a bit surprising was that there was no difference in calibration times, per binary problem, between the classifiers. The overall calibration time then depends mostly on the number of binary problems.

## 5 Discussion

Naive Bayes is known to be poorly calibrated because its assumptions about feature independence rarely hold. It is not a big surprise that calibration improves its performance but it is surprising that using the one-vs-rest approach can improve its calibration even without calibrating the binary classifiers. Calibrating the binary naive Bayes classifiers works for both one-vs-rest and all pairs approaches. The calibrated one-vs-rest approach seems to be better suited for naive Bayes than the all pairs and the difference is often statistically significant.

Random forest classifier is not as poorly calibrated as naive Bayes but has still been shown to improve with calibration on some binary problems even with small data sets by using DGG for generating the calibration data set. It is clear from our experiments that the one-vs-rest approach works better with random forest than the all pairs approach does. As the all pairs approach actually decreases calibration performance on some data sets, especially if the number of classes is high, the one-vs-rest is the recommended approach for random forest.

Computation time grew linearly as a function of the number of binary problems because the complexity of DGG data generation depends mainly on the amount of data to be generated which was held constant for each scenario. This indicates that as the number of classes grows so does the calibration time. This might become more of an issue with the all pairs approach than with the one-vs-rest approach. However, the training times for calibration were only a few seconds per binary problem while the prediction times are negligible. In addition, parallel implementation would be trivial to implement which would decrease computation time considerably.

Comparison of the proposed method with calibration approaches that can directly calibrate multi-class probabilities is left for future work.

## 6 Conclusions

Data Generation and Grouping with IR or ENIR calibration can be generalized to multi-class problems as we have shown in this work using ENIR calibration. Using our proposed approach, calibration error can be decreased on many classification problems as demonstrated by our experiments. This is an important finding as traditional calibration algorithms perform poorly on small data sets and not all classification problems are binary. DGG data generation adds computational complexity which grows linearly as a function of binary problems. As the number of binary problems grows more rapidly on the all pairs approach, the one-vs-rest approach has an advantage as the number of classes grows. More importantly, the one-vs-rest approach performs better than the all pairs approach in many cases and did not increase calibration error on any of the tested data sets whereas the all pairs approach does on some of the data sets. The computation times for training the calibration algorithm were merely seconds per binary problem on the tested data sets which is not something that would discourage the usage of this algorithm if good calibration is needed, especially with a parallel implementation.

Comments

There are no comments yet.