Despite the success of modern neural networks they are shown to be poorly calibrated guo2017calibration, which has led to a growing interest in the calibration of neural networks over the past few years kull2019beyond; kumar2019verified; kumar2018trainable; muller2019does
. Considering classification problems, a classifier is said to becalibrated if the probability values it associates with the class labels match the true probabilities of correct class assignments. For instance, if an image classifier outputs 0.2 probability for the “horse” label for 100 test images, then out of those 100 images approximately 20 images should be predicted as horse. It is important to ensure calibration when using classifiers for safety-critical applications such as medical image analysis and autonomous driving where the downstream decision making depends on the predicted probabilities.
One of the important aspects of machine learning research is the measure used to evaluate the performance of a model and in the context of calibration, this amounts to measuring the difference between two empirical probability distributions. To this end, the popular metric, Expected Calibration Error (ECE)naeini2015obtaining, approximates the classwise probability distributions using histograms and takes an expected difference. This histogram approximation has a weakness that the resulting calibration error depends on the binning scheme such as the number of bins and bin divisions. Even though the drawbacks of ECE have been pointed out and some improvements have been proposed kumar2019verified; nixon2019measuring, the histogram approximation has not been eliminated.111We consider metrics that measure classwise (top-) calibration error kull2019beyond. Refer to section 2 for details.
In this paper, we first introduce a simple, binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test kolmogorov1933sulla; smirnov1939estimation, which also provides an effective visualization of the degree of miscalibration similar to the reliability diagram niculescu2005predicting. To this end, the main idea of the KS-test is to compare the respective classwise cumulative (empirical) distributions. Furthermore, by approximating the empirical cumulative distribution using a differentiable function via splines Mckinley_cubicspline, we obtain an analytical recalibration function which maps the given network outputs to the actual class assignment probabilities. Such a direct mapping was previously unavailable and the problem has been approached indirectly via learning, for example, by optimizing the (modified) cross-entropy loss guo2017calibration; mukhoti2020calibrating; muller2019does. Similar to the existing methods guo2017calibration; kull2019beyond the spline-fitting is performed using a held-out calibration set and the obtained recalibration function is evaluated on an unseen test set.
We evaluated our method against existing calibration approaches on various image classification datasets and our spline-based recalibration approach consistently outperforms existing methods on KS error, ECE as well as other commonly used calibration measures. Our approach to calibration does not update the model parameters, which allows it to be applied on any trained network and it retains the original classification accuracy in all the tested cases.
2 Notation and Preliminaries
We abstract the network as a function , where , and write . Here, may be an image, or other input datum, and
is a vector, sometimes known as the vector oflogits. In this paper, the parameters will not be relevant, and we write simply to represent the network function. Moreover, a function of this type will be referred to as a classifier, which may be of some other kind than a neural network.
In a classification problem, is the number of classes to be distinguished, and we call the value (the -th component of vector ) the score for the class . If the final layer of a network is a softmax layer, then the values satisfy , and . Hence, the are pseudo-probabilities, though they do not necessarily have anything to do with real probabilities of correct class assignments. Typically, the value is taken as the (top-) prediction of the network, and the corresponding score, is called the confidence of the prediction. However, the term confidence does not have any mathematical meaning in this context and we deprecate its use.
We assume we are given a set of training data , where is an input data element, which for simplicity we call an image, and is the so-called ground-truth label. Our method also uses two other sets of data, called calibration data and test data.
It would be desirable if the numbers
output by a network represented true probabilities. For this to make sense, we posit the existence of joint random variables, where takes values in a domain , and takes values in . Further, let , another random variable, and be its -th component. Note that in this formulation and are joint random variables, and the probability is not assumed to be for single class, and for the others.
A network is said to be calibrated if for every class ,
This can be written briefly as . Thus, if the network takes input and outputs , then represents the probability (given ) that image belongs to class .
The probability is difficult to evaluate, even empirically, and most metrics (such as ECE) use or measure a different notion called classwise calibration kull2019beyond; zadrozny2002transforming, defined as,
This paper uses this definition (2) of calibration in the proposed KS metric.
Calibration and accuracy of a network are different concepts. For instance, one may consider a classifier that simply outputs the class probabilities for the data, ignoring the input . Thus, if , this classifier is calibrated but the accuracy is no better than the random predictor. Therefore, in calibration of a classifier, it is important that this is not done while sacrificing classification (for instance top-) accuracy.
The top- prediction.
The classifier being calibrated means that is calibrated for each class , not only for the top class. This means that scores for all classes
give a meaningful estimate of the probability of the sample belonging to class. This is particularly important in medical diagnosis where one may wish to have a reliable estimate of the probability of certain unlikely yet possible diagnoses.
Frequently, however, one is most interested in the probability of the top scoring class, the top- prediction, or in general the top- prediction. Suppose a classifier is given with values in and let be the ground truth label. Let us use to denote the -th top score (so would denote the top score; the notation follows python semantics in which represents the last element in array ). Similarly we define for the -th largest value. Let be defined as
In words, is if the -th top predicted class is the correct (ground-truth) choice. The network is calibrated for the top- predictor if for all scores ,
In words, the conditional probability that the top--th choice of the network is the correct choice, is equal to the -th top score.
Similarly, one may consider probabilities that a datum belongs to one of the top- scoring classes. The classifier is calibrated for being within-the-top- classes if
Here, the sum on the left is if the ground-truth label is among the top choices, otherwise, and the sum on the right is the sum of the top scores.
3 Kolmogorov-Smirnov Calibration Error
We now consider a way to measure if a classifier is classwise calibrated, including top- and within-top- calibration. This test is closely related to the Kolmogorov-Smirnov test kolmogorov1933sulla; smirnov1939estimation for the equality of two probability distributions. This may be applied when the probability distributions are represented by samples.
We start with the definition of classwise calibration:
This may be written more simply but with a less precise notation as
Motivation of the KS test.
One is motivated to test the equality (or difference between) two distributions, defined on the interval . However, instead of having a functional form of these distributions, one has only samples from them. Given samples , it is not straight-forward to estimate or , since a given value is likely to occur only once, or not at all, since the sample set is finite. One possibility is to use histograms of these distributions. However, this requires selection of the bin size, and the division between bins, and the result depends on these parameters. For this reason, we abjure this solution.
The approach suggested by the Kolmogorov-Smirnov test is to compare the cumulative distributions. Thus, with given, one tests the equality
Writing and to be the two sides of this equation, then the KS-distance between these two distributions is .
The fact that simply the maximum is used here may suggest a lack of robustness, but this is a maximum difference between two integrals, so it reflects an accumulated difference between the two distributions. In fact, if consistently over or under-estimates (which is usually the case, at least for top- classification), then has constant sign for all values of .
It follows that has constant sign and so the maximum value in the KS-distance is achieved when . In this case,
which is the expected difference between and . This can be equivalently referred to as the expected calibration error for the class .
Given samples , and a fixed , one can estimate these cumulative distributions by
where is the function that returns if the Boolean expression is true and otherwise . Thus, the sum is simply a count of the number of samples for which and , and so the integral represents the proportion of the data satisfying this condition. Similarly,
These sums can be computed quickly by sorting the data according to the values , then defining two sequences as follows.
The two sequences should be the same, and the metric
gives a numerical estimate of the similarity, and hence a measure of the degree of calibration of . This is essentially a version of the Kolmogorov-Smirnov test for equality of two distributions.
4 Recalibration using Splines
The function defined in (11) computes an empirical approximation
For convenience, the value of will be referred to as the score. We now define a continuous function for by
where is the -th fractile score, namely the value that a proportion of the scores lie below. For instance is the median score. So, is an empirical approximation to where . We now provide the basic observation that allows us to compute probabilities given the scores.
If as in (14) where is the -th fractile score, then , where .