1 Introduction
In this paper we consider the problem of calibration of neural networks, or classification functions in general. This problem has been considered in the context of Support Vector Machines
platt1999probabilistic, but has recently been considered in the context of Convolutional Neural Networks (CNNs)
guo2017calibration. In this case, a CNN used for classification takes an input , belonging to one of classes, and outputs a vector in , where the th component,is often interpreted as a probability that input
belongs to class . If this value is to represent probabilities accurately, then we require that. In this case, the classifier
is said to be calibrated, or multiclass calibrated. ^{1}^{1}1 In many papers, e.g. kull2019beyond and calibration metrics, e.g. ECE naeini2015obtaining a slightly different condition known as classwisecalibration is preferred: .A wellknown condition (bishop1994mixture) for a classifier to be calibrated is that it minimizes the crossentropy cost function, over all functions , where is the standard probability simplex (see definition below). If the absolute minimum is attained, it is true that . However, this condition is rarely satisfied, since
may be a very large space (for instance a set of images, of very high dimension) and the task of finding the absolute (or even a local) minimum of the loss is difficult: it requires the network to have sufficient capacity, and also that the network manages to find the optimal value through training. To fulfil this requirement, two networks that reach different minima of the loss function cannot both be calibrated. However, the requirement that a network be calibrated is separate from that of finding the optimal classifier.
In this paper, it is shown that a far less stringent condition is sufficient for the network to be calibrated; we say that the network is optimal with respect to calibration or calibrationoptimal provided no adjustment of the output of the network in the output space can improve the calibration (see definition LABEL:def:calibrationoptimal). This is a far simpler problem, since it requires that a function between far smallerdimensional spaces should be optimal.
Achieving optimality with respect to calibration can be achieved in either of two ways:

by addition of extra layers at the end of the network and posthoc training on a holdout calibration set to minimize the crossentropy cost function. The extra layers (which we call
layers) can be added before or after the usual softmax layer. Since the output space of the network is of small dimension (compared to the input of the whole network), optimization of the loss by training the
layers is a far easier task; 
by training the network with the small layers in place from the beginning, on the training set. Since the layers are small, and near the end of the network, the expectation is that they will become trained far more easily and more quickly than the rest of the network. A final training of the layers alone on a small holdout calibration set should also be applied.
We conduct experiments on various image classification datasets by learning a small fullyconnected network for the layers on a holdout calibration set and evaluate on an unseen test set. Our experiments confirm the theory that if the calibration set and the test set are statistically similar, our method outperforms existing posthoc calibration methods while retaining the original accuracy.
2 Preliminaries
We consider a pair of joint random variables,
. Random variable should take values in some domain for instance a set of images, and takes values in a finite class set . The variable will refer always to the number of labels, and denotes an element of the class set.We shall be concerned with a (measurable) function , and random variable defined by . Note that is the same as , but we shall usually use the notation to remind us that it is the range of function . The distribution of the random variable induces the distribution for the random variable . The symbol will always represent where is a value of random variable . The notation means that is a value sampled from the random variable . The situation we have in mind is that is the function implemented by a (convolutional) neural network. A notation (with uppercase ) always refers to probability, whereas a lower case
represents a probability distribution. We use the notation
for brevity to mean .A common way of doing classification, given classes, is that the neural net is terminated with a layer represented by a function (where typically , but this is not required), taking value in , and satifying and . The set of such vectors satisfying these conditions is called the standard probability simplex, , or simply the standard (open) simplex. This is an dimensional subset of . An example of such a function is the softmax function defined by .
Thus, the function implemented by a neural net is , where , and . The function will be called the activation in this paper. The notation represents the composition of the two functions and . One is tempted to declare (or hope) that , in other words that the neural network outputs the correct conditional class probabilities given the network output. At least it is assumed that the most probable class assignment is equal to . It will be investigated how justified these assumptions are. Clearly, since can be any function, this is not going to be true in general.
2.1 Loss
Given a pair , the negativelog loss is given by The expected loss over the distribution given by the random variables is
(1) 
We cannot know the complete distribution of the random variables in a real situation, however, if the distributions are represented by data pairs sampled from the distribution of , then the expected loss is approximated by the empirical loss
(2) 
The training process of the neural network is intended to find the function that minimizes the loss. Thus
3 Calibration
According to the theory (see bishop1994mixture), if the trained network is , where is the loss given in (1) then the network (function ) is calibrated, in the sense that , as stated in the following theorem.

Consider joint random variables , taking values in and respectively, where is some Cartesian space. Let be a function. Define the loss . If then
This theorem is a simple corollary of the following slight generalization, which is proved in the supplementary material. (The theorem is stated in terms of a function , rather than for convenience later.)
Comments
There are no comments yet.