Post-hoc Calibration of Neural Networks

by   Amir Rahimi, et al.

Calibration of neural networks is a critical aspect to consider when incorporating machine learning models in real-world decision-making systems where the confidence of decisions are equally important as the decisions themselves. In recent years, there is a surge of research on neural network calibration and the majority of the works can be categorized into post-hoc calibration methods, defined as methods that learn an additional function to calibrate an already trained base network. In this work, we intend to understand the post-hoc calibration methods from a theoretical point of view. Especially, it is known that minimizing Negative Log-Likelihood (NLL) will lead to a calibrated network on the training set if the global optimum is attained (Bishop, 1994). Nevertheless, it is not clear learning an additional function in a post-hoc manner would lead to calibration in the theoretical sense. To this end, we prove that even though the base network (f) does not lead to the global optimum of NLL, by adding additional layers (g) and minimizing NLL by optimizing the parameters of g one can obtain a calibrated network g ∘ f. This not only provides a less stringent condition to obtain a calibrated network but also provides a theoretical justification of post-hoc calibration methods. Our experiments on various image classification benchmarks confirm the theory.



There are no comments yet.


page 1

page 2

page 3

page 4


Post-hoc Uncertainty Calibration for Domain Drift Scenarios

We address the problem of uncertainty calibration. While standard deep n...

Classifier Calibration: How to assess and improve predicted class probabilities: a survey

This paper provides both an introduction to and a detailed overview of t...

Meta-Cal: Well-controlled Post-hoc Calibration by Ranking

In many applications, it is desirable that a classifier not only makes a...

Heterogeneous Calibration: A post-hoc model-agnostic framework for improved generalization

We introduce the notion of heterogeneous calibration that applies a post...

Quantile Regularization: Towards Implicit Calibration of Regression Models

Recent works have shown that most deep learning models are often poorly ...

Post-hoc loss-calibration for Bayesian neural networks

Bayesian decision theory provides an elegant framework for acting optima...

The Logic Traps in Evaluating Post-hoc Interpretations

Post-hoc interpretation aims to explain a trained model and reveal how t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we consider the problem of calibration of neural networks, or classification functions in general. This problem has been considered in the context of Support Vector Machines 


, but has recently been considered in the context of Convolutional Neural Networks (CNNs) 

guo2017calibration. In this case, a CNN used for classification takes an input , belonging to one of classes, and outputs a vector in , where the -th component,

is often interpreted as a probability that input

belongs to class . If this value is to represent probabilities accurately, then we require that

. In this case, the classifier

is said to be calibrated, or multi-class calibrated. 111 In many papers, e.g.  kull2019beyond and calibration metrics, e.g. ECE naeini2015obtaining a slightly different condition known as classwise-calibration is preferred: .

A well-known condition (bishop1994mixture) for a classifier to be calibrated is that it minimizes the cross-entropy cost function, over all functions , where is the standard probability simplex (see definition below). If the absolute minimum is attained, it is true that . However, this condition is rarely satisfied, since

may be a very large space (for instance a set of images, of very high dimension) and the task of finding the absolute (or even a local) minimum of the loss is difficult: it requires the network to have sufficient capacity, and also that the network manages to find the optimal value through training. To fulfil this requirement, two networks that reach different minima of the loss function cannot both be calibrated. However, the requirement that a network be calibrated is separate from that of finding the optimal classifier.

In this paper, it is shown that a far less stringent condition is sufficient for the network to be calibrated; we say that the network is optimal with respect to calibration or calibration-optimal provided no adjustment of the output of the network in the output space can improve the calibration (see definition LABEL:def:calibration-optimal). This is a far simpler problem, since it requires that a function between far smaller-dimensional spaces should be optimal.

Achieving optimality with respect to calibration can be achieved in either of two ways:

  1. by addition of extra layers at the end of the network and post-hoc training on a hold-out calibration set to minimize the cross-entropy cost function. The extra layers (which we call

    -layers) can be added before or after the usual softmax layer. Since the output space of the network is of small dimension (compared to the input of the whole network), optimization of the loss by training the

    -layers is a far easier task;

  2. by training the network with the small -layers in place from the beginning, on the training set. Since the -layers are small, and near the end of the network, the expectation is that they will become trained far more easily and more quickly than the rest of the network. A final training of the -layers alone on a small hold-out calibration set should also be applied.

We conduct experiments on various image classification datasets by learning a small fully-connected network for the -layers on a hold-out calibration set and evaluate on an unseen test set. Our experiments confirm the theory that if the calibration set and the test set are statistically similar, our method outperforms existing post-hoc calibration methods while retaining the original accuracy.

2 Preliminaries

We consider a pair of joint random variables,

. Random variable should take values in some domain for instance a set of images, and takes values in a finite class set . The variable will refer always to the number of labels, and denotes an element of the class set.

We shall be concerned with a (measurable) function , and random variable defined by . Note that is the same as , but we shall usually use the notation to remind us that it is the range of function . The distribution of the random variable induces the distribution for the random variable . The symbol will always represent where is a value of random variable . The notation means that is a value sampled from the random variable . The situation we have in mind is that is the function implemented by a (convolutional) neural network. A notation (with upper-case ) always refers to probability, whereas a lower case

represents a probability distribution. We use the notation

for brevity to mean .

A common way of doing classification, given classes, is that the neural net is terminated with a layer represented by a function (where typically , but this is not required), taking value in , and satifying and . The set of such vectors satisfying these conditions is called the standard probability simplex, , or simply the standard (open) simplex. This is an dimensional subset of . An example of such a function is the softmax function defined by .

Thus, the function implemented by a neural net is , where , and . The function will be called the activation in this paper. The notation represents the composition of the two functions and . One is tempted to declare (or hope) that , in other words that the neural network outputs the correct conditional class probabilities given the network output. At least it is assumed that the most probable class assignment is equal to . It will be investigated how justified these assumptions are. Clearly, since can be any function, this is not going to be true in general.

2.1 Loss

Given a pair , the negative-log loss is given by The expected loss over the distribution given by the random variables is


We cannot know the complete distribution of the random variables in a real situation, however, if the distributions are represented by data pairs sampled from the distribution of , then the expected loss is approximated by the empirical loss


The training process of the neural network is intended to find the function that minimizes the loss. Thus

3 Calibration

According to the theory (see bishop1994mixture), if the trained network is , where is the loss given in (1) then the network (function ) is calibrated, in the sense that , as stated in the following theorem.

  • Consider joint random variables , taking values in and respectively, where is some Cartesian space. Let be a function. Define the loss . If then

    This theorem is a simple corollary of the following slight generalization, which is proved in the supplementary material. (The theorem is stated in terms of a function , rather than for convenience later.)