Log In Sign Up

Heterogeneous Calibration: A post-hoc model-agnostic framework for improved generalization

by   David Durfee, et al.

We introduce the notion of heterogeneous calibration that applies a post-hoc model-agnostic transformation to model outputs for improving AUC performance on binary classification tasks. We consider overconfident models, whose performance is significantly better on training vs test data and give intuition onto why they might under-utilize moderately effective simple patterns in the data. We refer to these simple patterns as heterogeneous partitions of the feature space and show theoretically that perfectly calibrating each partition separately optimizes AUC. This gives a general paradigm of heterogeneous calibration as a post-hoc procedure by which heterogeneous partitions of the feature space are identified through tree-based algorithms and post-hoc calibration techniques are applied to each partition to improve AUC. While the theoretical optimality of this framework holds for any model, we focus on deep neural networks (DNNs) and test the simplest instantiation of this paradigm on a variety of open-source datasets. Experiments demonstrate the effectiveness of this framework and the future potential for applying higher-performing partitioning schemes along with more effective calibration techniques.


page 1

page 2

page 3

page 4


Post-hoc Uncertainty Calibration for Domain Drift Scenarios

We address the problem of uncertainty calibration. While standard deep n...

Improving Uncertainty Calibration of Deep Neural Networks via Truth Discovery and Geometric Optimization

Deep Neural Networks (DNNs), despite their tremendous success in recent ...

Post-hoc Calibration of Neural Networks

Calibration of neural networks is a critical aspect to consider when inc...

On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of Classifiers

Every uncalibrated classifier has a corresponding true calibration map t...

Adaptive Temperature Scaling for Robust Calibration of Deep Neural Networks

In this paper, we study the post-hoc calibration of modern neural networ...

Making Heads and Tails of Models with Marginal Calibration for Sparse Tagsets

For interpreting the behavior of a probabilistic model, it is useful to ...

Task Agnostic and Post-hoc Unseen Distribution Detection

Despite the recent advances in out-of-distribution(OOD) detection, anoma...

1 Introduction

Deep neural networks (DNNs) have become ubiquitous in decision making pipelines across industries due to an extensive line of work for improving accuracy, and are particularly applicable to settings where massive datasets are common He et al. (2016); Devlin et al. (2019); Vaswani et al. (2017); Naumov et al. (2019)

. The large number of parameters in DNNs affords greater flexibility in the modeling process for improving generalization performance and has been recently shown to be necessary for smoothly interpolating the data 

Bubeck and Sellke (2021).

However this over-parameterization (where the number of parameters exceed the number of training samples), along with other factors can lead to over-confidence

, where model performance is substantially better on training data compared to test data. For classification tasks, over-confidence is more specifically characterized by the model output probability of the predicted class being generally higher than the true probability.

Guo et al. (2017)

found that over-confidence increased with respect to the model depth and width, even when accuracy improves. Additional recent work proves that over-confidence is also inherent for under-parameterized logistic regression 

Bai et al. (2021)

. Accordingly, there is extensive work in the area of calibration, whose primary goal is to improve the accuracy of probability estimates that is essential for different use-cases.

Some of the common calibration techniques that apply a post-hoc model-agnostic transformation to properly adjust the model output include Platt scaling Platt and others (1999), Isotonic regression Zadrozny and Elkan (2002), histogram binning Zadrozny and Elkan (2001)

, Bayesian binning into quantiles 

Naeini et al. (2015), scaling-binning Kumar et al. (2019), and Dirichlet calibration  Kull et al. (2019). There is also work on calibration through ensemble type methods Lakshminarayanan et al. (2017); Gal and Ghahramani (2016) and recent work on using focal loss to train models that are already well-calibrated while maintaining accuracy Mukhoti et al. (2020). For a more indepth exposure to different calibration methods, please see Bai et al. (2021).

In this paper, we deviate from this traditional aim of calibration. Instead of trying to improve accuracy of probability estimates, we aim at improving model generalization. Our key insight is that over-confident models not only show mis-calibration but also tend to under-utilize heterogeneity in the data in a specific and intuitive manner. We develop this intuition through concrete examples, and focus on mitigating this under-utilization of data heterogeneity through efficient post-hoc calibration. We specifically aim to improve model accuracy, characterized through the area under the receiver-operating characteristic (ROC) curve, (commonly called AUC) and other metrics for binary classification.

Accordingly, we develop a new theoretical foundation for post-hoc calibration to improve AUC and prove that the transformation for optimizing output probability estimates will also optimize AUC. To the best of our knowledge, this is the first paper that provably shows how calibration techniques can improve model generalization. We further extend this theoretical optimality to separately calibrating different partitions of the heterogeneous feature space, and give concrete intuition how partitioning through standard tree-based algorithms and separately calibrating the partitions will improve AUC. This gives a natural and rigorous connection between tree-based algorithms and DNNs through the use of standard calibration techniques, and provides an efficient post-hoc transformation framework to improve accuracy.

In order to best show that the underlying theory of our general framework holds in practice, we test the simplest instantiation whereby a decision tree classifier identifies the partitioning and logistic regression is used for calibration. We test on open-source datasets and focus upon tabular data due to it’s inherent heterogeneity, but also discuss how this can be extended to image classification or natural language processing tasks. Across the different datasets we see a notable increase in performance from our heterogeneous calibration technique on the top performing DNN models. In addition, we see much more substantial increase in performance and more stable results while considering the average performing DNNs in the hyper-parameter tuning. Our experiments also confirm the intuition that more over-confident models will see greater increase in performance from our heterogeneous calibration framework.

We summarize our contributions as the following:

  1. [noitemsep]

  2. We use concrete examples to give intuition on how over-confident models, particularly DNNs, under-utilize heterogeneous partitions of the feature space.

  3. We provide theoretical justification of correcting this under-utilization through standard calibration on each partition to maximize AUC.

  4. We leverage this intuition and theoretical optimality to introduce the general paradigm of heterogeneous calibration that applies a post-hoc model-agnostic transformation to model outputs for improving AUC performance on binary classification tasks. This framework also easily generalizes to multi-class classification.

  5. We test the simplest instantiation of heterogeneous calibration on open-source datasets to show the effectiveness of the framework

The rest of the paper is organized as follows. We begin with the detailed problem setup in Section 2. In Section 3 we give an intuition as to how over-confident models tend to under-utilize heterogeneity. In Section 4 we give a provably optimal post-hoc transformation for mitigating under-utilized heterogeneity. In Section 5 we detail the framework of heterogeneous calibration. In Section 6 we give the experimental results before ending with a discussion in Section 7. All proofs are pushed to the Appendix.

2 Methodology

In practice, calibration is primarily used to get better probability estimates, which is especially useful for use-cases where uncertainty estimation is important Bai et al. (2021). At first glance, the notion of over-confidence being corrected by post-hoc calibration should only contract the range of probability estimates but not affect the ordering and thereby the AUC. Moreover, much of the literature attempts to maintain accuracy within the calibration. However if we consider over-confidence at a more granular level it can negatively affect the relative ordering between heterogeneous partitions. We primarily consider a heterogeneous partitioning to be a splitting of the feature space such that each partition has a disproportionately higher ratio of positive or negative labels for the binary classification setting. Intuitively, the more accurately a model has intricately fit the data the less it needs to utilize simpler patterns, such as heterogeneous partitions, but over-confident models will over-estimate their ability to fit the data and thus under-utilize simple patterns.

Ideally the relative ordering of heterogeneous partitions could be corrected for over-confident models as a post-hoc procedure. One approach would be to add a separate bias term to the output of each partition, but this may not fully capture the extent to which the relative ordering can be improved. We give a more rigorous examination of the AUC metric which measures the quality of our output ordering and prove that perfectly calibrating the probability estimates will also optimize the AUC and several other accuracy metrics for the given model. Furthermore we show that this extends to any partitioning of the feature space such that perfectly calibrating each partition separately will maximally improve AUC and other related metrics. The concept of separately calibrating partitions of the feature space has also been seen in the fairness literature Hebert-Johnson et al. (2018), but their partitions are predefined based upon fairness considerations and the considered metrics are towards ensuring fair models.

Combining our theoretical result with the intuition that over-confident models will improperly account for heterogeneous partitions gives a general framework of heterogeneous calibration as a post-hoc model-agnostic transformation that: (1) indentifies heterogeneous partitions of the feature space through tree-based algorithms; (2) calibrates each partition separately using a known technique from the extensive line of calibration literature.

The heterogeneous partitioning can be done through a variety of tree-based methods, and we view this as a natural, efficient, and rigorous incorporation of tree-based techniques into DNNs through the use of calibration. In fact, our theoretical optimality results also imply that heterogeneous calibration gives the optimal ensemble of a separately trained DNN and decision tree, combining the strengths of each into one model to maximize AUC.

Additionally the advantage of this post-hoc framework as opposed to applying techniques to fix over-confidence within the training itself is that overconfident models are not inherently undesirable with respect to accuracy Guo et al. (2017). The flexibility of over-parameterization allows the model training to simultaneously learn generalizable patterns and also memorize small portions of the training data. Validation data is often used to identify the point at which increased memorization outweighs the additional generalization, but decoupling these prior to this point and still achieving a similar level of performance is incredibly challenging. The post-hoc nature of our framework then allows us to avoid this difficulty and enjoy the additional generalization from over-confident models while also correcting the under-utilization of simpler patterns in the data.

2.1 Notation

To more rigorously set up the problem, we let be the data universe and we consider the classical binary classification setting where and and

is a feature vector and label from the data universe. Let

be the probability distribution over the data universe with density function

where our data is random samples . Let and be the probability distributions over where we condition on the label being 0 and 1 respectively, which is to say that their respective density functions are such that and .


be the score function of a binary classification model. We consider this to be the output of the final neuron of the DNN prior to applying the sigmoid function, but our theoretical results hold for any score function.

We will be considering splits of the feature space where we let be a partitioning of such that each where they cover , which is to say and they are all disjoint so for any we have .

We will also refer to heterogeneous partitions in the feature space by which we most often mean that either or .

For a more rigorous definition of over-confidence we borrow the definitions of Bai et al. (2021), where predicted probability for a given class is generally higher than the estimated probability. This also leads to the notion of a well-calibrated model whereby for all , and we give a more rigorous definition in Section 11.1 for completeness. For the most part, we will be considering post-hoc calibration (which we often shorten to calibration) where a post-hoc transformation is applied to the classifier score function to achieve for all . Note that this cannot be equivalently defined as requiring because this could be perfectly achieved through setting for all but lose all value of the classifier.

We focus our rigorous examination of accuracy with respect to the area under the curve (AUC) metric, which we precisely define here. Generally AUC is considered in terms of the receiver operating characteristic (ROC) curve, which is plotted based upon the True Positive Rate (TPR) vs False Positive Rate (FPR) at different thresholds. This definition is known to be equivalent to randomly drawing a positive and negative label example and determining the probability that the model will identify the positive label. We also show this equivalence in the appendix for completeness.

Definition 2.1.

[AUC] For a given classifier score function , along with distributions and then AUC can be defined as

Note that we will often omit the relevant terms for notational simplicity. We also give definitions of related metrics such as TPR, FPR, log-loss, Precision/Recall, and expected calibration error in the Appendix.

In this paper, we will first develop the intuition as to how over-confident models tend to under-utilize heterogeneous partitions of the feature space. Based on this intuition, the main focus of this paper, is to develop a framework that can leverage this heterogeneity to improve model generalization. Specifically, using a heterogeneous partition , how can we transform the score that optimizes AUC for binary classification tasks.

3 Intuition for over-confident models under-utilizing heterogeneity

In this section we give intuition on why over-confidence due to over-parameterization can negatively impact model performance when there is heterogeneity in the data. Note that we will consider binary classification for ease of visualization, but the same ideas generalize to multi-classification where then the output score is a vector in . We will set up this intuition by visualizing the distribution of scores for the positive and negative labels. First we will give an example of what these distributions might look like on training vs test data and how they often differ due to over-parameterization. Then we will consider independently adding a feature with heterogeneity and show how the over-confidence will lead to training data not properly accounting for that heterogeneity.

3.1 Over-confident model example

In order to visualize model performance it is common to look at the distributions of the score functions with respect to label. Specifically we want to empirically plot and for all which is often done by constructing a histogram of the scores with respect to their label. For our toy example, suppose our data is such that labels are balanced, so . Further we will let

denote the Gaussian distribution with mean

and variance


Generally the over-parameterization of neural networks leads to training data performing significantly better than test data because the model performs some memorization of the training data. Most often this memorization will occur on the harder data points to classify and better separate these examples compared to the test data. Visually this tends to then lead to a steeper decline in the respective score distributions for the training data on the harder data points to classify. Meanwhile for the test data the score distributions are much more symmetric because the model has not performed nearly as well on the hard-to-classify data points leading to more overlap. An example visualization of over-confidence is in Figure


[width=0.45]Training_distribution.png [width=0.45]Testing_distribution.png

Figure 1: Training and test score distributions for over-confident models

In this example we assume that our classifier function is such that and . Further let be the training data sample and let and be split into the negative and positive labels. In our example we set

This type of over-confidence on training data tends to be the root cause of mis-calibration. The model does often inherently attempt to optimize calibration, for instance with a log-loss function, but it is doing so on the training data where it is over-confident in how well it has separated positive and negative labels and thus scales up the scores substantially, pushing the associated probabilities closer to 0 or 1. In order to optimize the log-loss of the test data we would need to divide the score function by a factor of about 2, which would also give approximately optimal calibration. Note that the training data is also not optimized as we do assume some sort of regularization such as soft labels because the log loss would actually be optimized on the training data by scaling the score function up by a factor of about 2. Regardless of how much it is scaled up or down, the ordering of the score function and all associated ordering metrics such as AUC or accuracy will remain unchanged.

3.2 Under-utilized heterogeneity example

While the over-confidence in our example above only affects the output probability and not the ordering, this over-confidence can be detrimental to ordering if we add heterogeneity to the data set. Suppose we add a binary feature to our feature space that is uncorrelated with the other features but is well correlated with the label so it’s heterogeneous. Specifically, if we previously had then we now consider and distribution with density function such that for any such that we have . Further we assume that it’s conditionally independent of the other features given the label, but it does well predict the label and in particular we have and .

Assume that we use the same training and test data but with this new heterogeneous feature added to the dataset. Due to the new feature being conditionally independent from the other features it’s reasonable to assume that the score function the model would learn on the training set would (at least approximately) be for some optimized and .

The choice of determines the relative ordering of the score function when vs when , so the extent to which the model should utilize the heterogeneity of this binary feature (and then simply re-centers the score function appropriately). The better the model is performing the less it will need to use this additional heterogeneity to improve its prediction. It is then important to note that this and are optimized on the training data where the model is over-confident in its performance and as such will not set nearly as high as it should for the true distribution.

In particular for the training data, it will set and to optimize cross-entropy which also maximizes AUC on the training data, and on the true data distribution this gives an AUC of about 0.83. Due to the over-confidence on the training data the model actually set this value lower than it should have and if instead it had set and then we could have increased the AUC to about 0.85 on the true data distribution and also improved the log-loss along with other accuracy metrics.


Figure 2: Comparison of AUC with different

3.3 General Discussion

Our example above illustrates the more general concept of how neural networks can under-utilize simple patterns in the data because they are over-confident in their ability to fit the data. This point is generally understood as a potential pitfall of neural networks but we are focusing specifically on how it fails to appropriately utilize heterogeneity.

In particular the bias term in the output layer can be viewed as a centering term for the score function to optimally account for the balance of positive vs negative labels. This bias term will not affect the overall ordering, but the neural network can also make these centering decisions at a more fine-grained level where in our example above we considered simply splitting the data once. Especially if the internal nodes used a ReLU activation function then it would be quite simple for a neural network to construct internal variables that represent simple partitions of the data, reminiscent of partitions that are similarly defined by decision trees. This could then lead to relative orderings between partitions that are inappropriate because the model centered the partitions according to the training data on which it was overconfident.

In our example we assumed that the new feature was conditionally independent and thus the appropriate fix was simply shifting each partition. With more intricate dependence we would expect the score distributions on each side of the split to differ more significantly than being identical up to a bias term. Therefore ordering different partitions correctly with respect to the others will be a more complex task. In Section 4 we show that the optimal way of ordering these partitions relative to each other is actually equivalent to optimally calibrating each partition.

4 Calibration of partitions to optimize AUC

In Section 3 we provided intuition regarding over-confident models under-utilizing heterogeneity. In this section we assume that such a heterogeneous partitioning has been identified and provide the theoretical framework for optimally applying a post-hoc transformation to maximize AUC.

4.1 Optimal AUC calibration

We first consider applying a post-hoc transformation to the classifier score function, in the same way as standard calibration and define the corresponding AUC measurement.

Definition 4.1.

[Calibrated AUC] For a given classifier score function , along with distributions and , and a transformation function then we define calibrated AUC as

Note that when is the identity function or any isotonic function then this is equivalent to standard AUC. Further note that this is equivalent to but this notation will be easier to work with in our proofs for which we will give intuition here and prove in the appendix.

It is then natural to consider the optimal transformation function to maximize AUC conditioned on our classifier score function and data distribution.

Lemma 4.1 (Informal).

Given a classifier score function and any distributions , we can maximize with respect to by using the likelihood ratio, as our transformation function

For the purposes of maximizing AUC only the ordering imposed by is relevant, and intuitively the likelihood ratio will give the highest ordering to outputs that maximize True Positive Rate (TPR) and minimize False Positive Rate (FPR) thereby maximizing AUC. Furthermore, we also show that for any FPR the corresponding TPR is maximized by the likelihood ratio transformation, which implies that the ROC curve of any other transformation is contained within the ROC curve of the likelihood ratio transformation. As a corollary this implies that for any Recall the corresponding Precision is maximized and furthermore the PR-AUC is maximized by the likelihood ratio transformation. While these claims are intuitively reasonable, they will require more involved proofs in the appendix.

We then show that the ordering from this likelihood ratio is equivalent to the ordering from the optimal calibration which by definition sets .

Lemma 4.2 (Informal).

The likelihood ratio and optimal calibration give an equivalent ordering. For any we have that if and only if

Due to the fact that AUC is invariant under equivalent orderings, calibration on the full dataset will also optimize AUC and other associated metrics. This connection allows us to simply apply standard techniques from the literature for calibration to optimize AUC. However, we expect this affect to be minimal even when the model is overconfident because it is more so correcting the over-confident probability estimations but not changing the ordering.

4.2 Optimal partitioned AUC calibration

While calibration on the full dataset may not generally affect ordering and thus AUC, recall that Section 3 identified the issue of over-confidence negatively affecting the relative ordering between heterogeneous partitions of the data. In order to re-order these partitions appropriately, we then want to extend our optimal post-hoc transformation separately to each partition such that it provably maximizes overall AUC.

Definition 4.2.

[Partition Calibrated AUC] For a given classifier score function , distributions , and a partition of , along with a transformation function , we define partition calibrated AUC as

Once again if always then this is equivalent to . Furthermore we could have equivalently defined this as but this will be easier to work with in our proofs. For this definition we will also show that AUC is maximized by using the likelihood ratio.

Lemma 4.3 (Informal).

Given classifier score function , and distributions and , along with a partition of , we can maximize by using the likelihood ratio

Note that we could set for all and thus we can only improve (or keep equal) AUC by partitioning, and this holds for any arbitrary partition. Additionally this likelihood ratio will give the same ordering as the optimal calibration for each partition which for a given would set .

Lemma 4.4 (Informal).

The likelihood ratio and the optimal calibration probability give an equivalent ordering. For any and we have that if and only if

Therefore, by optimally calibrating each partition we can equivalently maximize overall AUC. If this partitioning is taken to the extreme then this calibration is just intuitively the optimal model.

Corollary 4.1 (Informal).

If is the full partitioning of which is to say for all , then the optimal is equivalent to where

However running post-processing to accomplish the same task as the model training is both redundant and infeasible to be accurately done in this way. It is then necessary to balance partitioning the feature space and still maintaining enough data to accurately calibrate each partition. Furthermore, from the intuition we gave before, we aren’t simply doing this partitioning in hopes of improvement because of the mathematical guarantee, but because the over-confidence means that our model may not have accounted for specific partitions appropriately.

The granularity to which the feature space can be partitioned and still maintain accuracy has also been studied in the fairness literature giving bounds on the sample complexity for multicalibration Shabat et al. (2020)

, and there has also been work on estimating calibration of higher moments for multicalibration 

Jung et al. (2021). The sample complexity results are agnostic to calibration technique, but for a more practical application the extent of the partitioning should be dependent on the which calibration technique is applied. For example, histogram binning essentially estimates the full score distributions and will require more samples to keep empirical error low. In contrast, Platt scaling is just logistic regression on one variable and thus requires fewer samples to get accurate parameters for the calibration.

Additionally, the extent of the partitioning also depends upon whether we use a random forest for our partitioning scheme and take the average calibration over all the trees. In the same way that the trees can have a greater depth because of the ensemble nature of a random forest, we could take advantage of the same ensemble type structure to partition more finely. We could also apply tree pruning techniques via cross-validation

Kohavi and others (1995) to determine the ideal level of partitioning.

5 Heterogeneous Calibration Framework

Combining the intuition in Section 3 whereby over-confident models under-utilize heterogeneous partitions, and the theoretical optimality in Section 4 of calibrating each partition separately to maximize AUC, immediately implies the general heterogeneous calibration framework:

  1. [noitemsep]

  2. Partition the feature space for maximal heterogeneity

  3. Calibrate each partition of the feature space separately

We give an explicit implementation of this framework in Section 5.1, but the flexibility of this paradigm allows for many possible implementations. In particular, there are a multitude of post-hoc calibration techniques from the literature that could be applied  Platt and others (1999); Zadrozny and Elkan (2002, 2001); Naeini et al. (2015); Kumar et al. (2019); Kull et al. (2019). Furthermore, splitting the feature space with a decision tree, which greedily maximizes heterogeneity, is the most obvious choice but we could also use a random forest here by repeating the partitioning and calibration multiple times and outputting the average across the trees. We could also utilize boosted trees, which gives a sequence of partitions, and then sequentially apply calibration such that the final transformation was a nested composition of calibrations for each partitioning. We further sketch out the details of how this could work for boosted trees in the appendix (Section 13), but leave a more thorough examination to future work. Additionally, we could construct decision trees that greedily split the feature space to more directly optimize AUC that we discuss in Section 12.

We note that this framework can easily be applied to multi-class classification with many tree-based partitioning schemes and calibration techniques being extendable to multi-class classification. We also focus upon tabular data and recommender systems because heterogeneity is much more common in these settings, but this framework could be extended to image classification and natural language processing. In particular the partitioning of the feature space could be identified by applying a decision tree to the neurons of an internal layer in the neural network, which are often considered to represent more general patterns and thus have more heterogeneity.

In order to best show that the underlying theory of our general framework holds in practice, we focus on the simplest instantiation and leave the application of higher-performing tree-based partitioning schemes and more effective calibration techniques to future work.

5.1 Example Implementation

To exemplify our framework we give a simple instantiation here which will also be used in our experiments. A decision tree classifier identifies the partitioning and logistic regression is used for calibration, which is Platt scaling.

We assume that the model is a DNN trained on the training data and the model with the highest accuracy on the validation is chosen, but this assumption is not necessary to apply this algorithm. Our heterogeneous calibration procedure can use the same training and validation data. However, by choosing the model with peak accuracy on validation data, it’s likely the model is slightly over-confident on the validation data, although much less than the training data, and using fresh data for the calibration would be preferable.

0:  Training data , validation data , and classifier score function from a trained model
1:  Build low depth classification tree on whose leaves’ generate a partitioning of the feature space,
2:  for Each partition  do
3:     Get label and score pairs for given partition
4:     Run Platt scaling (logistic regression) on to get calibration transformation
5:  end for
6:  For feature vector , use the classification tree to find partition such that
7:  Return Probability prediction
Algorithm 1 Heterogeneous Calibration

This framework will be most effectively applied to real-world use cases under three general conditions:

  1. [noitemsep]

  2. The model should have some degree of over-confidence in the same way that post-hoc calibration techniques give little additional value to well-calibrated models

  3. There should be an algorithmically identifiable partitioning of the feature space with a reasonable amount of heterogeneity

  4. There should be sufficient data outside of the training data to accurately perform calibration on each partition

5.2 Interpolation between DNNs and tree-based algorithms

In this section we further discuss how our heterogeneous calibration framework gives a natural interpolation between DNNs and tree-based algorithms through the use of calibration. In particular, we show how this framework can equivalently be viewed as an optimal ensemble of any given DNN and decision tree through the use of calibration. Furthermore, we discuss how this can extend to any tree-based algorithm.

We begin by re-considering Algorithm 1 whereby we could equivalently assume that we have learned a score classifier function from a DNN, and also independently have learned a partitioning through a decision tree classifier on the training data. Therefore, we have two separate binary classification prediction models for a given feature vector . Our DNN will give the probability prediction . Our decision tree classifier will identify the partition such that and return the probability prediction .

Next we consider the logistic regression from Algorithm 1 which is done on each and learns a function over . Our heterogeneous calibration will combine the DNN and the partitioning from the decision tree, , such that for any feature vector it will output the probability prediction where . Note that if our logistic regression learns and for all partitions, then the new model is identical to the original DNN. Similarly, if the logistic regression learns and for all partitions, then this new model is equivalent to the original decision tree. Accordingly, the calibration can then be seen as an interpolation between the DNN and decision tree model.

From our optimality results in Section 4, we further know that perfect calibration will actually optimize the ensemble of these two models. Essentially, the calibration will implicitly pick and choose which strengths of each model to use in order to combine them to maximize AUC. The natural interpolation of calibration between models equivalently extends to other tree-based algorithms such as random forests and boosted trees (further detail in Section 13). Extending the optimality to these settings should also follow similarly and we leave to future work. Accordingly, our heterogeneous calibration framework can be equivalently viewed as way to optimally combine independently trained DNNs and tree-based algorithms in a post-hoc manner. While this may theoretically guarantee an optimal combination, it’s again important to note that the extent of partitioning and intricacy of calibration must be balanced with the corresponding empirical error for our framework to be effectively applied in practice.

6 Experiments

Size Model Bank Marketing Census data Credit Default Higgs Diabetes
S Top 3 DNN 0.7758 0.8976 0.7784 0.7801 0.6915
Top 3 HC 0.7816 (+0.76%) 0.9021 (+0.50%) 0.7798 (+0.18%) 0.7816 (+0.19%) 0.6937 (+0.32%)
Top 50% DNN 0.7736 0.8892 0.7771 0.7650 0.6799
Top 50% HC 0.7810 (+0.96%) 0.9004 (+1.27%) 0.7789 (+0.23%) 0.7692 (+0.54%) 0.6879 (+1.18%)
M Top 3 DNN 0.7712 0.8978 0.7787 0.7773 0.6744
Top 3 HC 0.7800 (+1.14%) 0.9027 (+0.55%) 0.7794 (+0.09%) 0.7799 (+0.33%) 0.6856 (+1.66%)
Top 50% DNN 0.7690 0.8858 0.7775 0.7617 0.6683
Top 50% HC 0.7793 (+1.34%) 0.9009 (+1.70%) 0.7790 (+0.20%) 0.7680 (+0.83%) 0.6841 (+2.37%)
L Top 3 DNN 0.7716 0.9007 0.7783 0.7747 0.6679
Top 3 HC 0.7814 (+1.27%) 0.9027 (+0.22%) 0.7794 (+0.14%) 0.7775 (+0.36%) 0.6824 (+2.17%)
Top 50% DNN 0.7663 0.8800 0.7772 0.7596 0.6637
Top 50% HC 0.7779 (+1.52%) 0.9010 (+2.38%) 0.7789 (+0.23%) 0.7666 (+0.92%) 0.6824 (+2.82%)
Table 1: Test AUC-ROC (mean of 5 runs) on different datasets before and after calibration. DNN = Deep neural network, HC = Heterogeneous calibration. We report model performance on the top 3 variants as well as the top 50% variants for each model, where top 3 and top 50% is determined by DNN performance prior to HC.
Model Bank Marketing Census data Credit Default Higgs data Diabetes
Top 3 Reg DNN 0.7758 0.8976 0.7781 0.7801 0.6693
Top 3 Reg HC 0.7816 (+0.76%) 0.9021 (+0.50%) 0.7793 (+0.16%) 0.7816 (+0.19%) 0.6829 (+2.04%)
Top 3 Unreg DNN 0.7735 0.8773 0.7768 0.7498 0.6915
Top 3 Unreg HC 0.7804 (+0.88%) 0.8985 (+2.42%) 0.7787 (+0.25%) 0.7588 (+1.20%) 0.6937 (+0.32%)
Table 2: Test effect of regularization on AUC-ROC (mean of 5 runs) on different datasets before and after calibration for the small MLPs. DNN = Deep neural network, HC = Heterogeneous calibration, Reg = regularized model, Unreg = unregularized model. Top 3 variants are chosen using the procedure mentioned in Table 1. Table 3 in Appendix 14 contains more results for medium and large networks.

We evaluate the efficacy of heterogeneous calibration on the task of binary classification using deep neural networks on a variety of datasets. We make observations about the effect of model size, regularization and training set size on the effectiveness of the technique. All experiments were conducted using Tensorflow 

Abadi et al. (2015).

Datasets: We use datasets containing a varying number of data points and types of features. For each dataset, we create training, validation (for tuning neural networks), calibration (for training post-hoc calibration models) and test splits. Specifically, we use the following 5 datasets:

  • Bank marketing Moro et al. (2014) - Marketing campaign data to predict client subscriptions. datapoints.

  • Census Income Kohavi and others (1996) - Data to predict whether income exceeds a threshold or not. datapoints.

  • Credit Default Yeh and Lien (2009) - Data to predict credit card default. datapoints.

  • Higgs Baldi et al. (2014) - Data to distinguish between a Higgs boson producing signal process and a background process. We chose datapoints out of the entire set.

  • Diabetes Strack et al. (2014) - Data about readmission outcomes for diabetic patients. datapoints.

Further details about the datasets, including features, splits and pre-processing information, can be found in Appendix 14.1.

Modeling details:

We use multilayer perceptrons with 3 feed-forward layers. To understand the effect of model size and model regularization on calibration performance, we vary the number of neurons in each MLP layer and also toggle regularization techniques like batch normalization 

Ioffe and Szegedy (2015) and dropout Srivastava et al. (2014). Specifically, we choose 3 MLP sizes based on the number of parameters in each. We use the Adam optimizer Kingma and Ba (2014) and extensively tune the learning rate on a log-scaled grid for each variant, since even adaptive optimizers can benefit from learning rate tuning Loshchilov and Hutter (2017). Complete details about the MLP variants, regularization techniques and tuning of the learning rate can be found in Appendix 14.2.

For heterogeneous calibration, we train a decision tree classifier on the training set to partition the feature space and subsequently use Platt scaling Platt and others (1999)

on the calibration dataset for each partition. We lightly tune the tree hyperparameters, details in Appendix 

14.3. Note that extensive tuning or using a different partitioning or calibration algorithm could have led to further improvements for our method.

6.1 Main results

Table 1 displays the main results. We choose 3 MLP sizes based on the number of parameters and label them small, medium and large, based on the number of parameters. For each MLP size, we choose the top 3 and top 50% variants after extensive tuning of learning rate and regularization and report the mean of 5 runs.

We note that our method provides a consistent lift in AUC across all model sizes and datasets, despite the use of a simple calibration model. This meshes well with our hypothesis that modern neural networks, despite regularization, are overconfident and calibration can be used as a simple post-hoc technique to improve generalization performance.


Figure 3: Box plots of test AUC lifts of our method for various runs of the top 3 models. We note a consistent lift in AUC across runs and hyperparameter settings.

Figure 3 contains box plots of the test AUC lift provided by our method for 2 datasets. The plots contain lifts 5 different runs of the top 3 models for each setting. We observe a consistent lift in AUC across various runs and hyperparameter settings, demonstrating the consistency of our method. We include box plots for other datasets in Appendix 14.4.

Effect of model size. From Table 1, we note that as we go to larger models, the lift in performance for our method consistently increases for all datasets. This corroborates our hypothesis and intuition that larger models can be more overconfident, and hence may benefit more from our method.

6.2 Effect of model regularization

Table 2 contains the results of the effect of heteregeneous calibration on regularized (use of dropout or batch normalization) and unregularized MLPs of small size. Unsurprisingly, our method provides larger relative lift in performance for unregularized DNNs as compared to regularized DNNs. This fits well with our hypothesis that unregularized networks are highly overconfident, and may benefit from methods such as ours.

6.3 Computational efficiency

We note that hyperparameter tuning is critical for improving generalization performance. Model performance varies widely with the tuning of hyperparameters. For our experiments, we tuned the learning rate. Interestingly, we note that our method has a much tighter variance for AUC across a large range of the learning rate, when compared to an uncalibrated network. This was particularly notable for the Census data where our technique maintained high performance even when the uncalibrated network performance dipped. This may reduce the need for extensive hyperparameter tuning.

7 Discussions

In this paper we developed the framework of heterogeneous calibration that utilizes data heterogeneity and post-hoc calibration techniques for improving model generalization for over-confident models. We theoretically proved that the calibration transformation is optimal in improving AUC. To show its efficacy in practice, we focus on the simplest instantiation, but this framework can naturally apply combinations of known higher-performing techniques for both the partitioning and calibration. We believe further investigation into these applications of the framework would be an interesting and fruitful future direction now that we have established the efficacy of our heterogeneous calibration paradigm.

We further showed that our framework equivalently uses calibration to optimally combine a DNN and decision tree as a post-hoc ensemble method. This should extend to other tree-based algorithms in the same manner, but a more rigorous examination would be an interesting future direction. This investigation could also include a more thorough characterization of when the AUC increases most for this optimal combination of DNNs and tree-based algorithms. This would potentially be used in determining how to train DNNs to focus on learning patterns that are not identifiable through tree-based algorithms and then utilize the heterogeneous calibration framework to achieve a higher-performing combination.

Our experiments also showed much more consistent high performance of the model with heterogeneous calibration applied as we searched through the hyper-parameters. We think another interesting future direction would be to further investigate the extent to which heterogeneous calibration can serve as a replacement for hyper-parameter tuning.

8 Acknowledgements

We thank our colleagues Joojay Huyn, Varun Mithal, Preetam Nandy, Jun Shi, and Ye Tu for their helpful feedback and illuminating discussions.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    Note: Software available from External Links: Link Cited by: §6.
  • Y. Bai, S. Mei, H. Wang, and C. Xiong (2021) Don’t just blame over-parametrization for over-confidence: theoretical analysis of calibration in binary classification. International Conference on Machine Learning. Cited by: §1, §1, §2.1, §2.
  • P. Baldi, P. Sadowski, and D. Whiteson (2014)

    Searching for exotic particles in high-energy physics with deep learning

    Nature communications 5 (1), pp. 1–9. Cited by: 4th item.
  • S. Bubeck and M. Sellke (2021) A universal law of robustness via isoperimetry. International Conference on Machine Learning. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1050–1059. Cited by: §1.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1321–1330. Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 770–778. External Links: Document Cited by: §1.
  • U. Hebert-Johnson, M. Kim, O. Reingold, and G. Rothblum (2018) Multicalibration: calibration for the (Computationally-identifiable) masses. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1939–1948. External Links: Link Cited by: §2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §6.
  • C. Jung, C. Lee, M. Pai, A. Roth, and R. Vohra (2021) Moment multicalibration for uncertainty estimation. In Proceedings of Thirty Fourth Conference on Learning Theory, M. Belkin and S. Kpotufe (Eds.), Proceedings of Machine Learning Research, Vol. 134, pp. 2634–2678. External Links: Link Cited by: §4.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.
  • R. Kohavi et al. (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14, pp. 1137–1145. Cited by: §4.2.
  • R. Kohavi et al. (1996)

    Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid.

    In Kdd, Vol. 96, pp. 202–207. Cited by: 2nd item.
  • M. Kull, M. Perello-Nieto, M. Kängsepp, H. Song, P. Flach, et al. (2019) Beyond temperature scaling: obtaining well-calibrated multiclass probabilities with dirichlet calibration. In Advances in Neural Information Processing Systems (NeurIPS’19). Cited by: §1, §5.
  • A. Kumar, P. Liang, and T. Ma (2019) Verified uncertainty calibration. In Advances in Neural Information Processing Systems (NeurIPS’19). Cited by: §1, §11.1, §5.
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems. Cited by: §1.
  • I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §6.
  • S. Moro, P. Cortez, and P. Rita (2014) A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62, pp. 22–31. Cited by: 1st item.
  • J. Mukhoti, V. Kulharia, A. Sanyal, S. Golodetz, P. H. Torr, and P. K. Dokania (2020) Calibrating deep neural networks using focal loss. In Advances in Neural Information Processing Systems. Cited by: §1.
  • M. P. Naeini, G. Cooper, and M. Hauskrecht (2015) Obtaining well calibrated probabilities using bayesian binning. In

    Twenty-Ninth AAAI Conference on Artificial Intelligence

    Cited by: §1, §5.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Icml, Cited by: §14.2.
  • M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy (2019) Deep learning recommendation model for personalization and recommendation systems. CoRR abs/1906.00091. External Links: Link, 1906.00091 Cited by: §1.
  • J. Platt et al. (1999)

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

    Advances in large margin classifiers 10 (3), pp. 61–74. Cited by: §1, §5, §6.
  • E. Shabat, L. Cohen, and Y. Mansour (2020) Sample complexity of uniform convergence for multicalibration. In Advances in Neural Information Processing Systems. Cited by: §4.2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §6.
  • B. Strack, J. P. DeShazo, C. Gennings, J. L. Olmo, S. Ventura, K. J. Cios, and J. N. Clore (2014) Impact of hba1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed research international 2014. Cited by: 5th item.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. NIPS’17, Red Hook, NY, USA, pp. 6000–6010. External Links: ISBN 9781510860964 Cited by: §1.
  • I. Yeh and C. Lien (2009) The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications 36 (2), pp. 2473–2480. Cited by: 3rd item.
  • B. Zadrozny and C. Elkan (2001) Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, Vol. 1, pp. 609–616. Cited by: §1, §5.
  • B. Zadrozny and C. Elkan (2002) Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699. Cited by: §1, §5.

9 Partition Calibrated AUC Proofs

In this section we provide the missing proofs of the informal Lemmas, and 4.4. Note that Lemmas 4.3 and 4.4 are the more general case of the former two respectively, so we will only prove each of these.

We will copy the definition of partition calibrated AUC here for reference.

Definition 9.1.

[Partition Calibrated AUC] For a given classifier score function , and distributions and , along with a partition of , and a transformation function , we define partition calibrated AUC as

Our formalized version Lemma 4.3 can then be stated as such.

Lemma 9.1.

For a given classifier score function , and distributions and , along with a partition of , let be the transformation

and if then we let .

For any function we have

Note that we slightly deviate from the likelihood ratio in Lemma 4.3 to avoid divide by zero concerns but the optimal transformation in the lemma gives essentially an equivalent ordering to the likelihood ratio.

The notation will become too onerous here so we will let and similarly for other subscripts. We will further let


We re-arrange the integrals to pair together an and which allows us to re-write the AUC as

We could then further consider each pair and and re-write the AUC as

The first integral does not change regardless of the choice of . From Lemma 9.3 we have that the inequality is tight in Lemma 9.2 for all and when , and therefore the second integral is maximized with .

We utilize the following helper lemma that consider two pairs of scores and partition and gives an upper bound on their sum for both possible orderings.

Lemma 9.2.

For any and we have


This follows immediately from the fact that

and both terms are non-negative.

We also utilize another helper lemma that shows an equivalent ordering for our considered optimal transformation function with respect to pairs of scores and partitions.

Lemma 9.3.

For any and , if then


If then we must have and because otherwise contradicting our assumed inequality.

By adding to both sides, our assumed inequality can then be equivalently written

and dividing each side gives the desired inequality.

9.1 Ordering equivalence

We further show that the ordering of the optimal transformation is equivalent to the ordering given where each partition is perfectly calibrated.

We keep the notation for the lemma statement the same but will switch to shorthand for the proof where we let , , and .

Lemma 9.4.

Given distributions and our classifier score function and a partition of . For any and where and , then

if and only if


By the definition of conditional probability we have

Plugging this in to the first inequality in our if and only if statement, we then cross multiply and cancel like terms to get

By our definitions we have which then implies and applying this and cancelling gives

Furthermore by taking the second inequality in our desired if and only if statement, then cross multiplying and cancelling like terms we equivalently get

10 Calibrated FPR and TPR Proofs

In this section we show that the same transformation function that optimizes calibrated AUC will also optimize TPR with respect to FPR. In particular, we show that the ROC curve for the optimal transformation function will always contain the ROC curve for any other transformation function. As a corollary we obtain that the Precision with respect to Recall is also optimized and thus the PR-AUC is maximized. We begin by defining calibrated TPR and FPR.

Definition 10.1.

[Calibrated TPR] For a given classifier score function , and distribution , along with a transformation function , and some and , we define calibrated TPR as

Definition 10.2.

[Calibrated FPR] Defined identically to TPR but using distribution

The value of is necessary here because if is not continuous over then may not be defined for all and we want our statements to generalize over all probability distributions and score classifier functions. Note that when is the identity function then this is just the standard definition for TPR and FPR. As is well-known we can equivalently define AUC using TPR and FPR and we will prove the calibrated version here as well that follows equivalently.

Lemma 10.1.

For a given classifier score function , along with distributions and , and a transformation function