1 Introduction
Deep neural networks (DNNs) have become ubiquitous in decision making pipelines across industries due to an extensive line of work for improving accuracy, and are particularly applicable to settings where massive datasets are common He et al. (2016); Devlin et al. (2019); Vaswani et al. (2017); Naumov et al. (2019)
. The large number of parameters in DNNs affords greater flexibility in the modeling process for improving generalization performance and has been recently shown to be necessary for smoothly interpolating the data
Bubeck and Sellke (2021).However this overparameterization (where the number of parameters exceed the number of training samples), along with other factors can lead to overconfidence
, where model performance is substantially better on training data compared to test data. For classification tasks, overconfidence is more specifically characterized by the model output probability of the predicted class being generally higher than the true probability.
Guo et al. (2017)found that overconfidence increased with respect to the model depth and width, even when accuracy improves. Additional recent work proves that overconfidence is also inherent for underparameterized logistic regression
Bai et al. (2021). Accordingly, there is extensive work in the area of calibration, whose primary goal is to improve the accuracy of probability estimates that is essential for different usecases.
Some of the common calibration techniques that apply a posthoc modelagnostic transformation to properly adjust the model output include Platt scaling Platt and others (1999), Isotonic regression Zadrozny and Elkan (2002), histogram binning Zadrozny and Elkan (2001)
, Bayesian binning into quantiles
Naeini et al. (2015), scalingbinning Kumar et al. (2019), and Dirichlet calibration Kull et al. (2019). There is also work on calibration through ensemble type methods Lakshminarayanan et al. (2017); Gal and Ghahramani (2016) and recent work on using focal loss to train models that are already wellcalibrated while maintaining accuracy Mukhoti et al. (2020). For a more indepth exposure to different calibration methods, please see Bai et al. (2021).In this paper, we deviate from this traditional aim of calibration. Instead of trying to improve accuracy of probability estimates, we aim at improving model generalization. Our key insight is that overconfident models not only show miscalibration but also tend to underutilize heterogeneity in the data in a specific and intuitive manner. We develop this intuition through concrete examples, and focus on mitigating this underutilization of data heterogeneity through efficient posthoc calibration. We specifically aim to improve model accuracy, characterized through the area under the receiveroperating characteristic (ROC) curve, (commonly called AUC) and other metrics for binary classification.
Accordingly, we develop a new theoretical foundation for posthoc calibration to improve AUC and prove that the transformation for optimizing output probability estimates will also optimize AUC. To the best of our knowledge, this is the first paper that provably shows how calibration techniques can improve model generalization. We further extend this theoretical optimality to separately calibrating different partitions of the heterogeneous feature space, and give concrete intuition how partitioning through standard treebased algorithms and separately calibrating the partitions will improve AUC. This gives a natural and rigorous connection between treebased algorithms and DNNs through the use of standard calibration techniques, and provides an efficient posthoc transformation framework to improve accuracy.
In order to best show that the underlying theory of our general framework holds in practice, we test the simplest instantiation whereby a decision tree classifier identifies the partitioning and logistic regression is used for calibration. We test on opensource datasets and focus upon tabular data due to it’s inherent heterogeneity, but also discuss how this can be extended to image classification or natural language processing tasks. Across the different datasets we see a notable increase in performance from our heterogeneous calibration technique on the top performing DNN models. In addition, we see much more substantial increase in performance and more stable results while considering the average performing DNNs in the hyperparameter tuning. Our experiments also confirm the intuition that more overconfident models will see greater increase in performance from our heterogeneous calibration framework.
We summarize our contributions as the following:

[noitemsep]

We use concrete examples to give intuition on how overconfident models, particularly DNNs, underutilize heterogeneous partitions of the feature space.

We provide theoretical justification of correcting this underutilization through standard calibration on each partition to maximize AUC.

We leverage this intuition and theoretical optimality to introduce the general paradigm of heterogeneous calibration that applies a posthoc modelagnostic transformation to model outputs for improving AUC performance on binary classification tasks. This framework also easily generalizes to multiclass classification.

We test the simplest instantiation of heterogeneous calibration on opensource datasets to show the effectiveness of the framework
The rest of the paper is organized as follows. We begin with the detailed problem setup in Section 2. In Section 3 we give an intuition as to how overconfident models tend to underutilize heterogeneity. In Section 4 we give a provably optimal posthoc transformation for mitigating underutilized heterogeneity. In Section 5 we detail the framework of heterogeneous calibration. In Section 6 we give the experimental results before ending with a discussion in Section 7. All proofs are pushed to the Appendix.
2 Methodology
In practice, calibration is primarily used to get better probability estimates, which is especially useful for usecases where uncertainty estimation is important Bai et al. (2021). At first glance, the notion of overconfidence being corrected by posthoc calibration should only contract the range of probability estimates but not affect the ordering and thereby the AUC. Moreover, much of the literature attempts to maintain accuracy within the calibration. However if we consider overconfidence at a more granular level it can negatively affect the relative ordering between heterogeneous partitions. We primarily consider a heterogeneous partitioning to be a splitting of the feature space such that each partition has a disproportionately higher ratio of positive or negative labels for the binary classification setting. Intuitively, the more accurately a model has intricately fit the data the less it needs to utilize simpler patterns, such as heterogeneous partitions, but overconfident models will overestimate their ability to fit the data and thus underutilize simple patterns.
Ideally the relative ordering of heterogeneous partitions could be corrected for overconfident models as a posthoc procedure. One approach would be to add a separate bias term to the output of each partition, but this may not fully capture the extent to which the relative ordering can be improved. We give a more rigorous examination of the AUC metric which measures the quality of our output ordering and prove that perfectly calibrating the probability estimates will also optimize the AUC and several other accuracy metrics for the given model. Furthermore we show that this extends to any partitioning of the feature space such that perfectly calibrating each partition separately will maximally improve AUC and other related metrics. The concept of separately calibrating partitions of the feature space has also been seen in the fairness literature HebertJohnson et al. (2018), but their partitions are predefined based upon fairness considerations and the considered metrics are towards ensuring fair models.
Combining our theoretical result with the intuition that overconfident models will improperly account for heterogeneous partitions gives a general framework of heterogeneous calibration as a posthoc modelagnostic transformation that: (1) indentifies heterogeneous partitions of the feature space through treebased algorithms; (2) calibrates each partition separately using a known technique from the extensive line of calibration literature.
The heterogeneous partitioning can be done through a variety of treebased methods, and we view this as a natural, efficient, and rigorous incorporation of treebased techniques into DNNs through the use of calibration. In fact, our theoretical optimality results also imply that heterogeneous calibration gives the optimal ensemble of a separately trained DNN and decision tree, combining the strengths of each into one model to maximize AUC.
Additionally the advantage of this posthoc framework as opposed to applying techniques to fix overconfidence within the training itself is that overconfident models are not inherently undesirable with respect to accuracy Guo et al. (2017). The flexibility of overparameterization allows the model training to simultaneously learn generalizable patterns and also memorize small portions of the training data. Validation data is often used to identify the point at which increased memorization outweighs the additional generalization, but decoupling these prior to this point and still achieving a similar level of performance is incredibly challenging. The posthoc nature of our framework then allows us to avoid this difficulty and enjoy the additional generalization from overconfident models while also correcting the underutilization of simpler patterns in the data.
2.1 Notation
To more rigorously set up the problem, we let be the data universe and we consider the classical binary classification setting where and and
is a feature vector and label from the data universe. Let
be the probability distribution over the data universe with density function
where our data is random samples . Let and be the probability distributions over where we condition on the label being 0 and 1 respectively, which is to say that their respective density functions are such that and .Let
be the score function of a binary classification model. We consider this to be the output of the final neuron of the DNN prior to applying the sigmoid function, but our theoretical results hold for any score function.
We will be considering splits of the feature space where we let be a partitioning of such that each where they cover , which is to say and they are all disjoint so for any we have .
We will also refer to heterogeneous partitions in the feature space by which we most often mean that either or .
For a more rigorous definition of overconfidence we borrow the definitions of Bai et al. (2021), where predicted probability for a given class is generally higher than the estimated probability. This also leads to the notion of a wellcalibrated model whereby for all , and we give a more rigorous definition in Section 11.1 for completeness. For the most part, we will be considering posthoc calibration (which we often shorten to calibration) where a posthoc transformation is applied to the classifier score function to achieve for all . Note that this cannot be equivalently defined as requiring because this could be perfectly achieved through setting for all but lose all value of the classifier.
We focus our rigorous examination of accuracy with respect to the area under the curve (AUC) metric, which we precisely define here. Generally AUC is considered in terms of the receiver operating characteristic (ROC) curve, which is plotted based upon the True Positive Rate (TPR) vs False Positive Rate (FPR) at different thresholds. This definition is known to be equivalent to randomly drawing a positive and negative label example and determining the probability that the model will identify the positive label. We also show this equivalence in the appendix for completeness.
Definition 2.1.
[AUC] For a given classifier score function , along with distributions and then AUC can be defined as
Note that we will often omit the relevant terms for notational simplicity. We also give definitions of related metrics such as TPR, FPR, logloss, Precision/Recall, and expected calibration error in the Appendix.
In this paper, we will first develop the intuition as to how overconfident models tend to underutilize heterogeneous partitions of the feature space. Based on this intuition, the main focus of this paper, is to develop a framework that can leverage this heterogeneity to improve model generalization. Specifically, using a heterogeneous partition , how can we transform the score that optimizes AUC for binary classification tasks.
3 Intuition for overconfident models underutilizing heterogeneity
In this section we give intuition on why overconfidence due to overparameterization can negatively impact model performance when there is heterogeneity in the data. Note that we will consider binary classification for ease of visualization, but the same ideas generalize to multiclassification where then the output score is a vector in . We will set up this intuition by visualizing the distribution of scores for the positive and negative labels. First we will give an example of what these distributions might look like on training vs test data and how they often differ due to overparameterization. Then we will consider independently adding a feature with heterogeneity and show how the overconfidence will lead to training data not properly accounting for that heterogeneity.
3.1 Overconfident model example
In order to visualize model performance it is common to look at the distributions of the score functions with respect to label. Specifically we want to empirically plot and for all which is often done by constructing a histogram of the scores with respect to their label. For our toy example, suppose our data is such that labels are balanced, so . Further we will let
denote the Gaussian distribution with mean
and variance
.Generally the overparameterization of neural networks leads to training data performing significantly better than test data because the model performs some memorization of the training data. Most often this memorization will occur on the harder data points to classify and better separate these examples compared to the test data. Visually this tends to then lead to a steeper decline in the respective score distributions for the training data on the harder data points to classify. Meanwhile for the test data the score distributions are much more symmetric because the model has not performed nearly as well on the hardtoclassify data points leading to more overlap. An example visualization of overconfidence is in Figure
1.In this example we assume that our classifier function is such that and . Further let be the training data sample and let and be split into the negative and positive labels. In our example we set
This type of overconfidence on training data tends to be the root cause of miscalibration. The model does often inherently attempt to optimize calibration, for instance with a logloss function, but it is doing so on the training data where it is overconfident in how well it has separated positive and negative labels and thus scales up the scores substantially, pushing the associated probabilities closer to 0 or 1. In order to optimize the logloss of the test data we would need to divide the score function by a factor of about 2, which would also give approximately optimal calibration. Note that the training data is also not optimized as we do assume some sort of regularization such as soft labels because the log loss would actually be optimized on the training data by scaling the score function up by a factor of about 2. Regardless of how much it is scaled up or down, the ordering of the score function and all associated ordering metrics such as AUC or accuracy will remain unchanged.
3.2 Underutilized heterogeneity example
While the overconfidence in our example above only affects the output probability and not the ordering, this overconfidence can be detrimental to ordering if we add heterogeneity to the data set. Suppose we add a binary feature to our feature space that is uncorrelated with the other features but is well correlated with the label so it’s heterogeneous. Specifically, if we previously had then we now consider and distribution with density function such that for any such that we have . Further we assume that it’s conditionally independent of the other features given the label, but it does well predict the label and in particular we have and .
Assume that we use the same training and test data but with this new heterogeneous feature added to the dataset. Due to the new feature being conditionally independent from the other features it’s reasonable to assume that the score function the model would learn on the training set would (at least approximately) be for some optimized and .
The choice of determines the relative ordering of the score function when vs when , so the extent to which the model should utilize the heterogeneity of this binary feature (and then simply recenters the score function appropriately). The better the model is performing the less it will need to use this additional heterogeneity to improve its prediction. It is then important to note that this and are optimized on the training data where the model is overconfident in its performance and as such will not set nearly as high as it should for the true distribution.
In particular for the training data, it will set and to optimize crossentropy which also maximizes AUC on the training data, and on the true data distribution this gives an AUC of about 0.83. Due to the overconfidence on the training data the model actually set this value lower than it should have and if instead it had set and then we could have increased the AUC to about 0.85 on the true data distribution and also improved the logloss along with other accuracy metrics.
3.3 General Discussion
Our example above illustrates the more general concept of how neural networks can underutilize simple patterns in the data because they are overconfident in their ability to fit the data. This point is generally understood as a potential pitfall of neural networks but we are focusing specifically on how it fails to appropriately utilize heterogeneity.
In particular the bias term in the output layer can be viewed as a centering term for the score function to optimally account for the balance of positive vs negative labels. This bias term will not affect the overall ordering, but the neural network can also make these centering decisions at a more finegrained level where in our example above we considered simply splitting the data once. Especially if the internal nodes used a ReLU activation function then it would be quite simple for a neural network to construct internal variables that represent simple partitions of the data, reminiscent of partitions that are similarly defined by decision trees. This could then lead to relative orderings between partitions that are inappropriate because the model centered the partitions according to the training data on which it was overconfident.
In our example we assumed that the new feature was conditionally independent and thus the appropriate fix was simply shifting each partition. With more intricate dependence we would expect the score distributions on each side of the split to differ more significantly than being identical up to a bias term. Therefore ordering different partitions correctly with respect to the others will be a more complex task. In Section 4 we show that the optimal way of ordering these partitions relative to each other is actually equivalent to optimally calibrating each partition.
4 Calibration of partitions to optimize AUC
In Section 3 we provided intuition regarding overconfident models underutilizing heterogeneity. In this section we assume that such a heterogeneous partitioning has been identified and provide the theoretical framework for optimally applying a posthoc transformation to maximize AUC.
4.1 Optimal AUC calibration
We first consider applying a posthoc transformation to the classifier score function, in the same way as standard calibration and define the corresponding AUC measurement.
Definition 4.1.
[Calibrated AUC] For a given classifier score function , along with distributions and , and a transformation function then we define calibrated AUC as
Note that when is the identity function or any isotonic function then this is equivalent to standard AUC. Further note that this is equivalent to but this notation will be easier to work with in our proofs for which we will give intuition here and prove in the appendix.
It is then natural to consider the optimal transformation function to maximize AUC conditioned on our classifier score function and data distribution.
Lemma 4.1 (Informal).
Given a classifier score function and any distributions , we can maximize with respect to by using the likelihood ratio, as our transformation function
For the purposes of maximizing AUC only the ordering imposed by is relevant, and intuitively the likelihood ratio will give the highest ordering to outputs that maximize True Positive Rate (TPR) and minimize False Positive Rate (FPR) thereby maximizing AUC. Furthermore, we also show that for any FPR the corresponding TPR is maximized by the likelihood ratio transformation, which implies that the ROC curve of any other transformation is contained within the ROC curve of the likelihood ratio transformation. As a corollary this implies that for any Recall the corresponding Precision is maximized and furthermore the PRAUC is maximized by the likelihood ratio transformation. While these claims are intuitively reasonable, they will require more involved proofs in the appendix.
We then show that the ordering from this likelihood ratio is equivalent to the ordering from the optimal calibration which by definition sets .
Lemma 4.2 (Informal).
The likelihood ratio and optimal calibration give an equivalent ordering. For any we have that if and only if
Due to the fact that AUC is invariant under equivalent orderings, calibration on the full dataset will also optimize AUC and other associated metrics. This connection allows us to simply apply standard techniques from the literature for calibration to optimize AUC. However, we expect this affect to be minimal even when the model is overconfident because it is more so correcting the overconfident probability estimations but not changing the ordering.
4.2 Optimal partitioned AUC calibration
While calibration on the full dataset may not generally affect ordering and thus AUC, recall that Section 3 identified the issue of overconfidence negatively affecting the relative ordering between heterogeneous partitions of the data. In order to reorder these partitions appropriately, we then want to extend our optimal posthoc transformation separately to each partition such that it provably maximizes overall AUC.
Definition 4.2.
[Partition Calibrated AUC] For a given classifier score function , distributions , and a partition of , along with a transformation function , we define partition calibrated AUC as
Once again if always then this is equivalent to . Furthermore we could have equivalently defined this as but this will be easier to work with in our proofs. For this definition we will also show that AUC is maximized by using the likelihood ratio.
Lemma 4.3 (Informal).
Given classifier score function , and distributions and , along with a partition of , we can maximize by using the likelihood ratio
Note that we could set for all and thus we can only improve (or keep equal) AUC by partitioning, and this holds for any arbitrary partition. Additionally this likelihood ratio will give the same ordering as the optimal calibration for each partition which for a given would set .
Lemma 4.4 (Informal).
The likelihood ratio and the optimal calibration probability give an equivalent ordering. For any and we have that if and only if
Therefore, by optimally calibrating each partition we can equivalently maximize overall AUC. If this partitioning is taken to the extreme then this calibration is just intuitively the optimal model.
Corollary 4.1 (Informal).
If is the full partitioning of which is to say for all , then the optimal is equivalent to where
However running postprocessing to accomplish the same task as the model training is both redundant and infeasible to be accurately done in this way. It is then necessary to balance partitioning the feature space and still maintaining enough data to accurately calibrate each partition. Furthermore, from the intuition we gave before, we aren’t simply doing this partitioning in hopes of improvement because of the mathematical guarantee, but because the overconfidence means that our model may not have accounted for specific partitions appropriately.
The granularity to which the feature space can be partitioned and still maintain accuracy has also been studied in the fairness literature giving bounds on the sample complexity for multicalibration Shabat et al. (2020)
, and there has also been work on estimating calibration of higher moments for multicalibration
Jung et al. (2021). The sample complexity results are agnostic to calibration technique, but for a more practical application the extent of the partitioning should be dependent on the which calibration technique is applied. For example, histogram binning essentially estimates the full score distributions and will require more samples to keep empirical error low. In contrast, Platt scaling is just logistic regression on one variable and thus requires fewer samples to get accurate parameters for the calibration.Additionally, the extent of the partitioning also depends upon whether we use a random forest for our partitioning scheme and take the average calibration over all the trees. In the same way that the trees can have a greater depth because of the ensemble nature of a random forest, we could take advantage of the same ensemble type structure to partition more finely. We could also apply tree pruning techniques via crossvalidation
Kohavi and others (1995) to determine the ideal level of partitioning.5 Heterogeneous Calibration Framework
Combining the intuition in Section 3 whereby overconfident models underutilize heterogeneous partitions, and the theoretical optimality in Section 4 of calibrating each partition separately to maximize AUC, immediately implies the general heterogeneous calibration framework:

[noitemsep]

Partition the feature space for maximal heterogeneity

Calibrate each partition of the feature space separately
We give an explicit implementation of this framework in Section 5.1, but the flexibility of this paradigm allows for many possible implementations. In particular, there are a multitude of posthoc calibration techniques from the literature that could be applied Platt and others (1999); Zadrozny and Elkan (2002, 2001); Naeini et al. (2015); Kumar et al. (2019); Kull et al. (2019). Furthermore, splitting the feature space with a decision tree, which greedily maximizes heterogeneity, is the most obvious choice but we could also use a random forest here by repeating the partitioning and calibration multiple times and outputting the average across the trees. We could also utilize boosted trees, which gives a sequence of partitions, and then sequentially apply calibration such that the final transformation was a nested composition of calibrations for each partitioning. We further sketch out the details of how this could work for boosted trees in the appendix (Section 13), but leave a more thorough examination to future work. Additionally, we could construct decision trees that greedily split the feature space to more directly optimize AUC that we discuss in Section 12.
We note that this framework can easily be applied to multiclass classification with many treebased partitioning schemes and calibration techniques being extendable to multiclass classification. We also focus upon tabular data and recommender systems because heterogeneity is much more common in these settings, but this framework could be extended to image classification and natural language processing. In particular the partitioning of the feature space could be identified by applying a decision tree to the neurons of an internal layer in the neural network, which are often considered to represent more general patterns and thus have more heterogeneity.
In order to best show that the underlying theory of our general framework holds in practice, we focus on the simplest instantiation and leave the application of higherperforming treebased partitioning schemes and more effective calibration techniques to future work.
5.1 Example Implementation
To exemplify our framework we give a simple instantiation here which will also be used in our experiments. A decision tree classifier identifies the partitioning and logistic regression is used for calibration, which is Platt scaling.
We assume that the model is a DNN trained on the training data and the model with the highest accuracy on the validation is chosen, but this assumption is not necessary to apply this algorithm. Our heterogeneous calibration procedure can use the same training and validation data. However, by choosing the model with peak accuracy on validation data, it’s likely the model is slightly overconfident on the validation data, although much less than the training data, and using fresh data for the calibration would be preferable.
This framework will be most effectively applied to realworld use cases under three general conditions:

[noitemsep]

The model should have some degree of overconfidence in the same way that posthoc calibration techniques give little additional value to wellcalibrated models

There should be an algorithmically identifiable partitioning of the feature space with a reasonable amount of heterogeneity

There should be sufficient data outside of the training data to accurately perform calibration on each partition
5.2 Interpolation between DNNs and treebased algorithms
In this section we further discuss how our heterogeneous calibration framework gives a natural interpolation between DNNs and treebased algorithms through the use of calibration. In particular, we show how this framework can equivalently be viewed as an optimal ensemble of any given DNN and decision tree through the use of calibration. Furthermore, we discuss how this can extend to any treebased algorithm.
We begin by reconsidering Algorithm 1 whereby we could equivalently assume that we have learned a score classifier function from a DNN, and also independently have learned a partitioning through a decision tree classifier on the training data. Therefore, we have two separate binary classification prediction models for a given feature vector . Our DNN will give the probability prediction . Our decision tree classifier will identify the partition such that and return the probability prediction .
Next we consider the logistic regression from Algorithm 1 which is done on each and learns a function over . Our heterogeneous calibration will combine the DNN and the partitioning from the decision tree, , such that for any feature vector it will output the probability prediction where . Note that if our logistic regression learns and for all partitions, then the new model is identical to the original DNN. Similarly, if the logistic regression learns and for all partitions, then this new model is equivalent to the original decision tree. Accordingly, the calibration can then be seen as an interpolation between the DNN and decision tree model.
From our optimality results in Section 4, we further know that perfect calibration will actually optimize the ensemble of these two models. Essentially, the calibration will implicitly pick and choose which strengths of each model to use in order to combine them to maximize AUC. The natural interpolation of calibration between models equivalently extends to other treebased algorithms such as random forests and boosted trees (further detail in Section 13). Extending the optimality to these settings should also follow similarly and we leave to future work. Accordingly, our heterogeneous calibration framework can be equivalently viewed as way to optimally combine independently trained DNNs and treebased algorithms in a posthoc manner. While this may theoretically guarantee an optimal combination, it’s again important to note that the extent of partitioning and intricacy of calibration must be balanced with the corresponding empirical error for our framework to be effectively applied in practice.
6 Experiments
Size  Model  Bank Marketing  Census data  Credit Default  Higgs  Diabetes 

S  Top 3 DNN  0.7758  0.8976  0.7784  0.7801  0.6915 
Top 3 HC  0.7816 (+0.76%)  0.9021 (+0.50%)  0.7798 (+0.18%)  0.7816 (+0.19%)  0.6937 (+0.32%)  
Top 50% DNN  0.7736  0.8892  0.7771  0.7650  0.6799  
Top 50% HC  0.7810 (+0.96%)  0.9004 (+1.27%)  0.7789 (+0.23%)  0.7692 (+0.54%)  0.6879 (+1.18%)  
M  Top 3 DNN  0.7712  0.8978  0.7787  0.7773  0.6744 
Top 3 HC  0.7800 (+1.14%)  0.9027 (+0.55%)  0.7794 (+0.09%)  0.7799 (+0.33%)  0.6856 (+1.66%)  
Top 50% DNN  0.7690  0.8858  0.7775  0.7617  0.6683  
Top 50% HC  0.7793 (+1.34%)  0.9009 (+1.70%)  0.7790 (+0.20%)  0.7680 (+0.83%)  0.6841 (+2.37%)  
L  Top 3 DNN  0.7716  0.9007  0.7783  0.7747  0.6679 
Top 3 HC  0.7814 (+1.27%)  0.9027 (+0.22%)  0.7794 (+0.14%)  0.7775 (+0.36%)  0.6824 (+2.17%)  
Top 50% DNN  0.7663  0.8800  0.7772  0.7596  0.6637  
Top 50% HC  0.7779 (+1.52%)  0.9010 (+2.38%)  0.7789 (+0.23%)  0.7666 (+0.92%)  0.6824 (+2.82%) 
Model  Bank Marketing  Census data  Credit Default  Higgs data  Diabetes 

Top 3 Reg DNN  0.7758  0.8976  0.7781  0.7801  0.6693 
Top 3 Reg HC  0.7816 (+0.76%)  0.9021 (+0.50%)  0.7793 (+0.16%)  0.7816 (+0.19%)  0.6829 (+2.04%) 
Top 3 Unreg DNN  0.7735  0.8773  0.7768  0.7498  0.6915 
Top 3 Unreg HC  0.7804 (+0.88%)  0.8985 (+2.42%)  0.7787 (+0.25%)  0.7588 (+1.20%)  0.6937 (+0.32%) 
We evaluate the efficacy of heterogeneous calibration on the task of binary classification using deep neural networks on a variety of datasets. We make observations about the effect of model size, regularization and training set size on the effectiveness of the technique. All experiments were conducted using Tensorflow
Abadi et al. (2015).Datasets: We use datasets containing a varying number of data points and types of features. For each dataset, we create training, validation (for tuning neural networks), calibration (for training posthoc calibration models) and test splits. Specifically, we use the following 5 datasets:

Bank marketing Moro et al. (2014)  Marketing campaign data to predict client subscriptions. datapoints.

Census Income Kohavi and others (1996)  Data to predict whether income exceeds a threshold or not. datapoints.

Credit Default Yeh and Lien (2009)  Data to predict credit card default. datapoints.

Higgs Baldi et al. (2014)  Data to distinguish between a Higgs boson producing signal process and a background process. We chose datapoints out of the entire set.

Diabetes Strack et al. (2014)  Data about readmission outcomes for diabetic patients. datapoints.
Further details about the datasets, including features, splits and preprocessing information, can be found in Appendix 14.1.
Modeling details:
We use multilayer perceptrons with 3 feedforward layers. To understand the effect of model size and model regularization on calibration performance, we vary the number of neurons in each MLP layer and also toggle regularization techniques like batch normalization
Ioffe and Szegedy (2015) and dropout Srivastava et al. (2014). Specifically, we choose 3 MLP sizes based on the number of parameters in each. We use the Adam optimizer Kingma and Ba (2014) and extensively tune the learning rate on a logscaled grid for each variant, since even adaptive optimizers can benefit from learning rate tuning Loshchilov and Hutter (2017). Complete details about the MLP variants, regularization techniques and tuning of the learning rate can be found in Appendix 14.2.For heterogeneous calibration, we train a decision tree classifier on the training set to partition the feature space and subsequently use Platt scaling Platt and others (1999)
on the calibration dataset for each partition. We lightly tune the tree hyperparameters, details in Appendix
14.3. Note that extensive tuning or using a different partitioning or calibration algorithm could have led to further improvements for our method.6.1 Main results
Table 1 displays the main results. We choose 3 MLP sizes based on the number of parameters and label them small, medium and large, based on the number of parameters. For each MLP size, we choose the top 3 and top 50% variants after extensive tuning of learning rate and regularization and report the mean of 5 runs.
We note that our method provides a consistent lift in AUC across all model sizes and datasets, despite the use of a simple calibration model. This meshes well with our hypothesis that modern neural networks, despite regularization, are overconfident and calibration can be used as a simple posthoc technique to improve generalization performance.
Figure 3 contains box plots of the test AUC lift provided by our method for 2 datasets. The plots contain lifts 5 different runs of the top 3 models for each setting. We observe a consistent lift in AUC across various runs and hyperparameter settings, demonstrating the consistency of our method. We include box plots for other datasets in Appendix 14.4.
Effect of model size. From Table 1, we note that as we go to larger models, the lift in performance for our method consistently increases for all datasets. This corroborates our hypothesis and intuition that larger models can be more overconfident, and hence may benefit more from our method.
6.2 Effect of model regularization
Table 2 contains the results of the effect of heteregeneous calibration on regularized (use of dropout or batch normalization) and unregularized MLPs of small size. Unsurprisingly, our method provides larger relative lift in performance for unregularized DNNs as compared to regularized DNNs. This fits well with our hypothesis that unregularized networks are highly overconfident, and may benefit from methods such as ours.
6.3 Computational efficiency
We note that hyperparameter tuning is critical for improving generalization performance. Model performance varies widely with the tuning of hyperparameters. For our experiments, we tuned the learning rate. Interestingly, we note that our method has a much tighter variance for AUC across a large range of the learning rate, when compared to an uncalibrated network. This was particularly notable for the Census data where our technique maintained high performance even when the uncalibrated network performance dipped. This may reduce the need for extensive hyperparameter tuning.
7 Discussions
In this paper we developed the framework of heterogeneous calibration that utilizes data heterogeneity and posthoc calibration techniques for improving model generalization for overconfident models. We theoretically proved that the calibration transformation is optimal in improving AUC. To show its efficacy in practice, we focus on the simplest instantiation, but this framework can naturally apply combinations of known higherperforming techniques for both the partitioning and calibration. We believe further investigation into these applications of the framework would be an interesting and fruitful future direction now that we have established the efficacy of our heterogeneous calibration paradigm.
We further showed that our framework equivalently uses calibration to optimally combine a DNN and decision tree as a posthoc ensemble method. This should extend to other treebased algorithms in the same manner, but a more rigorous examination would be an interesting future direction. This investigation could also include a more thorough characterization of when the AUC increases most for this optimal combination of DNNs and treebased algorithms. This would potentially be used in determining how to train DNNs to focus on learning patterns that are not identifiable through treebased algorithms and then utilize the heterogeneous calibration framework to achieve a higherperforming combination.
Our experiments also showed much more consistent high performance of the model with heterogeneous calibration applied as we searched through the hyperparameters. We think another interesting future direction would be to further investigate the extent to which heterogeneous calibration can serve as a replacement for hyperparameter tuning.
8 Acknowledgements
We thank our colleagues Joojay Huyn, Varun Mithal, Preetam Nandy, Jun Shi, and Ye Tu for their helpful feedback and illuminating discussions.
References

TensorFlow: largescale machine learning on heterogeneous systems
. Note: Software available from tensorflow.org External Links: Link Cited by: §6.  Don’t just blame overparametrization for overconfidence: theoretical analysis of calibration in binary classification. International Conference on Machine Learning. Cited by: §1, §1, §2.1, §2.

Searching for exotic particles in highenergy physics with deep learning
. Nature communications 5 (1), pp. 1–9. Cited by: 4th item.  A universal law of robustness via isoperimetry. International Conference on Machine Learning. Cited by: §1.
 BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1.
 Dropout as a bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pp. 1050–1059. Cited by: §1.
 On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning  Volume 70, ICML’17, pp. 1321–1330. Cited by: §1, §2.

Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 770–778. External Links: Document Cited by: §1.  Multicalibration: calibration for the (Computationallyidentifiable) masses. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1939–1948. External Links: Link Cited by: §2.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §6.
 Moment multicalibration for uncertainty estimation. In Proceedings of Thirty Fourth Conference on Learning Theory, M. Belkin and S. Kpotufe (Eds.), Proceedings of Machine Learning Research, Vol. 134, pp. 2634–2678. External Links: Link Cited by: §4.2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.
 A study of crossvalidation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14, pp. 1137–1145. Cited by: §4.2.

Scaling up the accuracy of naivebayes classifiers: a decisiontree hybrid.
. In Kdd, Vol. 96, pp. 202–207. Cited by: 2nd item.  Beyond temperature scaling: obtaining wellcalibrated multiclass probabilities with dirichlet calibration. In Advances in Neural Information Processing Systems (NeurIPS’19). Cited by: §1, §5.
 Verified uncertainty calibration. In Advances in Neural Information Processing Systems (NeurIPS’19). Cited by: §1, §11.1, §5.
 Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems. Cited by: §1.
 Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §6.
 A datadriven approach to predict the success of bank telemarketing. Decision Support Systems 62, pp. 22–31. Cited by: 1st item.
 Calibrating deep neural networks using focal loss. In Advances in Neural Information Processing Systems. Cited by: §1.

Obtaining well calibrated probabilities using bayesian binning.
In
TwentyNinth AAAI Conference on Artificial Intelligence
, Cited by: §1, §5.  Rectified linear units improve restricted boltzmann machines. In Icml, Cited by: §14.2.
 Deep learning recommendation model for personalization and recommendation systems. CoRR abs/1906.00091. External Links: Link, 1906.00091 Cited by: §1.

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods
. Advances in large margin classifiers 10 (3), pp. 61–74. Cited by: §1, §5, §6.  Sample complexity of uniform convergence for multicalibration. In Advances in Neural Information Processing Systems. Cited by: §4.2.
 Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §6.
 Impact of hba1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed research international 2014. Cited by: 5th item.
 Attention is all you need. NIPS’17, Red Hook, NY, USA, pp. 6000–6010. External Links: ISBN 9781510860964 Cited by: §1.
 The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications 36 (2), pp. 2473–2480. Cited by: 3rd item.
 Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, Vol. 1, pp. 609–616. Cited by: §1, §5.
 Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699. Cited by: §1, §5.
9 Partition Calibrated AUC Proofs
In this section we provide the missing proofs of the informal Lemmas 4.1, 4.2, 4.3, and 4.4. Note that Lemmas 4.3 and 4.4 are the more general case of the former two respectively, so we will only prove each of these.
We will copy the definition of partition calibrated AUC here for reference.
Definition 9.1.
[Partition Calibrated AUC] For a given classifier score function , and distributions and , along with a partition of , and a transformation function , we define partition calibrated AUC as
Our formalized version Lemma 4.3 can then be stated as such.
Lemma 9.1.
For a given classifier score function , and distributions and , along with a partition of , let be the transformation
and if then we let .
For any function we have
Note that we slightly deviate from the likelihood ratio in Lemma 4.3 to avoid divide by zero concerns but the optimal transformation in the lemma gives essentially an equivalent ordering to the likelihood ratio.
The notation will become too onerous here so we will let and similarly for other subscripts. We will further let
Proof.
We rearrange the integrals to pair together an and which allows us to rewrite the AUC as
We could then further consider each pair and and rewrite the AUC as
The first integral does not change regardless of the choice of . From Lemma 9.3 we have that the inequality is tight in Lemma 9.2 for all and when , and therefore the second integral is maximized with .
∎
We utilize the following helper lemma that consider two pairs of scores and partition and gives an upper bound on their sum for both possible orderings.
Lemma 9.2.
For any and we have
Proof.
This follows immediately from the fact that
and both terms are nonnegative.
∎
We also utilize another helper lemma that shows an equivalent ordering for our considered optimal transformation function with respect to pairs of scores and partitions.
Lemma 9.3.
For any and , if then
Proof.
If then we must have and because otherwise contradicting our assumed inequality.
By adding to both sides, our assumed inequality can then be equivalently written
and dividing each side gives the desired inequality.
∎
9.1 Ordering equivalence
We further show that the ordering of the optimal transformation is equivalent to the ordering given where each partition is perfectly calibrated.
We keep the notation for the lemma statement the same but will switch to shorthand for the proof where we let , , and .
Lemma 9.4.
Given distributions and our classifier score function and a partition of . For any and where and , then
if and only if
Proof.
By the definition of conditional probability we have
Plugging this in to the first inequality in our if and only if statement, we then cross multiply and cancel like terms to get
By our definitions we have which then implies and applying this and cancelling gives
Furthermore by taking the second inequality in our desired if and only if statement, then cross multiplying and cancelling like terms we equivalently get
∎
10 Calibrated FPR and TPR Proofs
In this section we show that the same transformation function that optimizes calibrated AUC will also optimize TPR with respect to FPR. In particular, we show that the ROC curve for the optimal transformation function will always contain the ROC curve for any other transformation function. As a corollary we obtain that the Precision with respect to Recall is also optimized and thus the PRAUC is maximized. We begin by defining calibrated TPR and FPR.
Definition 10.1.
[Calibrated TPR] For a given classifier score function , and distribution , along with a transformation function , and some and , we define calibrated TPR as
Definition 10.2.
[Calibrated FPR] Defined identically to TPR but using distribution
The value of is necessary here because if is not continuous over then may not be defined for all and we want our statements to generalize over all probability distributions and score classifier functions. Note that when is the identity function then this is just the standard definition for TPR and FPR. As is wellknown we can equivalently define AUC using TPR and FPR and we will prove the calibrated version here as well that follows equivalently.
Lemma 10.1.
For a given classifier score function , along with distributions and , and a transformation function