1 Introduction
Current decisionmaking systems are often supported by machine learning agents that are utility maximizers. To make reliable decisions, it is at the heart of machine learning models to make accurate probabilistic predictions. Unfortunately, it has been observed that many existing machine learning methods especially deep learning methods can yield poorly calibrated probabilistic predictions
korb1999calibration ; bella2010calibration ; guo2017calibration , which hurts reliability of the decisionmaking systems.For a binary classification problem, a machine learning model is said to be well calibrated if it makes probabilistic predictions that agree with the actual outcomes naeini2015obtaining ; guo2017calibration . That is, when the model makes predictions on an unseen data set, in any subset of the data, if the averaged prediction is , the actual outcomes will do occur around fraction of the times.
It has been reported in recent studies guo2017calibration ; borisov2018calibration ; geifman2018bias
that, in the field of computer vision and information retrieval, deep neural networks can make poorly calibrated probabilistic predictions. It is also observed that on several general machine learning and data mining tasks, miscalibration not only affects the overall utility, but also undermines the fairness on certain groups of data. In particular, some common deep models can make predictions desirable with respect to noncalibration performance measures, but suffer from poor results in calibrationrelated measures.
Let us look at an example, the details of which will be shown in Section 3.2
. We train a multilayer perceptron (MLP) over a public dataset
^{1}^{1}1The Lending Club loan data: https://www.kaggle.com/wendykan/lendingclubloandatato predict whether an issued loan would be in default. The trained neural network achieves a high AUC (area under curve) score on the test set. Unfortunately, we found the predicted probabilities unfair among borrowers in different states in the U.S. In this case, the trained model seems “powerful” for certain utility functions such as the AUC score, but it is unreliable because of the large miscalibration error and unfairness to some specific subsets of data. Therefore, it is crucial to calibrate the predictions so as not to mislead the decision maker.
For this purpose, we first raise the following question.
Q1. Given probabilistic predictions on a test dataset, how to measure the calibration error?
Perhaps surprisingly, common metrics are insufficient to evaluate miscalibration errors. In particular, they cannot be used to report biases over subsets of data, such as the aforementioned example of “unfairness” over borrowers in different U.S. states. For example, Negative LogLikelihood and the Brier score, arguably the two most popular metrics, can solely evaluate the error on instance level which is too finegrained to measure miscalibration on subsets. On the other hand, the reliability diagram degroot1983comparison can only visualize the averaged the error on probability intervals, thus is too coarse grained at subset level.
To answer Q1, we put forward a new class of evaluation metrics, coined the Fieldlevel Calibration Error. It can evaluate the averaged bias of predictions over specific input fields, which is especially useful on categorical data. Take the loan defaulter prediction task as an example, the new metric can measure the unfairness on the field “address state”.
We observe that the proposed fieldlevel calibration error indeed measures the error ignored by previous metrics. That is, a set of predictions can simultaneously get high AUC score, low Logloss, but large fieldlevel calibration error.
Various calibration methods have been proposed to fix miscalibration, e.g., Isotonicbarlow1972statistical ; platt1999probabilistic ; zadrozny2002transforming ; niculescu2005predicting ; naeini2015obtaining ; guo2017calibration .. A standard pipeline for calibration builds a mapping function on a validation (development) dataset that transforms raw model outputs into calibrated probabilities. By using the mapping function, the error on the holdout data can then be reduced. However, such methods might be insufficient in practice: when directly training a model over the joint of the training set and the validation set, we observe that it can reach a much higher AUC score comparing to conventional calibration methods. Therefore, a practical question arises:
Q2. Can we simultaneously reduce the calibration error and improve other noncalibration metrics such as the AUC score?
Our answer is “Yes”. To achieve this, we propose a neural network based method, coined Neural Calibration. Rather than learning a mapping from the raw model output to a calibrated probability like previous work, Neural Calibration trains a neural network over the validation set by taking both the raw model output and all other features as inputs. This method is simple yet powerful to use.
It naturally follows a learning pipeline for general machine learning and data mining tasks: first train a base model on the training set, and then train a Neural Calibration model over a validation dataset. We conducted experiments over five large scale realworld datasets to verify the effectiveness. We show that by using our learning pipeline, the resulted predictions can not only achieve lower calibration error than previous calibration methods, but also reach a comparable or better performance on noncalibration metrics compared with the joint training pipeline.
Our contribution can be summarized as follows:

We put forward Fieldlevel Calibration Error, a new type of metric to measure miscalibration. It focuses on detecting the bias on specific subset of data. We observe that the new metric indeed reports errors that are overlooked by existing metrics.

We propose Neural Calibration, which takes the uncalibrated model output along with other input features as input and outputs calibrated probabilistic predictions.

It follows a pipeline for practitioners in machine learning and data mining, which can achieve strong results in both calibration metrics and noncalibration metrics.
What we do not study: This paper is not related to the literature of fairnessaware classification kamishima2011fairness ; zafar2015fairness ; menon2018cost which aims to give absolute fair predictions for sensitive features, e.g., to predict the same acceptance rate for female and male applicants. Also, we do not study the reason why miscalibration occurs as in degroot1983comparison ; guo2017calibration , we study the metrics to evaluate it and a practical method to fix it.
2 Background
This paper focuses on calibrating probabilistic predictions for binary classification, i.e., to predict , where is the input and is the binary outcome. Consider we obtain the probabilistic prediction by a base model where
is the sigmoid function and
is the nonprobabilistic output (also known as the logit) of the discriminative model. We denote a labeled dataset as . The training / validation / test set are denoted by , respectively. For the simplicity of notations, we will use to denote .2.1 Existing metrics for probabilistic predictions
Instancelevel calibration error
A straightforward way to measure miscalibration is to average the error on every single instance. For example, Negative LogLikelihood (NLL), also known as the Logloss, is formulated as
(1) 
Similarly, the Brier score is the mean squared error over instances
(2) 
A drawback for these two metrics is that they cannot measure the bias on groups of instances. Thus by optimizing these objectives, the model can still give unfair predictions.
Probabilitylevel calibration error
In many previous studies degroot1983comparison ; naeini2015obtaining ; guo2017calibration , the calibration error is formulated by partitioning the predictions into bins and summing up the errors over the bins. Formally, if we partition the interval into partitions where the interval is , the error is
(3) 
By minimizing this objective, the goal can be understood as “for every subset of data where the prediction is , the actual averaged outcome should be around ”.
However, we argue that this metric is too rough for evaluating predictions, and can be misleading for realworld applications. For example, one can get zero ProbECE by predicting a constant for all the samples. Therefore, this paper does not include ProbECE as an evaluation metric.
Noncalibration metrics
A number of noncalibration metrics can evaluate probabilistic predictions, such as the classification accuracy, the Fscores, and the Area under Curve (AUC). We will use AUC as the major noncalibration metric in this paper.
2.2 Existing calibration methods
Generally, data scientists observe miscalibration when testing the model on the validation set. Therefore, it is necessary to fix the error by learning a calibration function using the validation set.
Existing calibration methods can be categorized into nonparametric and parametric methods based on the mapping function. Nonparametric methods includes binning methods zadrozny2002transforming ; naeini2015obtaining and Isotonic Regression Isotonicbarlow1972statistical ; niculescu2005predicting . The idea is to partition the raw prediction into bins. Each bin is assigned with the averaged outcomes of instances in this bin using the validation set. If the partitions are predefined, the method is known as Histogram Binning zadrozny2002transforming . However, the mapping function cannot keep the order of predictions, thus cannot be used for applications including advertising. Another common nonparametric method is Isotonic Regression Isotonicbarlow1972statistical , which requires the mapping function to be nondecreasing and is widely used in realworld industry mcmahan2013ad ; borisov2018calibration . Parametric methods, on the other hand, use parameterized functions as the mapping function. The most common choice, Platt Scaling platt1999probabilistic ; niculescu2005predicting
, is equivalent to a univariate logistic regression that transforms the model output (the logit) into calibrated probabilities. Because of the simple form, Platt scaling can be extended to multiclass classification for image and text classification
guo2017calibration . However, the oversimplified mapping tends to underfit the data and might be suboptimal.Other related works includes calibration with more detailed mapping functions or in different problem settings. To name some, naeini2015obtaining extends Histogram Binning to a Bayes ensemble; guo2017calibration extends Platt scaling to Temperature scaling; neelon2004bayesian extends Isotonic Regression to Bayesian, kotlowski2016online generalizes it in an online setting, borisov2018calibration uses it for calibrating click models; and lakshminarayanan2017simple uses model ensemble to reduce the bias of predictions of deep learning models.
3 Measuring unfairness with fieldlevel calibration error
We put forward the fieldlevel calibration error as a new metric to measure the bias of probabilistic predictions in different subset of the dataset, which reflects the “unfairness” of the model.
3.1 Formulation of fieldlevel calibration errors
Suppose that the model input is a
dimensional vector
including one specific categorical field that the decisionmaker especially cares about. Given that is a categorical feature, we can partition the input space into disjoint subsets. For example, in the loan defaulter prediction task mentioned previously, this particular field is the “address state” feature with 51 levels, i.e., . Thus the data can be partitioned into 51 disjoint subsets.Now we use these subsets to formulate fieldlevel calibration errors. In particular, we formulate the fieldlevel expected calibration error (FieldECE) as
(4) 
which is straightforward to understand: “for every subset of data categoried by the field , the averaged prediction should agree with the averaged outcome”. Therefore, if a set of predictions gets a large FieldECE, it indicates that the model is biased on some part of the data.
Although this formulation has a similar form to ProbECE, there is a key difference. In ProbECE, the partition is determined by the prediction itself, so the result can be misleading, e.g., it can get zero ProbECE by predicting a constant. But in our FieldECE, the partition is determined by the input feature , so the resultant metric can be consistent without being affected by the predictions.
Further, we can have the fieldlevel relative calibration error formulated as the averaged rate of errors divided by the true outcomes,
(5) 
where is the number of instance in each subset, i.e., , and is a positive small number to prevent division by zero, e.g., .
Note that although our fieldlevel calibration errors are formulated upon a categorical input field, they can be easily extended to noncategorical fields by discretizing them into disjoint subsets.
3.2 Observing miscalibration
Here we would like to show some observations to demonstrate the issue of miscalibration, especially fieldlevel miscalibration. For all the tested tasks, we split the datasets into three parts: 60% for training, 20% for validation, and the other 20% for testing. We trained the base model, a two layered MLP, in two versions: Model1 is trained on the training data , and Model2 is updated incrementally over the validation set. For Model1, since we can observe miscalibration on the validation set, we tested two existing calibration methods, Isotonic Regression and Platt Scaling. Such calibration pipelines are denoted by .
Observation 1: Neither a larger training set nor a higher AUC could indicate a smaller calibration error. Table 1 shows some results on the mentioned loan defaulter prediction task. From the results, we see that Model2 outperforms Model1 in AUC significantly, which is easy to understand because it is trained over more data. Meanwhile, however, Model2 suffered from higher calibration errors than Model1, thus is not reliable to use.
Observation 2: Lower instancelevel calibration error does not indicate lower fieldlevel calibration error. Table 2 shows another example. It reports that Model2 gets lower Logloss and Brier score than the baseline calibration methods Isotonic Regression and Platt scaling. However, the FieldECE and FieldRCE of Model2 are larger, which means it is more biased than the calibrated ones.
Observation 3: Previous calibration methods are sometimes better than Model2 in calibration metrics, but they are always worse than Model2 in AUC. It can be observed from Table 1, 2, 3. Particuarlly, Isotonic Regression and Platt scaling did help calibration on the first two datasets, but failed on the third one. Moreover, we can see these calibration methods cannot help improving the AUC score over the base model, thus are always worse than Model2 which learns from more data.
Observation 4: Improvement in Fieldlevel calibration errors is easier to observe and more interpretable than in instancelevel metrics. From the results, we see that the calibration methods often significantly reduce the fieldlevel errors of Model1, e.g., in Table 2, the relative reductions on FieldECE and FieldRCE are around . However, in the same Table, the relative reduction in Logloss and Brier score given by calibration methods are no more than 0.2%, which is not significant. So the fieldlevel metrics can be easier to use and to explain.
Method  Training data  Logloss  Brier score  FieldECE  FieldRCE  AUC 

Base (Model1)  2.250  0.111  0.114  73.1%  0.821  
Base (Model2)  3.704  0.183  0.187  122.1%  0.936  
Isotonic Reg.  0.302  0.086  0.025  16.1%  0.821  
Platt Scaling  0.324  0.093  0.041  26.3%  0.821  
Neural Calibration  0.059  0.013  0.025  16.0%  0.993 
Method  Training data  Logloss  Brier score  FieldECE  FieldRCE  AUC 

Base (Model1)  0.4547  0.1474  0.0160  7.46%  0.7967  
Base (Model2)  0.4516  0.1464  0.0167  7.08%  0.8001  
Isotonic Reg.  0.4539  0.1472  0.0134  6.09%  0.7967  
Platt Scaling  0.4539  0.1472  0.0135  6.11%  0.7967  
Neural Calibration  0.4513  0.1463  0.0094  4.59%  0.7996 
Method  Training data  Logloss  Brier score  FieldECE  FieldRCE  AUC 

Base (Model1)  0.3920  0.1215  0.0139  12.88%  0.7442  
Base (Model2)  0.3875  0.1204  0.0120  11.17%  0.7496  
Isotonic Reg.  0.3917  0.1216  0.0199  18.56%  0.7442  
Platt Scaling  0.3921  0.1215  0.0165  15.18%  0.7442  
Neural Calibration  0.3866  0.1202  0.0121  10.91%  0.7520 
4 Neural Calibration
In light of these observations, we are motivated to design a new calibration method that can improve both calibration and noncalibration metrics. Our proposed solution is named Neural Calibration. It consists of two modules: a parametric probabilistic calibration module to transform the original model output into a calibrated one, and an auxiliary neural network to fully exploit the validation set. The basic formulation is written as follows
(6) 
which is a sigmoid function of the sum of two terms: transforms the logit given by the original model to a calibrated one, and is an auxiliary neural network of all the other features.
Therefore, there are two functions and to learn, which are parametrized with trainable parameters and , respectively. They need to be trained simultaneously over the validation set.
The objective for training Neural Calibration is to minimize the Logloss over the validation set, i.e.,
(7) 
Therefore, Neural Calibration can be trained by stochastic gradient descent just as any other deep learning models. Also, it can extend to multiclass classification without any difficulty.
Now we introduce the detailed function structure of the two modules and .
4.1 Isotonic LinePlot Scaling (ILPS)
We are interested in finding a stronger parametric function that can achieve high performance. To enhance stronger fitting power, we borrow the spirit of binning from nonparametric methods. We first partition the real axis into several partitions with fixed splits where is a large number, and design the function as
(8) 
where are the coefficients to learn. To make it easier to optimize, we further reparameterize it
(9) 
where are the parameters. This mapping function looks just like a lineplot, as it connects one by one. In practice, we set and the splits s.t. for .
Further, we would like to restrict the function to be isotonic (nondecreasing). To achieve this, we put a constraint on the parameters, i.e., ,
. In the actual implementation, the constraint is realized by the Lagrange method, i.e., adding a term on the loss function, so the overall optimization problem can be solved by gradient descent.
Now we get a novel parametric calibration mapping named Isotonic LinePlot Scaling. We will show in the ablation study (Section 6.2) that this mapping can significantly outperform Platt scaling and be comparable or better than common nonparametric methods.
4.2 The auxiliary neural network for calibration
The previous part designed a univariate mapping from the original model output to a calibrated one. We further learn to use the whole validation data to train an auxiliary neural network.
The neural network learns to fix the “unfairness” by using all necessary features from the validation set. Our intuition is simple: if we observe a fieldlevel calibration error over the validation set on the specific field , then we should find a way to fix it by learning a function of both the model output and the field . Since in many cases is a part of the input , we can directly learn a function of . Here we do not restrict the form of neural network for
. For example, one can use a multilayered perceptron for general data mining task, a convolution neural network for image inputs, or a recurrent neural network for sequential data.
5 Learning pipeline
In this section, we would like to describe the actual learning pipeline for realworld application that uses Neural Calibration to improve the performance. Suppose that we want to train a model with labeled binary classification data, and the target is to make reliable and fair probabilistic predictions during inference on the unseen data. We put forward the following pipeline of Neural Calibration:
Step 1. Split the dataset at hand into a training set and a validation set .
Step 2. Train a base model over the training set . If necessary, select the model and tune the hyperparameters by testing on the validation set.
Step 3. Train a Neural Calibration model over the validation set.
Step 4. Test on the holdout data by predicting .
Here we give a brief explanation. To begin with, Step 1 and Step 2 are the common processes in machine learning. After having the model , one might observe instance, probability, or fieldlevel miscalibration by examining the predictions on the validation set. We learn to calibrate the predictions in Step 3, which can be viewed as training a model with inputs to fit the label by minimizing the Logloss as shown in Eq. (7). Finally, during inference on an unseen data sample with input , we can get the final prediction by two steps: first, compute the logit by the original model , and then compute the calibrated prediction by Neural Calibration .
Comparison with existing learning pipelines
Generally, machine learning or data mining pipelines do not consider the miscalibration problem. That is they will directly make inference after Step 1 and Step 2, which results in the Model1 as mentioned in the previous section. In such case, the validation set is merely used for model selection.
Often, it is preferred to making full use of the labeled data at hand, including the validation set. So after training on the training set, one can further update the model according to the validation set, which results in the Model2 as mentioned. Such a training pipeline is useful especially when the data is arranged by time, because the validation set contains samples that are closer to the current time. However, this pipeline does not consider the calibration error.
The pipeline of calibration has the same procedure as ours. However, conventional calibration methods solely learn a mapping from uncalibrated outputs to calibrated predictions at Step 3. Our Neural Calibration is more flexible and powerful, because it can fully exploit the validation data.
6 Experiments
6.1 Experimental setup
Datasets:
We tested the methods on five large scale realworld binary classification datasets.
1. Lending Club loan data^{1}^{1}footnotemark: 1, to predict whether an issued loan will be in default, with 2.26 million samples. The data is splitted by index. The field is set as the “address state” with 51 levels.
2. Criteo display advertising data^{2}^{2}2https://www.kaggle.com/c/criteodisplayadchallenge, to predict the probability that the user will click on a given ad. It consists of 45.8 million samples over 10 days and is splitted by index. The field is set as an anonymous feature “C11” with 5683 levels.
3. Avazu clickthrough rate prediction data^{3}^{3}3https://www.kaggle.com/c/avazuctrprediction. We used the data of first 10 days with 40.4 million samples, splitted by date. The field is set as the “site ID” with 4737 levels.
4. Porto Seguro’s safe driver prediction data^{4}^{4}4https://www.kaggle.com/c/portosegurosafedriverprediction, to predict if a driver will file an insurance claim next year. It has 0.6 million samples and is splitted by index. The field is “ps_ind_03” with 12 levels.
5. Tencent clickthrough rate prediction, which is subsampled directly from the Tencent’s online advertising stream. It consists of 100 million samples across 10 days, and is splitted by date. The field is set as the advertisement ID with 0.1 million levels.
Tested models and training details:
For all the datasets, the base models and the net for calibration
are neural networks with the same structure: the input fields are first transformed into 256dimensional dense embeddings respectively, which are concatenated together and followed by a multilayer perceptron with two 200dimensional ReLU layers. For each task, the base model is trained by the Adam optimizer
kingma2014adamto minimize the Logloss with a fixed learning rate of 0.001 for one epoch to get the Model1. Next, we train the calibration methods on the validation set, or incrementally update Model1 on the validation set for an epoch to get the Model2. Specifically, Neural Calibration is also trained on the validation set for the same training steps and learning rates.
Compared baselines:
We tested base models trained on the training set (Model1) and on the joint of training and validation set (Model2). We also tested common calibration methods, including Histogram Binning, Isotonic Regression and Platt scaling.
6.2 Experimental results
The results are shown in Table 15, where we leave the results of Histogram Binning to Table 6 due to the space limitation. From the four columns in the middle of the tables, we found Neural Calibration the best in all calibration metrics, which is significantly better than the tested baselines. From the rightmost column in the tables, we see that Neural Calibration can get significantly higher AUC than conventional calibration methods, but also be better or comparable compared with Model2.
Method  Training data  Logloss  Brier score  FieldECE  FieldRCE  AUC 

Base (Model1)  0.1552  0.0351  0.0133  28.55%  0.6244  
Base (Model2)  0.1538  0.0349  0.0064  13.90%  0.6245  
Isotonic Reg.  0.1544  0.0349  0.0021  4.47%  0.6244  
Platt Scaling  0.1532  0.0349  0.0020  4.30%  0.6244  
Neural Calibration  0.1531  0.0349  0.0018  3.66%  0.6269 
Method  Training data  Logloss  Brier score  FieldECE  FieldRCE  AUC 

Base (Model1)  0.1960  0.0522  0.0145  27.12%  0.7885  
Base (Model2)  0.1953  0.0521  0.0128  24.58%  0.7908  
Isotonic Reg.  0.1958  0.0522  0.0141  25.45%  0.7884  
Platt Scaling  0.1958  0.0522  0.0142  25.72%  0.7885  
Neural Calibration  0.1952  0.0521  0.0124  22.87%  0.7907 
Method  Data1  Data2  Data3  Data4  Data5 

Histogram Bin.  0.026  0.0134  0.0146  0.0019  0.0142 
Isotonic Reg.  0.025  0.0134  0.0199  0.0021  0.0141 
Platt Scaling  0.041  0.0135  0.0165  0.0020  0.0142 
ILPS  0.028  0.0133  0.0146  0.0018  0.0141 
Ablation study:
We tested solely the Isotonic LinePlot Scaling to see if it is stronger than previous calibration methods which learn the univariate mapping functions. Table 6 shows that it constantly outperformed Platt scaling on all the datasets, and is comparable or better than nonparametric methods on dataset 25.
7 Conclusion
This paper studied the issue of miscalibration for probabilistic predictions of binary classification. We first put forward Fieldlevel Calibration Error as a new class of metrics to measure miscalibration. It can report the bias and “unfairness” on specific subsets of data, which is often overlooked by common metrics. Then we observed that existing calibration methods cannot make full use of the labeled data, and we proposed a new method based on neural networks, named Neural Calibration, to address this issue. It consists of a novel parametric calibration mapping named Isotonic LinePlot Scaling, and an auxiliary neural network. We tested our method on five largescale datasets. By using the pipeline of Neural Calibration, we achieved significant improvements over conventional methods on both calibration metrics and noncalibration metrics simultaneously.
References
 [1] Richard E Barlow, David J Bartholomew, James M Bremner, and H Daniel Brunk. Statistical inference under order restrictions: The theory and application of isotonic regression. Technical report, Wiley New York, 1972.
 [2] Antonio Bella, Cèsar Ferri, José HernándezOrallo, and María José RamírezQuintana. Calibration of machine learning models. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pages 128–146. IGI Global, 2010.
 [3] Alexey Borisov, Julia Kiseleva, Ilya Markov, and Maarten de Rijke. Calibration: A simple way to improve click models. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 1503–1506. ACM, 2018.
 [4] Morris H DeGroot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(12):12–22, 1983.

[5]
Yonatan Geifman, Guy Uziel, and Ran ElYaniv.
Biasreduced uncertainty estimation for deep neural classifiers.
2018.  [6] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1321–1330. JMLR. org, 2017.
 [7] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairnessaware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pages 643–650. IEEE, 2011.
 [8] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[9]
Kevin B Korb.
Calibration and the evaluation of predictive learners.
In
Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
, pages 73–77, 1999.  [10] Wojciech Kotłowski, Wouter M Koolen, and Alan Malek. Online isotonic regression. In Conference on Learning Theory, pages 1165–1189, 2016.
 [11] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
 [12] H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1222–1230. ACM, 2013.
 [13] Aditya Krishna Menon and Robert C Williamson. The cost of fairness in binary classification. In Conference on Fairness, Accountability and Transparency, pages 107–118, 2018.
 [14] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
 [15] Brian Neelon and David B Dunson. Bayesian isotonic regression and trend analysis. Biometrics, 60(2):398–406, 2004.

[16]
Alexandru NiculescuMizil and Rich Caruana.
Predicting good probabilities with supervised learning.
In Proceedings of the 22nd international conference on Machine learning, pages 625–632. ACM, 2005. 
[17]
John Platt et al.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.
Advances in large margin classifiers, 10(3):61–74, 1999.  [18] Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694–699. ACM, 2002.
 [19] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness constraints: Mechanisms for fair classification. arXiv preprint arXiv:1507.05259, 2015.