Group calibration is a byproduct of unconstrained learning

08/29/2018 ∙ by Lydia T. Liu, et al. ∙ 0

Much recent work on fairness in machine learning has focused on how well a score function is calibrated in different groups within a given population, where each group is defined by restricting one or more sensitive attributes. We investigate to which extent group calibration follows from unconstrained empirical risk minimization on its own, without the need for any explicit intervention. We show that under reasonable conditions, the deviation from satisfying group calibration is bounded by the excess loss of the empirical risk minimizer relative to the Bayes optimal score function. As a corollary, it follows that empirical risk minimization can simultaneously achieve calibration for many groups, a task that prior work deferred to highly complex algorithms. We complement our results with a lower bound, and a range of experimental findings. Our results challenge the view that group calibration necessitates an active intervention, suggesting that often we ought to think of it as a byproduct of unconstrained machine learning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although many fairness-promoting interventions have been proposed in the machine learning literature, unconstrained learning remains the dominant paradigm among practitioners for learning risk scores from data. Given a prespecified class of models, unconstrained learning simply seeks to minimize the average prediction loss over a labeled dataset, without explicitly correcting for disparity with respect to sensitive attributes, such as race or gender. Many criticize the practice of unconstrained machine learning for propagating harmful biases (Crawford, 2013; Barocas and Selbst, 2016; Crawford, 2017). Others see merit in unconstrained learning for reducing bias in consequential decisions (Corbett-Davies et al., 2017b, a; Kleinberg et al., 2018).

In this work, we show that defaulting to unconstrained learning does not neglect fairness considerations entirely. Instead, it prioritizes one notion of “fairness” over others: unconstrained learning achieves calibration with respect to one or more sensitive attributes, as well as a related criterion called sufficiency (e.g., Barocas et al., 2018), at the cost of violating other widely used fairness criteria, separation and independence (see Section 1.2 for references therein).

A risk score is calibrated for a group if the risk score obviates the need to solicit group membership for the purpose of predicting an outcome variable of interest. The concept of calibration has a venerable history in statistics and machine learning (Cox, 1958; Murphy and Winkler, 1977; Dawid, 1982; DeGroot and Fienberg, 1983; Platt, 1999; Zadrozny and Elkan, 2001; Niculescu-Mizil and Caruana, 2005). The appearance of calibration as a widely adopted and discussed “fairness criterion” largely resulted from a recent debate around fairness in recidivism prediction and pre-trial detention. After journalists at ProPublica pointed out that a popular recidivism risk score had a disparity in false positive rates between white defendants and black defendants (Angwin et al., 2016), the organization that produced these scores countered that this disparity was a consequence of the fact that their scores were calibrated by race (Dieterich et al., 2016). Formal trade-offs dating back the 1970s confirm the observed tension between calibration and other classification criteria, including the aforementioned criterion of separation, which is related to the disparity in false positive rates (Darlington, 1971; Chouldechova, 2017; Kleinberg et al., 2017; Barocas et al., 2018).

Implicit in this debate is the view that calibration is a constraint that needs to be actively enforced as a means of promoting fairness. Consequently, recent literature has proposed new learning algorithms which ensure approximate calibration in different settings (Hebert-Johnson et al., 2018; Kearns et al., 2017).

The goal of this work is to understand when approximate calibration can in fact be achieved by unconstrained machine learning alone. We define several relaxations of the exact calibration criterion, and show that approximate group calibration is often a routine consequence of unconstrained learning. Such guarantees apply even when the sensitive attributes in question are not available to the learning algorithm. On the other hand, we demonstrate that under similar conditions, unconstrained learning strongly violates the separation and independence criteria. We also prove novel lower bounds which demonstrate that in the worst case, no other algorithm can produce score functions that are substantially better-calibrated than unconstrained learning. Finally, we verify our theoretical findings with experiments on two well-known datasets, demonstrating the effectiveness of unconstrained learning in achieving approximate calibration with respect to multiple group attributes simultaneously.

1.1 Our results

We begin with a simplified presentation of our results. As is common in supervised learning, consider a pair of random variables

where models available features, and is a binary target variable that we try to predict from 

We choose a discrete random variable 

in the same probability space to model group membership. For example,

could represent gender, or race. In particular, our results do not require that perfectly encodes the attribute .

A score function  maps the random variable to a real number. We say that the score function is sufficient with respect to attribute  if we have almost surely.111This notion has also been referred to as “calibration” in previous work (e.g., Chouldechova, 2017). In this work we refer to it as “sufficiency”, hence distinguishing it from , which has also been called “calibration” in previous work (e.g., Pleiss et al., 2017). These two notions are not identical, but closely related; we present analagous theoretical results for both. In words, conditioning on  provides no additional information about  beyond what was revealed by  This definition leads to a natural notion of the sufficiency gap:


which measures the expected deviation from satisfying sufficiency over a random draw of

We say that the score function is calibrated with respect to group  if we have . Note that calibration implies sufficiency. We define the calibration gap (see also Pleiss et al., 2017) as


Denote by the population risk (risk, for short) of the score function 

. Think of the loss function 

as either the square loss or the logistic loss, although our results apply more generally. Our first result relates the sufficiency and calibration gaps of a score to its risk.

Theorem 1.1 (Informal).

For a broad class of loss functions that includes the square loss and logistic loss, we have

Here, is the calibrated Bayes risk, i.e., the risk of the score function

The theorem shows that if we manage to find a score function with small excess risk over the calibrated Bayes risk, then the score function will also be reasonably sufficient and well-calibrated with respect to the group attribute . We also provide analogous results for the calibration error restricted to a particular group .

In particular, the above theorem suggests that computing the unconstrained empirical risk minimizer (Vapnik, 1992), or ERM, is a natural strategy for achieving group calibration and sufficiency. For a given loss , finite set of examples , and class of possible scores , the ERM is the score function


It is well known that, under very general conditions, ; that is, the risk of converges in probability to the least expected loss of any score function .

In general, the ERM may not achieve small excess risk, . Indeed, we have defined the calibrated Bayes score  as one that has access to both and  In cases where the available features  do not encode  but  is relevant to the prediction task, the excess risk may be large. In other cases, the excess risk may be large simply because the function class over which we can feasibly optimize provides only poor approximations to the calibrated Bayes score. In example 2.1, we provide scenarios when the excess risk is indeed small.

The constant in front of the square root in our theorem depends on properties of the loss function, and is typically small, e.g., bounded by  for both the squared loss and the logistic loss. The more significant question is if the square root is necessary. We answer this question in the affirmative.

Theorem 1.2 (Informal).

There is a triple of random variables such that the empirical risk minimizer trained on samples drawn i.i.d. from satisfies and with probability 

In other words, our upper bound sharply characterizes the worst-case relationship between excess risk, sufficiency and calibration. Moreover, our lower bound applies not only to the empirical risk minimizer , but to any score learned from data which is a linear function of the features . Although group calibration and sufficiency is a natural consequence of unconstrained learning, it is in general untrue that they imply a good predictor. For example, predicting the group average, is a pathological score function that nevertheless satisfies calibration and sufficiency.

Although unconstrained learning leads to well-calibrated scores, it violates other notions of group fairness. We show that the ERM typically violates independence—the criterion that scores are independent of group attribute —as long as the base rate differs by group. Moreover, we show that the ERM violates separation, which asks for scores to be conditionally independent of the attribute given the target (see Barocas et al., 2018, Chapter 2). In this work, we define the separation gap:

and show that any score with small excess risk must in general have a large separation gap. Similarly, we show that unconstrained learning violates , a quantitative version of the independence criterion (see Barocas et al., 2018, Chapter 2).

Theorem 1.3 (Informal).

For a broad class of loss functions that includes the square loss and logistic loss, we have

where and are problem-specific constants independent of .  represents the inherent noise level of the prediction task, and is the variation in group base rates. Moreover, for the same constant .

The lower bound for is explained in Section 2.2; the lower bound for is deferred to Appendix F.

Experimental evaluation.

We explore the extent to which the result of empirical risk minimization satisfies sufficiency, calibration and separation, via comprehensive experiments on the UCI Adult dataset (Dua and Karra Taniskidou, 2017) and pretrial defendants dataset from Broward County, Florida (Angwin et al., 2016; Dressel and Farid, 2018). For various choices of group attributes, including those defined using arbitrary combinations of features, we observe that the empirical risk minimizing score is fairly close to being calibrated and sufficient. Notably, this holds even when the score is not a function of the group attribute in question.

1.2 Related work

Calibration was first introduced as a fairness criterion by the education testing literature in the 1960s. It was formalized by the Cleary criterion (Cleary, ; Cleary, 1968), which compares the slope of regression lines between the test score and the outcome in different groups. More recently, machine learning and data mining communities have rediscovered calibration, and examined the inherent tradeoffs between calibration and other fairness constraints. Chouldechova (2017) and Kleinberg et al. (2017) independently demonstrate that exact group calibration is incompatible with separation (equal true positive and false positive rates), except under highly restrictive situations such as perfect prediction or equal group base rates. Such impossibility results have been further generalized by Pleiss et al. (2017).

There are multiple post-processing procedures which achieve calibration, (see e.g. Niculescu-Mizil and Caruana, 2005, and references therein). Notably, Platt scaling (Platt, 1999)

learns calibrated probabilities for a given score function by logistic regression. Recently,

Hebert-Johnson et al. (2018) proposed a polynomial time agnostic learning algorithm that achieves both low prediction error, and multi-calibration, or simultaneous calibration with respect to all, possibly overlapping, groups that can be described by a concept class of a given complexity. Complementary to this finding, our work shows that low prediction error often implies calibration with no additional computational cost, under very general conditions. Unlike Hebert-Johnson et al. (2018), we do not aim to guarantee calibration with respect to arbitrarily complex group structure; instead we study when usual empirical risk minimization already achieves calibration with respect to a given group attribute .

A variety of other fairness criteria have been proposed to address concerns of fairness with respect to a sensitive attribute. These are typically group parity constraints on the score function, including, among others, demographic parity (also known as independence and statistical parity), equalized odds (also known as error-rate balance and separation), as well as calibration and sufficiency (see e.g. Feldman et al., 2015; Hardt et al., 2016; Chouldechova, 2017; Kleinberg et al., 2017; Pleiss et al., 2017; Barocas et al., 2018). Beyond parity constraints, recent works have also studied dynamic aspects of fairness, such as the impact of model predictions on future welfare (Liu et al., 2018) and demographics (Hashimoto et al., 2018).

2 Formal setup and results

We consider the problem of finding a score function which encodes the probability of a binary outcome , given access to features . We consider functions which lie in a prespecified function class . We assume that individuals’ features and outcomes are random variables whose law is governed by a probability measure over a space , and will view functions as maps via . We use to denote the probability of events under , and to denote expectation taken with respect to .

We also consider a -measurable protected attribute , with respect to which we would like to ensure sufficiency or calibration, as defined in Section 1.1 above. While assume that for all , we compare the performance of to the benchmark that we call the calibrated Bayes score222Note that this is not the perfect predictor unless is deterministic given and .


which is a function of both the feature and the attribute . As a consequence, , except possibly whenever is conditionally independent of given . Nevertheless, is well defined as a map and it always satisfies sufficiency and calibration:

Proposition 2.1.

is sufficient and calibrated, that is and , almost surely. Moreover, if

is any map, then the classifier

is sufficient and calibrated.

Proposition 2.1 is a direct consequence of the tower property (proof in Appendix A.1). In general, there are many challenges to learning perfectly calibrated scores. As mentioned above, depends on information about which is not necessarily accessible to scores . Moreover, even in the setting where , it may still be the case that is a restricted class of scores, and . Lastly, if

is estimated from data, it may require infinitely many samples to achieve perfect calibration. To this end, we introduce the following approximate notion of sufficiency and calibration:

Definition 1.

Given a -measurable attribute and value , we define the sufficiency gap of with respect to for group as


and the calibration gap for group as


We shall let and be as defined above in (1) and (2), respectively.

2.1 Sufficiency and calibration

We now state our main results, which show that the sufficiency and calibration gaps of a function can be controlled by its loss, relative to the calibrated Bayes score . All proofs are deferred to the supplementary material. Throughout, we let denote a class of score functions . For a loss function and any -measurable , recall the population risk . Note that for , , whereas for the calibrated Bayes score , we denote its population risk as . We further assume that our losses satisfy the following regularity condition:

Assumption 1.

Given a probability measure , we assume that is (a) -strongly convex: , (b) there exists a differentiable map such that (that is, is a Bregman Divergence), and (c) the calibrated Bayes score is a critical point of the population risk, that is

Assumption 1 is satisfied by common choices for the loss function, such as the square loss with , and the logistic loss, as shown by the following lemma, proved in Appendix A.2.

Lemma 2.2 (Logistic Loss).

The logistic loss satisfies Assumption 1 with .

We are now ready to state our main theorem (proved in Appendix B), which provides a simple bound on the sufficiency and calibration gaps, and , in terms of the excess risk :

Theorem 2.3 (Sufficiency and Calibration are Upper Bounded by Excess Risk).

Suppose the loss function satisfies Assumption 1 with parameter . Then, for any score and any attribute ,


Moreover, it holds that for ,


Theorem 2.3 applies to any , regardless of how is obtained. As a consequence of Theorem 2.3, we immediately conclude the following corollary for the empirical risk minimizer:

Corollary 2.4 (Calibration of the ERM).

Let be the output of any learning algorithm (e.g. ERM) trained on a sample , and let be as in Theorem 2.3. Then, if satisfies the guarantee

and if satisfies Assumption 1 with parameter , then with probability at least over , it holds that

The above corollary states that if there exists a score in the function class whose population risk is close to that of the calibrated Bayes optimal , then empirical risk minimization succeeds in finding a well-calibrated score.

In order to apply Corollary 2.4, one must know when the gap between the best-in-class risk and calibrated Bayes risk, , is small. In the full information setting where (that is, the group attribute is available to the score function), corresponds to the approximation error for the class  (Bartlett et al., 2006). When may not contain all the information about , depends not only on the class but also on how well can be encoded by given the class , and possibly additional regularity conditions. We now present a guiding example under which one can meaningfully bound the excess risk in the incomplete information setting. In Appendix B.3, we provide two further examples to guide the readers’ intuition. For our present example, we introduce as a benchmark the uncalibrated Bayes optimal score

which minimizes empirical risk over all measurable functions, and is necessarily in . Our first example gives a decomposition of when is the square loss.

Example 2.1.

Let denote the squared loss. Then,



denotes the conditional variance of

given .

The decomposition in Example 2.1 follows immediately from the fact that the excess risk of over , , is precisely when is the square loss. Examining (9), (i) represents the excess risk of over the best score in , which tends to zero if is the ERM. Term (ii) captures the richness of the function class, for as contains a close approximation to . If is obtained by a consistent non-parametric learning procedure, and has small complexity, then both (i) and (ii) tend to zero in the limit of infinite samples. Lastly, (iii) captures the additional information about contained in . Note that in the full information zero, this term is zero.

2.2 Lower bounds for separation

In this section, we show that empirical risk minimization robustly violates the separation criterion that scores are conditionally independent of the group given the outcome . For a classifier that exactly satisfies separation, we have for any group and outcome . We define the separation gap as the average margin by which this equality is violated:

Definition 2 (Separation gap).

The separation gap is

Our first result states that the calibrated Bayes score , has a non-trivial separation gap. The following lower bound is proved in Appendix F:

Proposition 2.5 (Lower bound on separation gap).

Denote , and for a group attribute . Let denote variance, and denote conditional variance given a random variable . Then, , where

Intuitively, the above bound says that the separation gap of the calibrated Bayes score is lower bounded by the product of two quantities: corresponds to the -variation in base-rates among groups, and corresponds to the intrinsic noise level of the prediction problem. For example, consider the case where perfect prediction is possible (that is, is deterministic given ). Then, the lower bound is vacuous because , and indeed has zero separation gap.

Proposition 2.5 readily implies that any score which has small risk with respect to also necessarily violates the separation criterion:

Corollary 2.6 (Separation of the ERM).

Let be the risk associated with a loss function satisfying Assumption 1 with parameter . Then, for any score , possibly the ERM, and any attribute ,

In prior work, Kleinberg et al. (2017)’s impossibility result (Theorem 1.1, 1.2), as well as subsequent generalizations in Pleiss et al. (2017), states that a score that satisfies both calibration and separation must be either a perfect predictor or the problem must have equal base rates across groups, that is, . In contrast, Proposition 2.5 provides a quantitative lower bound on the separation gap of a calibrated score, for arbitrary configurations of base rates and closeness to perfect prediction. This is crucial for approximating the separation gap of the ERM in Corollary 2.6.

2.3 Lower bounds for sufficiency and calibration

We now present two lower bounds which demonstrate that the behavior depicted in Theorem 2.3 is sharp in the worse case. In Appendix C, we construct a family of distributions over pairs , and a family of attributes which are measurable functions of . We choose the distribution parameter and attribute parameter to be drawn from specified priors and . We also consider a class of score functions mapping , which contains the calibrated Bayes classifer for any and (this is possible because the attributes are -measurable). We choose to be the risk associated with the square loss, and consider classifiers trained on a sample of i.i.d draws from . In this setting, we have the following:

Theorem 2.7.

Let denote the output of any learning algorithm trained on a sample , and let denote the empirical risk minimizer of trained on . Then, with constant probability over , , and , and .

In particular, taking , we see that the for any sample size , we have that

with constant probability. In addition, Theorem 2.7 shows that in the worst case, the calibration and sufficiency gaps decay as with samples.

We can further modify the construction to lower bound the per-group sufficency and calibration gaps in terms of . Specifically, for each , we construct in Appendix D a family of distributions and -measurable attributes such that, for all , , for all and . The construction also entails modifying the class ; in this setting, our construction is as follows:

Theorem 2.8.

Fix . For any score trained on , and the empirical risk mnimizer , it holds that and , with constant probability over , , and .

3 Experiments

In this section, we present numerical experiments on two datasets to corroborate our theoretical findings. These are the Adult dataset from the UCI Machine Learning Repository (Dua and Karra Taniskidou, 2017) and a dataset of pretrial defendants from Broward County, Florida (Angwin et al., 2016; Dressel and Farid, 2018) (henceforth referred to as the Broward dataset).

The Adult dataset contains 14 demographic features for 48842 individuals, for predicting whether one’s annual income is greater than $50,000. The Broward dataset contains 7 features of 7214 individuals arrested in Broward County, Florida between 2013 and 2014, with the goal of predicting recidivism within two years. It is derived by Dressel and Farid (2018) from the original dataset used by Angwin et al. (2016) to evaluate a widely used criminal risk assessment tool. We present results for the Adult dataset in the current section, and those for the Broward dataset in Appendix G.2.

Score functions are obtained by logistic regression on a training set that is 80% of the original dataset, using all available features, unless otherwise stated.

We first examine the sufficiency of the score with respect to two sensitive attributes, gender and race in Section 3.1. Then, in Section 3.2 we show that the score obtained from empirical risk minimization is sufficient and calibrated with respect to multiple sensitive attributes simultaneously. Section 3.3 explores how sufficiency and separation are affected differently by the amount of training data, as well as the model class.

We use two descriptions of sufficiency. In Sections 3.1 and 3.2, we present the so-called calibration plots (e.g., Figure 1

), which plots observed positive outcome rates against score deciles for different groups. The shaded regions indicate 95% confidence intervals for the rate of positive outcomes under a binomial model. In Section 

3.3, we report empirical estimates of the sufficiency gap, , using a test set that is 20% of the original dataset. More details on this estimator can be found in Appendix G.1. In general, models that are more sufficient and calibrated have smaller and their calibration plots show overlapping confidence intervals for different groups.

3.1 Training with group information has modest effects on sufficiency

Figure 1: Calibration plot for score using group attribute
Figure 2: Calibration plot for score not using group attribute

In this section, we examine the sufficiency of ERM scores, with respect to gender and race. When all available features were used in the regression, including sensitive attributes, the empirical risk minimizer of the logistic loss is sufficient and calibrated with respect to both gender and race, as seen in Figure 1. However, sufficiency can hold approximately even when the score is not a function of the group attribute. Figure 2 shows that without the group variable, the ERM score is only slightly less calibrated; the confidence intervals for both groups still overlap at every score decile.

3.2 Simultaneous sufficiency with respect to multiple group attributes

Figure 3: Calibration plot with respect to other group attributes
Figure 4: Calibration plot with respect to combinations of features: linear combination (left), intersectional combination (right)

Furthermore, we observe that empirical risk minimization with logistic regression also achieves approximate sufficiency with respect to any other group attribute defined on the basis of the given features, not only gender and race. In Figure 3, we show the calibration plot for the ERM score with respect to Age, Education-Num, Workclass, and Hours per week; Figure 4 considers combinations of two features. In each case, the confidence intervals for the rate of positive outcomes for all groups overlap at all, if not most, score deciles. In particular, Figure 4 (right) shows that the ERM score is close to sufficient and calibrated even for a newly defined group attribute that is the intersectional combination of race and gender. The calibration plots for other features, as well as implementation details, can be found in Appendix G.3.

3.3 Sufficiency improves with model accuracy and model flexibility

Figure 5: Sufficiency, Separation, and Logistic Loss vs. Number of training examples
Figure 6: Sufficiency for models trained with different L1 regularization parameters (left) and with different number of features (right)

Our theoretical results suggest that the sufficiency gap of a score function is tightly related to its excess risk. In general, it is impossible to determine the excess risk of a given classifier with respect to the Bayes risk from experimental data. Instead we shall examine how the sufficiency gap of a score trained by logistic regression varies with the number of samples and the model class, both of which were chosen because of their impact on the excess risk of the score.

Specifically, we explore the effects of decreased risk on sufficiency gap due to (a) increased number of training examples (Figure 5) and (b) increased expressiveness of the class of score functions (Figure 6). As the number of training samples increases, the gap between the ERM and least-risk score function in a given class , , decreases. On the other hand, as the number of model parameters grows, the class becomes more expressive, and may become closer to the Bayes risk .

Figures 5 and 6

display, for each experiment, the sufficiency gap and logistic loss on a test set averaged over 10 random trials, each using a randomly chosen training set. The shaded region in the figures indicates two standard deviations from the average value. In Figure

5, as the number of training examples increase, the logistic loss of the score decreases, and so does the sufficiency gap. For the race group attribute, we even observe that the sufficiency gap is going to zero; this is predicted by Theorem 2.3 as the risk of the score approaches the Bayes risk. Figure 5 also displays the separation gap of the scores. Indeed, the separation gap is bounded away from zero, as predicted by Corollary 2.6, and does not decrease with the number of training examples. This corroborates our finding that unconstrained machine learning cannot achieve the separation notion of fairness even with infinite data samples.

In Figure 6 (right), we gradually restrict the model class by reducing the number of features used in logistic regression. As the number of features decreases, the logistic loss increases and so does the sufficiency gap. In Figure 6 (left), we implicitly restrict the model class by varying the regularization parameter. In this case, a smaller regularization parameter corresponds to more severe regularization, which constrains the learned weights to be inside a smaller L1 ball. As we increase regularization, the logistic loss increases and so does the sufficiency gap. Both experiments show that the sufficiency gap is reduced when the model class is enlarged, again demonstrating its tight connection to the excess risk.


In summary, our results show that group calibration follows from closeness to the risk of the calibrated Bayes optimal score function. Consequently, empirical risk minimization is a simple and efficient recipe for achieving group calibration, provided that (1) the function class is sufficiently rich, (2) there are enough training samples, and (3) the group attribute can be approximately predicted from the available features. On the other hand, we show that group calibration does not and cannot solve fairness concerns that pertain to the Bayes optimal score function, such as the violation of separation and independence.

More broadly, our findings suggest that group calibration is an appropriate notion of fairness only when we expect unconstrained machine learning to be fair, given sufficient data. Stated otherwise, focusing on calibration alone is likely insufficient to mitigate the negative impacts of unconstrained machine learning.


  • Ahlswede [2007] Rudolf Ahlswede. The final form of Tao’s inequality relating conditional expectation and conditional mutual information. Advances in Mathematics of Communications, 1:239, 2007.
  • Angwin et al. [2016] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias. ProPublica, May 2016. URL
  • Barocas and Selbst [2016] Solon Barocas and Andrew D Selbst. Big data’s disparate impact. UCLA Law Review, 2016.
  • Barocas et al. [2018] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning., 2018.
  • Bartlett et al. [2006] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
  • Chouldechova [2017] A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5, 2017.
  • [7] T. Anne Cleary. Test bias: Validity of the scholastic aptitude test for negro and white students in integrated colleges. ETS Research Bulletin Series, 1966(2):i–23.
  • Cleary [1968] T. Anne Cleary. Test bias: Prediction of grades of negro and white students in integrated colleges. Journal of Educational Measurement, 5(2):115–124, 1968.
  • Corbett-Davies et al. [2017a] Sam Corbett-Davies, Sharad Goel, and Sandra González-Bailón. Thoughts on machine learning accuracy. New York Times, July 2017a. URL
  • Corbett-Davies et al. [2017b] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, pages 797–806, New York, NY, USA, 2017b. ACM.
  • Cox [1958] David R. Cox. Two further applications of a model for binary regression. Biometrika, 45(3-4):562–565, 1958.
  • Crawford [2013] Kate Crawford. The hidden biases in big data. Harvard Business Review, 1, 2013.
  • Crawford [2017] Kate Crawford. The trouble with bias. NIPS Keynote, 2017.
  • Darlington [1971] Richard B Darlington. Another look at “cultural fairness”. Journal of Educational Measurement, 8(2):71–82, 1971.
  • Dawid [1982] A. P. Dawid. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605–610, 1982.
  • DeGroot and Fienberg [1983] Morris H. DeGroot and Stephen E. Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society. Series D (The Statistician), 32(1/2):12–22, 1983.
  • Dieterich et al. [2016] William Dieterich, Christina Mendoza, and Tim Brennan. Compas risk scales: Demonstrating accuracy equity and predictive parity, 2016. URL
  • Dressel and Farid [2018] Julia Dressel and Hany Farid. The accuracy, fairness, and limits of predicting recidivism. Science Advances, 4(1), 2018. doi: 10.1126/sciadv.aao5580.
  • Dua and Karra Taniskidou [2017] Dheeru Dua and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL
  • Feldman et al. [2015] M. Feldman, S. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying and removing disparate impact. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 259–268, 2015.
  • Hardt et al. [2016] M. Hardt, E. Price, and N. Srebo. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems (NIPS), pages 3315–3323, 2016.
  • Hashimoto et al. [2018] T. B. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning (ICML), 2018.
  • Hebert-Johnson et al. [2018] Ursula Hebert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (Computationally-identifiable) masses. In Proceedings of the 35th International Conference on Machine Learning, pages 1944–1953, Stockholm, Sweden, 2018.
  • Hsu et al. [2011] Daniel Hsu, Sham M Kakade, and Tong Zhang.

    An analysis of random design linear regression.

    Citeseer, 2011.
  • Kearns et al. [2017] Michael J. Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. CoRR, abs/1711.05144, 2017.
  • Kleinberg et al. [2018] Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ashesh Rambachan. Algorithmic fairness. AEA Papers and Proceedings, 108:22–27, 2018.
  • Kleinberg et al. [2017] Jon M. Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. Proc. th ITCS, 2017.
  • Liu et al. [2018] Lydia T. Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. Delayed impact of fair machine learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3156–3164, Stockholm, Sweden, 2018.
  • Murphy and Winkler [1977] Allan H. Murphy and Robert L. Winkler. Reliability of subjective probability forecasts of precipitation and temperature. Journal of the Royal Statistical Society. Series C (Applied Statistics), 26(1):41–47, 1977.
  • Niculescu-Mizil and Caruana [2005] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 625–632, 2005. ISBN 1-59593-180-5.
  • Platt [1999] John C. Platt.

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.

    In Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999.
  • Pleiss et al. [2017] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. On fairness and calibration. In Advances in Neural Information Processing Systems 30, pages 5684–5693, 2017.
  • Simchowitz et al. [2016] Max Simchowitz, Kevin Jamieson, and Benjamin Recht. Best-of-K-bandits. In Conference on Learning Theory, pages 1440–1489, 2016.
  • Tao [2006] Terence Tao. Szemerédi’s regularity lemma revisited. Contributions to Discrete Mathematics, 1, 2006.
  • Tropp [2015] Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015.
  • Tsybakov [2008] Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519, 9780387790510.
  • Vapnik [1992] Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831–838, 1992.
  • Zadrozny and Elkan [2001] Bianca Zadrozny and Charles Elkan.

    Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers.

    In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 609–616, 2001. ISBN 1-55860-778-1.

Appendix A Additional Proofs for Section 2.1

In this section we prove Proposition 2.1, and Lemma 2.2.

a.1 Proof of Proposition 2.1

Recall that .

By the tower rule for conditional expectation,


Therefore, the calibrated Bayes score  is sufficient and calibrated.

More generally, conditional expectations of the form are calibrated, where can be any transformation of the features. This follows similarly from the tower rule.

a.2 Proof of Lemma 2.2

To see that this is true, first note that is a Bregman divergence. We can easily check that . Finally, -strong convexity follows from Pinkser’s inequality for Bernoulli random variables:

Appendix B Proof of Theorem 2.3

Throughout, we consider a fixed distribution and attribute . We shall also use the shorthand and . We begin by proving the following lemma, which establishes Theorem 2.3 in the case where is the squared loss:

Lemma B.1.

Let be the Bayes classifier, and let denote any function. Then,


To conclude the proof of Theorem 2.3, we first show that . Since is a Bregman divergence and calibrated at (Assumption 1), we have

Moreover, by strong convexity, we have that . Thus,

Applying Lemma B.1 concludes the proof.

b.1 Proof of Lemma b.1 (10) and (11)

First we bound the difference of the conditional expectations. Note that since ,


Moreover, by the definition of


and thus, by (14) and (15), we have


where (16) follows from Jensen’s inequality. Similarly, one has


We then find that


where we’ve applied Jensen’s inequality in (19), and the inequality in (20) uses (17) and (18).

Similarly, for a fixed group , we have

b.2 Proof of Lemma b.1 (12) and (13)

By (14), we have


where the inequality follows from Jensen’s inequality and the tower property. We then find that, by Jensen’s inequality,

Similarly, for an fixed group , we have

b.3 Further Examples for Calibration and Sufficiency Bounds

We present two further examples under which one can meaningfully bound the excess risk, and consequently the sufficiency and calibration gaps, in the incomplete information setting. In the next example, we examine sufficiency when is precisely the uncalibrated Bayes score . The following lemma establishes an upper bound on the sufficiency gap of the uncalibrated Bayes score in terms of the conditional mutual information between and , conditioning on . It is a simple consequence of Tao’s inequality.

Example B.1 (Sufficiency for uncalibrated Bayes score).

Suppose and are discrete -measurable random variables, and is the set of all functions . Denote . Then, under Assumption 1, and


For , we have the following identity for by the tower rule:

By applying Tao’s inequality [Tao, 2006, Ahlswede, 2007], we have that

Note that and the result follows.

Lastly, we consider an example when the attribute is continuous, and there exists a function which approximately predicts from .

Lemma B.2 (Calibration and sufficiency for continuous group attribute ).

Suppose (1) is the logistic loss, and (2) there exists such that . Let . Denote and . Further suppose (3) is -Lipschitz in its second argument, that is and (4) for some . Then,