Tracking and Improving Information in the Service of Fairness

04/22/2019 ∙ by Sumegha Garg, et al. ∙ Princeton University Stanford University 14

As algorithmic prediction systems have become widespread, fears that these systems may inadvertently discriminate against members of underrepresented populations have grown. With the goal of understanding fundamental principles that underpin the growing number of approaches to mitigating algorithmic discrimination, we investigate the role of information in fair prediction. A common strategy for decision-making uses a predictor to assign individuals a risk score; then, individuals are selected or rejected on the basis of this score. In this work, we formalize a framework for measuring the information content of predictors. Central to this framework is the notion of a refinement; intuitively, a refinement of a predictor z increases the overall informativeness of the predictions without losing the information already contained in z. We show that increasing information content through refinements improves the downstream selection rules across a wide range of fairness measures (e.g. true positive rates, false positive rates, selection rates). In turn, refinements provide a simple but effective tool for reducing disparity in treatment and impact without sacrificing the utility of the predictions. Our results suggest that in many applications, the perceived "cost of fairness" results from an information disparity across populations, and thus, may be avoided with improved information.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As algorithmic predictions are increasingly employed as parts of systems that classify people, concerns that such classfiers may be biased or discriminatory have increased correspondingly. These concerns are far from hypothetical; disparate treatment on the basis of sensitive features, like race and gender, has been well-documented in diverse algorithmic application domains [BG18, BCZ16, FK18]

. As such, researchers across fields like computer science, machine learning, and economics have responded with many works aiming to address the serious issues of fairness and unfairness that arise in automated decision-making systems.

111Moritz Hardt’s lecture communicates this trend quite succinctly. https://fairmlclass.github.io/1.html#/4

While most researchers studying algorithmic fairness can agree on the high-level objectives of the field (e.g., to ensure individuals are not mistreated on the basis of protected attributes; to promote social well-being and justice across populations), there is much debate about how to translate these normative aspirations into a concrete, formal definition of what it means for a prediction system to be fair. Indeed, as this nascent field has progressed, the efforts to promote “fair prediction” have grown increasingly divided, rather than coordinated. Exacerbating the problem, [MPB18] identifies that each new approach to fairness makes its own set of assumptions, often implicitly, leading to contradictory notions about the right way to approach fairness [Cho17, KMR17]; these inconsistencies add to the “serious challenge for cataloguing and comparing defintions” [MPB18]. Complicating matters further, recent works [LDR18, CG18] have identified shortcomings of many well-established notions of fairness. At the extreme, these works argue that blindly requiring certain statistical fairness conditions may in fact harm the communities they are meant to protect.

The state of the literature makes clear that choosing an appropriate notion of fairness for prediction tasks is a challenging affair. Increasingly, fairness is viewed as a context-dependent notion [SbF19, HM19], where the “right” notion for a given task should be informed by conversations between computational and social scientists. In the hopes of unifying some of the many directions of research in the area, we take a step back and ask whether there are guiding princples that broadly serve the high-level goals of “fair” prediction, without relying too strongly on any specific notion of fairness. The present work argues that understanding the “informativeness” of predictions needs to be part of any sociotechnical conversation about the fairness of a prediction system. Our main contribution is to provide a technical language with strong theoretical backing to discuss informational issues in the context of fair prediction.

Our contributions.

Towards the goal of understanding common themes across algorithmic fairness, we investigate the role of information in fair prediction. We formalize the notion of informativeness in predictions and demonstrate that it provides an effective tool for understanding and improving the utility, fairness, and impact of downstream decisions. In short, we identify that many “failures” of requiring fairness in prediction systems can be explained by an information disparity across subpopulations. Further, we provide algorithmic tools that aim to counteract these failures by improving informativeness. Importantly, the framework is not wedded to any specific fairness desideratum and can be applied broadly to prediction settings where discrimination may be a concern. Our main contributions can be summarized as follows:

  • First and foremost, we identify informativeness as a key fairness desideratum. We develop information-theoretic and algorithmic tools for reasoning about how much an individual’s prediction reveals about their eventual outcome. Our formulation clarifies the intuition that more informative predictions should enable fairer outcomes, across a wide array of interpretations of what it means to be “fair.” Notably, calibration

    plays a key technical role in this reasoning; the information-theoretic definitions and connections we draw rely intimately on the assumption that the underlying predictors are calibrated. Indeed, our results demonstrate a surprising application of these calibration-based methods towards improving parity-based fairness criteria, running counter to the conventional wisdom that calibration and parity are completely at odds with one another.

  • In Section 3, we provide an intuitive framework for studying the information content of a predictor. Information content formally quantifies the uncertainty over individuals’ outcomes given their predictions. Leveraging properties of calibrated predictors, we show that the information content of a predictor is directly related to the information loss between the true risk distribution and the predicted risk distribution. Therefore, in many cases, information content – a measurable characteristic of the predicted risk distribution – can serve as a proxy for reasoning about the information disparity across groups. To compare the information content of multiple predictors, we introduce a key concept we call a refinement; informally, a refinement of a predictor increases the overall information content, without losing any of the original information. Refinements provide the technical tool for reasoning about how to improve prediction quality.

  • In Section 4

    , we revisit the question of finding an optimal fair selection rule. For prominent parity-based fairness desiderata, we show that the optimal selection rule can be characterized as the solution to a certain linear program, based on the given predictor. We prove that improving the information content of these predictions via refinements results in a Pareto improvement of the resulting program in terms of utility, disparity, and long-term impact. As one concrete example, if we hold the selection rule’s utility constant, then

    refining the underlying predictions causes the disparity between groups to decrease. Additionally, we prove that at times, the cost associated with requiring fairness should be blamed on a lack of information about important subpopulations, not on the fairness desideratum itself.

  • In Section 5, we describe a simple algorithm, merge, for incorporating disparate sources of information into a single calibrated predictor. The merge operation can be implemented efficiently, both in terms of time and sample complexity. Along the way in our analysis, we introduce the concept of refinement distance – a measure of how much two predictors’ knowledge “overlaps” – that may be of independent interest.

Finally, a high-level contribution of the present work is to clarify challenges in achieving fairness in prediction tasks. Our framework for tracking and improving information is particularly compelling because it does not requires significant technical background in information theory nor algorithms to understand. We hope the framework will facilitate interactions between computational and social scientists to further unify the literature on fair prediction, and ultimately, effect change in the fairness of real-world prediction systems.

1.1 Why information?

We motivate the study of information content in fair prediction by giving an intuitive overview of how information disparities can lead to unfair treatment and impact. We scaffold our discussion around examples from two recent works [LDR18, CG18] that raise concerns about using broad-strokes statistical tests as the definition of fairness. This overview will be informal, prioritizing intuition over technicality; see Section 2 for the formal preliminaries.

We consider a standard prediction setting where a decision maker, who we call the lender, has access to a predictor ; from the predicted risk score , the decison maker must choose whether to accept or reject the individual , i.e. whether to give a loan or not. For each individual , we assume there is an associated outcome representing if they would default or repay a given loan ( and , respectively). Throughout, we will be focused on calibrated predictors. Intuitively, calibration requires that a predicted score of corresponds to the same level of risk, regardless of whether or . More technically, this means that we can think of as a conditional probability; that is, amongst the individuals who receive score , a -fraction of them end up having . For simplicity, we assume there are two disjoint subpopulations . The works of [LDR18, CG18] mainly focus on settings where there are material differences between the distribution of in the populations and , arguing that these differences can lead to undesirable outcomes. We argue that even if the true risk of individuals from and are identically distributed, differences in the distribution of predicted risk scores give rise to the same pitfalls.

A caution against parity.

[LDR18] focuses on notions of fairness that require parity between groups. One notion they study is demographic parity, which requires that the selection rate between groups and be equal; that is, . Suppose that the majority of applicants come from group and that on-average, members of tend to have higher predictions according to . In such a setting, an unconstrained utility-maximizing lender would give out loans at a higher rate in than in . The argument in [LDR18] against requiring demographic parity goes as follows: a lender who is constrained to satisfy demographic parity must either give out fewer loans in or more in ; because the lender does not want to give up utility from loaning to , the constrained lender will give out more loans in . [LDR18] argue that in many reasonable settings, the lender will end up loaning to underqualified individuals in who are unlikely to repay; thus, the default rate in will increase significantly. In their model, this increased default rate translates into negative impact on the population , whose members may go into debt and become even less creditworthy.

A caution against calibration.

In general, [CG18] advocates for the use of calibrated score functions paired with threshold selection policies, where an individual is selected if for some fixed, group-independent threshold . Still, they caution that threshold policies paired with calibrated predictors are not sufficient to guarantee equitable treatment. In particular, suppose that the lender is willing to accept individuals if they have at least probability of returning the loan. But now consider a set of calibrated risk scores where the scores are much more confident about than about ; at the extreme, suppose that for , (i.e. perfect predictions) and for , (i.e. uniform predictions). In this case, using a fixed threshold of will select every qualified individual in and none of the individuals from , even though, by the fact that is calibrated, half of them were qualified. Worse yet, even if we try to select more members of , every member of has a probability of defaulting. Indeed, in this example, we cannot distinguish between the individuals in because they all receive the same score . In other words, we have no information within the population even though the predictor was calibrated.

These examples make clear that when there are actual differences in the risk score distributions between populations and , seemingly-natural approaches to ensuring fairness – enforcing parity amongst groups or setting a group-independent threshold – may result in a disservice to the underrepresented population. These works echo a perspective raised by [DHP12] that emphasizes the distinction between requiring broad-strokes statistical parity as a constraint versus stating parity as a desideratum. Even if we believe that groups should ideally be treated similarly, defining fairness as satisfying a set of hard constraints may have unintended consequences.

Note that the arguments above relied on differences in the predicted risk scores for and , but not the true underlying risk. This observation has two immediate corollaries. On the one hand, in both of these vignettes, if the predicted score distributions are different between population and , then such approaches to fairness could still cause harm, even if the true score distributions are identically distributed. On the other hand, just because and look different according to the predicted scores, they may not actually be different. Intuitively, the difference between the true risk distribution and the observed risk distribution represents a certain “information loss.” Optimistically, if we could somehow improve the informativeness of the predicted scores to reflect the underlying populations more accurately, then the resulting selection rule might exhibit less disparity between and in both treatment and impact.

Concretely, suppose we’re given a set of predicted risk scores where the scores in tend to be much more extreme (towards and ) than those of . Differences in the risk score distributions such as these can arise for one of two reasons: either individuals from are inherently more stochastic and unpredictable than those in

; or somewhere along the risk estimation pipeline, more information was lost about

than about . Understanding which story is true can be challenging, if not impossible. Still, in cases where we can reject the hypothesis that certain individuals are inherently less predictable than others, the fundamental question to ask is how to recover the lost information in our predictions. In this work, we provide tools to answer this question.

Refinements.

Here, we give a technical highlight of the notion of a refinement and the role refinements serve in improving fairness. In Section 3, we introduce the concept of information content, which gives a global measure of how informative a calibrated predictor is over the population of individuals; intuitively, as increases, the uncertainty in a typical individual’s outcome decreases. The idea that more information in predictions could lead to better utility or better fairness is not particularly surprising. Still, this intuition on its own presents some challenges. For example, suppose we’re concerned about minimizing the false positive rate (the fraction of the population where that were selected). Because is a global measure of uncertainty over all of , could be very high due to confidence about a population that is very likely to have ( close to ), even though gives very little information ( far from or ) about the rest of , which consists of a mix of and . In this case, a predictor with less information (), but better certainty about even a tiny part of the population where ( close to ) would enable lower false positive rates (with nontrivial selection rate).

As such, we need another way to reason about what it means for one set of predicted risk scores to have “better information” than the other. Refinements provide the key tool for comparing the information of predictors. Intuitively, a calibrated predictor is a refinement of if hasn’t forgotten any of the information contained by . Formally, we say that refines if ; this definition is closely related to the idea of calibration in a sense that we make formal in Proposition 3.1.

Refinements allow us to reason about how information influences a broad range of quantities of interest in the context of fair prediction. To give a sense of this, consider the following lemma, which we use to prove our main result in Section 4, but is also independently interesting. The lemma shows that under any fixed selection rate , the true positive rates, false positive rates, and positive predictive value all improve with a nontrivial refinement. If is a refinement of , then for all selection rates ,

Intuitively, the lemma shows that by improving information through refinements, mutliple key fairness quantities improve simultaneously. Leveraging this lemma and other properties of refinements and calibration, we show that for many different ways a decision-maker might choose their “optimal” selection rule, the “quality” of the selection rule improves under refinements. We highlight this lemma as one example of the broad applicability of the refinement concept in the context of fair prediction.

Organization.

The manuscript is structured as follows: Section 2 establishes notation and covers the necessary preliminaries; Section 3 provides the technical framework for measuring information in predictors; Section 4 demonstrates how improving information content improves the resulting fair selection rules; and Section 5 describes the merge algorithm for combining and refining multiple predictors. We conclude with a brief discussion of the context of this work and some directions for future investigation.

1.2 Related Works

The influential work of [DHP12] provided two observations that are of particular relevance to the present work. First, [DHP12] emphasized the pitfalls of hoping to achieve “fairness through blindness” by censoring sensitive information during prediction. Second, the work highlighted how enforcing broad-strokes statistical parity conditions – even if desired or expected in fair outcomes – is insufficient to imply fairness. Our results can be viewed as providing further evidence for these perspectives. As discussed earlier, understanding the ways in which fair treatment can fail to provide fair outcomes [CG18, LDR18] provided much of the motivation for this work. For a comprehensive overview of the literature on the growing list of approaches to fairness in prediction systems, we recommend the recent encyclopedic survey of [MPB18].

A few recent works have (implicitly or explicitly) touched on the relationship between information and fairness. [CJS18] argues that discrimination may arise in prediction systems due to disparity in predictive power; they advocate for addressing discrimination through data collection. Arguably, much of the work on fairness in online prediction [JKMR16]

can be seen as a way to gather information while maintaining fairness. Recently, issues of information and fairness were also studied in unsupervised learning tasks

[STM18]. From the computational economics literature, [KLMR18] presents a simple planning model that draws similar qualitative conclusions to this work, demonstrating the significance of trustworthy information as a key factor in algorithmic fairness.

The idea that better information improves the lender’s ability to make useful and fair predictions may seem intuitive under our framing. Interestingly, different framings of prediction and informativeness can lead to qualitatively different conclusions. Specifically, the original work on delayed impact [LDR18] suggests that some forms of misestimation (i.e. loss of information) may reduce the potential for harm from applying parity-based fairness notions. In particular, if the lender’s predictor is miscalibrated in a way that underestimates the quality of a group , then increasing the selection rate beyond the global utility-maximizing threshold may be warranted. In our setting, because we assume that the lender’s predictions are calibrated, this type of systematic bias in predictions cannot occur, and more information always improves the resulting selection rule. This discrepancy further demonstrates the importance of group calibration to our notion of information content.

Other works [KRZ19, ILZ19] have investigated the role of hiding information through strategic signaling. In such settings, it may be strategic for a group to hide information about individuals in order to increase the overall selection rate for the group. These distinctions highlight the fact that understanding exactly the role of information in “fair” prediction is subtle and also depends on the exact environment of decision-making. We further discuss how to interpret our theorems as well as the importance of faithfully translating fairness desiderata into mathematical constraints/objectives in Section 6.

The present work can also be viewed as further investigating the tradeoffs between calibration and parity. Inspired by investigative reporting on the “biases” of the COMPAS recidivism prediction system [ALMK16], the incompatability of calibration and parity-based notions of fairness has received lots of attention in recent years [Cho17, KMR17, PRW17]. Perhaps counterintuitively, our work shows how to leverage properties of calibrated predictors to improve the disparity of the eventual decisions. At a technical level, our techniques are similar in flavor to the those of [HKRR18], which investigates how to strengthen group-level calibration as a notion of fairness; we discuss further connections to [HKRR18] in Section 6.

Outside the literature on fair prediction, our notions of information content and refinements are related to other notions from the fields of online forecasting and information theory. In particular, the idea of increasing the information content of calibrated predictions is similar in spirit to the idea of sharpness, studied in the forecasting literature [GBR07, GR07]. The concept of a refinement of a calibrated predictor is closely related to Blackwell’s informativeness criterion [Bla53, Cré82].

2 Preliminaries

Basic notation.

Let denote the domain of individuals and

denote the binary outcome space. We assume that individuals and their outcomes are jointly distributed according to

supported on . Let denote an independent random draw from . For a subpopulation , we use the shorthand to be the data distribution conditioned on , and to denote a random sample from the marginal distribution over conditioned on membership in .

Predictors.

A basic goal in learning is to find a classifier that given drawn from the marginal distribution over individuals, accurately predicts their outcome . One common strategy for binary classification first maps individuals to a real-valued score using a predictor and then selects individuals on the basis of this score. We denote by the support of . We denote by the Bayes optimal predictor, where represents the inherent uncertainty in the outcome given the individual. Equivalently, for each individual , their outcome is drawn independently from

, the Bernoulli distribution with expectation

. While we use to denote the codomain of predictors, throughout this work, we assume that the set of individuals is finite and hence, the support of any predictor is a discrete, finite subset of the interval.

Risk score distributions.

Note that there is a natural bijection between predictors and score distributions. A predictor , paired with the marginal distribution over , induces a score distribution, which we denote , supported on

, where the probability density function is given as

. For a subpopulation , we denote by the score distribution conditioned on .

Calibration.

A useful property of predictors is called calibration, which implies that the scores can be interpreted meaningfully as the probability that an individual will result in a positive outcome. Calibration has been studied extensively in varied contexts, notably in forecasting and online prediction (e.g. [FV98]), and recently as a fairness desideratum [KMR17, PRW17, HKRR18, CG18]; the definition we use is adapted from the fairness literature. [Calibration] A predictor is calibrated on a subpopulation if for all ,

For convenience when discussing calibration, we use the notation . Note that we can equivalently define calibration with respect to the Bayes optimal predictor, where is calibrated on if for all , . Operationally in proofs, we end up using this definition of calibration. This formulation also makes clear that is calibrated on every subpopulation.

Parity-based fairness.

As a notion of fairness, calibration aims to ensure similarity between predictions and the true outcome distribution. Other fairness desiderata concern disparity in prediction between subpopulations on the basis of a sensitive attribute. For simplicity, we will imagine individuals are partitioned into two subpopulations ; we will overload notation and use to denote the names of the attributes as well. We let map individuals to their associated attribute. The most basic notion of parity is demographic parity (also called statistical parity), which states that the selection rate of individuals should be independent of the sensitive attribute. [Demographic parity [DHP12]] A selection rule satisfies demographic parity if

One critique of demographic parity is that the notion does not take into account the actual qualifications of groups (i.e. no dependence on ). Another popular parity-based notion, called equalized opportunity, addresses this criticism by enforcing parity of false negative rates across groups. [Equalized opportunity [HPS16]] A selection rule satisfies equalized opportunity if

In addition to these fairness concepts, the following properties of a selection rule will be useful to track. Specifically, we define the true positive rate (), false positive rate (), and positive predictive value ().

3 Measuring information in binary prediction

In this section, we formalize a notion of information content in calibrated predictors. In the context of binary prediction, a natural way to measure the “informativeness” of a predictor is by the uncertainty in an individual’s outcome given their score. We quantify this uncertainty using variance.222 Alternatively, we could measure uncertainty through Shannon entropy. In Appendix A, we show that notions of information that arise from Shannon entropy are effectively interchangeable with those that arise from variance. We elect to work with variance in the main body primarily because it simplifies the analysis in Section 5. For

, the variance of a Bernoulli random variable with expected value

is given as . Note that variance is a strictly concave function in and is maximized at and minimized ; that is, a Bernoulli trial with is maximally uncertain whereas a trial with or is perfectly certain. Consider a random draw . If is a calibrated predictor, then given and , the conditional distribution over follows a Bernoulli distribution with expectation . This observation suggests the following defintion. [Information content] Suppose for , is calibrated on . The information content of on is given as

For a calibrated , we use to denote the “information content of ”. The factor in the definition of information content acts as a normalization factor such that . At the extremes, a perfectly informative predictor has information content , whereas a calibrated predictor that always outputs has information.

This formulation of information content as uncertainty in a binary outcome is intuitive in the context of binary classification. In some settings, however, it may be more instructive to reason about risk score distributions directly. A conceptually different approach to measuring informativeness of a risk score distribution might track the uncertainty in the true (Bayes optimal) risk, given the predicted risk score.

Consider a random variable that takes value for sampled from the individuals with score ; that is, equals the true risk for an individual sampled amongst those receiving predicted risk score . Again, we could measure the uncertainty in this random variable by tracking its variance given ; the higher the variance, the less information the risk score distribution provides about the true risk score distribution . Recall, for a predictor that is calibrated on and score , we let . Consider the variance in given as

We define the information loss by taking an expectation of this conditional variance over the score distribution induced by . [Information loss] For , suppose a predictor is calibrated on . The information loss of on is given as

Again, the factor is simply to normalize the information loss into the range . This loss is maximized when is a : mix of but always predicts ; the information loss is minimized for . We observe that this notion of information loss is actually proportional to the expected squared error of with respect to ; that is,

Thus, for calibrated predictors, we can interpret the familiar squared loss between a predictor and the Bayes optimal predictor as a notion of information loss.

Connecting information content and information loss.

As we defined them, information content and information loss seem like conceptually different ways to measure uncertainty in a predictor. Here, we show that they actually capture the same notion. In particular, we can express information loss of as the difference in information content of and that of . Let denote the Bayes optimal predictor. Suppose for , is calibrated on . Then

Proof.

The proof follows by expanding the information loss and rearranging so that, assuming is calibrated, terms cancel. As notational shorthand, let and let . Thus, we can rewrite the information loss as follows.

where (3) follows because under calibration. ∎

Because the information loss is a nonnegative quantity, Proposition 3 also formalizes the inutition that the Bayes optimal predictor is the most informative predictor; for , . Ideally, in order to evaluate the information disparity across groups, we would compare the information lost from to across and . But because the definition of information loss depends on the true score distribution , in general, it’s impossible to directly compare the loss. Still, if we believe that is similarly distributed across and , then measuring the information contents and – properties of the observed risk scores – allows us to directly compare the loss.

3.1 Incorporating information via refinements

We have motivated the study of informativeness in prediction with the intuition that as information content improves, so too will the resulting fairness and utility of the decisions derived from the predictor. Without further assumptions, however, this line of reasoning turns out to be overly-optimistic. For instance, consider a setting where the expected utility of lending to individuals is positive if for some fixed threshold (and negative otherwise). In this case, information about individuals whose is significantly below is not especially useful. Figure 1 gives an example of two predictors, each calibrated to the same , where , but is preferable. At a high-level, the example exploits the fact that information content is a global property of , whereas the quantities that affect the utility and fairness directly, like or are conditional properties. Even if , it could be that has lost information about an important subpopulation compared to , compensating with lots of information about the unqualified individuals.

Figure 1: Comparing information content directly is insufficient to compare predictors’ utility. Two predictors are each calibrated to with . In , and ; in , and . , but achieves better utility than whenever .

Still, we would like to characterize ways in which more information is definitively “better.” Intuitively, more information is better when we don’t have to give up on the information in the current predictor, but rather refine the information contained in the predictions further. The following definition formalizes this idea. [Refinement] For , suppose are calibrated on . is a refinement of on if for all ,

That is, we say that refines if maintains the same expectation over the level sets . To understand why this property makes sense in the context of maintaining information from to , suppose the property was violated: that is, there is some such that . This disagreement provides evidence that has some consistency with the true risk that is lacking; because is calibrated, . In other words, even if has greater information content, it may not be consistent with the content of .

Another useful perspective on refinements is through measuring the information on each of the sets . Restricted to , has minimal information content – its predictions are constant – whereas may vary. Because is calibrated and maintains the expectation over , we can conclude that for each of the partitions.

This perspective highlights the importance of requiring calibration in the definition of refinements. Indeed, because a refinement is a calibrated predictor, refinements cannot make arbitrary distinctions in predictions, so any additional distinctions on the level sets of the original predictor must represent true variability in . We draw attention to the similarity between the definition of a refinement and the definition of calibration. In particular, if is a refinement of , then is not only calibrated with respect to , but also with respect to ; stated differently, is a refinement of every calibrated predictor. Indeed, one way to interpret a refinement is as a “candidate” Bayes optimal predictor. Carrying this intuition through, we note that the only property of we used in the proof of Proposition 3 is that it is a refinement of a calibrated . Thus, we can immediately restate the proposition in terms of generic refinements. Suppose for , are calibrated on . If refines on , then . This characterization further illustrates the notion that a refinement could plausibly be the true risk given the information in the current predictions . In particular, because , we get for any refinement .

In the context of fair prediction, we want to ensure that the information content on specific protected subpopulations does not decrease. Indeed, in this case, it may be important to ensure that the predictions are refined, not just overall, but also on the sensitive subpopulations. In Figure 2, we illustrate this point by showing two predictors that are each calibrated on two subpopulations ; refines on overall, but loses information about the subpopulation . This negative example highlights the importance of incorporating all the information available (e.g. group membership), not only at the time of decision-making, but also along the way when developing predictors; it serves as yet another rebuke of the approach of “fairness through blindness” [DHP12].

Figure 2: Per-group refinement is necessary to maintain information for each group. Let and where and . The two predictors are each calibrated on and . Note that refines overall, but has lost all information about .

4 The value of information in fair prediction

In this section, we argue that reasoning about the information content of calibrated predictors provides a lens into understanding how to improve the utility and fairness of predictors, even when the eventual fairness desideratum is based on parity. We discuss a prediction setting based on that of [LDR18] where a lender selects individuals to give loans from a pool of applicants. While we use the language of predicting creditworthiness, the setup is generic and can be applied to diverse prediction tasks. [LDR18] introduced a notion of “delayed impact” of selection policies, which models the potential negative impact on communities of enforcing parity-based fairness as a constraint. We revisit the question of delayed impact as part of a broader investigation of the role of information in fair prediction. We begin with an overview of the prediction setup. Then, we prove our main result: refining the underlying predictions used to choose a selection policy results in an improvement in utility, parity, or impact (or a combination of the three).

4.1 Fair prediction setup

When deciding how to select qualified individuals, the lender’s goal is to maximize some expected utility. Specifically, the utility function specifies the lender’s expected utility from an individual based on their score and a fixed threshold333Assuming such an affine utility function is equivalent to assuming that the lender receives from repayments, from defaults, and from individuals that do not receive loans. In this case, the expected utility for score is for some constant . A similar rationale applies to the individuals’ impact function. as given in (1). When considering delayed impact, we will measure the expected impact per subpopulation. The impact function specifies the expected benefit to an individual from receiving a loan based on their score and a fixed threshold also given in (1).

(1)

[LDR18] models risk-aversion of the lender by assuming that ; that is, by choosing accepting individuals with , the impact on subpopulations may improve beyond the lender’s utility-maximizing policy.

In this setup, we allow the lender to pick a (randomized) group-sensitive selection policy that selects individuals on the basis of a predicted score and their sensitive attribute. That is, the selection policy makes decisions about individuals via their score according to some calibrated predictor and their sensitive attribute ; for every individual , the probability that is selected is given as .

We will restrict our attention to threshold policies; that is, for sensitive attribute (resp., ), there is some , such that is given as if , if and for , where is a probability used to randomly break ties on the threshold. The motivation for focusing on threshold policies is their intuitiveness, widespread use, computational efficiency444Indeed, without the restriction to threshold policies, many of the information-theoretic arguments become easier at the expense of computational cost. As is a refinement of , we can always simulate decisions derived from given , but in general, we cannot do this efficiently.. The restriction to threshold policies is justified formally in [LDR18] by the fact that the optimal decision rule in our setting can be specified as a threshold policy under both demographic parity and equalized opportunity.

Given this setup, we can write the expected utility of a policy based on a calibrated predictor , that is calibrated on both subpopulations, and , as follows.

(2)

Recall, .

Similarly, the expected impact over the subpopulations are given as

(3)

Often, it may make sense to constrain the net impact to each group as defined in (3) to be positive, ensuring that the selection policies do not do harm as in [LDR18].

The following quantities will be of interest to the lender when choosing a selection policy as a function of . First, the lender’s overall utility is given as in (2). In the name of fairness, the lender may also be concerned about the disparity of a number of quantities. We will show below that these quantities can be written as a linear function of the selection rule. In particular, for demographic parity, which serve as our running example, compares the selection rates ,

(4)

We may also be concerned about comparing the true positive rates (equalized opportunity) and false positive rates. Recall, and ; in this context, we can rewrite these quantities as follows.

(5)
(6)

where represents the base rate of the subpopulation ; that is, . Another quantity we will track is the positive predictive value, .

(7)

Note that is not a linear function of values, but as we never use positive predictive values directly in the optimizations for choosing a selection policy (or in a parity-based fairness definition), the optimizations are still a linear program. For notational convenience, we may drop the superscript of these quantities when is clear from the context.

4.2 Refinements in the service of fair prediction

Note that all of the quantities described in Section 4.1 can be written as linear functions of . Given a fixed predictor , we can expand the quantities of interest; in particular, we note that the linear functions over can be rewritten as linear functions over , where the quantities depend on only through the predictor . In this section, we show how refining the predictor used for determining the selection rule can improve the utility, parity, and impact of the optimal selection rule. By the observations above, we can formulate a generic policy-selection problem as a linear program where controls many coefficients in the program. When we refine , we show that the value of the program increases. Recalling that different contexts may call for different notions of fairness, we consider a number of different linear programs the lender (or regulator) might choose to optimize. At a high-level, the lender can choose to maximize utility, minimize disparity, or maximize positive impact on groups, while also maintaining some guarantees over the other quantities.

We will consider selection policies given a fixed predictor . Note that the parity-based fairness desiderata we consider are of the form for some ; rather than requiring equality, we will consider the disparity and in some cases, constrain it to be less than some constant . We also use to denote lower bounds on the desired impact and utility, respectively. For simplicity’s sake, we assume that is the “protected” group, so we only enforce the positive impact constraint for this group; more generally, we could include an impact constraint for each group. Formally, we consider the following constrained optimizations.

Optimization  Optimization  Optimization 


s.t.  
s.t.   s.t.  


    (Utility Maximization)
    (Disparity minimization)     (Impact Maximization)

Let . Given a calibrated predictor , Optimization 4.2,4.2, and 4.2 are linear programs in the variables for and . Further, for each program, there is an optimal solution that is a threshold policy. We sketch the proof of the lemma. The fact that the optimizations are linear programs follows immediately from the observations that each quantity of interest is a linear function in . The proof that there is a threshold policy that achieves the optimal value in each program is similar to the proof of Theorem 4.2 given below. Consider an arbitrary (non-threshold) selection policy ; let . The key observation is that for the fixed value of , there is some other threshold policy where and and . Leveraging this observation, given any non-threshold optimal selection policy, we can construct a threshold policy, which is also optimal.

We remark that our analysis applies even if considering the more general linear maximization:

Optimization 

for any fixed .555In particular, each of Optimizations 4.2, 4.2, and 4.2 can be expressed as an instance of Optimization 4.2 by choosing to be the optimal dual multipliers for each program. We note that the dual formulation actually gives an alternate way to derive results from [LDR18]. Their main result can be restated as saying that there exist distributions of scores such that the dual multiplier on the positive impact constraint in Optimization 4.2 is positve; that is, without this constraint, the utility-maximizing policy will do negative impact to group . In other words, the arguments hold no matter the relative weighting of the value of utility, disparity, and impact.

Improving the cost of fairness.

We argue that in all of these optimizations, increasing information through refinements of the current predictor on both the subpopulations and improves this value of the program. We emphasize that this conclusion is true for all of the notions of parity-based fairness we mentioned above. Thus, independent of the exact formulation of fair selection that policy-makers deem appropriate, information content is a key factor in determining the properties of the resulting selection rule. We formalize this statement in the following theorem.

Let be two predictors that are calibrated on disjoint subpopulations . For any of the Optimization 4.24.24.24.2 and their corresponding fixed parameters, let denote their optimal value under predictor . If refines on and , then for Optimization 4.24.24.2 and for Optimization 4.2. One way to understand Theorem 4.2 is through a “cost of fairness” analysis. Focusing on the utility maximization setting, let be the maximum unconstrained utility achievable by the lender given the optimal predictions . Let be the optimal value of Optimization 4.2 using predictions ; that is, the best utility a lender can achieve under a parity-based fairness constraint () and positive impact constraint (). If we take the cost of fairness to be the difference between these optimal utilities, , then Theorem 4.2 says that by refining to , the cost of fairness decreases with increasing informativeness; that is, . This corollary of Theorem 4.2 corroborates the idea that in some cases the high perceived cost associated with requiring fairness might actually be due to the low informativeness of the predictions in minority populations. No matter what the true is, this cost will decrease as we increase information content by refining subpopulations.

For , we use to denote the true positive rate of the threshold policy with selection rate for the subpopulation while using the predictor 666Given a predictor, there is a bijection between selection rates and threshold policies.. Similarly, , are defined. The following lemma, which plays a key role in each proof, shows that refinements broadly improve selection policies across these three statistics of interest. If is a refinement of on subpopulations and , then for , for all ,

In particular, the proof of Theorem 4.2 crucially uses the fact that the positive predictive values, true positive rates, and false positive rates improve for all selection rates. Leveraging properties of refinements, the improvement across all selection rates guarantees improvement for any fixed objective. As we’ll see, the proof actually tells us more: for any selection policy using the predictor , there exists a threshold selection policy that uses the refined predictor and simultaneously has utility, disparity, and impact that are no worse than under . In this sense, increasing informativeness of predictors through refinements is an effective strategy for improving selection rules across a wide array of criteria. Still, we emphasize the importance of identifying fairness desiderata and specifying them clearly when optimizing for a selection rule.

For instance, suppose the selection rule is selected by constrained utility maximization with a predictor and with a refined predictor . It is possible that the optimal selection policy under the refinement will have a lower quantitative impact than the optimal policy under the original predictor (while still satisfying the impact constraint). If maintaining the impact above a certain threshold is desired, then this should be specified clearly in the optimization used for determining the selection rule. We defer further discussion of these issues to Section 6.

Proofs.

Next, we prove Lemma 6 and Theorem 4.2.

Lemma 6 Note that for a fixed selection rate , is maximized by picking the top-most fraction of the individuals ranked according to , i.e. a threshold policy that selects a -fraction of the individuals using the Bayes optimal predictor . Similarly, for a fixed selection rate, the and values are also optimized under a threshold selection policy that uses the Bayes optimal predictor .

Recall, we can interpret a refinement as a “candidate” Bayes optimal predictor. In particular, because refines over and , we know that is calibrated not only with respect to the true Bayes optimal predictor , but also with respect to the refinement on both subpopulations. Imagining a world in which is the Bayes optimal predictor, the , , and must be no worse under a threshold policy derived from compared to that of by the initial observation. Thus, the lemma follows.

Using Lemma 6, we are ready to prove Theorem 4.2.

Theorem 4.2 Let be any threshold selection policy under the predictor . Using , we will construct a selection policy that uses the refined score distribution such that where , , and and . Here, specifies the parity-based fairness definition being used. Thus, taking to be the optimal solution to any of the Optimizations 4.24.24.2, or 4.2, we see that is a feasible solution to the same optimization and has the same or a better objective value compared to . Therefore, after optimization, objective values can only get better.

In words, we are saying that refined predictors allow us to get better utility and impact as the original predictor while keeping the parity values the same (for e.g., while keeping the selection rates the same in both subpopulations).

We separately construct for each fairness definition () as follows:

  1. (Statistical Parity) :

    For , let be the selection rate of in the population . Let be the threshold policy that uses the predictor and achieves selection rates and in the subpopulations and , respectively. By Lemma 6, for . The utility of the policy can be written as

    Similarly, we can show that the impact on the subpopulation under is at least as good as under .

  2. (Equalized Opportunity) :

    Let be the selection rates of policy on the subpopulations and . We know that () through Lemma 6. Let be the threshold selection policy corresponding to a selection rates of such that (). As the true positive rates increase with increasing selection rate, . The utility of the policy can be written as