From Soft Classifiers to Hard Decisions: How fair can we be?

10/03/2018 ∙ by Ran Canetti, et al. ∙ Boston University 0

A popular methodology for building binary decision-making classifiers in the presence of imperfect information is to first construct a non-binary "scoring" classifier that is calibrated over all protected groups, and then to post-process this score to obtain a binary decision. We study the feasibility of achieving various fairness properties by post-processing calibrated scores, and then show that deferring post-processors allow for more fairness conditions to hold on the final decision. Specifically, we show: 1. There does not exist a general way to post-process a calibrated classifier to equalize protected groups' positive or negative predictive value (PPV or NPV). For certain "nice" calibrated classifiers, either PPV or NPV can be equalized when the post-processor uses different thresholds across protected groups, though there exist distributions of calibrated scores for which the two measures cannot be both equalized. When the post-processing consists of a single global threshold across all groups, natural fairness properties, such as equalizing PPV in a nontrivial way, do not hold even for "nice" classifiers. 2. When the post-processing is allowed to `defer' on some decisions (that is, to avoid making a decision by handing off some examples to a separate process), then for the non-deferred decisions, the resulting classifier can be made to equalize PPV, NPV, false positive rate (FPR) and false negative rate (FNR) across the protected groups. This suggests a way to partially evade the impossibility results of Chouldechova and Kleinberg et al., which preclude equalizing all of these measures simultaneously. We also present different deferring strategies and show how they affect the fairness properties of the overall system. We evaluate our post-processing techniques using the COMPAS data set from 2016.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 31

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The concept of fairness is deeply ingrained in our psyche as a fundamental, essential ingredient of Human existence. Indeed the perception of fairness, broadly construed as accepting each others’ equal right for well being, is arguably one of the most basic tenets of cooperative societies of individuals in general, and Human existence in particular.

However, as fundamental as this concept may be, it is also extremely elusive: different societies have developed very different notions of fairness and equality among individuals, subject to different religious, ethical, and social beliefs; in particular, the intricate interplay between fairness and justice, which is yet another somewhat elusive concept, is often not well-defined and a matter of subjective interpretation.

The concept is further complicated by the fact that human decisions are often made with incomplete information and limited resources. These two factors must be taken into account when evaluating whether decision-making processes are “fair.” Indeed, these two aspects of the problem have become increasingly prominent as societies grow and decision processes become more complex and algorithmic.

One way that researchers are responding to these growing concerns is by attempting to formulate precise notions for fairness of decisions processes, e.g. [dwork, Kleinberg]. While these notions do not intend to capture the complexities of the ethical, socio-economic, or religious aspects of fairness, they do consider the fairness aspects of statistical decision-making processes with incomplete information. Essentially, these notions accept the fact that a decision process with incomplete information will inevitably make errors relative to the desired full-information notion (which is treated as a given), and provide guidelines on how to “distribute the errors fairly” across individuals, or alternatively across groups of individuals. These definitions have proven to be meaningful and eye opening; in particular, it has been demonstrated that some very natural notions of “fair distribution of errors” are mutually inconsistent: No decision mechanism with incomplete information can satisfy all, except for in trivial cases [Cho17, Kleinberg].

Faced with this basic impossibility, we would like to better understand the process of decision making with incomplete information, and use this understanding to propose ways to circumvent this impossibility.

Specifically, we concentrate on the task of post-processing a calibrated soft classifier under group fairness constraints. We suppose that individuals belong to one of two or more disjoint protected groups. Our overall task is to decide whether a given individual has some hidden binary property in a way that ensures “fair balancing of errors” across the groups.

For that purpose, we consider the following two-stage mechanism. The first stage consists of constructing a classifier that outputs for each individual a score that is related to the chance that has property . The only requirement we make of is group-wise calibration: for both and , and for each , the fraction of individuals in the group that get score and have the property, out of all individuals in the group that get score , is . The second stage takes as input the output of the first stage and the group to which belongs, and outputs a binary decision: its best guess at whether has property .

An attractive aspect of this two-stage mechanism is that each stage can be viewed as aimed at a different goal: The first stage is aimed at gathering information and providing the best accuracy possible, with only minimal regard to fairness (i.e only group-wise calibration). The second stage is aimed to extract a decision from the information collected in the first stage, while making sure that the errors are distributed “fairly.”

To further focus our study, we take the first stage as a given and concentrate on the second. That is, we consider the problem of post-processing the scores given by the calibrated soft classifier into binary predictions. A representative example is a judge making a bail decision based on a score provided by a software package. Following [Cho17, EqualOpp] we consider the following four performance measures for the resulting binary classifier: the positive predictive value (PPV), namely the fraction of individuals that have the property among all individuals that the classifier predicted to have the property; The false positive rate (FPR), namely the fraction of individuals that were predicted to have the property among all individuals that don’t have the property; The negative predictive value (NPV) and false negative rate (FNR), which are defined analogously. Ideally, we would like to equalize each one of the four measures across the groups, i.e. the measure will have the same value when restricted to samples from each group. Unfortunately, however, we know that this is impossible in general [Cho17, Kleinberg]. This leads us to a broad question that motivates our work:

Under what conditions can we post-process a calibrated soft classifier’s outputs so that the resulting hard classifier equates a subset of across a set of protected groups? How can we balance these conflicting goals?

Results: Post-Processing With Thesholds

In a first set of results we consider the properties obtained by post-processing via a “threshold” mechanism. Naively, a threshold post-processing mechanism would return 1 for individual whenever is above some fixed threshold, and return 0 otherwise. We somewhat extend this mechanism by allowing the post-processor “fine-tune” its decision by choosing the output probabilistically whenever the result of the soft classifier is exactly the threshold.

We show that no  post-processing mechanism that uses a single threshold across all groups can guarantee equality of PPV (or NPV) across protected groups. This indicates that the task of post-processing a calibrated soft classifier to obtain a “fair” binary classifier with any fairness property is non-trivial. We then show that using different thresholds for the different groups, one can equalize either PPV or NPV (but not both) across the two groups, assuming the distribution of has some non-degeneracy property.

The combination of the impossibility of single threshold and the possibility of per-group threshold also stands in contrast to the belief that a soft classifier that is calibrated across both groups allows “ignoring” group-membership information in any post-processing decision [MP17]. Indeed, the conversion to a binary decision “loses information” in different ways for the two groups, and so group membership becomes relevant again after post-processing.

Results: Adding deferrals.

For the second set of results we consider post-processing strategies that do not always output a decision. Rather, with some probability the output is

, or “I dont know”, which means that the decision is deferred to another (hopefully higher quality, even if more expensive) process. Let us first present our technical results and then discuss potential interpretations and context.

The first strategy is a natural extension of the per-group threshold: we use two thresholds per group, returning 1 above the right threshold, 0 below the left threshold, and between the thresholds. We show that there always exists a way to choose the thresholds such that, conditioned on the decision not being , both the PPV and NPV are equal across groups.

Next we show a family of post-processing strategies where, conditioned on the decision not being , all four quantities (PPV, NPV, FPR, FNR) are equal across groups.

All strategies in this family have the following structure: Given an individual , the strategy first makes a randomized decision whether to defer on , where the probability depends on and the group membership of . If not deferred, then the decision is made via another post-processing technique.

One method for determining the probabilities of deferrals is to make sure that, the distribution of scores returned by the calibrated soft classifier conditioned on not deferring, is equal for the two groups (That is, let denote the probability, restricted to group , that an element gets score conditioned on not deferring. Then for any , we choose deferral probabilities so that .) The resulting classifier can then be post-processed in any group-blind way (say, via a single threshold mechanism as described above).

Of course, the fact that all four quantities are equalized conditioned on not deferring does not, in and of itself, provide any guarantees regarding the fairness properties of the overall decision process — which includes also the downstream decision mechanism. For one, it would be naive to simply assume that fairness “composes” [DI18]. Furthermore, the impossibility of [Cho17, Kleinberg] says that the overall decision-making process cannot possibly equalize all four measures.

However, in some cases one can provide alternative (non-statistical) justification for the fairness of the overall process: For instance, if the downstream decision process never errs, the overall process might be considered “procedurally fair.” We present more detailed reflections on our deferral-based approach in Section 4.

We note that deferring was considered in machine learning in a number of contexts, including the context of fairness-preservation

[MPZ17]. In these works, the classifier typically punts only when its confidence regarding some decision is low. By contrast, we use deferrals in order to “equalize” the probability mass functions of the soft classifier over the two groups, which may involve deferring on individuals for whom there is higher confidence. Indeed, deferring on some higher-confidence individuals seems inherent to our goal of equalizing PPV, NPV, FPR, and FNR while keeping the deferral rate low. Furthermore, our framework allows for a wide range of deferral strategies which might be used to promote additional goals. Pursuing alternate strategies for deferral is an interesting direction for future work.

Experimental results.

We demonstrate the validity of our methodology on the Broward county dataset with COMPAS scores made public by ProPublica [angwin2016machine]. Indeed, it has been shown that the COMPAS scoring mechanism is an approximately calibrated soft classifier. We first ran our two-threshold post-processing mechanism and obtained a binary decision algorithm which equalizes both PPV and NPV across Caucasians and African-Americans.

We then ran our post-processing mechanism with deferrals to equalize all four of PPV, NPV, FPR, FNR across the two groups, with three different methods for deciding how to defer: In the first method, decisions are deferred only for Caucasians; in the second, decisions are deferred only for African Americans; in the third method, decisions are deferred for an equal fraction of Caucasians and of African Americans. This fraction is precisely equal to the statistical (total variation) distance between the distributions of scores produced by the soft classifier on the two groups. More details about the results along with figures are given in Section 6.

Extensions and open problems.

As just mentioned, a natural question is to find alternative ways for deciding when to defer, along with ways to argue fairness properties for the overall combined process.

We also leave open the setting where individuals belong to multiple, potentially intersecting groups as in [Multicalibration, gerrymandering].

Yet another question is to consider additional (or alternative) properties of soft classifiers that will make for more efficient or effective post-processing.

1.1 Related work

We briefly describe the works most closely related to ours, though both the list of works and their summaries are inevitably too short. Our work fits in a research program on group fairness notions following the work of Chouldechova [Cho17] and Kleinberg et al. [Kleinberg]. Those works demonstrate the inherent infeasibility of simultaneously equalizing a collection of measures of group accuracy. Our work considers the notions of calibration as formalized in [Pleiss] and those of PPV, NPV, FPR, and FNR from [Cho17] and [Kleinberg].

The power of post-processing calibrated scores into decisions using threshold classifiers in the context of fairness has been previously studied by Corbett-Davies, Pierson, Feller, Goel, and Huq [Corbett]. As in our work, they show that it is feasible to equalize certain statistical fairness notions across groups using (possibly different) thresholds. They additionally show that these thresholds are in some sense optimal. Whereas [Corbett] focuses on statistical parity, conditional statistical parity, and false positive rate, our most comparable results consider PPV. In our work, we further show that in some cases thresholds fail to equalize both PPV and NPV (called predictive parity by [Cho17]), unless we also allow our post-processor to defer on some inputs. Our work also studies methods of post-processing that are much more powerful than thresholding, especially when allowing deferrals. On the technical side, [Corbett] assumes that their soft classifiers are supported on the continuous interval , simplifying the analyses. We instead study classifiers with finite support as it is closer to true practice in many settings (e.g., COMPAS risk scores).

Using deferrals to promote fairness has been considered also in Zemel, Madras, and Pitassi [MPZ17]. Specifically they consider how deferring on some inputs may promote a combination of accuracy and fairness, especially when taking explicit account of the downstream decision maker. They make use of two-threshold deferring post-processors like those discussed in Section 5. While it helped inform our work, [MPZ17] takes a more experimental approach and focuses on minimizing the “disparate impact,” a measure of total difference in classification error between groups, while maximizing accuracy. One important difference between our works is that Madras et al. distinguish between “rejecting” and “deferring.” Rejecting is oblivious as to properties of the downstream decision maker, while deferring tries to counteract the biases of the decision maker. Our work considers only the former notion, but uses the term ”defer” instead of ”reject.”

2 Preliminaries

We study the problem of binary classification. An instance is an element, usually denoted , of a universe . We restrict our attention to instances sampled uniformly at random from the universe, denoted . Our theory extends directly to any other distribution on ; that distribution does not need to be known to the classifiers. Each instance is associated with a true type . Each instance is also associated with a group , where is the set of groups. We restrict our attention to sets that form a partition of the universe . We denote by the set of instances in group , and by

the random variable distributed uniformly over

. Note that for any events and , .

Definition 2.1 (Base rate (Br)).

The base rate of a group , is

(1)

When is finite, is simply the fraction of individuals in the group for whom .

A classifier is a randomized function with domain .222As the focus of this paper is on the post processing of classifiers, we set aside questions such as the origin of the given classifier, including the randomness used in training, the origin or quality of the training data, and societal factors affecting the classifier. In particular, the classifiers we consider in this work are memoryless: they do not remember inputs or random choices from previous invocations. That is, we assume that if are two independent random variables drawn from then and (respectively and ) are also independent random variables. (Alternatively, the present formalism can be viewed as fixing the random choices made during the training phase of the classifier, and taking probabilities only over the draws from and over the random choices made by the classifier during the scoring phase.) A hard classifier, denoted , outputs a prediction in , interpreted as a guess of the true type . A soft classifier, denoted , outputs a score , interpreted as a measure of confidence that . We restrict our attention to soft classifiers with finite image. We call a classifier group blind if its output is independent of the input group . For all groups , we call a hard classifier non-trivial on if and . Hard classifiers are trivial on if they are not non-trivial on .

A post-processor is a randomized function with domain . As with classifiers, a post-processor can be hard or soft. A hard post-processor, denoted , outputs a prediction in . A soft post-processor, denoted , outputs a score . Observe that for a soft classifier , is a hard classifier, and is a soft classifier. As with classifiers, we call a post-processor group blind if its output is independent of the group , and we restrict our attention to post-processors with finite image. The restriction to finite image is for mathematical convenience (and also because digital memory leads to discrete universes); our results generalize to infinite images as well.

In Section 4, we expand the definitions of both classifier and post-processors to allow an additional input or output: the special symbol .

“soft”

“hard post-processor”

“hard post-processor”

“soft post-processor”

“hard”

Figure 1: We call a classifier that returns results in a soft classifier to differentiate it from those which return results in , which we call hard classifiers. We refer to classifiers that take as input the output of a soft classifier as post-processors.

2.1 Calibration

Several works concerning algorithmic fairness focus on various notions of calibration. The following calibration notions are defined only over soft classifiers:

Definition 2.2 (Calibration (Soft)).

We say a soft classifier is calibrated if for which ,

The probability above is taken over the sampling of , as well as random choices made by at classification time.

Definition 2.3 (Groupwise Calibration (Soft)).

We say that a soft classifier is groupwise calibrated if it is calibrated within all groups. That is, and for which , we have that

Groupwise calibration is essentially the same notion as multicalibration [Multicalibration] with the difference that in their case the true types are values in . We use a different term to emphasize that we restrict our attention to collections of groups that form a partition of the universe .

The two definitions above are stated for soft classifiers whose output distribution is discrete, since we must be able to condition on the event or . That said, it extends naturally to classifiers with continuously-distributed outputs provided that the conditional probabilities are well defined.

2.2 Distributions on Calibrated Scores

Throughout this work, we make repeated reference to the probability mass function of the random variable for a calibrated soft classifier acting on a randomly distributed input . We call this probability mass function the distribution on calibrated scores (DOCS).

Definition 2.4 (Distribution on Calibrated Scores (DOCS)).

The distribution on calibrated scores (DOCS) of a calibrated soft classifier for a group , denoted by , is the PMF of . That is, for ,

Abusing notation, we denote by the collection , and call it the DOCS of . We denote by the support of the DOCS , namely the set .

A DOCS is a distribution of scores for a classifier that happens to be calibrated. Because is calibrated, the DOCS conveys information about the performance of , and is constrained by properties of the underlying distribution on . For example, the DOCS’ expectation is exactly the base rate for the population:

Proposition 2.1.

For any groupwise calibrated soft classifier , for all groups : .

Proof of Proposition 2.1.

where the third line follows from the definition of a calibrated classifier (Definition 2.3). ∎

DOCS also provide useful geometric intuition for reasoning about the effects of post-processing calibrated scores. We elaborate on this in Section 3.1 (see Figure 2).

2.3 Group Fairness Measures

Several well-studied measures of statistical “fairness” (e.g., [EqualOpp, Cho17, Kleinberg, Pleiss, Multicalibration, gerrymandering]) look at how the following key performance measures of a classifier differ across groups. The false positive rate (FPR) of a hard classifier for a group is the rate at which gives a positive classification among instances with true type . The false negative rate (FNR) is defined analogously for predicted negative instances with true type . Positive predictive value (PPV) and negative predictive value (NPV) track the rate of mistakes within instances that share a predicted type. Informally, positive predictive value captures how much meaning can be given to a predicted , and negative predictive value is similar for predicted . We now define these statistics formally.

Definition 2.5.

Given a hard classifier and a group , we define
the false positive rate of for : ;
the false negative rate of for : ;
the positive predictive value of for : ;
the negative predictive value of for :

The probability statements in the definitions above reflect two sources of randomness: the sampling of from the group and any random choices made by the classifier .

Among previous works, some [EqualOpp, Kleinberg] focus on equalizing only one or both of the false positive rates and false negative rates across groups, called balance for the negative and positive classes, respectively. Equalizing positive and negative predictive value across groups is often combined into one condition called predictive parity [Cho17]. We split the value out to be a separate condition for the positive and negative predictive classes. Predictive parity appears to be a hard-classifier analogue of calibration: both can be interpreted as saying that the output of the classifier (hard or soft) contains all the information contained in group membership. Our results highlight that the relationship between these notions is more subtle than it first appears; see Section 3 for further discussion.

3 The Limits of Post-Processing

Suppose throughout this section that is a groupwise calibrated soft classifier. Our goal in this section is to make binary predictions based on — and possibly the group — subject to equalizing PPV and/or NPV among groups. That is, we wish to make a prediction using a hard post-processor such that equalizes PPV and/or NPV among groups. We chose to concentrate first on (the limitations of) equalizing PPV and NPV rather than FPR and FNR due to the conceptual similarity of PPV and NPV to calibration. Also, the case of equalizing false positive rates with thresholds is addressed in  [Corbett].

3.1 Fairness Conditions for Post-Processors

We begin by making a simple observation about post-processing that provides some geometric intuition for the rest of this section. Just as in Proposition 2.1, we can express and succinctly in terms of conditional expectations over the DOCS .

Proposition 3.1.

Let be a hard classifier that is non-trivial for all where is groupwise calibrated with respect to . For any we have:

Proof of Proposition 3.1.

We first observe that the output of a post-processor is conditionally independent of the true type, conditioned on the output of the soft classifier it is post-processing and the group membership:

Fact 3.1.

Consider any randomized function . Since is a randomized function with inputs and , we have that

(2)

or in other words that is conditionally independent of the true type , since fixing the inputs to makes its output purely a function of its random string.

Now recall that and are well-defined for all groups because is non-trivial on all groups. We then have

where the fourth line follows from the fact that the group is fixed within , which lets us apply Fact 3.1, and the fact that is calibrated on . Similar simplifications give us that

Using Proposition 3.1, we can geometrically see how certain post-processing decision rules will interact with the DOCS for a group

. For example, using a threshold, the expected true positives, true negatives, false positives, and false negatives can be estimated, as shown in Figure

2.

Figure 2: Distributions on Calibrated Scores (DOCS, definition 2.4) yield useful geometric intuitions, which come from the calibration property (definition 2.2). With a threshold, the expected PPV, NPV, FPR, and FNR can be seen visually.

Proposition 3.2 below gives a characterizations of the false positive and false negative rates in a manner analogous to how Proposition 3.1 describes PPV and NPV:

Proposition 3.2.

Let and be hard and soft classifiers as in Proposition 3.1. Then for any ,

Assume that and (that is, assume ) so that FPR and FNR are well-defined.

Proof of Proposition 3.2.

We give the proof for FPR, and the proof for FNR is similar. By applying Bayes’ rule, we can write

(3)

Noting that , we can apply Proposition 3.1 and rearrange to write the RHS of Equation 3 as follows.

(4)

We note that (Proposition 2.1). Substituting this in to the RHS of Equation 4, we conclude the result. ∎

3.2 General impossibility of equalizing PPV, NPV

It is not always possible to directly post-process a soft groupwise calibrated classifier into a hard one with equalized PPV (or NPV) for all groups, as we demonstrate by counterexample in Proposition 3.3. Before proceeding, we note that our counterexample is somewhat contrived—in particular, the DOCS induced by the soft classifier in the proof of Proposition 3.3 takes only one value on each group. When DOCS of is more nicely structured on each group, we will see that there are general methods to equalize PPV (or NPV).

Proposition 3.3.

Fix two disjoint groups and with respective base rates and such that . Then there exists a soft classifier that is groupwise calibrated, but for which there is no post-processor such that equalizes PPV, unless for or 2.

Proof of Proposition 3.3.

Consider the classifier such that if and if . This classifier is trivially groupwise calibrated. Since for and 2, we conclude that is well-defined for and . The proof now follows from the characterization of PPV in Proposition 3.1. This is because is equal to the expectation of where is drawn from a distribution with support contained in , and hence it is equal to , and . ∎

The analogous statement regarding impossibility of equalizing NPV is formulated as Proposition A.1 in Appendix A.1.

3.3 A niceness Condition for DOCS

We now give a non-degeneracy condition condition on DOCS motivated by the impossibility result for post-processing given by Proposition 3.3.

Definition 3.1 (Niceness of DOCS).

Let be a set of groups. A distribution on calibrated scores is nice if is the same for all .

Note that this condition rules out the counterexample given by Proposition 3.3, since the DOCS in the counterexample had different (in fact, disjoint) supports for different groups. Hence, we can hope to successfully post-process soft classifiers with nice DOCS.

3.4 Equalizing PPV or NPV by Thresholding

We pay special attention to thresholds because they are simple to understand and therefore very widely used. We use one slight modification to deterministic thresholds that adds an element of randomness: if a score is at the threshold, we randomly determine which side of the threshold it falls on, according to a distribution defined below.

Definition 3.2 (Threshold Post-Processor).

A threshold post-processor is a function from a score and a group , parameterized by and . The threshold parameter specifies the threshold for the group , and is the probability of returning 1 when the input score is on the threshold . It returns the following outputs:

In the setting of an infinite number of scores and a continuous domain (i.e. scores are represented by a probability density function instead of a probability mass function), we can use purely deterministic threshold functions in which

, and achieve very similar results for the rest of this section.

If both and do not vary across groups , then the post-processor is the same across groups. In this case, we will call the post-processor a group blind threshold post-processor, and will overload and to be constants.

We now study the effectiveness of thresholds for post-processing soft classifiers with nice DOCS. The main takeaways are:

  1. If the DOCS are nice, then threshold post-processors can equalize PPV (Propositions 3.4 and 3.6).

  2. However, group blind threshold post-processors are rather limited in their ability to equalize PPV (Proposition 3.5).

  3. Furthermore, equalizing PPV with thresholds (group blind or otherwise) may have undesirable social consequences.

  4. Thresholds cannot always equalize PPV and NPV simultaneously, even for nice DOCS (Proposition 3.7).

Results 1-3 also apply to NPV (see Proposition A.3).

3.4.1 Group Blind Thresholds

We begin by classifying which group-blind threshold post-processors can equalize PPVs across all groups (Propositions 3.4 and 3.5). By symmetry, our arguments give a similar characterization for equalizing NPVs.

Proposition 3.4.

For every nice groupwise calibrated soft classifier and for every group-blind threshold post-processor such that for all , then the composed classifier equalizes PPVs across all groups for which is non-trivial.

The existence of the threshold post-processors in Proposition 3.4 follows from the assumed finiteness of the range of the soft classifier. In the case where the range of the soft classifier is infinite, such post-processors may not exist.

Proof of Proposition 3.4.

Any of the given post-processors only ever maps the largest score in the support of to 1, for all groups . Hence, is exactly the largest score in . By the assumption that is nice, is the same for all groups , and hence the PPV is equalized across groups. ∎

We prove the analogous statement for NPV in Proposition A.2 in the Appendix. We proceed to show that the post-processors described in Proposition 3.4 are the only non-trivial, group blind post-processors that equalize PPV across groups in general, as we prove in Proposition 3.5.

Proposition 3.5.

There exists a groupwise-calibrated soft classifier with a nice DOCS for which no non-trivial group blind threshold post-processor, other than the ones in Proposition 3.4, can equalize PPV across groups.

At a high level, the proof of Proposition 3.5 works as follows: We can make the DOCS on one group uniform, and the DOCS of another group strictly increasing. Then, threshold post-processors naturally favor the latter group, as the DOCS for that group gives more weight to higher scores than lower ones when compared to the former DOCS. Our characterization of PPV (Proposition 3.1) features prominently in the proof.

In preparation to proving Proposition 3.5, we first prove the following lemma:

Lemma 3.1.

Let be two different groups, and fix a group-blind threshold post-processor . Let be the expected conditional DOCS on scores that results from starting with the DOCS over scores in group and conditioning on the scores that sends to 1, and similarly let denote the same type of conditional DOCS when starting with the over scores in group .
If strictly stochastically dominates , then

Proof.

We use the characterization of PPV given in Proposition 3.1, for the special case where the post-processor thresholds as described above. We can write the PPV for group as follows:

(5)

where the second line follows from the definition of .

Similarly, we have that

(6)

Since stochastically dominates , the expectation on the RHS of Equation 6 is larger than the expectation on the RHS of Equation 5, yielding the result. ∎

Proof of Proposition 3.5.

Fix two groups and and a finite set of points such that the PMFs of the soft classifier on and have support equal to - that is, . For concreteness, suppose that .

Let the PMFs of the soft classifier on these two groups respectively be given by and for all , where is normalized with a constant such that it sums to 1. Fix a group blind threshold post-processor that is not one of the ones mentioned in Proposition 3.4. Since is group blind, its threshold function is a constant which we name .

Let be the expected conditional DOCS on scores that results from starting with the DOCS over scores in group and conditioning on the scores that sends to 1. We can get this conditional PMF by removing scores , multiplying by , and re-normalizing the remaining values to get a distribution. Let be defined similarly.

We claim that strictly stochastically dominates , which allows us to invoke Lemma 3.1 to conclude that the PPV on the two groups are unequal. We now show that strictly stochastically dominates . This is clearly true by design if or : in this case, the post-processor is simply a deterministic threshold function, and we know by design that is uniform while is a strictly increasing function. If for some , then we can write as a convex combination of and (with weight on the distribution where , and weight on the distribution where ). We can write as the same convex combination of the conditional distributions over where and . Since we already established stochastic domination for the cases where and , this establishes stochastic domination for the case where . ∎

We achieve the same result for NPV in Proposition A.3. In the setting where the range of the soft classifier is infinite and continuous, we show in Proposition A.5 that a similar negative result holds, but without the existence of the classifiers in Proposition 3.4.

Propositions 3.4 and 3.5 demonstrate the limitations of group blind thresholds on calibrated scores. Though this method of post-processing has social appeal, it does not actually preserve the fairness properties that one would expect. In the next section we repeat our analysis but relax our group blindness requirement.

3.4.2 Group-Aware Thresholds

If we allow the different groups to have different thresholds, then we grant ourselves more degrees of freedom to be able to satisfy binary fairness constraints. In particular, we can equalize PPV across groups in a more meaningful way than done in Proposition 

3.4.

Recall that the group blind threshold post-processors in Proposition 3.4 are the only group blind threshold post-processors that work on certain nice DOCS (shown in Proposition 3.5). However, these post-processors have the property that the only score they map to is the largest score in the support, which can be undesirable for many applications.

In particular, all classifiers in Proposition 3.5 make the PPV on each group equal to the maximum score in the support of . However, the (not-necessarily-group-blind) threshold post-processors in Proposition 3.6 below can make the PPV on each group equal to any fixed value between the maximum base rate of and the maximum score in .

Proposition 3.6.

Let be a set of groups. For any soft classifier with a nice DOCS such that is groupwise calibrated over and for all , then there exists a non group blind, non-trivial threshold post-processor that is not one of the ones from Proposition 3.4 such that the hard classifier equalizes PPV across .

This holds even if we require that the PPV of all the groups is equal to an arbitrary value in , where is the maximum base rate among the groups and is the maximum score in the support of .333For the case where the support of is infinite, should be the supremum of scores.

Moreover, since this post-processor is not group blind, it is not one of the post-processors described in Proposition 3.4.

In preparation to proving Proposition 3.6, we first prove the following claim:

Claim 3.1 (Monotonicity of PPV and NPV).

Fix a soft classifier and corresponding DOCS , as well as a group . Fix group blind threshold post-processors and such that either or and . Then:
(a)
(b)

Proof.

We show conclusion (a); conclusion (b) is shown analogously. Define to be the conditional PMF on scores that results from starting with the DOCS over scores in group and conditioning on the scores that sends to 1, and let be defined similarly (but for the threshold post-processor ).

We claim that stochastically dominates , which yields the desired result by the characterization of PPV given in Proposition 3.1 (and more explicitly written in Equations 5 and 6). ∎

Proof of Proposition 3.6.

Fix a soft classifier with a nice DOCS that is group-wise calibrated over , and fix a desired value . We will show that we can design a threshold post-processor such that for all groups .

Fix an arbitrary group . We proceed via a continuity argument to show that we can tune the threshold on to achieve PPV equal to . The maximum possible value for is (achieved when , by Claim 3.1), where is the largest score in the support, as defined in the proposition statement444We ignore the trivial post-processor that never maps anything to 1, and hence leaves the PPV undefined..

Furthermore, note that, for any group, a lower bound on the PPV of a hard classifier on that group is the base rate of the group, where the lower bound is matched by the trivial post-processor that sends every score to 1. This follows from Claim 3.1.

We now claim that there is a setting of and that achieves . We accomplish this by showing that there is a way to change such that the PPV decreases continuously. We first show:

Claim 3.2 (Continuity of PPV).

Fix a soft classifier and corresponding DOCS , as well as a group . Suppose we have two post-processing algorithms, and . Let be the expected conditional DOCS that results from starting with the DOCS over scores in group and conditioning on the scores that sends to 1, and define similarly. If , then . Or in words, if the distance between the conditional DOCS is small, then the difference in PPV is small.

Proof.

Recall the characterization of PPV given in Proposition 3.1 (and more explicitly written in Equation 5). This tells us that the PPV of group for the classifier is exactly the expectation of a random variable distributed according to . Similarly, the PPV of group for the classifier is the expectation of a r.v. distributed according to . Since both and have support bounded between 0 and 1, their expectations can differ by at most , from which the claim follows. For completeness, we prove this below.

Suppose wlog that has the larger expectation. Let . Then:

where in the second line we use the fact that is the expectation of , and in the last line we use the fact that and that the TV-distance between the two distributions is less than . ∎

Now, consider the following way to change . Fix , and an initial setting for s.t. is not the smallest item in the support or . Reduce by , wrapping around on the interval and decreasing to the next largest item in the support when this would otherwise make negative.

This very minor transformation to the threshold changes the DOCS conditional on outputting 1 very slightly - so slightly that the TV distance between the old conditional DOCS and the new DOCS is at most some which is a function of . This lets us apply Claim 3.2 to show that the PPV changes by at most a function of . So as we take going towards 0, this shows that the PPV changes by an amount going towards 0. This establishes that the PPV changes “continuously” with respect to this deforming procedure.

By Claim 3.1, we have that the above deforming procedure can only decrease the PPV. Therefore, we can continuously decrease the PPV, starting from , by continuously deforming the threshold post-processor with the method above. Note that . By the Intermediate Value Theorem, there must be a setting of such that .

We assert the analogous statement for the case of NPV in Claim A.1. The corresponding statement for the case of soft classifiers with infinite range is asserted in Proposition A.6.

3.4.3 The Limitations of Thresholding

While Proposition 3.6 shows that a threshold post-processor can equalize the PPV across groups, this threshold post-processor can be unsatisfying from a social justice standpoint. Consider an example with two groups and , where group is “privileged” by having a higher base rate of, say, credit worthiness. Suppose that we have a DOCS that is decreasing with respect to score on group , and increasing with respect to score on group . This is illustrated in Example 3.1 and Figure 3. This means that a group blind threshold post-processor yields larger PPV on , since large scores are given more weight in . So, to equalize the PPV between the two groups, we will classify more low scores as positive in than . This effectively means that our threshold on group is more lenient than our threshold on , which seems blatantly unfair, since is the privileged group in the first place!

Example 3.1 (Socially Unsatisfying Example).

Fix groups and , and we fix the DOCS of the soft classifier as follows. Let be for and , let for appropriately selected constant and let . Group has a higher base rate and may have social advantages over group .

Let be a non-trivial post-processor. If were group blind, then by Lemma 3.1, since stochastically dominates , the PPV on must be larger than the PPV of .