Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions

A recent flurry of research activity has attempted to quantitatively define "fairness" for decisions based on statistical and machine learning (ML) predictions. The rapid growth of this new field has led to wildly inconsistent terminology and notation, presenting a serious challenge for cataloguing and comparing definitions. This paper attempts to bring much-needed order. First, we explicate the various choices and assumptions made---often implicitly---to justify the use of prediction-based decisions. Next, we show how such choices and assumptions can raise concerns about fairness and we present a notationally consistent catalogue of fairness definitions from the ML literature. In doing so, we offer a concise reference for thinking through the choices, assumptions, and fairness considerations of prediction-based decision systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/01/2021

The zoo of Fairness metrics in Machine Learning

In recent years, the problem of addressing fairness in Machine Learning ...
11/25/2019

On the Legal Compatibility of Fairness Definitions

Past literature has been effective in demonstrating ideological gaps in ...
02/01/2021

Emergent Unfairness: Normative Assumptions and Contradictions in Algorithmic Fairness-Accuracy Trade-Off Research

Across machine learning (ML) sub-disciplines, researchers make explicit ...
10/20/2020

Where Is the Normative Proof? Assumptions and Contradictions in ML Fairness Research

Across machine learning (ML) sub-disciplines researchers make mathematic...
10/31/2019

Methodological Blind Spots in Machine Learning Fairness: Lessons from the Philosophy of Science and Computer Science

In the ML fairness literature, there have been few investigations throug...
06/01/2019

Achieving Fairness in Determining Medicaid Eligibility through Fairgroup Construction

Effective complements to human judgment, artificial intelligence techniq...
12/08/2021

Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation

Human annotations play a crucial role in machine learning (ML) research ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Prediction-based decision-making has swept through industry and is quickly making its way into government. These techniques are already common in lending [53, 82, 44], hiring [88, 89, 57], and online advertising [119], and increasingly figure into decisions regarding pretrial detention [6, 78, 38, 31], immigration detention [76], child maltreatment screening [122, 37, 18], public health [97, 109], and welfare eligibility [37]. Across these domains, decisions are based on predictions of an outcome deemed relevant to the decision.

We outline the choices and assumptions commonly made (often implicitly) to justify such a decision system.111Much of our discussion applies to non-prediction-based decisions (e.g. decisions made by humans), but we will not draw any explicit comparisons. These choices and assumptions affect how a system behaves, with potentially serious implications for disadvantaged groups along social axes such as race, gender, and class. This has motivated the field of research we will call “Fairness in Machine Learning (ML)”.

Fairness in ML has been explored in popular books [101, 37] and a White House report [99], surveyed in technical review papers [13, 41, 19], tutorials [9, 94, 92, 23], and an in-progress textbook [10], and inspired a number of software packages [129, 39, 45].

Though the ML fairness conversation is somewhat new, it resembles older work. For example, since the 1960s, psychometricians have studied the fairness of educational tests based on their ability to predict performance (at school or work) [20, 21, 121, 28, 36, 106, 60, 79, 61]. More recently, Dorans and Cook review broader notions of fairness, including in test design and administration [33].

Our hope with this paper is to contribute a concise, cautious, and reasonably comprehensive catalogue of choices, assumptions, and fairness considerations. While we regard no notion as the axiomatic definition of fairness (nor the guarantor of any particular social goal), our catalogue enables practitioners to systematically consider these concepts.

The paper proceeds as follows: in Section 2, we introduce setup and notation. Section 3 outlines common choices and assumptions made to justify prediction-based decision systems. Concern about these can motivate various flavors of fairness constraints, which we present in Section 4. Though the literature often collapses the two, we separate the flavors of fairness (Section 4) from the groups to which they can be applied (Section 5). Section 6 presents tensions among fairness definitions. We highlight some ways forward in Section 7. Section 8 concludes.

2 Setup and notation

Consider a population about whom we want to make decisions. We index people by . Each person has variables (i.e. features, covariates, or inputs) that are known at decision time. In some cases, we can separate these into sensitive variable(s) (e.g. race, gender, or class) and other variables , writing . A binary decision

is made for each person. We use bold font to denote a vector of values for everyone in the population,

. We restrict decisions to be functions of variables known at decision time, , where . In prediction-based decisions, each person has a binary outcome (i.e. label or target)

that is unknown at decision time. We define random variables

, , as the values of a person randomly drawn from the population.222We can also imagine the values as randomly generated from some “superpopulation” [48].

Prediction-based decisions can be made by first estimating the

conditional probability

The decision system does not know the true conditional probability; instead it can compute a score where is intended333We assume scores have been calibrated to be interpreted as probabilities, though this is non-trivial in practice [95]

. More generally, decisions can be based on scores from methods such as SVM and random forest that do not estimate conditional probabilities

[54]. to estimate . Let be a random score from the population. Both and are functions of training data, a sample of from the population of interest.

3 Choices and assumptions made to justify prediction-based decision systems

Several recent papers have raised concerns about prediction-based decision systems, describing “biases” or “traps” beyond the focus of most quantitative literature [49, 50, 96, 32, 115]. It may be difficult for a practitioner to see where these important points connect to their work. To bridge the gap, we link these socio-technical concerns to quantitative choices and assumptions.

Below are some common choices and assumptions made (often implicitly) to justify a prediction-based decision system. We are motivated by a few examples prevalent in the literature: pretrial detention, child maltreatment screening, and lending. We question some of the most salient choices and assumptions for both scientific and political merit.444These choices might not be made in any specific, sequential order. Each can influence the others, e.g. Green et al. [49, 50] note that 3.4 can influence 3.1 insofar as measuring crime puts emphasis on crime prevention as opposed to other indicators of social well-being. See section 7.2 for a sketch of how to address some of these issues.

3.1 Choose a goal

Choose a goal that the decisions are meant to achieve. For a benevolent social planner, this may be justice or some notion of social welfare [58]. For a bank, this may be profits. Often there are genuinely different and conflicting goals, which are not resolved by more data [37, 102]. Furthermore, make the (substantial) assumption that progress towards this goal can be summarized by a number (denote this by ).

3.2 Choose a population

Choose to make decisions at the individual level (as opposed to a policy that acts at the community level) about members of a particular population. In pretrial decisions, the population is defendants. In lending decisions, the population is loan applicants. These populations are sampled from the world by some mechanism, e.g. people become defendants by being arrested, applicants by applying for loans. The mechanism of entry into this population may reflect unjust social structures, e.g. racist policing that targets black people for arrest [2, 84] and discrimination in loan preapplication screening [25].

3.3 Choose a decision space

Assume decisions are made at the individual level and are binary. In this simplified setup, pretrial decisions are usually restricted to detain or not, while lending decisions to lend or not. Both these decision spaces leave out many less harmful and more supportive interventions. In pretrial decisions, the space might instead include offering funds to help defendants with transportation to court, paying for childcare, or locating job opportunities. In lending, a broader decision space could include providing longer-term, lower-interest loans.

3.4 Choose an outcome

Assume the utility of decisions depends in part on a binary outcome that can be measured at the individual level. Suppose decisions about a family intervention program are based on maltreatment of children. Let if there is maltreatment, and otherwise. Consider the following scenario: suppose in Family A there is maltreatment, with or without this intervention program. Suppose in Family B there is maltreatment without the intervention, but the intervention helps. Then it is rational enroll Family B in this program, but Family A may need an alternative program. This example considers the potential outcomes under both decisions [114, 62, 56]: without the intervention, under the intervention. The benefit or harm from a decision can depend on both potential outcomes.

More generally, let be the potential outcome under the whole decision system (the bold denotes that the decision for person may affect the outcome for person , where ). Assume now that the utility of decisions can be expressed as a function of these and no other outcomes. In mathematical notation, the utility under decisions is a function of the potential outcomes for all people under all possible decision policies: . The bold font denotes vectors of outcomes and decisions for the whole population. In the pretrial detention example: are the outcomes for everyone if everyone is released pretrial, are the outcomes for everyone if everyone is detained pretrial, with all other possible decision regimes in between.

Prediction-based decision systems often only consider one or two outcomes, e.g. in pretrial decisions, outcomes of interest are crime (measured as arrest, more on this later) and non-appearance in court. In contrast, human decision-makers may consider several outcomes, including impacts on a defendant’s well-being or caretaker status [15]. Narrowing focus to a chosen, measured outcome can fall short of larger goals. For example, in making decisions about college admissions, it may be tempting to narrow the larger goals of higher education to simply the future GPAs of admitted students [73].

3.5 Assume decisions can be evaluated separately, symmetrically, and simultaneously

Assume decisions can be evaluated as the sum of separately evaluated individual decisions. This eliminates consideration of aggregate decisions (e.g. the number of detainees, the number of loans, etc.) within groups. This assumption resembles beliefs from utilitarianism, which represents social welfare as a sum of individual utilities, allowing for the uneven distribution of benefits to different individuals (and different groups) as long as overall utility is maximized [112, 116]. The “separately” assumption includes assuming that outcomes are not affected by the decisions for others, an assumption known as no interference [62]: .

Assuming decisions can be evaluated symmetrically (i.e. identically) requires, for example, that the harm of denying a loan to someone who could repay is equal across people. In reality, this error could be especially harmful to lower-income applicants.

Finally, assuming decisions can be made simultaneously (“batch classification” [19]), as opposed to serially, eliminates consideration of dynamics that could be important. For example, Harcourt shows that if the population changes their behavior in response to changing decision probabilities, prediction-based decisions can backfire and diminish social welfare [52]. While it may be possible to make predictions robust to strategic behavior, this may also have social costs, especially for disadvantaged groups [59, 90]. Alternatively, dynamics can be considered in a way that is socially beneficial [57, 75].

Mathematically, we write the separate, symmetric, and simultaneous (sss) utility as:

3.6 Assume away one potential outcome

Prediction-based decision systems usually only focus on one potential outcome (e.g. crime if released) and assume the other is known (e.g. no crime if detained). But studies have found criminogenic effects of imprisonment [125, 30, 27, 91]. Even if pretrial tools only predict events during the possible detention period (as some recommend [42]), it should not be assumed that detention prevents crime during that period. A person may be driven to break the law while in jail.

In child maltreatment screening, prediction is aimed at maltreatment in the absence of intervention, implicitly assuming intervention is helpful. Interventions may not be helpful, though, and impacts of interventions may vary by group. There is evidence of racial disparity in foster care placement [110] and access to mental health services [46] among youth reported to the child welfare system.

3.7 Choose the prediction setup

For notational simplicity, assume is the potential outcome of interest. Then takes 4 possible values. Without loss of generality, it is positive (good) when and negative (bad) when

. Adopting confusion matrix terminology, express utility in terms of true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN):

where are scalars.

Assuming is fixed, choose to maximize this. Rearranging terms, dropping those without , and renormalizing (dividing by a positive number), the quantity to maximize can be written as:

with only one parameter, . Corbett-Davies et al. call this immediate utility [24]. Lipton et al. [80] note that adding and negating gives the cost-sensitive risk that Menon and Williamson aim to minimize [86]:

Giving false positives and false negatives equal cost, . Thus, minimizing , maximizing , and maximizing accuracy () are all equivalent.

is maximized with perfect prediction, that is . But only variables are known at decision time. So decisions are restricted to be functions of these variables, , where . It has been shown [70, 12, 24] that is maximized at

so decisions are made by thresholding (at ) the conditional probabilities of the outcome, given covariates [24]. This can be called a single-threshold rule [23].

In implementing such a rule, the decision maker chooses variables and computes a score that is intended to estimate . Below we first discuss variable selection. We then describe four reasons the estimate differs from the estimand : non-representative training data, measurement error, an incorrect model, and statistical uncertainty.

3.7.1 Variable selection

The variables are chosen by a process commonly called variable or feature selection. Conditional probabilities change depending on what we condition on.555“Conditioning is the soul of statistics” [14]. Person A can have a higher conditional probability than Person B with a choice of variables , but a lower conditional probability with a choice of variables . This can be obscured by describing these probabilities as the “true risk” (of crime or loan default), which we explore further in Section 7.1.

3.7.2 Sampling

For some subset of the population of interest, and are measured, creating the training dataset. The training data may not be representative of the population for several reasons. For example, in pretrial decisions, if training data only include (crime if released) then they exclude people who were detained [23]. Defendants released in the past may not be representative of current defendants. Incorrectly assuming the data are representative can lead to biased estimation of conditional probabilities (e.g. probability of crime given input variables).

Under a missing at random assumption (), modeling could hope to avoid this selection bias [48]. But there can be regions of the input variables where no defendants are released (no ) and hence is unobserved [19]. One option is extrapolation: fit the model to released defendants (), then use the model for all defendants (regardless of ) even if this includes new regions of the variables. However, ML models can perform poorly with such a shift in input variables [111]. Another option is to trust previous decisions: assume that regions of the input variables where no defendants were released (no ) are regions where all defendants would have commited crimes if released () [29].

3.7.3 Measurement

Similarly compromising are issues of measurement. In the pretrial detention setting, the outcome is often defined as crime, but measured as arrest. In doing so, we would need to take seriously not only the difference between arrests and crime, but the fact that this difference may be greater for some groups (known as differential measurement error [123]). For example, black people are arrested more often for crimes they did not commit [2, 83].666The pretrial risk assessment literature inadequately addresses error in using arrests as a measure of crime. A full analysis is beyond the scope of this paper, but we note that the evidence cited is far from conclusive. Skeem and Lowencamp write “official records of arrest — particularly for violent offenses — are a valid criterion” [118]. But they cite Piquero, who writes “self-reported offending data reveal a much more similar set of estimates regarding offending across race” [107]. They also cite Piquero and Brame, who found little racial differences in either self-report or arrest record data in a sample of “serious adolescent offenders” in Philadelphia and Phoenix [108]. These findings say little about cases with racial differences in arrests.

What about measurement error of the observed variables ? If is mismeasured, our ability to predict could be diminished. However, mismeasurement of is somewhat less serious than mismeasurement of because prediction models are designed to use only to the extent that it predicts .

If is mismeasured differentially across groups, this could lead to different prediction error rates across those groups. For example, accessing drug treatment () is used to predict child maltreatment (). If is measured as accessing public treatment, its measurement will differ between poorer families and wealthier families, who may instead access private (i.e. unobserved) treatment [37].

A model that is aware of group membership may be able to account for this, but if the relationship between and is weaker for a particular group, prediction error rates for that group will be higher.

3.7.4 Model selection

We have a choice of how to model as a function of , that is, model selection.777

Methods that do not first estimate conditional probabilities (e.g. SVM, random forest) also involve several researcher degrees of freedom.

Estimation of can be highly model-dependent, particularly if is high-dimensional. Though existing tools may simplify to only consider low-dimensional representations (e.g. number of past arrests), the observable variables may be richer (e.g. times and locations of past arrests). Therefore model choice cannot be easily dismissed, even in situations of apparently low dimension.

3.7.5 Statistical uncertainty

In addition to bias, there is uncertainty (variance) in estimating

. If the model has group-specific parameters, uncertainty may be larger for groups with less training data.

4 Flavors of fairness definitions

Under the choices and assumptions of section 3 we saw that a single-threshold rule is considered optimal [24, 23, 73, 80]. If any of the choices and assumptions are not justified, then optimality is not guaranteed.

An even stronger claim about the single-threshold rule is that it “maximizes a natural notion of social welfare for all groups” [23]. Here we explore this notion of fairness (with the explicit assumption that scores are well-estimated):

  • [leftmargin=*]

  • Single-threshold fairness:
    where

To see how this plays out at the group level, assume for simplicity two groups: (advantaged) and (disadvantaged). We can write the overall utility of decisions as a weighted average from the decisions in each group:

Under the assumptions, a single-threshold rule maximizes both and . It may be tempting to conclude that this is fair to both groups. However, the impact of decisions may not be contained within groups. Furthermore, the disadvantaged group may have a lower maximum.

For example, in lending, a possible goal for applicants is continued access to credit over time and across multiple applications. Liu et al. consider a lending policy designed to increase applicants’ credit scores over successive applications, rather than a bank’s profit [82]. Their optimal policy maximizes credit scores in both groups, but these maximums are not equal. Starting out with an advantage, the wealthier group enjoys larger increases in credit scores.

It may also be tempting to conclude that a single-threshold rule is fair on the basis that people with the same scores are treated the same, regardless of group membership. But as we saw above, conditional probabilities change with variable selection. Thus, this notion of fairness relies on scores being based on enough predictive variables to be meaningful. In the limit, the scores would be perfect predictions, i.e. the outcomes themselves.

This motivates fairness definitions enforcing equal decisions across groups, conditioned on the outcome. We explore these in 4.1, as well as three other flavors of definitions in 4.2-4.4 that compare decisions without explicit consideration of an outcome to predict.

4.1 Equal prediction measures

The previous section focused on , an expression in terms of combinations of and (true positives, false positives, false negatives, and true negatives). When the utilities of these combinations are such that , maximizing is equivalent to maximizing accuracy. If we believe this corresponds to a notion of well-being within groups, this motivates a definition of fairness:

  • [leftmargin=*]

  • Equal accuracy:

Instead of comparing overall accuracies, we could restrict the comparison to subgroups. For example, to reflect the perspective of “innocent” defendants, we could compare accuracies only for defendants who would not get rearrested () [94]. Or we could compare how accurately we predict repayment among those granted loans (). The confusion matrix illustrates match and mismatch between and , with margins expressing conditioning on subgroups based on or ; see Table 1.



Base rate


True Positive (TP) False Positive (FP) Positive Predictive Value (PPV), Precision False Discovery Rate (FDR)


False Negative (FN) True Negative (TN) False Omission Rate (FOR) Negative Predictive Value (NPV)


True Positive Rate (TPR), Recall, Sensitivity False Positive Rate (FPR) Accuracy


False Negative Rate (FNR) True Negative Rate (TNR), Specificity

Table 1: Confusion matrix, illustrating match and mismatch between outcome and decision .

4.1.1 Definitions from the margins of the confusion matrix

For any box in the confusion matrix involving the decision , we can require equality across groups. We list definitions from the margins of the confusion matrix, grouped by equivalence from pairs that sum to 1 and their expression as conditional independence. While this induces some repetition, we hope that being explicit can address some of the confusion in the literature:

  • [leftmargin=*]

  • Equal FPRs:
    Equal TNRs:

  • Equal TPRs:
    Equal FNRs:

    (also known as equal opportunity [53])

  • Equal FORs:
    Equal NPVs:

  • Equal PPVs:
    Equal FDRs:

    (also called predictive parity [17], assessed by an outcome test [117])

Chouldechova calls the first two bullet points error rate balance [17] (also called separation [10] or

equalized odds

[53], see Section 4.1.3). These reflect a fairness notion that people with the same outcome are treated the same, regardless of group.

The second two bullet points are called sufficiency [10], see Section 4.1.3. They reflect a fairness notion that people with the same decision would have had similar outcomes, regardless of group.

4.1.2 Scores

Prediction-based decision systems often compute a score that is intended to estimate , where is a random score from the population. We can consider definitions based on scores:

  • [leftmargin=*]

  • Balance for the negative class:

  • Balance for the positive class:

  • Calibration within groups: for all

  • AUC parity: the area under the receiver operating characteristic (ROC) curve is the same across groups.

Barocas et al. point out that calibration within groups is satisfied without a fairness-specific effort [10]. With enough (representative, well-measured) data and model flexibility, a score can be very close to . So by a lemma,888Lemma (). For any random variables , , and : . Proof. Let . By the law of total expectation: . we have , i.e. calibration within groups. With many variables, may be well-predicted by them, i.e. there is a function that is approximately . Then we can get calibration within groups even without using because .

Corbett-Davies et al. point out that calibration within groups does not prevent problematic practices [24]. The above-mentioned lemma holds using any variables (or none at all). As we have noted, inclusion or exclusion of variables changes scores, and therefore the fraction of each group that ends up above a threshold. Intentional manipulations of this could mimic the racist practice of redlining, justifying loan denials by neighborhood characteristics [85].

4.1.3 Separation and sufficiency

The two margins of the confusion matrix are distinguished by Barocas et al. as [10]:

  • [leftmargin=*]

  • Separation: or
    (also known as error rate balance [17], equalized odds [53], or conditional procedure accuracy equality [13])

  • Sufficiency: or
    (With , this is also known as conditional use accuracy equality [13]. Somewhat confusingly, with , this is also sometimes called calibration [17, 23] or matching conditional frequencies [53].)

In terms of , these are equivalent to pairs of definitions from the margins of the confusion matrix. In terms of , calibration within groups implies sufficiency. Conversely, if satisfies sufficiency then there exists a function such that satisfies calibration within groups (see Proposition 1 in [10]).

4.2 Equal decisions across groups (stratified)

We now turn to fairness notions that focus on decisions , without the outcome . These can be motivated in a few ways.

Suppose that from the perspective of the population about whom we make decisions, one decision is always preferable to another, regardless of (e.g. non-detention, admission into college999In contrast, lending to someone unable to repay could hurt their credit score [82]. Of course, the ability to repay may strongly depend on the terms of the loan.) [94]. In other words, allocation of benefits and harms across groups can be examined by looking at the decision () alone.

Furthermore, while the decisions (e.g. detentions) are observed, the outcome being predicted (e.g. crime if released) may be unobserved, making error rates unknown. Therefore, disparity in decisions (e.g. racial disparity in detention rates) may be more publicly visible than disparity in error rates (e.g. racial disparity in detention rates among those who would not have committed a crime).

Yet another motivation to consider fairness constraints without the outcome is measurement error (see Section 3.7.3). For example, if arrests are a poor measure of crime, fairness constraints based on arrests may be unappealing [66]. One might believe that all group differences in are a result of measurement error, and that the true outcomes on which we want to base decisions are actually similar across groups [40].

These considerations can all motivate requiring demographic parity: equal decision rates across groups (e.g. equal lending rates across races). A related definition considers parity within strata:

  • [leftmargin=*]

  • Demographic parity:
    (also known as statistical parity, or group fairness [35])

  • Conditional demographic parity:

When , conditional demographic parity is separation. When (the insensitive variables), it is equivalent to:

  • [leftmargin=*]

  • Unawareness: if
    (also known as blindness, fairness through unawareness [77], anti-classification [23], treatment parity [80])

In other words, people with the same are treated the same. A related idea requires people who are similar in to be treated similarly. More generally, we could define a similarity metric between people that is aware of the sensitive variables, motivating the next flavor of fairness definitions [35].

4.3 Metric fairness

Assume there is a metric that defines similarity of individuals based on all variables, .

  • [leftmargin=*]

  • Metric fairness: for every , their closeness implies closeness in decisions
    (also known as individual fairness, the -Lipschitz property [35], and perfect metric fairness [113])

4.3.1 The metric

Dwork et al. say “the metric should (ideally) capture ground truth” [35]. This inspired Friedler et al. to define the construct space, the variables on which we want to base decisions [40]. For example, suppose we want to base decisions on the outcome, or any variables that can predict it: .101010Here the probabilities are over some superpopulation model for [48]. But we do not observe the entire world at decision time, instead only the variables , and we calculate scores intended to estimate .

Friedler et al. introduce an assumption they call WYSIWYG (what you see is what you get), i.e. that we can define a metric in the observed space that approximates a metric in the construct space: . To satisfy WYSIWYG, may need to be aware of the sensitive variables [35]. One reason is that the insensitive variables may predict differently for different groups. For example, suppose we want to predict who likes math so we can recruit them to the school’s math team. Let be liking math and be choice of major. Suppose in one group, students who like math are steered towards economics, and in the other group towards engineering. To predict liking math, we should use group membership in addition to .

Friedler et al. also introduce an alternate assumption called WAE (we’re all equal), i.e. that the groups have small distance in the construct space [40]. On this basis, we could adjust a metric in the observed space so that the groups have small distance [35]. Relatedly, Johndrow and Lum describe adjusting the insensitive variables so that they are independent of group [66].

4.3.2 Connections to unawareness and conditional demographic parity

Consider metric fairness where the metric only considers insensitive variables, . Then if , we require the decisions to be identical . This implies unawareness, or equivalently, (a version of conditional demographic parity).

4.3.3 Relaxations

Rothblum and Yona relax the metric fairness condition [113]:

  • [leftmargin=*]

  • Approximate metric fairness:

Informally, they say a person is -discriminated against if the fraction of people treated -differently from them exceeds . Any group of size at least cannot all experience -discrimination. Taking , , gives Dwork’s condition.

4.4 Causal definitions

We have already introduced causal notions, considering the potential or counterfactual111111We use the terms “potential outcomes” and “counterfactuals” interchangeably [56]. values under different decisions [114, 62, 56]. Causal fairness definitions consider instead counterfactuals under different settings of a sensitive variable. Let be the input variables if the individual had their sensitive variable set to . We write for the corresponding decision, e.g. what would the hiring decision be if they had been white? There is debate over whether this is a well-defined notion. Pearl allows counterfactuals under conditions without specifying how those conditions are established, e.g. “if they had been white”. In contrast, Hernán and Robins introduce counterfactuals only under well-defined interventions, e.g. the intervention studied by Greiner and Rubin: “if the name on their resume were set to be atypical among black people” [51, 56].

Putting these issues to the side, we can proceed to define fairness in terms of counterfactual decisions under different settings of a sensitive variable. We order definitions from strongest to weakest:121212Conditional counterfactual parity implies counterfactual parity (by averaging over the distribution of Data in the population). However, conditional demographic parity () does not in general imply demographic parity (), because the distribution of Data may differ by group (this idea underlies Simpson’s “paradox” [14]). Results 4 and 5 give results about the relationship between conditional and unconditional independence.

  • [leftmargin=*]

  • Individual Counterfactual Fairness:

  • Conditional Counterfactual Parity:

  • Counterfactual Parity:

Kusner et al.’s “counterfactual fairness” is a form of conditional counterfactual parity [77]. Conditioned on a lot of data, this approaches individual counterfactual fairness. If (a criterion called ignorability or unconfoundedness [103, 48, 62, 56]), then conditional counterfactual parity is equivalent to conditional demographic parity.

These first three causal definitions consider the total effect of (e.g. race) on (e.g. hiring). However, it is possible to consider some causal pathways from the sensitive variable to be fair. For example, suppose race affects education obtained. If hiring decisions are based on the applicant’s education, then race affects hiring decisions. It often helps to visualize causal relationships graphically, see Figure 1.131313See Pearl [103] section 1.3.1 for a definition of causal graphs, which encode conditional independence statements for counterfactuals.

Figure 1: A causal graph showing direct (red), indirect (blue), and back-door (green) paths from race to hiring.

In this cartoon version of the world, a complex historical process creates an individual’s race and socioeconomic status at birth [124, 64]. These both affect the hiring decision, including through the mediating variable education. We can define effects along paths in this graph by defining fancier counterfactuals. Let

be the decision if the applicant had not been black, but education had remained the same. To disallow only the red path in Figure 1, we can define

  • [leftmargin=*]

  • No Direct Effect Fairness:

Beyond direct effects, one could consider other directed paths from race to be fair or unfair. Two definitions from the literature include:

  • [leftmargin=*]

  • No Unfair Path-Specific Effects [93]: no average effects along unfair (user-specified) directed paths from to

  • No Unresolved Discrimination [72]: there is no directed path from to unless through a resolving variable (a variable that is influenced by in a manner that we accept as nondiscriminatory)

All of the above causal definitions consider only directed paths from race. In Figure 1, these include the red and blue paths. But what about the green paths? Known as back-door paths [103], these do not represent causal effects of race, and therefore are permitted under causal definitions of fairness. However, back-door paths can contribute to the association between race and hiring. Indeed, they are why we say “correlation is not causation.”141414If satisfies the back-door criterion ( includes no descendants of and blocks all back-door paths between and ) in a causal graph, then unconfoundedness () holds [103, 105]. The converse is not true in general. Zhang and Bareinboim decompose the total disparity into disparities from each type of path (direct, indirect, back-door) [130]. In contrast to the causal fairness definitions, health disparities are defined to include contribution from back-door paths (e.g. through socioeconomics at birth) [98, 34, 22, 7, 64].

Causal definitions of fairness focus our attention on how to compensate for causal influences at decision time. Causal reasoning can be used instead to design interventions (to reduce disparities and improve overall outcomes), rather than to define fairness. In particular, causal graphs can be used to develop interventions at earlier points, prior to decision-making [64, 8].

5 Intersectionality

Much of the ML fairness literature considers the simple case of two groups (advantaged and disadvantaged). However, the fairness flavors described in the previous section could each be applied to various groups. For clarity, we advocate separating two axes: the flavors of fairness (Section 4) from the groups to which they can be applied (this section). The latter is part of a much larger conversation about intersectionality.

Crenshaw analyzes failed employment discrimination lawsuits involving black women who could only seek recourse against discrimination as black women which they were unable to establish simply as sex discrimination (since it does not apply to white women) or as race discrimination (since it does not apply to black men) [26]. Crenshaw quotes the court in deGraffenreid vs General Motors [1]:

The prospect of the creation of new classes of protected minorities, governed only by the mathematical principles of permutation and combination, clearly raises the prospect of opening the hackneyed Pandora’s box.

Intersectionality has been studied in various quantitative literatures, e.g. in epidemiology [65, 63]. In the ML fairness literature, Buolamwini and Gebru evaluated commercial gender classification systems and found that darker-skinned females are the most misclassified group [16]. Kearns et al. apply approximate statistical and false positive rate parity to groups of large enough size that are defined by the sensitive variables [71]. Hébert-Johnson et al. apply approximate calibration to groups defined by any of the input variables [55]. Rothblum and Yona apply approximate metric fairness to any group of large enough size [113]. At the extreme, one can take groups to be individuals. Some definitions already give parity across individuals: single-threshold fairness, unawareness, and metric fairness.

We could extend ideas about intersectionality to causal definitions of fairness. Instead of comparing across several groups of people, one could compare across several counterfactuals. For example, we could add gender to the causal graph in Figure 1.

6 Impossibilities

In this section we catalogue several results showing that it is typically impossible to simultaneously satisfy all flavors of fairness from section 4. Practitioners will therefore need to choose among them. To that end, we also discuss some of their mathematical and moral tensions.

6.1 Separation and sufficiency

Tension between margins of the confusion matrix is expressed in three very similar results.151515We write results 1, 4, and 5 in terms of scores, outcomes, and sensitive variables, but they are more general properties about random variables.

Result 1 (Proposition 4 in Barocas et al. [10], Theorem 17.2 in Wasserman [126]).

Assume separation () and sufficiency (). Then at least one of the following is true:

  • An event in the joint distribution has probability zero.

    161616Formally, there exist Borel sets such that
    but
    .

Result 2 (Kleinberg et al. [74]).

Assume binary and . Assume also balance for the negative class, balance for the positive class, and calibration within groups. Then at least one of the following is true:

  • Equal base rates:

  • Perfect prediction: or 1 for all

Equal base rates and perfect prediction can be called trivial, degenerate, or even utopian (representing two very different utopias).

Result 3 (Chouldechova [17]).

Assume binary and . Assume also that equal FPRs, equal FNRs, and equal PPVs hold. Then at least one of the following is true:171717This result comes from the following relationship among FPR, FNR, PPV, and base rate:

  • Equal base rates:

  • FPR = 0, PPV = 1 for both groups

  • FPR = 0, FNR = 1 for both groups

6.1.1 The COMPAS debate

Tension between margins of the confusion matrix factored into a debate about a tool called COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) and its estimates of risk for “recidivism” (one can object to the term as creating an impression of certainty that there was a first offense). ProPublica published a highly influential analysis based on data obtained through public records requests [6, 78]. They found that COMPAS does not satisfy equal FNRs: among defendants who got rearrested, white defendants were twice as likely to be misclassified as low risk. COMPAS also does not satisfy equal FPRs: among defendants who did not get rearrested, black defendants were twice as likely to be misclassified as high risk.

Northpointe (now Equivant), the developers of COMPAS, critiqued ProPublica’s work and pointed out that COMPAS satisfies equal PPVs: among those labeled higher risk, the proportion of defendants who got rearrested is approximately the same regardless of race [31]. COMPAS also satisfies calibration within groups [38]. ProPublica then responded to these critiques [4, 5, 3]. Much of the subsequent conversation consisted of either trying to harmonize these definitions of fairness or asserting that one or the other is correct.

The debate between definitions was particularly intense because the decision space (detain or not) is so narrow and harmful. Different definitions dictate how the harm of detention is allocated across groups. Instead of choosing among these, we can choose alternative, less harmful policies (see Section 3.3).

6.2 Separation and demographic parity

The next two results have not (yet) been as central to public debate as the previous results, but we include them in our catalogue for completeness.

Result 4 (Barocas et al. Proposition 3 in [10]).

Assume is binary. Assume also that separation () and demographic parity () hold. Then at least one of the following is true:

  • Equal base rates:

6.3 Sufficiency and demographic parity

Result 5 (Barocas et al. Proposition 2 in [10]).

Assume sufficiency () and demographic parity (). Then we must have equal base rates: .

6.4 Unawareness and demographic parity

Corbett-Davies et al. and Lipton et al. both note that a decision rule that maximizes under a demographic parity constraint (in general) uses the sensitive variables both in estimating the conditional probabilities and for determining their thresholds [24, 80]. Therefore, solutions such as disparate learning processes (DLPs), which allow the use of sensitive variables during training but not prediction, are either sub- or equi-optimal [104, 67, 68, 69, 128].

Lipton et al. present this result as a tension between unawareness and demographic parity [80]. It is, but the result is conditional on a choice of outcome , insensitive variables , and optimization goal. Changing any of these could result in reduced demographic disparity under unawareness. For example, in lending, we could consider repayment under different loan conditions (e.g. a longer time-line) . We can also change the shape of the distributions of conditional probabilities by considering new variables . In pretrial detention, we could consider a that penalizes false positives more strongly, bringing the threshold up to where the distributions of the conditional probabilities might look much more similar, lessening disparity in detention rates.

Unlike the other impossibility results, unawareness and demographic parity do not imply equal base rates, perfect prediction, or some other “degenerate” case.

7 Discussion

Here we discuss two recommendations. First, we highlight confusing terminology and suggest moving to more descriptive language. Next, we briefly point to processes and research that respond to the scientific and political concerns that we have raised about prediction-based decision-making.

7.1 Terminology

In the computer science literature it is common to conflate an individual with their variables, e.g. “we denote the set of individuals by [35]. Furthermore, is then called an individual’s “true risk” [24, 23]. These terminologies allow us to assume we have measured and conditioned on all the relevant attributes of an individual. The statistics literature usually separates the notion of an individual (often indexed by ) from their measured variables . We propose adopting this convention and describing as conditional probabilities.

The term “biased data” (e.g. [67, 11, 17, 80]) collapses retrospective injustice (societal bias) with concerns about non-representative sampling and measurement error (statistical bias), see Figure 2. There is overlap between the two concepts, e.g. using arrests as a measure of crime can introduce statistical bias from measurement error that is differential by race because of a racist policing system [2, 84]. But suppose we could perfectly measure crime, does this make the data free from “bias”? In a statistical sense, yes.181818In statistics, “bias” refers to properties of an estimator, not data. Here we mean bias in the estimation of conditional probabilities or fairness metrics that could result from non-representative data, measurement error, or model misspecification. In a societal sense, no, because crime rates reflect societal injustice (including how crime is defined).

Figure 2: A cartoon showing two components of “biased data”: societal bias and statistical bias.

The term “biased model/algorithm” is used to describe violations of parities, e.g. unequal FPRs [6]. Lipton and Steinhardt caution against collapsing statistical parities with legal or ethical concepts [81]. Adopting the word “fairness” to describe the above definitions leads to confusion, e.g. thinking that we should “applaud and encourage” the application of any of them because it “ immediately increases the amount of fairness, by some metric” [120]. This “mythic equivalence between technical and social notions of fairness” precludes progress by stifling necessary debate about how to weigh competing normative commitments [50]. Similarly, a quantity labeled “utility” or “social welfare” may fail to reflect the goals of many.

7.2 Ways forward

In Section 3, we described concerning choices and assumptions often used to justify a prediction-based decision system. As we have seen, one way to address these is by examining relevant flavors of fairness definitions from Section 4. Another way forward is to address them directly. Here we sketch that approach, including pointing to some of the relevant statistics literature.

Starting with clearly articulated goals can improve accountability. To best serve those goals, consider whether interventions should be made at the individual or aggregate level. Carefully describe the eligible population to clarify who is impacted by the possible interventions. Expanding the decision space to include less harmful and more supportive interventions can benefit all groups and mitigate fairness concerns.

To build a decision system aligned with stated goals, choose outcomes carefully, considering data limitations. Using prior information [48] can help specify a realistic utility function. For example, instead of assuming benefits and harms are constant across decisions (the “symmetric” assumption), prior data can inform a more realistic distribution.

Combining all choices in one expanded model [127] can mitigate sensitivity of decisions to model selection. Instead of assuming one potential outcome is known, causal methods can be used to estimate effects of decisions. Furthermore, these effects may not be separate and constant across the population. As such, causal methods can be used to study interference [100, 87] and heterogeneous effects [47].

8 Conclusion

We have identified several pitfalls in the justification of prediction-based decision systems and offered a catalogue of notions of fairness. Neither maximization of a “utility function” (e.g. accuracy) nor satisfaction of a “fairness constraint” (e.g. demographic parity) guarantee social and political goals. Neither provide a complete, causal model of the world to prescribe interventions towards those goals. Both can narrow focus to the quantifiable, introduce harmful simplifications, and mislead that the issues are purely technical [50, 96]. But while data and mathematical formalization are far from saviors, they are not doomed to be tools of oppression. Indeed, they can be designed to help disadvantaged groups [43, 109].

In the pursuit of that goal, we need explicit, clear communication. We attempted this in cataloguing the choices and assumptions made, often implicitly, to justify a prediction-based decision system. We presented several definitions of fairness from the literature in common notation to facilitate comparisons, regarding none as the axiomatic definition of fairness, justice, or nondiscrimination.

References

  • [1] Degraffenreid v. general motors assembly div., etc. https://law.justia.com/cases/federal/district-courts/FSupp/413/142/1660699/, 1976.
  • [2] Michelle Alexander. The New Jim Crow. New Press, 2012.
  • [3] Julia Angwin and Jeff Larson. Annotated responses to an academic paper to flores et al. https://www.documentcloud.org/documents/3248777-Lowenkamp-Fedprobation-sept2016-0.html, 2016.
  • [4] Julia Angwin and Jeff Larson. Propublica responds to company’s critique of machine bias story. https://www.propublica.org/article/propublica-responds-to-companys-critique-of-machine-bias-story, 2016.
  • [5] Julia Angwin and Jeff Larson. Technical response to northpointe. https://www.propublica.org/article/technical-response-to-northpointe, 2016.
  • [6] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing, 2016.
  • [7] Zinzi D. Bailey, Nancy Krieger, Madina Agénor, Jasmine Graves, Natalia Linos, and Mary T. Bassett. Structural racism and health inequities in the usa: evidence and interventions. The Lancet, 389(10077):1453–1463, 2017.
  • [8] Chelsea Barabas, Karthik Dinakar, Joichi Ito, Madars Virza, and Jonathan Zittrain. Interventions over predictions: Reframing the ethical debate for actuarial risk assessment, 2018. Presented at FAT* 2018, February 2018, New York, NY USA.
  • [9] Solon Barocas and Moritz Hardt. Fairness in machine learning, 2017. Presented at the 31st Annual Conference on Neural Information Processing Systems (NIPS), December 2017, Long Beach, CA USA.
  • [10] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. fairmlbook.org, 2018.
  • [11] Solon Barocas and Andrew D. Selbst. Big data’s disparate impact. Cal. L. Rev., 104:671, 2016.
  • [12] James O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer Series in Statistics. Springer New York, 1985.
  • [13] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in criminal justice risk assessments: the state of the art. arXiv preprint arXiv:1703.09207, 2017.
  • [14] Joseph K. Blitzstein and Jessica Hwang. Introduction to Probability. Chapman & Hall/CRC Texts in Statistical Science. CRC Press, 2014.
  • [15] William N. Brownsberger. Bill s.770: An act providing community-based sentencing alternatives for primary caretakers of dependent children who have been convicted of non-violent crimes. https://malegislature.gov/Bills/190/S770, 2017.
  • [16] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency (FAT*), pages 77–91, 2018.
  • [17] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017.
  • [18] Alexandra Chouldechova, Diana Benavides-Prado, Oleksandr Fialko, and Rhema Vaithianathan. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In Conference on Fairness, Accountability and Transparency (FAT*), pages 134–148, 2018.
  • [19] Alexandra Chouldechova and Aaron Roth. The frontiers of fairness in machine learning, 2018.
  • [20] Anne T. Cleary. Test bias: Validity of the scholastic aptitude test for negro and white students in integrated colleges. ETS Research Bulletin Series, 1966(2):i–23, 1966.
  • [21] T Anne Cleary. Test bias: Prediction of grades of negro and white students in integrated colleges. Journal of Educational Measurement, 5(2):115–124, 1968.
  • [22] Benjamin Lê Cook, Thomas G. McGuire, and Alan M. Zaslavsky. Measuring racial/ethnic disparities in health care: methods and practical issues. Health services research, 47(3pt2):1232–1254, 2012.
  • [23] Sam Corbett-Davies and Sharad Goel. Defining and designing fair algorithms. https://policylab.stanford.edu/projects/defining-and-designing-fair-algorithms.html, 2018. Presented at EC 2018 and ICML 2018.
  • [24] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806. ACM, 2017.
  • [25] Marsha Courchane, David Nebhut, and David Nickerson. Lessons learned: Statistical techniques and fair lending. Journal of Housing Research, pages 277–295, 2000.
  • [26] Kimberle Crenshaw. Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. U. Chi. Legal F., page 139, 1989.
  • [27] Francis T Cullen, Cheryl Lero Jonson, and Daniel S Nagin. Prisons do not reduce recidivism: The high cost of ignoring science. The Prison Journal, 91(3_suppl):48S–65S, 2011.
  • [28] Richard B. Darlington. Another look at ‘cultural fairness’. Journal of Educational Measurement, 8(2):71–82, 1971.
  • [29] Maria De-Arteaga, Artur Dubrawski, and Alexandra Chouldechova. Learning under selective labels in the presence of expert consistency, 2018. Presented at the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018), Stockholm, Sweden.
  • [30] Robert DeFina and Lance Hannon. For incapacitation, there is no time like the present: The lagged effects of prisoner reentry on property and violent crime rates. Social Science Research, 39(6):1004–1014, 2010.
  • [31] William Dieterich, Christina Mendoza, and Tim Brennan. Compas risk scales: Demonstrating accuracy equity and predictive parity. Northpoint Inc, 2016.
  • [32] Roel Dobbe, Sarah Dean, Thomas Gilbert, and Nitin Kohli. A broader view on bias in automated decision-making: Reflecting on epistemology and dynamics, 2018. Appearing in the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018), Stockholm, Sweden.
  • [33] Neil J. Dorans and Linda L. Cook. Fairness in Educational Assessment and Measurement. Taylor & Francis, 2016.
  • [34] Naihua Duan, Xiao-Li Meng, Julia Y. Lin, Chih nan Chen, and Margarita Alegria. Disparities in defining disparities: statistical conceptual frameworks. Statistics in medicine, 27(20):3941–3956, 2008.
  • [35] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226. ACM, 2012.
  • [36] Hillel J. Einhorn and Alan R. Bass. Methodological considerations relevant to discrimination in employment testing. Psychological Bulletin, 75(4):261, 1971.
  • [37] Virginia Eubanks. Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. St. Martin’s Press, 2018.
  • [38] Anthony W. Flores, Kristin Bechtel, and Christopher T. Lowenkamp. False positives, false negatives, and false analyses: A rejoinder to machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. Fed. Probation, 80:38, 2016.
  • [39]

    Center for Data Science and University of Chicago Public Policy.

    Aequitas. https://dsapp.uchicago.edu/aequitas/, 2018.
  • [40] Sorelle A. Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. On the (im)possibility of fairness, 2016.
  • [41] Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P. Hamilton, and Derek Roth. A comparative study of fairness-enhancing interventions in machine learning, 2018.
  • [42] The Leadership Conference Education Fund. The use of pretrial ”risk assessment” instruments: a shared statement of civil rights concerns, 2018.
  • [43] Sidney Fussell. The algorithm that could save vulnerable new yorkers from being forced out of their homes. https://gizmodo.com/the-algorithm-that-could-save-vulnerable-new-yorkers-fr-1826807459, August 2018.
  • [44] Andreas Fuster, Paul Goldsmith-Pinkham, Tarun Ramadorai, and Ansgar Walther. Predictably unequal? the effects of machine learning on credit markets. 2018.
  • [45] Pratik Gajane and Mykola Pechenizkiy. On formalizing fairness in prediction with machine learning, 2017. Presented at the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018), Stockholm, Sweden.
  • [46] Antonio R Garcia, Minseop Kim, and Christina DeNard. Context matters: The state of racial disparities in mental health services among youth reported to child welfare in 1999 and 2009. Children and youth services review, 66:101–108, 2016.
  • [47] Andrew Gelman. The connection between varying treatment effects and the crisis of unreplicable research: A bayesian perspective, 2015.
  • [48] Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2013.
  • [49] Ben Green. “fair” risk assessments: A precarious approach for criminal justice reform. https://www.fatml.org/media/documents/fair_risk_assessments_criminal_justice.pdf, 2018. Presented at the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018), Stockholm, Sweden.
  • [50] Ben Green and Lily Hu. The myth in the methodology: Towards a recontextualization of fairness in machine learning. https://www.dropbox.com/s/4tf5qz3mgft9ro7/Hu%20Green%20-%20Myth%20in%20the%20Methodology.pdf?dl=0, 2018. Presented at the Machine Learning: The Debates workshop at the 35th International Conference on Machine Learning.
  • [51] James D. Greiner and Donald B. Rubin. Causal effects of perceived immutable characteristics. Review of Economics and Statistics, 93(3):775–785, 2011.
  • [52] Bernard E. Harcourt. Against prediction: Profiling, policing, and punishing in an actuarial age. University of Chicago Press, 2008.
  • [53] Moritz Hardt, Eric Price, and Nathan Srebro.

    Equality of opportunity in supervised learning, 2016.

  • [54] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer New York, 2013.
  • [55] Úrsula Hébert-Johnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Calibration for the (computationally-identifiable) masses, 2017.
  • [56] Miguel A. Hernán and James M. Robins. Causal Inference. Chapman & Hall/CRC, 2018. forthcoming.
  • [57] Lily Hu and Yiling Chen. A short-term intervention for long-term fairness in the labor market. 2018.
  • [58] Lily Hu and Yiling Chen. Welfare and distributional impacts of fair classification. 2018. Presented at the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018), Stockholm, Sweden.
  • [59] Lily Hu, Nicole Immorlica, and Jennifer Wortman Vaughan. The disparate effects of strategic classification, 2018.
  • [60] John E Hunter and Frank L Schmidt. Critical analysis of the statistical and ethical implications of various definitions of test bias. Psychological Bulletin, 83(6):1053, 1976.
  • [61] Ben Hutchinson and Margaret Mitchell. 50 years of test (un)fairness: Lessons for machine learning, 2019. Conference on Fairness, Accountability and Transparency (FAT*) (preprint).
  • [62] Guido W. Imbens and Donald B. Rubin. Causal Inference in Statistics, Social, and Biomedical Sciences. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015.
  • [63] John W. Jackson. Explaining intersectionality through description, counterfactual thinking, and mediation analysis. Social Psychiatry and Psychiatric Epidemiology, 52(7):785–793, July 2017.
  • [64] John W. Jackson and Tyler J. VanderWeele. Decomposition analysis to identify intervention targets for reducing disparities, 2018.
  • [65] John W. Jackson, David R. Williams, and Tyler J. VanderWeele. Disparities at the intersection of marginalized groups. Social Psychiatry and Psychiatric Epidemiology, 51(10):1349–1359, Oct 2016.
  • [66] James E. Johndrow and Kristian Lum. An algorithm for removing sensitive information: application to race-independent recidivism prediction, 2017.
  • [67] Faisal Kamiran and Toon Calders. Classifying without discriminating. In Computer, Control and Communication, 2009. IC4 2009. 2nd International Conference on, pages 1–6. IEEE, 2009.
  • [68] Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy.

    Discrimination aware decision tree learning.

    In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 869–874. IEEE, 2010.
  • [69] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairness-aware learning through regularization approach. In Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, pages 643–650. IEEE, 2011.
  • [70] Samuel Karlin and Herman Rubin. The theory of decision procedures for distributions with monotone likelihood ratio. The Annals of Mathematical Statistics, pages 272–299, 1956.
  • [71] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning, 2018.
  • [72] Niki Kilbertus, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, and Bernhard Schölkopf. Avoiding discrimination through causal reasoning. In Advances in Neural Information Processing Systems, pages 656–666, 2017.
  • [73] Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ashesh Rambachan. Algorithmic fairness. In AEA Papers and Proceedings, volume 108, pages 22–27, 2018.
  • [74] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores, 2016.
  • [75] Jon Kleinberg and Manish Raghavan. How do classifiers induce agents to invest effort strategically?, 2018.
  • [76] Robert Koulish. Immigration detention in the risk classification assessment era. Connecticut Public Interest Law Journal, 16(1), November 2016.
  • [77] Matt J. Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, pages 4066–4076, 2017.
  • [78] Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. How we analyzed the compas recidivism algorithm. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm, 2016.
  • [79] Mary A. Lewis. A comparison of three models for determining test fairness. Technical report, Federal Aviation Administration Washington DC Office of Aviation Medicine, 1978.
  • [80] Zachary C. Lipton, Alexandra Chouldechova, and Julian McAuley. Does mitigating ml’s impact disparity require treatment disparity?, 2018.
  • [81] Zachary C. Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship. https://www.dropbox.com/s/ao7c090p8bg1hk3/Lipton%20and%20Steinhardt%20-%20Troubling%20Trends%20in%20Machine%20Learning%20Scholarship.pdf?dl=0, 2018. Presented at the Machine Learning: The Debates workshop at the 35th International Conference on Machine Learning.
  • [82] Lydia T. Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. Delayed impact of fair machine learning. arXiv preprint arXiv:1803.04383, 2018. Best Paper Award at ICML 2018.
  • [83] Kristian Lum. Limitations of mitigating judicial bias with machine learning. Nature Human Behaviour, 1(0141), 2017.
  • [84] Kristian Lum and William Isaac. To predict and serve? Significance, 13(5):14–19, 2016.
  • [85] Douglas S Massey and Nancy A Denton. American apartheid: Segregation and the making of the underclass. Harvard University Press, 1993.
  • [86] Aditya Krishna Menon and Robert C Williamson. The cost of fairness in binary classification. In Conference on Fairness, Accountability and Transparency, pages 107–118, 2018.
  • [87] Caleb H. Miles, Maya Petersen, and Mark J. van der Laan. Causal inference for a single group of causally-connected units under stratified interference, 2017.
  • [88] Claire Cain Miller. Can an algorithm hire better than a human? https://www.nytimes.com/2015/06/26/upshot/can-an-algorithm-hire-better-than-a-human.html, June 2015.
  • [89] Claire Cain Miller. When algorithms discriminate. https://www.nytimes.com/2015/06/26/upshot/can-an-algorithm-hire-better-than-a-human.html, July 2015.
  • [90] Smitha Milli, John Miller, Anca D. Dragan, and Moritz Hardt. The social cost of strategic classification, 2018.
  • [91] Ojmarrh Mitchell, Joshua C Cochran, Daniel P Mears, and William D Bales. Examining prison effects on recidivism: A regression discontinuity approach. Justice Quarterly, 34(4):571–596, 2017.
  • [92] Shira Mitchell and Jackie Shadlen. Mirror mirror: Reflections on quantitative fairness. https://shiraamitchell.github.io/fairness/, 2018.
  • [93] Razieh Nabi and Ilya Shpitser. Fair inference on outcomes. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 2018, page 1931. NIH Public Access, 2018.
  • [94] Arvind Narayanan. 21 fairness definitions and their politics, 2018. Presented at FAT* 2018, February 2018, New York, NY USA.
  • [95] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632. ACM, 2005.
  • [96] Rodrigo Ochigame, Chelsea Barabas, Karthik Dinakar, Madars Virza, and Joichi Ito. Beyond legitimation: Rethinking fairness, interpretability, and accuracy in machine learning. https://www.dropbox.com/s/6ue5knrlvbxiavy/Ochigame%20et%20al%20-%20Beyond%20Legitimation.pdf?dl=0, 2018. Presented at the Machine Learning: The Debates workshop at the 35th International Conference on Machine Learning.
  • [97] The Mayor’s Office of Data Analytics. Legionnaires’ disease response: Moda assisted in a citywide response effort after an outbreak of legionnaires’ disease. https://moda-nyc.github.io/Project-Library/projects/cooling-towers/, 2018.
  • [98] Institutes of Medicine (IOM). Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care. Washington, DC: National Academies Press, 2003.
  • [99] Executive Office of the President. Big data: A report on algorithmic systems, opportunity, and civil rights. https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/2016_0504_data_discrimination.pdf, 2016.
  • [100] Elizabeth L Ogburn, Tyler J VanderWeele, et al. Causal diagrams for interference. Statistical science, 29(4):559–578, 2014.
  • [101] Cathy O’Neil. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown/Archetype, 2016.
  • [102] Cathy O’Neil and Hanna Gunn. Near term artificial intelligence and the ethical matrix, 2018. Book chapter, to appear.
  • [103] Judea Pearl. Causality. Cambridge University Press, 2009.
  • [104] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 560–568. ACM, 2008.
  • [105] Emilija Perković, Johannes Textor, Markus Kalisch, and Marloes H Maathuis. A complete generalized adjustment criterion. arXiv preprint arXiv:1507.01524, 2015.
  • [106] Nancy S. Petersen and Melvin R. Novick. An evaluation of some models for culture-fair selection. Journal of Educational Measurement, 13(1):3–29, 1976.
  • [107] Alex R. Piquero. Understanding race/ethnicity differences in offending across the life course: Gaps and opportunities. Journal of developmental and life-course criminology, 1(1):21–32, 2015.
  • [108] Alex R. Piquero and Robert W. Brame. Assessing the race–crime and ethnicity–crime relationship in a sample of serious adolescent delinquents. Crime & Delinquency, 54(3):390–422, 2008.
  • [109] Eric Potash, Joe Brew, Alexander Loewi, Subhabrata Majumdar, Andrew Reece, Joe Walsh, Eric Rozier, Emile Jorgenson, Raed Mansour, and Rayid Ghani. Predictive modeling for public health: Preventing childhood lead poisoning. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2039–2047. ACM, 2015.
  • [110] Jessica Pryce, Wonhyung Lee, Elizabeth Crowe, Daejun Park, Mary McCarthy, and Greg Owens. A case study in public child welfare: county-level practices that address racial disparity in foster care placement. Journal of Public Child Welfare, pages 1–25, 2018.
  • [111] Stephan Rabanser, Stephan Günnemann, and Zachary C. Lipton. Failing loudly: An empirical study of methods for detecting dataset shift, 2018.
  • [112] John Rawls. A Theory of Justice. Harvard University Press, 1971.
  • [113] Guy N. Rothblum and Gal Yona. Probably approximately metric-fair learning. https://www.fatml.org/media/documents/probably_approximately_metric_fair_learning.pdf, 2018. Appearing in the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018), Stockholm, Sweden.
  • [114] Donald B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005.
  • [115] Andrew D. Selbst, Danah Boyd, Sorelle Friedler, Suresh Venkatasubramanian, and Janet Vertesi. Fairness and abstraction in sociotechnical systems. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3265913, 2018. Conference on Fairness, Accountability and Transparency (FAT*) 2019.
  • [116] Amartya Sen. Utilitarianism and welfarism. The Journal of Philosophy, 76(9):463–489, 1979.
  • [117] Camelia Simoiu, Sam Corbett-Davies, and Sharad Goel. The problem of infra-marginality in outcome tests for discrimination. The Annals of Applied Statistics, 11(3):1193–1216, 2017.
  • [118] Jennifer L. Skeem and Christopher T. Lowenkamp. Risk, race, and recidivism: predictive bias and disparate impact. Criminology, 54(4):680–712, 2016.
  • [119] Latanya Sweeney. Discrimination in online ad delivery. Queue, 11(3):10, 2013.
  • [120] Jared Sylvester and Edward Raff. What about applied fairness? https://www.dropbox.com/s/p3c9514mw36qs5b/Sylvester%20Raff%20-%20What%20About%20Applied%20Fairness.pdf?dl=0, 2018. Presented at the Machine Learning: The Debates workshop at the 35th International Conference on Machine Learning.
  • [121] Robert L. Thorndike. Concepts of culture-fairness. Journal of Educational Measurement, 8(2):63–70, 1971.
  • [122] Rhema Vaithianathan, Tim Maloney, Emily Putnam-Hornstein, and Nan Jiang. Children in the public benefit system at risk of maltreatment: Identification via predictive modeling. American journal of preventive medicine, 45(3):354–359, 2013.
  • [123] Tyler J. VanderWeele and Miguel A. Hernán. Results on differential and dependent measurement error of the exposure and the outcome using signed directed acyclic graphs. American journal of epidemiology, 175(12):1303–1310, 2012.
  • [124] Tyler J. VanderWeele and Whitney R. Robinson. On causal interpretation of race in regressions adjusting for confounding and mediating variables. Epidemiology (Cambridge, Mass.), 25(4):473, 2014.
  • [125] Lynne M Vieraitis, Tomislav V Kovandzic, and Thomas B Marvell. The criminogenic effects of imprisonment: Evidence from state panel data, 1974–2002. Criminology & Public Policy, 6(3):589–622, 2007.
  • [126] Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer New York, 2010.
  • [127] Yuling Yao, Aki Vehtari, Daniel Simpson, Andrew Gelman, et al. Using stacking to average bayesian predictive distributions. Bayesian Analysis, 2018.
  • [128] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi. Fairness constraints: Mechanisms for fair classification. AISTATS, 2017.
  • [129] Meike Zehlike, Carlos Castillo, Francesco Bonchi, Ricardo Baeza-Yates, Sara Hajian, and Mohamed Megahed. Fairness measures: A platform for data collection and benchmarking in discrimination-aware ml. http://fairness-measures.org, June 2017.
  • [130] Junzhe Zhang and Elias Bareinboim. Fairness in decision-making–the causal explanation formula. In 32nd AAAI Conference on Artificial Intelligence, 2018.