1 Introduction
Predictionbased decisionmaking has swept through industry and is quickly making its way into government. These techniques are already common in lending [53, 82, 44], hiring [88, 89, 57], and online advertising [119], and increasingly figure into decisions regarding pretrial detention [6, 78, 38, 31], immigration detention [76], child maltreatment screening [122, 37, 18], public health [97, 109], and welfare eligibility [37]. Across these domains, decisions are based on predictions of an outcome deemed relevant to the decision.
We outline the choices and assumptions commonly made (often implicitly) to justify such a decision system.^{1}^{1}1Much of our discussion applies to nonpredictionbased decisions (e.g. decisions made by humans), but we will not draw any explicit comparisons. These choices and assumptions affect how a system behaves, with potentially serious implications for disadvantaged groups along social axes such as race, gender, and class. This has motivated the field of research we will call “Fairness in Machine Learning (ML)”.
Fairness in ML has been explored in popular books [101, 37] and a White House report [99], surveyed in technical review papers [13, 41, 19], tutorials [9, 94, 92, 23], and an inprogress textbook [10], and inspired a number of software packages [129, 39, 45].
Though the ML fairness conversation is somewhat new, it resembles older work. For example, since the 1960s, psychometricians have studied the fairness of educational tests based on their ability to predict performance (at school or work) [20, 21, 121, 28, 36, 106, 60, 79, 61]. More recently, Dorans and Cook review broader notions of fairness, including in test design and administration [33].
Our hope with this paper is to contribute a concise, cautious, and reasonably comprehensive catalogue of choices, assumptions, and fairness considerations. While we regard no notion as the axiomatic definition of fairness (nor the guarantor of any particular social goal), our catalogue enables practitioners to systematically consider these concepts.
The paper proceeds as follows: in Section 2, we introduce setup and notation. Section 3 outlines common choices and assumptions made to justify predictionbased decision systems. Concern about these can motivate various flavors of fairness constraints, which we present in Section 4. Though the literature often collapses the two, we separate the flavors of fairness (Section 4) from the groups to which they can be applied (Section 5). Section 6 presents tensions among fairness definitions. We highlight some ways forward in Section 7. Section 8 concludes.
2 Setup and notation
Consider a population about whom we want to make decisions. We index people by . Each person has variables (i.e. features, covariates, or inputs) that are known at decision time. In some cases, we can separate these into sensitive variable(s) (e.g. race, gender, or class) and other variables , writing . A binary decision
is made for each person. We use bold font to denote a vector of values for everyone in the population,
. We restrict decisions to be functions of variables known at decision time, , where . In predictionbased decisions, each person has a binary outcome (i.e. label or target)that is unknown at decision time. We define random variables
, , as the values of a person randomly drawn from the population.^{2}^{2}2We can also imagine the values as randomly generated from some “superpopulation” [48].Predictionbased decisions can be made by first estimating the
conditional probability
The decision system does not know the true conditional probability; instead it can compute a score where is intended^{3}^{3}3We assume scores have been calibrated to be interpreted as probabilities, though this is nontrivial in practice [95]
. More generally, decisions can be based on scores from methods such as SVM and random forest that do not estimate conditional probabilities
[54]. to estimate . Let be a random score from the population. Both and are functions of training data, a sample of from the population of interest.3 Choices and assumptions made to justify predictionbased decision systems
Several recent papers have raised concerns about predictionbased decision systems, describing “biases” or “traps” beyond the focus of most quantitative literature [49, 50, 96, 32, 115]. It may be difficult for a practitioner to see where these important points connect to their work. To bridge the gap, we link these sociotechnical concerns to quantitative choices and assumptions.
Below are some common choices and assumptions made (often implicitly) to justify a predictionbased decision system. We are motivated by a few examples prevalent in the literature: pretrial detention, child maltreatment screening, and lending. We question some of the most salient choices and assumptions for both scientific and political merit.^{4}^{4}4These choices might not be made in any specific, sequential order. Each can influence the others, e.g. Green et al. [49, 50] note that 3.4 can influence 3.1 insofar as measuring crime puts emphasis on crime prevention as opposed to other indicators of social wellbeing. See section 7.2 for a sketch of how to address some of these issues.
3.1 Choose a goal
Choose a goal that the decisions are meant to achieve. For a benevolent social planner, this may be justice or some notion of social welfare [58]. For a bank, this may be profits. Often there are genuinely different and conflicting goals, which are not resolved by more data [37, 102]. Furthermore, make the (substantial) assumption that progress towards this goal can be summarized by a number (denote this by ).
3.2 Choose a population
Choose to make decisions at the individual level (as opposed to a policy that acts at the community level) about members of a particular population. In pretrial decisions, the population is defendants. In lending decisions, the population is loan applicants. These populations are sampled from the world by some mechanism, e.g. people become defendants by being arrested, applicants by applying for loans. The mechanism of entry into this population may reflect unjust social structures, e.g. racist policing that targets black people for arrest [2, 84] and discrimination in loan preapplication screening [25].
3.3 Choose a decision space
Assume decisions are made at the individual level and are binary. In this simplified setup, pretrial decisions are usually restricted to detain or not, while lending decisions to lend or not. Both these decision spaces leave out many less harmful and more supportive interventions. In pretrial decisions, the space might instead include offering funds to help defendants with transportation to court, paying for childcare, or locating job opportunities. In lending, a broader decision space could include providing longerterm, lowerinterest loans.
3.4 Choose an outcome
Assume the utility of decisions depends in part on a binary outcome that can be measured at the individual level. Suppose decisions about a family intervention program are based on maltreatment of children. Let if there is maltreatment, and otherwise. Consider the following scenario: suppose in Family A there is maltreatment, with or without this intervention program. Suppose in Family B there is maltreatment without the intervention, but the intervention helps. Then it is rational enroll Family B in this program, but Family A may need an alternative program. This example considers the potential outcomes under both decisions [114, 62, 56]: without the intervention, under the intervention. The benefit or harm from a decision can depend on both potential outcomes.
More generally, let be the potential outcome under the whole decision system (the bold denotes that the decision for person may affect the outcome for person , where ). Assume now that the utility of decisions can be expressed as a function of these and no other outcomes. In mathematical notation, the utility under decisions is a function of the potential outcomes for all people under all possible decision policies: . The bold font denotes vectors of outcomes and decisions for the whole population. In the pretrial detention example: are the outcomes for everyone if everyone is released pretrial, are the outcomes for everyone if everyone is detained pretrial, with all other possible decision regimes in between.
Predictionbased decision systems often only consider one or two outcomes, e.g. in pretrial decisions, outcomes of interest are crime (measured as arrest, more on this later) and nonappearance in court. In contrast, human decisionmakers may consider several outcomes, including impacts on a defendant’s wellbeing or caretaker status [15]. Narrowing focus to a chosen, measured outcome can fall short of larger goals. For example, in making decisions about college admissions, it may be tempting to narrow the larger goals of higher education to simply the future GPAs of admitted students [73].
3.5 Assume decisions can be evaluated separately, symmetrically, and simultaneously
Assume decisions can be evaluated as the sum of separately evaluated individual decisions. This eliminates consideration of aggregate decisions (e.g. the number of detainees, the number of loans, etc.) within groups. This assumption resembles beliefs from utilitarianism, which represents social welfare as a sum of individual utilities, allowing for the uneven distribution of benefits to different individuals (and different groups) as long as overall utility is maximized [112, 116]. The “separately” assumption includes assuming that outcomes are not affected by the decisions for others, an assumption known as no interference [62]: .
Assuming decisions can be evaluated symmetrically (i.e. identically) requires, for example, that the harm of denying a loan to someone who could repay is equal across people. In reality, this error could be especially harmful to lowerincome applicants.
Finally, assuming decisions can be made simultaneously (“batch classification” [19]), as opposed to serially, eliminates consideration of dynamics that could be important. For example, Harcourt shows that if the population changes their behavior in response to changing decision probabilities, predictionbased decisions can backfire and diminish social welfare [52]. While it may be possible to make predictions robust to strategic behavior, this may also have social costs, especially for disadvantaged groups [59, 90]. Alternatively, dynamics can be considered in a way that is socially beneficial [57, 75].
Mathematically, we write the separate, symmetric, and simultaneous (sss) utility as:
3.6 Assume away one potential outcome
Predictionbased decision systems usually only focus on one potential outcome (e.g. crime if released) and assume the other is known (e.g. no crime if detained). But studies have found criminogenic effects of imprisonment [125, 30, 27, 91]. Even if pretrial tools only predict events during the possible detention period (as some recommend [42]), it should not be assumed that detention prevents crime during that period. A person may be driven to break the law while in jail.
In child maltreatment screening, prediction is aimed at maltreatment in the absence of intervention, implicitly assuming intervention is helpful. Interventions may not be helpful, though, and impacts of interventions may vary by group. There is evidence of racial disparity in foster care placement [110] and access to mental health services [46] among youth reported to the child welfare system.
3.7 Choose the prediction setup
For notational simplicity, assume is the potential outcome of interest. Then takes 4 possible values. Without loss of generality, it is positive (good) when and negative (bad) when
. Adopting confusion matrix terminology, express utility in terms of true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN):
where are scalars.
Assuming is fixed, choose to maximize this. Rearranging terms, dropping those without , and renormalizing (dividing by a positive number), the quantity to maximize can be written as:
with only one parameter, . CorbettDavies et al. call this immediate utility [24]. Lipton et al. [80] note that adding and negating gives the costsensitive risk that Menon and Williamson aim to minimize [86]:
Giving false positives and false negatives equal cost, . Thus, minimizing , maximizing , and maximizing accuracy () are all equivalent.
is maximized with perfect prediction, that is . But only variables are known at decision time. So decisions are restricted to be functions of these variables, , where . It has been shown [70, 12, 24] that is maximized at
so decisions are made by thresholding (at ) the conditional probabilities of the outcome, given covariates [24]. This can be called a singlethreshold rule [23].
In implementing such a rule, the decision maker chooses variables and computes a score that is intended to estimate . Below we first discuss variable selection. We then describe four reasons the estimate differs from the estimand : nonrepresentative training data, measurement error, an incorrect model, and statistical uncertainty.
3.7.1 Variable selection
The variables are chosen by a process commonly called variable or feature selection. Conditional probabilities change depending on what we condition on.^{5}^{5}5“Conditioning is the soul of statistics” [14]. Person A can have a higher conditional probability than Person B with a choice of variables , but a lower conditional probability with a choice of variables . This can be obscured by describing these probabilities as the “true risk” (of crime or loan default), which we explore further in Section 7.1.
3.7.2 Sampling
For some subset of the population of interest, and are measured, creating the training dataset. The training data may not be representative of the population for several reasons. For example, in pretrial decisions, if training data only include (crime if released) then they exclude people who were detained [23]. Defendants released in the past may not be representative of current defendants. Incorrectly assuming the data are representative can lead to biased estimation of conditional probabilities (e.g. probability of crime given input variables).
Under a missing at random assumption (), modeling could hope to avoid this selection bias [48]. But there can be regions of the input variables where no defendants are released (no ) and hence is unobserved [19]. One option is extrapolation: fit the model to released defendants (), then use the model for all defendants (regardless of ) even if this includes new regions of the variables. However, ML models can perform poorly with such a shift in input variables [111]. Another option is to trust previous decisions: assume that regions of the input variables where no defendants were released (no ) are regions where all defendants would have commited crimes if released () [29].
3.7.3 Measurement
Similarly compromising are issues of measurement. In the pretrial detention setting, the outcome is often defined as crime, but measured as arrest. In doing so, we would need to take seriously not only the difference between arrests and crime, but the fact that this difference may be greater for some groups (known as differential measurement error [123]). For example, black people are arrested more often for crimes they did not commit [2, 83].^{6}^{6}6The pretrial risk assessment literature inadequately addresses error in using arrests as a measure of crime. A full analysis is beyond the scope of this paper, but we note that the evidence cited is far from conclusive. Skeem and Lowencamp write “official records of arrest — particularly for violent offenses — are a valid criterion” [118]. But they cite Piquero, who writes “selfreported offending data reveal a much more similar set of estimates regarding offending across race” [107]. They also cite Piquero and Brame, who found little racial differences in either selfreport or arrest record data in a sample of “serious adolescent offenders” in Philadelphia and Phoenix [108]. These findings say little about cases with racial differences in arrests.
What about measurement error of the observed variables ? If is mismeasured, our ability to predict could be diminished. However, mismeasurement of is somewhat less serious than mismeasurement of because prediction models are designed to use only to the extent that it predicts .
If is mismeasured differentially across groups, this could lead to different prediction error rates across those groups. For example, accessing drug treatment () is used to predict child maltreatment (). If is measured as accessing public treatment, its measurement will differ between poorer families and wealthier families, who may instead access private (i.e. unobserved) treatment [37].
A model that is aware of group membership may be able to account for this, but if the relationship between and is weaker for a particular group, prediction error rates for that group will be higher.
3.7.4 Model selection
We have a choice of how to model as a function of , that is, model selection.^{7}^{7}7
Methods that do not first estimate conditional probabilities (e.g. SVM, random forest) also involve several researcher degrees of freedom.
Estimation of can be highly modeldependent, particularly if is highdimensional. Though existing tools may simplify to only consider lowdimensional representations (e.g. number of past arrests), the observable variables may be richer (e.g. times and locations of past arrests). Therefore model choice cannot be easily dismissed, even in situations of apparently low dimension.3.7.5 Statistical uncertainty
In addition to bias, there is uncertainty (variance) in estimating
. If the model has groupspecific parameters, uncertainty may be larger for groups with less training data.4 Flavors of fairness definitions
Under the choices and assumptions of section 3 we saw that a singlethreshold rule is considered optimal [24, 23, 73, 80]. If any of the choices and assumptions are not justified, then optimality is not guaranteed.
An even stronger claim about the singlethreshold rule is that it “maximizes a natural notion of social welfare for all groups” [23]. Here we explore this notion of fairness (with the explicit assumption that scores are wellestimated):

[leftmargin=*]

Singlethreshold fairness:
where
To see how this plays out at the group level, assume for simplicity two groups: (advantaged) and (disadvantaged). We can write the overall utility of decisions as a weighted average from the decisions in each group:
Under the assumptions, a singlethreshold rule maximizes both and . It may be tempting to conclude that this is fair to both groups. However, the impact of decisions may not be contained within groups. Furthermore, the disadvantaged group may have a lower maximum.
For example, in lending, a possible goal for applicants is continued access to credit over time and across multiple applications. Liu et al. consider a lending policy designed to increase applicants’ credit scores over successive applications, rather than a bank’s profit [82]. Their optimal policy maximizes credit scores in both groups, but these maximums are not equal. Starting out with an advantage, the wealthier group enjoys larger increases in credit scores.
It may also be tempting to conclude that a singlethreshold rule is fair on the basis that people with the same scores are treated the same, regardless of group membership. But as we saw above, conditional probabilities change with variable selection. Thus, this notion of fairness relies on scores being based on enough predictive variables to be meaningful. In the limit, the scores would be perfect predictions, i.e. the outcomes themselves.
This motivates fairness definitions enforcing equal decisions across groups, conditioned on the outcome. We explore these in 4.1, as well as three other flavors of definitions in 4.24.4 that compare decisions without explicit consideration of an outcome to predict.
4.1 Equal prediction measures
The previous section focused on , an expression in terms of combinations of and (true positives, false positives, false negatives, and true negatives). When the utilities of these combinations are such that , maximizing is equivalent to maximizing accuracy. If we believe this corresponds to a notion of wellbeing within groups, this motivates a definition of fairness:

[leftmargin=*]

Equal accuracy:
Instead of comparing overall accuracies, we could restrict the comparison to subgroups. For example, to reflect the perspective of “innocent” defendants, we could compare accuracies only for defendants who would not get rearrested () [94]. Or we could compare how accurately we predict repayment among those granted loans (). The confusion matrix illustrates match and mismatch between and , with margins expressing conditioning on subgroups based on or ; see Table 1.
4.1.1 Definitions from the margins of the confusion matrix
For any box in the confusion matrix involving the decision , we can require equality across groups. We list definitions from the margins of the confusion matrix, grouped by equivalence from pairs that sum to 1 and their expression as conditional independence. While this induces some repetition, we hope that being explicit can address some of the confusion in the literature:

[leftmargin=*]

Equal FPRs:
Equal TNRs: 
Equal TPRs:
Equal FNRs:
(also known as equal opportunity [53]) 
Equal FORs:
Equal NPVs:
4.1.2 Scores
Predictionbased decision systems often compute a score that is intended to estimate , where is a random score from the population. We can consider definitions based on scores:

[leftmargin=*]

Balance for the negative class:

Balance for the positive class:

Calibration within groups: for all

AUC parity: the area under the receiver operating characteristic (ROC) curve is the same across groups.
Barocas et al. point out that calibration within groups is satisfied without a fairnessspecific effort [10]. With enough (representative, wellmeasured) data and model flexibility, a score can be very close to . So by a lemma,^{8}^{8}8Lemma (). For any random variables , , and : . Proof. Let . By the law of total expectation: . we have , i.e. calibration within groups. With many variables, may be wellpredicted by them, i.e. there is a function that is approximately . Then we can get calibration within groups even without using because .
CorbettDavies et al. point out that calibration within groups does not prevent problematic practices [24]. The abovementioned lemma holds using any variables (or none at all). As we have noted, inclusion or exclusion of variables changes scores, and therefore the fraction of each group that ends up above a threshold. Intentional manipulations of this could mimic the racist practice of redlining, justifying loan denials by neighborhood characteristics [85].
4.1.3 Separation and sufficiency
The two margins of the confusion matrix are distinguished by Barocas et al. as [10]:

[leftmargin=*]
In terms of , these are equivalent to pairs of definitions from the margins of the confusion matrix. In terms of , calibration within groups implies sufficiency. Conversely, if satisfies sufficiency then there exists a function such that satisfies calibration within groups (see Proposition 1 in [10]).
4.2 Equal decisions across groups (stratified)
We now turn to fairness notions that focus on decisions , without the outcome . These can be motivated in a few ways.
Suppose that from the perspective of the population about whom we make decisions, one decision is always preferable to another, regardless of (e.g. nondetention, admission into college^{9}^{9}9In contrast, lending to someone unable to repay could hurt their credit score [82]. Of course, the ability to repay may strongly depend on the terms of the loan.) [94]. In other words, allocation of benefits and harms across groups can be examined by looking at the decision () alone.
Furthermore, while the decisions (e.g. detentions) are observed, the outcome being predicted (e.g. crime if released) may be unobserved, making error rates unknown. Therefore, disparity in decisions (e.g. racial disparity in detention rates) may be more publicly visible than disparity in error rates (e.g. racial disparity in detention rates among those who would not have committed a crime).
Yet another motivation to consider fairness constraints without the outcome is measurement error (see Section 3.7.3). For example, if arrests are a poor measure of crime, fairness constraints based on arrests may be unappealing [66]. One might believe that all group differences in are a result of measurement error, and that the true outcomes on which we want to base decisions are actually similar across groups [40].
These considerations can all motivate requiring demographic parity: equal decision rates across groups (e.g. equal lending rates across races). A related definition considers parity within strata:

[leftmargin=*]

Demographic parity:
(also known as statistical parity, or group fairness [35]) 
Conditional demographic parity:
When , conditional demographic parity is separation. When (the insensitive variables), it is equivalent to:

[leftmargin=*]
In other words, people with the same are treated the same. A related idea requires people who are similar in to be treated similarly. More generally, we could define a similarity metric between people that is aware of the sensitive variables, motivating the next flavor of fairness definitions [35].
4.3 Metric fairness
Assume there is a metric that defines similarity of individuals based on all variables, .

[leftmargin=*]
4.3.1 The metric
Dwork et al. say “the metric should (ideally) capture ground truth” [35]. This inspired Friedler et al. to define the construct space, the variables on which we want to base decisions [40]. For example, suppose we want to base decisions on the outcome, or any variables that can predict it: .^{10}^{10}10Here the probabilities are over some superpopulation model for [48]. But we do not observe the entire world at decision time, instead only the variables , and we calculate scores intended to estimate .
Friedler et al. introduce an assumption they call WYSIWYG (what you see is what you get), i.e. that we can define a metric in the observed space that approximates a metric in the construct space: . To satisfy WYSIWYG, may need to be aware of the sensitive variables [35]. One reason is that the insensitive variables may predict differently for different groups. For example, suppose we want to predict who likes math so we can recruit them to the school’s math team. Let be liking math and be choice of major. Suppose in one group, students who like math are steered towards economics, and in the other group towards engineering. To predict liking math, we should use group membership in addition to .
Friedler et al. also introduce an alternate assumption called WAE (we’re all equal), i.e. that the groups have small distance in the construct space [40]. On this basis, we could adjust a metric in the observed space so that the groups have small distance [35]. Relatedly, Johndrow and Lum describe adjusting the insensitive variables so that they are independent of group [66].
4.3.2 Connections to unawareness and conditional demographic parity
Consider metric fairness where the metric only considers insensitive variables, . Then if , we require the decisions to be identical . This implies unawareness, or equivalently, (a version of conditional demographic parity).
4.3.3 Relaxations
Rothblum and Yona relax the metric fairness condition [113]:

[leftmargin=*]

Approximate metric fairness:
Informally, they say a person is discriminated against if the fraction of people treated differently from them exceeds . Any group of size at least cannot all experience discrimination. Taking , , gives Dwork’s condition.
4.4 Causal definitions
We have already introduced causal notions, considering the potential or counterfactual^{11}^{11}11We use the terms “potential outcomes” and “counterfactuals” interchangeably [56]. values under different decisions [114, 62, 56]. Causal fairness definitions consider instead counterfactuals under different settings of a sensitive variable. Let be the input variables if the individual had their sensitive variable set to . We write for the corresponding decision, e.g. what would the hiring decision be if they had been white? There is debate over whether this is a welldefined notion. Pearl allows counterfactuals under conditions without specifying how those conditions are established, e.g. “if they had been white”. In contrast, Hernán and Robins introduce counterfactuals only under welldefined interventions, e.g. the intervention studied by Greiner and Rubin: “if the name on their resume were set to be atypical among black people” [51, 56].
Putting these issues to the side, we can proceed to define fairness in terms of counterfactual decisions under different settings of a sensitive variable. We order definitions from strongest to weakest:^{12}^{12}12Conditional counterfactual parity implies counterfactual parity (by averaging over the distribution of Data in the population). However, conditional demographic parity () does not in general imply demographic parity (), because the distribution of Data may differ by group (this idea underlies Simpson’s “paradox” [14]). Results 4 and 5 give results about the relationship between conditional and unconditional independence.

[leftmargin=*]

Individual Counterfactual Fairness:

Conditional Counterfactual Parity:

Counterfactual Parity:
Kusner et al.’s “counterfactual fairness” is a form of conditional counterfactual parity [77]. Conditioned on a lot of data, this approaches individual counterfactual fairness. If (a criterion called ignorability or unconfoundedness [103, 48, 62, 56]), then conditional counterfactual parity is equivalent to conditional demographic parity.
These first three causal definitions consider the total effect of (e.g. race) on (e.g. hiring). However, it is possible to consider some causal pathways from the sensitive variable to be fair. For example, suppose race affects education obtained. If hiring decisions are based on the applicant’s education, then race affects hiring decisions. It often helps to visualize causal relationships graphically, see Figure 1.^{13}^{13}13See Pearl [103] section 1.3.1 for a definition of causal graphs, which encode conditional independence statements for counterfactuals.
In this cartoon version of the world, a complex historical process creates an individual’s race and socioeconomic status at birth [124, 64]. These both affect the hiring decision, including through the mediating variable education. We can define effects along paths in this graph by defining fancier counterfactuals. Let
be the decision if the applicant had not been black, but education had remained the same. To disallow only the red path in Figure 1, we can define

[leftmargin=*]

No Direct Effect Fairness:
Beyond direct effects, one could consider other directed paths from race to be fair or unfair. Two definitions from the literature include:

[leftmargin=*]

No Unfair PathSpecific Effects [93]: no average effects along unfair (userspecified) directed paths from to

No Unresolved Discrimination [72]: there is no directed path from to unless through a resolving variable (a variable that is influenced by in a manner that we accept as nondiscriminatory)
All of the above causal definitions consider only directed paths from race. In Figure 1, these include the red and blue paths. But what about the green paths? Known as backdoor paths [103], these do not represent causal effects of race, and therefore are permitted under causal definitions of fairness. However, backdoor paths can contribute to the association between race and hiring. Indeed, they are why we say “correlation is not causation.”^{14}^{14}14If satisfies the backdoor criterion ( includes no descendants of and blocks all backdoor paths between and ) in a causal graph, then unconfoundedness () holds [103, 105]. The converse is not true in general. Zhang and Bareinboim decompose the total disparity into disparities from each type of path (direct, indirect, backdoor) [130]. In contrast to the causal fairness definitions, health disparities are defined to include contribution from backdoor paths (e.g. through socioeconomics at birth) [98, 34, 22, 7, 64].
Causal definitions of fairness focus our attention on how to compensate for causal influences at decision time. Causal reasoning can be used instead to design interventions (to reduce disparities and improve overall outcomes), rather than to define fairness. In particular, causal graphs can be used to develop interventions at earlier points, prior to decisionmaking [64, 8].
5 Intersectionality
Much of the ML fairness literature considers the simple case of two groups (advantaged and disadvantaged). However, the fairness flavors described in the previous section could each be applied to various groups. For clarity, we advocate separating two axes: the flavors of fairness (Section 4) from the groups to which they can be applied (this section). The latter is part of a much larger conversation about intersectionality.
Crenshaw analyzes failed employment discrimination lawsuits involving black women who could only seek recourse against discrimination as black women which they were unable to establish simply as sex discrimination (since it does not apply to white women) or as race discrimination (since it does not apply to black men) [26]. Crenshaw quotes the court in deGraffenreid vs General Motors [1]:
The prospect of the creation of new classes of protected minorities, governed only by the mathematical principles of permutation and combination, clearly raises the prospect of opening the hackneyed Pandora’s box.
Intersectionality has been studied in various quantitative literatures, e.g. in epidemiology [65, 63]. In the ML fairness literature, Buolamwini and Gebru evaluated commercial gender classification systems and found that darkerskinned females are the most misclassified group [16]. Kearns et al. apply approximate statistical and false positive rate parity to groups of large enough size that are defined by the sensitive variables [71]. HébertJohnson et al. apply approximate calibration to groups defined by any of the input variables [55]. Rothblum and Yona apply approximate metric fairness to any group of large enough size [113]. At the extreme, one can take groups to be individuals. Some definitions already give parity across individuals: singlethreshold fairness, unawareness, and metric fairness.
We could extend ideas about intersectionality to causal definitions of fairness. Instead of comparing across several groups of people, one could compare across several counterfactuals. For example, we could add gender to the causal graph in Figure 1.
6 Impossibilities
In this section we catalogue several results showing that it is typically impossible to simultaneously satisfy all flavors of fairness from section 4. Practitioners will therefore need to choose among them. To that end, we also discuss some of their mathematical and moral tensions.
6.1 Separation and sufficiency
Tension between margins of the confusion matrix is expressed in three very similar results.^{15}^{15}15We write results 1, 4, and 5 in terms of scores, outcomes, and sensitive variables, but they are more general properties about random variables.
Result 1 (Proposition 4 in Barocas et al. [10], Theorem 17.2 in Wasserman [126]).
Assume separation () and sufficiency (). Then at least one of the following is true:


An event in the joint distribution has probability zero.
^{16}^{16}16Formally, there exist Borel sets such that
but
.
Result 2 (Kleinberg et al. [74]).
Assume binary and . Assume also balance for the negative class, balance for the positive class, and calibration within groups. Then at least one of the following is true:

Equal base rates:

Perfect prediction: or 1 for all
Equal base rates and perfect prediction can be called trivial, degenerate, or even utopian (representing two very different utopias).
Result 3 (Chouldechova [17]).
Assume binary and . Assume also that equal FPRs, equal FNRs, and equal PPVs hold. Then at least one of the following is true:^{17}^{17}17This result comes from the following relationship among FPR, FNR, PPV, and base rate:

Equal base rates:

FPR = 0, PPV = 1 for both groups

FPR = 0, FNR = 1 for both groups
6.1.1 The COMPAS debate
Tension between margins of the confusion matrix factored into a debate about a tool called COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) and its estimates of risk for “recidivism” (one can object to the term as creating an impression of certainty that there was a first offense). ProPublica published a highly influential analysis based on data obtained through public records requests [6, 78]. They found that COMPAS does not satisfy equal FNRs: among defendants who got rearrested, white defendants were twice as likely to be misclassified as low risk. COMPAS also does not satisfy equal FPRs: among defendants who did not get rearrested, black defendants were twice as likely to be misclassified as high risk.
Northpointe (now Equivant), the developers of COMPAS, critiqued ProPublica’s work and pointed out that COMPAS satisfies equal PPVs: among those labeled higher risk, the proportion of defendants who got rearrested is approximately the same regardless of race [31]. COMPAS also satisfies calibration within groups [38]. ProPublica then responded to these critiques [4, 5, 3]. Much of the subsequent conversation consisted of either trying to harmonize these definitions of fairness or asserting that one or the other is correct.
The debate between definitions was particularly intense because the decision space (detain or not) is so narrow and harmful. Different definitions dictate how the harm of detention is allocated across groups. Instead of choosing among these, we can choose alternative, less harmful policies (see Section 3.3).
6.2 Separation and demographic parity
The next two results have not (yet) been as central to public debate as the previous results, but we include them in our catalogue for completeness.
Result 4 (Barocas et al. Proposition 3 in [10]).
Assume is binary. Assume also that separation () and demographic parity () hold. Then at least one of the following is true:

Equal base rates:

6.3 Sufficiency and demographic parity
Result 5 (Barocas et al. Proposition 2 in [10]).
Assume sufficiency () and demographic parity (). Then we must have equal base rates: .
6.4 Unawareness and demographic parity
CorbettDavies et al. and Lipton et al. both note that a decision rule that maximizes under a demographic parity constraint (in general) uses the sensitive variables both in estimating the conditional probabilities and for determining their thresholds [24, 80]. Therefore, solutions such as disparate learning processes (DLPs), which allow the use of sensitive variables during training but not prediction, are either sub or equioptimal [104, 67, 68, 69, 128].
Lipton et al. present this result as a tension between unawareness and demographic parity [80]. It is, but the result is conditional on a choice of outcome , insensitive variables , and optimization goal. Changing any of these could result in reduced demographic disparity under unawareness. For example, in lending, we could consider repayment under different loan conditions (e.g. a longer timeline) . We can also change the shape of the distributions of conditional probabilities by considering new variables . In pretrial detention, we could consider a that penalizes false positives more strongly, bringing the threshold up to where the distributions of the conditional probabilities might look much more similar, lessening disparity in detention rates.
Unlike the other impossibility results, unawareness and demographic parity do not imply equal base rates, perfect prediction, or some other “degenerate” case.
7 Discussion
Here we discuss two recommendations. First, we highlight confusing terminology and suggest moving to more descriptive language. Next, we briefly point to processes and research that respond to the scientific and political concerns that we have raised about predictionbased decisionmaking.
7.1 Terminology
In the computer science literature it is common to conflate an individual with their variables, e.g. “we denote the set of individuals by ” [35]. Furthermore, is then called an individual’s “true risk” [24, 23]. These terminologies allow us to assume we have measured and conditioned on all the relevant attributes of an individual. The statistics literature usually separates the notion of an individual (often indexed by ) from their measured variables . We propose adopting this convention and describing as conditional probabilities.
The term “biased data” (e.g. [67, 11, 17, 80]) collapses retrospective injustice (societal bias) with concerns about nonrepresentative sampling and measurement error (statistical bias), see Figure 2. There is overlap between the two concepts, e.g. using arrests as a measure of crime can introduce statistical bias from measurement error that is differential by race because of a racist policing system [2, 84]. But suppose we could perfectly measure crime, does this make the data free from “bias”? In a statistical sense, yes.^{18}^{18}18In statistics, “bias” refers to properties of an estimator, not data. Here we mean bias in the estimation of conditional probabilities or fairness metrics that could result from nonrepresentative data, measurement error, or model misspecification. In a societal sense, no, because crime rates reflect societal injustice (including how crime is defined).
The term “biased model/algorithm” is used to describe violations of parities, e.g. unequal FPRs [6]. Lipton and Steinhardt caution against collapsing statistical parities with legal or ethical concepts [81]. Adopting the word “fairness” to describe the above definitions leads to confusion, e.g. thinking that we should “applaud and encourage” the application of any of them because it “ immediately increases the amount of fairness, by some metric” [120]. This “mythic equivalence between technical and social notions of fairness” precludes progress by stifling necessary debate about how to weigh competing normative commitments [50]. Similarly, a quantity labeled “utility” or “social welfare” may fail to reflect the goals of many.
7.2 Ways forward
In Section 3, we described concerning choices and assumptions often used to justify a predictionbased decision system. As we have seen, one way to address these is by examining relevant flavors of fairness definitions from Section 4. Another way forward is to address them directly. Here we sketch that approach, including pointing to some of the relevant statistics literature.
Starting with clearly articulated goals can improve accountability. To best serve those goals, consider whether interventions should be made at the individual or aggregate level. Carefully describe the eligible population to clarify who is impacted by the possible interventions. Expanding the decision space to include less harmful and more supportive interventions can benefit all groups and mitigate fairness concerns.
To build a decision system aligned with stated goals, choose outcomes carefully, considering data limitations. Using prior information [48] can help specify a realistic utility function. For example, instead of assuming benefits and harms are constant across decisions (the “symmetric” assumption), prior data can inform a more realistic distribution.
Combining all choices in one expanded model [127] can mitigate sensitivity of decisions to model selection. Instead of assuming one potential outcome is known, causal methods can be used to estimate effects of decisions. Furthermore, these effects may not be separate and constant across the population. As such, causal methods can be used to study interference [100, 87] and heterogeneous effects [47].
8 Conclusion
We have identified several pitfalls in the justification of predictionbased decision systems and offered a catalogue of notions of fairness. Neither maximization of a “utility function” (e.g. accuracy) nor satisfaction of a “fairness constraint” (e.g. demographic parity) guarantee social and political goals. Neither provide a complete, causal model of the world to prescribe interventions towards those goals. Both can narrow focus to the quantifiable, introduce harmful simplifications, and mislead that the issues are purely technical [50, 96]. But while data and mathematical formalization are far from saviors, they are not doomed to be tools of oppression. Indeed, they can be designed to help disadvantaged groups [43, 109].
In the pursuit of that goal, we need explicit, clear communication. We attempted this in cataloguing the choices and assumptions made, often implicitly, to justify a predictionbased decision system. We presented several definitions of fairness from the literature in common notation to facilitate comparisons, regarding none as the axiomatic definition of fairness, justice, or nondiscrimination.
References
 [1] Degraffenreid v. general motors assembly div., etc. https://law.justia.com/cases/federal/districtcourts/FSupp/413/142/1660699/, 1976.
 [2] Michelle Alexander. The New Jim Crow. New Press, 2012.
 [3] Julia Angwin and Jeff Larson. Annotated responses to an academic paper to flores et al. https://www.documentcloud.org/documents/3248777LowenkampFedprobationsept20160.html, 2016.
 [4] Julia Angwin and Jeff Larson. Propublica responds to company’s critique of machine bias story. https://www.propublica.org/article/propublicarespondstocompanyscritiqueofmachinebiasstory, 2016.
 [5] Julia Angwin and Jeff Larson. Technical response to northpointe. https://www.propublica.org/article/technicalresponsetonorthpointe, 2016.
 [6] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. https://www.propublica.org/article/machinebiasriskassessmentsincriminalsentencing, 2016.
 [7] Zinzi D. Bailey, Nancy Krieger, Madina Agénor, Jasmine Graves, Natalia Linos, and Mary T. Bassett. Structural racism and health inequities in the usa: evidence and interventions. The Lancet, 389(10077):1453–1463, 2017.
 [8] Chelsea Barabas, Karthik Dinakar, Joichi Ito, Madars Virza, and Jonathan Zittrain. Interventions over predictions: Reframing the ethical debate for actuarial risk assessment, 2018. Presented at FAT* 2018, February 2018, New York, NY USA.
 [9] Solon Barocas and Moritz Hardt. Fairness in machine learning, 2017. Presented at the 31st Annual Conference on Neural Information Processing Systems (NIPS), December 2017, Long Beach, CA USA.
 [10] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. fairmlbook.org, 2018.
 [11] Solon Barocas and Andrew D. Selbst. Big data’s disparate impact. Cal. L. Rev., 104:671, 2016.
 [12] James O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer Series in Statistics. Springer New York, 1985.
 [13] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in criminal justice risk assessments: the state of the art. arXiv preprint arXiv:1703.09207, 2017.
 [14] Joseph K. Blitzstein and Jessica Hwang. Introduction to Probability. Chapman & Hall/CRC Texts in Statistical Science. CRC Press, 2014.
 [15] William N. Brownsberger. Bill s.770: An act providing communitybased sentencing alternatives for primary caretakers of dependent children who have been convicted of nonviolent crimes. https://malegislature.gov/Bills/190/S770, 2017.
 [16] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency (FAT*), pages 77–91, 2018.
 [17] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017.
 [18] Alexandra Chouldechova, Diana BenavidesPrado, Oleksandr Fialko, and Rhema Vaithianathan. A case study of algorithmassisted decision making in child maltreatment hotline screening decisions. In Conference on Fairness, Accountability and Transparency (FAT*), pages 134–148, 2018.
 [19] Alexandra Chouldechova and Aaron Roth. The frontiers of fairness in machine learning, 2018.
 [20] Anne T. Cleary. Test bias: Validity of the scholastic aptitude test for negro and white students in integrated colleges. ETS Research Bulletin Series, 1966(2):i–23, 1966.
 [21] T Anne Cleary. Test bias: Prediction of grades of negro and white students in integrated colleges. Journal of Educational Measurement, 5(2):115–124, 1968.
 [22] Benjamin Lê Cook, Thomas G. McGuire, and Alan M. Zaslavsky. Measuring racial/ethnic disparities in health care: methods and practical issues. Health services research, 47(3pt2):1232–1254, 2012.
 [23] Sam CorbettDavies and Sharad Goel. Defining and designing fair algorithms. https://policylab.stanford.edu/projects/defininganddesigningfairalgorithms.html, 2018. Presented at EC 2018 and ICML 2018.
 [24] Sam CorbettDavies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806. ACM, 2017.
 [25] Marsha Courchane, David Nebhut, and David Nickerson. Lessons learned: Statistical techniques and fair lending. Journal of Housing Research, pages 277–295, 2000.
 [26] Kimberle Crenshaw. Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. U. Chi. Legal F., page 139, 1989.
 [27] Francis T Cullen, Cheryl Lero Jonson, and Daniel S Nagin. Prisons do not reduce recidivism: The high cost of ignoring science. The Prison Journal, 91(3_suppl):48S–65S, 2011.
 [28] Richard B. Darlington. Another look at ‘cultural fairness’. Journal of Educational Measurement, 8(2):71–82, 1971.
 [29] Maria DeArteaga, Artur Dubrawski, and Alexandra Chouldechova. Learning under selective labels in the presence of expert consistency, 2018. Presented at the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018), Stockholm, Sweden.
 [30] Robert DeFina and Lance Hannon. For incapacitation, there is no time like the present: The lagged effects of prisoner reentry on property and violent crime rates. Social Science Research, 39(6):1004–1014, 2010.
 [31] William Dieterich, Christina Mendoza, and Tim Brennan. Compas risk scales: Demonstrating accuracy equity and predictive parity. Northpoint Inc, 2016.
 [32] Roel Dobbe, Sarah Dean, Thomas Gilbert, and Nitin Kohli. A broader view on bias in automated decisionmaking: Reflecting on epistemology and dynamics, 2018. Appearing in the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018), Stockholm, Sweden.
 [33] Neil J. Dorans and Linda L. Cook. Fairness in Educational Assessment and Measurement. Taylor & Francis, 2016.
 [34] Naihua Duan, XiaoLi Meng, Julia Y. Lin, Chih nan Chen, and Margarita Alegria. Disparities in defining disparities: statistical conceptual frameworks. Statistics in medicine, 27(20):3941–3956, 2008.
 [35] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226. ACM, 2012.
 [36] Hillel J. Einhorn and Alan R. Bass. Methodological considerations relevant to discrimination in employment testing. Psychological Bulletin, 75(4):261, 1971.
 [37] Virginia Eubanks. Automating Inequality: How HighTech Tools Profile, Police, and Punish the Poor. St. Martin’s Press, 2018.
 [38] Anthony W. Flores, Kristin Bechtel, and Christopher T. Lowenkamp. False positives, false negatives, and false analyses: A rejoinder to machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. Fed. Probation, 80:38, 2016.

[39]
Center for Data Science and University of Chicago Public Policy.
Aequitas. https://dsapp.uchicago.edu/aequitas/, 2018.  [40] Sorelle A. Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. On the (im)possibility of fairness, 2016.
 [41] Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P. Hamilton, and Derek Roth. A comparative study of fairnessenhancing interventions in machine learning, 2018.
 [42] The Leadership Conference Education Fund. The use of pretrial ”risk assessment” instruments: a shared statement of civil rights concerns, 2018.
 [43] Sidney Fussell. The algorithm that could save vulnerable new yorkers from being forced out of their homes. https://gizmodo.com/thealgorithmthatcouldsavevulnerablenewyorkersfr1826807459, August 2018.
 [44] Andreas Fuster, Paul GoldsmithPinkham, Tarun Ramadorai, and Ansgar Walther. Predictably unequal? the effects of machine learning on credit markets. 2018.
 [45] Pratik Gajane and Mykola Pechenizkiy. On formalizing fairness in prediction with machine learning, 2017. Presented at the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018), Stockholm, Sweden.
 [46] Antonio R Garcia, Minseop Kim, and Christina DeNard. Context matters: The state of racial disparities in mental health services among youth reported to child welfare in 1999 and 2009. Children and youth services review, 66:101–108, 2016.
 [47] Andrew Gelman. The connection between varying treatment effects and the crisis of unreplicable research: A bayesian perspective, 2015.
 [48] Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2013.
 [49] Ben Green. “fair” risk assessments: A precarious approach for criminal justice reform. https://www.fatml.org/media/documents/fair_risk_assessments_criminal_justice.pdf, 2018. Presented at the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018), Stockholm, Sweden.
 [50] Ben Green and Lily Hu. The myth in the methodology: Towards a recontextualization of fairness in machine learning. https://www.dropbox.com/s/4tf5qz3mgft9ro7/Hu%20Green%20%20Myth%20in%20the%20Methodology.pdf?dl=0, 2018. Presented at the Machine Learning: The Debates workshop at the 35th International Conference on Machine Learning.
 [51] James D. Greiner and Donald B. Rubin. Causal effects of perceived immutable characteristics. Review of Economics and Statistics, 93(3):775–785, 2011.
 [52] Bernard E. Harcourt. Against prediction: Profiling, policing, and punishing in an actuarial age. University of Chicago Press, 2008.

[53]
Moritz Hardt, Eric Price, and Nathan Srebro.
Equality of opportunity in supervised learning, 2016.
 [54] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer New York, 2013.
 [55] Úrsula HébertJohnson, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. Calibration for the (computationallyidentifiable) masses, 2017.
 [56] Miguel A. Hernán and James M. Robins. Causal Inference. Chapman & Hall/CRC, 2018. forthcoming.
 [57] Lily Hu and Yiling Chen. A shortterm intervention for longterm fairness in the labor market. 2018.
 [58] Lily Hu and Yiling Chen. Welfare and distributional impacts of fair classification. 2018. Presented at the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018), Stockholm, Sweden.
 [59] Lily Hu, Nicole Immorlica, and Jennifer Wortman Vaughan. The disparate effects of strategic classification, 2018.
 [60] John E Hunter and Frank L Schmidt. Critical analysis of the statistical and ethical implications of various definitions of test bias. Psychological Bulletin, 83(6):1053, 1976.
 [61] Ben Hutchinson and Margaret Mitchell. 50 years of test (un)fairness: Lessons for machine learning, 2019. Conference on Fairness, Accountability and Transparency (FAT*) (preprint).
 [62] Guido W. Imbens and Donald B. Rubin. Causal Inference in Statistics, Social, and Biomedical Sciences. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, 2015.
 [63] John W. Jackson. Explaining intersectionality through description, counterfactual thinking, and mediation analysis. Social Psychiatry and Psychiatric Epidemiology, 52(7):785–793, July 2017.
 [64] John W. Jackson and Tyler J. VanderWeele. Decomposition analysis to identify intervention targets for reducing disparities, 2018.
 [65] John W. Jackson, David R. Williams, and Tyler J. VanderWeele. Disparities at the intersection of marginalized groups. Social Psychiatry and Psychiatric Epidemiology, 51(10):1349–1359, Oct 2016.
 [66] James E. Johndrow and Kristian Lum. An algorithm for removing sensitive information: application to raceindependent recidivism prediction, 2017.
 [67] Faisal Kamiran and Toon Calders. Classifying without discriminating. In Computer, Control and Communication, 2009. IC4 2009. 2nd International Conference on, pages 1–6. IEEE, 2009.

[68]
Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy.
Discrimination aware decision tree learning.
In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 869–874. IEEE, 2010.  [69] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. Fairnessaware learning through regularization approach. In Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, pages 643–650. IEEE, 2011.
 [70] Samuel Karlin and Herman Rubin. The theory of decision procedures for distributions with monotone likelihood ratio. The Annals of Mathematical Statistics, pages 272–299, 1956.
 [71] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In Proceedings of the 35th International Conference on Machine Learning, 2018.
 [72] Niki Kilbertus, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, and Bernhard Schölkopf. Avoiding discrimination through causal reasoning. In Advances in Neural Information Processing Systems, pages 656–666, 2017.
 [73] Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ashesh Rambachan. Algorithmic fairness. In AEA Papers and Proceedings, volume 108, pages 22–27, 2018.
 [74] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent tradeoffs in the fair determination of risk scores, 2016.
 [75] Jon Kleinberg and Manish Raghavan. How do classifiers induce agents to invest effort strategically?, 2018.
 [76] Robert Koulish. Immigration detention in the risk classification assessment era. Connecticut Public Interest Law Journal, 16(1), November 2016.
 [77] Matt J. Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, pages 4066–4076, 2017.
 [78] Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. How we analyzed the compas recidivism algorithm. https://www.propublica.org/article/howweanalyzedthecompasrecidivismalgorithm, 2016.
 [79] Mary A. Lewis. A comparison of three models for determining test fairness. Technical report, Federal Aviation Administration Washington DC Office of Aviation Medicine, 1978.
 [80] Zachary C. Lipton, Alexandra Chouldechova, and Julian McAuley. Does mitigating ml’s impact disparity require treatment disparity?, 2018.
 [81] Zachary C. Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship. https://www.dropbox.com/s/ao7c090p8bg1hk3/Lipton%20and%20Steinhardt%20%20Troubling%20Trends%20in%20Machine%20Learning%20Scholarship.pdf?dl=0, 2018. Presented at the Machine Learning: The Debates workshop at the 35th International Conference on Machine Learning.
 [82] Lydia T. Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. Delayed impact of fair machine learning. arXiv preprint arXiv:1803.04383, 2018. Best Paper Award at ICML 2018.
 [83] Kristian Lum. Limitations of mitigating judicial bias with machine learning. Nature Human Behaviour, 1(0141), 2017.
 [84] Kristian Lum and William Isaac. To predict and serve? Significance, 13(5):14–19, 2016.
 [85] Douglas S Massey and Nancy A Denton. American apartheid: Segregation and the making of the underclass. Harvard University Press, 1993.
 [86] Aditya Krishna Menon and Robert C Williamson. The cost of fairness in binary classification. In Conference on Fairness, Accountability and Transparency, pages 107–118, 2018.
 [87] Caleb H. Miles, Maya Petersen, and Mark J. van der Laan. Causal inference for a single group of causallyconnected units under stratified interference, 2017.
 [88] Claire Cain Miller. Can an algorithm hire better than a human? https://www.nytimes.com/2015/06/26/upshot/cananalgorithmhirebetterthanahuman.html, June 2015.
 [89] Claire Cain Miller. When algorithms discriminate. https://www.nytimes.com/2015/06/26/upshot/cananalgorithmhirebetterthanahuman.html, July 2015.
 [90] Smitha Milli, John Miller, Anca D. Dragan, and Moritz Hardt. The social cost of strategic classification, 2018.
 [91] Ojmarrh Mitchell, Joshua C Cochran, Daniel P Mears, and William D Bales. Examining prison effects on recidivism: A regression discontinuity approach. Justice Quarterly, 34(4):571–596, 2017.
 [92] Shira Mitchell and Jackie Shadlen. Mirror mirror: Reflections on quantitative fairness. https://shiraamitchell.github.io/fairness/, 2018.

[93]
Razieh Nabi and Ilya Shpitser.
Fair inference on outcomes.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume 2018, page 1931. NIH Public Access, 2018.  [94] Arvind Narayanan. 21 fairness definitions and their politics, 2018. Presented at FAT* 2018, February 2018, New York, NY USA.
 [95] Alexandru NiculescuMizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632. ACM, 2005.
 [96] Rodrigo Ochigame, Chelsea Barabas, Karthik Dinakar, Madars Virza, and Joichi Ito. Beyond legitimation: Rethinking fairness, interpretability, and accuracy in machine learning. https://www.dropbox.com/s/6ue5knrlvbxiavy/Ochigame%20et%20al%20%20Beyond%20Legitimation.pdf?dl=0, 2018. Presented at the Machine Learning: The Debates workshop at the 35th International Conference on Machine Learning.
 [97] The Mayor’s Office of Data Analytics. Legionnaires’ disease response: Moda assisted in a citywide response effort after an outbreak of legionnaires’ disease. https://modanyc.github.io/ProjectLibrary/projects/coolingtowers/, 2018.
 [98] Institutes of Medicine (IOM). Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care. Washington, DC: National Academies Press, 2003.
 [99] Executive Office of the President. Big data: A report on algorithmic systems, opportunity, and civil rights. https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/2016_0504_data_discrimination.pdf, 2016.
 [100] Elizabeth L Ogburn, Tyler J VanderWeele, et al. Causal diagrams for interference. Statistical science, 29(4):559–578, 2014.
 [101] Cathy O’Neil. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown/Archetype, 2016.
 [102] Cathy O’Neil and Hanna Gunn. Near term artificial intelligence and the ethical matrix, 2018. Book chapter, to appear.
 [103] Judea Pearl. Causality. Cambridge University Press, 2009.
 [104] Dino Pedreshi, Salvatore Ruggieri, and Franco Turini. Discriminationaware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 560–568. ACM, 2008.
 [105] Emilija Perković, Johannes Textor, Markus Kalisch, and Marloes H Maathuis. A complete generalized adjustment criterion. arXiv preprint arXiv:1507.01524, 2015.
 [106] Nancy S. Petersen and Melvin R. Novick. An evaluation of some models for culturefair selection. Journal of Educational Measurement, 13(1):3–29, 1976.
 [107] Alex R. Piquero. Understanding race/ethnicity differences in offending across the life course: Gaps and opportunities. Journal of developmental and lifecourse criminology, 1(1):21–32, 2015.
 [108] Alex R. Piquero and Robert W. Brame. Assessing the race–crime and ethnicity–crime relationship in a sample of serious adolescent delinquents. Crime & Delinquency, 54(3):390–422, 2008.
 [109] Eric Potash, Joe Brew, Alexander Loewi, Subhabrata Majumdar, Andrew Reece, Joe Walsh, Eric Rozier, Emile Jorgenson, Raed Mansour, and Rayid Ghani. Predictive modeling for public health: Preventing childhood lead poisoning. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2039–2047. ACM, 2015.
 [110] Jessica Pryce, Wonhyung Lee, Elizabeth Crowe, Daejun Park, Mary McCarthy, and Greg Owens. A case study in public child welfare: countylevel practices that address racial disparity in foster care placement. Journal of Public Child Welfare, pages 1–25, 2018.
 [111] Stephan Rabanser, Stephan Günnemann, and Zachary C. Lipton. Failing loudly: An empirical study of methods for detecting dataset shift, 2018.
 [112] John Rawls. A Theory of Justice. Harvard University Press, 1971.
 [113] Guy N. Rothblum and Gal Yona. Probably approximately metricfair learning. https://www.fatml.org/media/documents/probably_approximately_metric_fair_learning.pdf, 2018. Appearing in the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2018), Stockholm, Sweden.
 [114] Donald B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005.
 [115] Andrew D. Selbst, Danah Boyd, Sorelle Friedler, Suresh Venkatasubramanian, and Janet Vertesi. Fairness and abstraction in sociotechnical systems. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3265913, 2018. Conference on Fairness, Accountability and Transparency (FAT*) 2019.
 [116] Amartya Sen. Utilitarianism and welfarism. The Journal of Philosophy, 76(9):463–489, 1979.
 [117] Camelia Simoiu, Sam CorbettDavies, and Sharad Goel. The problem of inframarginality in outcome tests for discrimination. The Annals of Applied Statistics, 11(3):1193–1216, 2017.
 [118] Jennifer L. Skeem and Christopher T. Lowenkamp. Risk, race, and recidivism: predictive bias and disparate impact. Criminology, 54(4):680–712, 2016.
 [119] Latanya Sweeney. Discrimination in online ad delivery. Queue, 11(3):10, 2013.
 [120] Jared Sylvester and Edward Raff. What about applied fairness? https://www.dropbox.com/s/p3c9514mw36qs5b/Sylvester%20Raff%20%20What%20About%20Applied%20Fairness.pdf?dl=0, 2018. Presented at the Machine Learning: The Debates workshop at the 35th International Conference on Machine Learning.
 [121] Robert L. Thorndike. Concepts of culturefairness. Journal of Educational Measurement, 8(2):63–70, 1971.
 [122] Rhema Vaithianathan, Tim Maloney, Emily PutnamHornstein, and Nan Jiang. Children in the public benefit system at risk of maltreatment: Identification via predictive modeling. American journal of preventive medicine, 45(3):354–359, 2013.
 [123] Tyler J. VanderWeele and Miguel A. Hernán. Results on differential and dependent measurement error of the exposure and the outcome using signed directed acyclic graphs. American journal of epidemiology, 175(12):1303–1310, 2012.
 [124] Tyler J. VanderWeele and Whitney R. Robinson. On causal interpretation of race in regressions adjusting for confounding and mediating variables. Epidemiology (Cambridge, Mass.), 25(4):473, 2014.
 [125] Lynne M Vieraitis, Tomislav V Kovandzic, and Thomas B Marvell. The criminogenic effects of imprisonment: Evidence from state panel data, 1974–2002. Criminology & Public Policy, 6(3):589–622, 2007.
 [126] Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer New York, 2010.
 [127] Yuling Yao, Aki Vehtari, Daniel Simpson, Andrew Gelman, et al. Using stacking to average bayesian predictive distributions. Bayesian Analysis, 2018.
 [128] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi. Fairness constraints: Mechanisms for fair classification. AISTATS, 2017.
 [129] Meike Zehlike, Carlos Castillo, Francesco Bonchi, Ricardo BaezaYates, Sara Hajian, and Mohamed Megahed. Fairness measures: A platform for data collection and benchmarking in discriminationaware ml. http://fairnessmeasures.org, June 2017.
 [130] Junzhe Zhang and Elias Bareinboim. Fairness in decisionmaking–the causal explanation formula. In 32nd AAAI Conference on Artificial Intelligence, 2018.
Comments
There are no comments yet.