Replication materials for "Omitted and Included Variable Bias in Tests for Disparate Impact" (Jung et al., 2018)
Policymakers often seek to gauge discrimination against groups defined by race, gender, and other protected attributes. One popular strategy is to estimate disparities after controlling for observed covariates, typically with a regression model. This approach, however, suffers from two statistical challenges. First, omitted-variable bias can skew results if the model does not control for all relevant factors; second, and conversely, included-variable bias can skew results if the set of controls includes irrelevant factors. Here we introduce a simple three-step strategy---which we call risk-adjusted regression---that addresses both concerns in settings where decision makers have clearly measurable objectives. In the first step, we use all available covariates to estimate the utility of possible decisions. In the second step, we measure disparities after controlling for these utility estimates alone, mitigating the problem of included-variable bias. Finally, in the third step, we examine the sensitivity of results to unmeasured confounding, addressing concerns about omitted-variable bias. We demonstrate this method on a detailed dataset of 2.2 million police stops of pedestrians in New York City, and show that traditional statistical tests of discrimination can yield misleading results. We conclude by discussing implications of our statistical approach for questions of law and policy.READ FULL TEXT VIEW PDF
In studies of discrimination, researchers often seek to estimate a causa...
Threshold tests have recently been proposed as a robust method for detec...
We study the implications of including many covariates in a first-step
The strength of evidence provided by epidemiological and observational
Since the initial work by Ashenfelter and Card in 1985, the use of
Classical epidemiology has focused on the control of confounding but it ...
What does it mean for an algorithm to be biased? In U.S. law, unintentio...
Replication materials for "Omitted and Included Variable Bias in Tests for Disparate Impact" (Jung et al., 2018)
Studies of discrimination generally start by assessing whether certain groups, particularly those defined by race and gender, receive favorable decisions more often than others. For example, one might examine whether white loan applicants receive credit extensions more often than minorities, or whether male employees are promoted more often than women. Although observed disparities may be the result of bias, it is also possible that they stem from differences in group composition. In particular, if some groups contain disproportionately many qualified members, then one would also expect those groups to receive disproportionately many favorable decisions, even in the absence of discrimination.
To tease apart these two possibilities—group composition versus discrimination—the most popular statistical approach is ordinary linear or logistic regression.
In the banking context, for example,
one could examine race-contingent lending rates after controlling for relevant factors, such as income and credit history.
Disparities that persist after accounting for such differences in group composition are often interpreted as evidence of discrimination.
This basic statistical strategy has been used in hundreds of studies to test for bias in dozens of domains, including
To tease apart these two possibilities—group composition versus discrimination—the most popular statistical approach is ordinary linear or logistic regression. In the banking context, for example, one could examine race-contingent lending rates after controlling for relevant factors, such as income and credit history. Disparities that persist after accounting for such differences in group composition are often interpreted as evidence of discrimination. This basic statistical strategy has been used in hundreds of studies to test for bias in dozens of domains, including education(Espenshade et al., 2004), employment (Polachek et al., 2008), policing (Gelman et al., 2007), and medicine (Balsa et al., 2005).
Despite the ubiquity of such regression-based tests for discrimination, the approach suffers from two serious statistical limitations. First, the well-known problem of omitted-variable bias arises when decisions are based in part on factors that correlate with group membership, but which are omitted from the regression (Angrist and Pischke, 2008). For example, if lending officers consider an applicant’s payment history, and if payment history correlates with race but is not recorded in the data (and thus cannot be included in the regression), the results of the regression can suggest discrimination where there is none. Unfortunately, omitted-variable bias is the rule rather than the exception. It is generally prohibitive to measure every variable relevant to a decision, and it is likely that most unmeasured variables at least partially correlate with protected attributes, skewing results.
The second problem with regression-based tests is what Ayres (2005, 2010) calls included-variable bias, an issue as important as omitted-variable bias in studies of discrimination but one that receives far less attention. To take an extreme example, it is problematic to include control variables in a regression that are obvious proxies for protected attributes—such as vocal register as a proxy for gender—when examining the extent to which observed disparities stem from group differences in qualification. Including such proxies will typically lead one to underestimate the true effect of discrimination on decisions. But what counts as a “proxy” is not always clear. For example, given existing patterns of residential segregation, one might argue that zip codes are a proxy for race, and thus should be excluded when testing for racial bias. But one could also argue that zip code provides legitimate information relevant to a decision, and so excluding it would lead to omitted-variable bias. Ayres (2010) proposes a middle ground, suggesting that potential proxies should be included, but their coefficients capped to a justifiable level; in practice, however, it is difficult to determine and defend specific constraints on regression coefficients.
In this paper, we develop a statistically principled and straightforward method for measuring discrimination that addresses both omitted- and included-variable bias. Our method, which we call risk-adjusted regression, proceeds in three steps. In the first step, we use all available information, including protected attributes and their potential proxies, to estimate the expected utility of taking a particular action. For example, in the lending context, we might estimate an applicant’s risk of default if granted a loan (or equivalently, likelihood of repayment if granted a loan) conditional on all available covariates. In the second step, we assess disparities via a regression model where we control only for individual-level risk scores, and accordingly measure the extent to which similarly qualified individuals are treated similarly. This strategy can be seen as formalizing the coefficient-capping procedure of Ayres (2010)—with covariates used only to the extent that they are statistically justified by risk—and thus circumvents the problem of included-variable bias. Finally, we adapt the classical method of Rosenbaum and Rubin (1983) to estimate the sensitivity of our estimates to unmeasured confounding, addressing the problem of omitted-variable bias.
To demonstrate this technique, we examine 2.2 million stops of pedestrians conducted by the New York Police Department between 2008 and 2011. Controlling for a stopped individual’s statistical risk of carrying a weapon—based in part on detailed behavioral indicators recorded by officers—we find that blacks and Hispanics are searched for weapons more often than whites. These risk-adjusted disparities are substantially larger than disparities suggested by a standard regression that controls for all available covariates, underscoring the importance of accounting for included-variable bias. Our results persist after allowing for search policies that may differ by location, and appear robust to the possibility of substantial unmeasured confounding.
There are two main legal doctrines of discrimination in the United States: disparate treatment and disparate impact. The first, disparate treatment, derives force from the equal protection clause of the U.S. Constitution’s Fourteenth Amendment, and it prohibits government agents from acting with “discriminatory purpose” (Washington v. Davis., 1976). Although equal protection law bars policies undertaken with animus, it allows for the limited use of protected attributes to further a compelling government interest. For example, certain affirmative action programs for college admissions are legally permissible to further the government’s interest in promoting diversity (Fisher v. University of Texas, 2016).
The most widespread statistical test of such intentional discrimination is ordinary linear or logistic regression, in which one estimates the likelihood of favorable (or unfavorable) decisions across groups defined by race, gender, or other protected traits. In this approach, the investigator controls for all potentially relevant risk factors, excluding only clear proxies for the protected attributes. Barring omitted-variable bias, non-zero coefficients on the protected traits suggest those factors influenced the decision maker’s actions; and in the absence of a compelling justification, such evidence is suggestive of a discriminatory purpose.222The regression-based approach to discrimination is closely related to benchmark analysis. In benchmark analysis, one compares the demographic composition of those receiving favorable decisions to the composition of a “qualified pool” (Ridgeway and MacDonald, 2010). For example, one might compare the proportion of loan recipients who are minorities to the proportion of minority applicants having a certain minimum credit rating. Deciding which individuals are “qualified” is analogous to deciding which variables to control for in a regression model. We note that it is difficult—and perhaps impossible—to rigorously define the influence, or causal effect, of largely immutable traits like race and gender on decisions (VanderWeele and Robinson, 2014; Greiner and Rubin, 2011). A regression of this type is nevertheless considered a reasonable first step to identify discriminatory motive, both by criminologists and by legal scholars (Fagan, 2010). However, for an equal protection claim to succeed in court, one typically needs additional documentary evidence (e.g., acknowledgement of an illegitimate motive) to bolster the statistical evidence.
In contrast to disparate treatment, the disparate impact doctrine is concerned with the effects of a policy, not a decision maker’s intentions, and it is the primary form of discrimination we study in this paper. Under the disparate impact standard, a practice may be deemed discriminatory if it has an unjustified adverse effect on protected groups, even in the absence of explicit categorization or animus. The doctrine stems from statutory rules, rather than constitutional law, and applies only in certain contexts, such as employment (via Title VII of the 1964 Civil Rights Act) and housing (via the Fair Housing Act of 1968). Apart from federal statutes, some states have passed more expansive disparate impact laws, including Illinois and California.
The disparate impact doctrine was formalized in the landmark U.S. Supreme Court case Griggs v. Duke Power Co. . In 1955, the Duke Power Company instituted a policy that mandated employees have a high school diploma to be considered for promotion, which had the effect of drastically limiting the eligibility of black employees. The Court found that this requirement had little relation to job performance, and thus deemed it to have an unjustified disparate impact. Importantly, the employer’s motivation for instituting the policy was irrelevant to the Court’s decision; even if enacted without discriminatory purpose, the policy was deemed discriminatory in its effects and hence illegal.
As discussed above, the standard statistical test for disparate treatment is a “kitchen sink” regression, where one examines the residual predictive power of protected group status after including all other available covariates as controls. That approach, however, is ill-suited to assess whether practices are rationally justified, which is the relevant standard in disparate impact claims. Ayres (2005) makes the point persuasively in the context of the original Griggs decision:
“One could imagine running a regression to test whether an employer was less likely to hire African American applicants than white applicants. It would be possible to control in this regression for whether the applicant had received a high-school diploma. Under the facts of Griggs, such a control would likely have reduced the racial disparity in the hiring rates. But including in the regression a variable controlling for applicants’ education would be inappropriate. The central point of Griggs was to determine whether the employer’s diploma requirement had a disparate racial impact. The possibility that including a diploma variable would reduce the estimated race effect in the regression would in no way be inconsistent with a theory that the employer’s diploma requirement disparately excluded African Americans from employment.”
By including educational status in the regression, one would mask the policy’s unjustified disparate impact. We note that such included-variable bias can also conceal discriminatory intent, as a decision maker may have purposely adopted a sub-optimal policy to further a prejudicial goal.333One example is the historical practice of redlining—where borrowers from minority neighborhoods were denied mortgages and federal loan insurance. A disparate treatment analysis of redlining that controlled for the neighborhood of the borrower would find no evidence of racial discrimination, concealing the fact that this is exactly the mechanism by which an intentionally discriminatory outcome is being achieved. (In reality, the Court found no evidence that the Duke Power Company’s policy was enacted with discriminatory purpose; it held only that the education requirement was unrelated to job performance and so had an illegal disparate impact.)
In general, to assess claims of disparate impact, one would ideally compare decision rates for similarly qualified groups of applicants (e.g., similarly qualified white and black candidates). Unfortunately, if one does not (or cannot) control for sufficiently many covariates, omitted-variable bias may skew results; conversely, if one does control for a rich set of covariates, included-variable bias may skew results. Such problems have prompted a search for alternatives to regression-based approaches. Most prominently, Becker (1993) proposed the outcome test, which is based not on the rate at which decisions are made, but on the success rate of those decisions. In the context of banking, Becker argued that even if minorities are less creditworthy than whites, minorities who are granted loans should still be found to repay their loans at the same rate as whites who are granted loans. If loans to minorities have a higher repayment rate than loans to whites, it suggests that lenders are effectively applying a double standard (intentionally or not), granting loans only to exceptionally qualified minorities. Such a finding would be evidence of a disparate impact violation. The outcome test has been applied almost as broadly as simple regression, to study discrimination in policing (Goel et al., 2016b, 2017; Ayres, 2002; Knowles et al., 2001), lending (Berkovec et al., 1996), and scientific publication (Smart and Waldfogel, 1996).
Outcome tests, however, have their own statistical shortcomings. To see this, suppose that there are two, easily distinguishable types of white loan applicants: those who have a 95% chance of repayment, and those who have a 50% chance of repayment. Similarly assume that black loan applicants have either a 99% or 50% chance of repayment. If bank officers, in a race-neutral manner, approve loans to all applicants at least 90% likely to repay their loans, then loans to whites will be repaid 95% of the time whereas loans to blacks will be repaid 99% of the time. In this stylized example, the outcome test would (incorrectly) suggest a double standard, with only exceptionally qualified minorities granted loans. This limitation of outcome tests is known as the problem of infra-marginality, and it stems from the fact that a group’s aggregate repayment rate is an average over individuals with different risk levels (Ayres, 2002; Simoiu et al., 2017; Corbett-Davies et al., 2017; Corbett-Davies and Goel, 2018).
The problem of infra-marginality in outcome tests is more than a hypothetical possibility. Analyzing policing patterns in North Carolina, Simoiu et al. (2017) found that infra-marginality likely caused the outcome test to yield misleading results. To address this issue, they introduced the threshold test, which uses a Bayesian strategy to jointly infer group-specific risk distributions and decision thresholds. The test has since been applied in several studies, including a large-scale analysis of traffic stops across the country (Pierson et al., 2018a, b). While a significant step forward, the threshold test has two notable limitations. First, the test is identified in part by the prior distributions and by the specific structural form of the model—not by the data alone. The authors suggest several diagnostics to assess the robustness of conclusions, but they also note this non-identifiability as an important caveat of the approach. Second, and perhaps more importantly, the threshold test requires considerable statistical knowledge to understand and to appropriately carry out, hindering adoption by practitioners with less technical training.
Our proposed test of disparate impact, which we call risk-adjusted regression, measures the extent to which similarly risky (or equivalently, similarly qualified) individuals from different groups are treated similarly. For example, in the banking context, the test can assess whether black and white loan applicants with similar risk of default are approved at similar rates; and in the policing context, the test can assess whether black and white individuals with similar likelihood of carrying a weapon are searched for weapons at similar rates. By estimating decision rates conditional on risk—rather than on a specific set of covariates—the test avoids the problem of included-variable bias. The test also avoids the problem of infra-marginality, since it computes decision rates conditional on a specific risk level rather than computing an average over a distribution of risk levels. The test cannot completely circumvent the problem of omitted-variable bias, but we present an approach to assess the sensitivity of estimates to potential unmeasured confounding.
The first two steps of our risk-adjusted regression address potential included-variable bias. First, for each individual, we estimate risk as a function of all available covariates, including membership in protected classes such as race and gender.444In some cases, protected covariates can add substantial predictive power, and thus excluding them can lead to poor estimates of risk. For example, in statistical estimates of recidivism, women have been found to reoffend less often than men with similar criminal profiles (Skeem et al., 2016; Corbett-Davies and Goel, 2018). Consequently, by excluding gender from such models, one would over-estimate the recidivism risk of women, potentially skewing estimates of risk-adjusted disparities. This phenomenon is closely related to what Arrow et al. (1973) call statistical discrimination. We then control for this risk estimate in a regression of the (binary) decision on the protected class to compute risk-adjusted disparities, the difference in average decision rates between groups after adjusting for individual-level risk. (The third step of our strategy, in which we account for omitted-variable bias, is described in Section 3.2.)
To carry out this procedure, suppose we have data of the form , where for each observation , indicates membership in a protected class, denotes all other information available to the decision maker prior to selecting a course of action, is the selected action, and indicates the outcome of interest. As a concrete example, consider estimating risk-adjusted racial disparities in police searches of pedestrians for weapons (the application we discuss in Section 4). In this case, , , , and would, respectively, indicate a stopped pedestrian’s race, all information available prior to the search, whether the individual was searched, and whether a weapon was found. Now, let and denote potential outcomes under the two possible actions one could take, where only is observed for each case. For example, might indicate whether a weapon would be discovered if pedestrian were searched, and whether a weapon would be discovered if that pedestrian were not searched. (In this particular example, , since a weapon cannot be found if a search is not conducted, though that need not be the case in all applications.555Suppose, for example, that we seek to estimate risk-adjusted disparities in judicial bail decision, where the outcome indicates whether a defendant fails to show up at required court proceedings (Jung et al., 2017, 2018). Then might indicate failure to appear if the judge requires bail as a condition of release, and might indicate failure to appear if the judge releases a defendant on his or her own recognizance. In this case, both and could be positive, since requiring bail decreases—but does not eliminate—flight risk. )
Next, for each individual, we define the ex-ante risk .
In our policing example, is the probability of finding a weapon if individual
is the probability of finding a weapon if individualis searched. Finally, we define a risk-adjusted regression to be a model of the form,
where is the coefficient for membership in group , and is an appropriately selected transformation of risk.666For ease of exposition, we define risk only in terms of the potential outcome , and our risk-adjusted regression accordingly controls only for that risk. This setup is sufficient in situations where is naturally zero, including many employment, lending, and policing applications. But in some settings—as described in Footnote 5—we require more generality. It is straightforward to extend the approach we outline here, including the sensitivity analysis of Section 3.2, to such scenarios. So that the model is identified, we set for the base group (typically the majority group), and incorporate an intercept term into . Under this model, positive values of indicate that members of group are more likely to receive action than similarly risky members of the base group. In our policing example, this means that groups with positive are searched more often than similarly risky members of the base group; we would accordingly say that such elevated search rates are unjustified by risk. Importantly, positive values of do not imply intentional discrimination—as in Griggs, unjustified disparate impact is possible even under a facially neutral policy undertaken without animus.
In such a risk-adjusted regression, we must choose the transformation to suit the application.
In some cases, we may believe the log-odds of taking action is approximately proportional to the log-odds of risk, suggesting a logit-linear model:
to suit the application. In some cases, we may believe the log-odds of taking action
is approximately proportional to the log-odds of risk, suggesting a logit-linear model:
If, however, there is reason to believe that decision makers are applying a threshold rule (where the probability of taking a certain action rapidly increases from near-zero to near-one at some risk threshold), a linear model would not be appropriate. In that case, binning the risk (into fixed bins ) may better capture the decision-making process:
We might alternatively use splines, or other functional forms, to express more complicated relationships between risk and decisions. No matter what form of is ultimately chosen, one must ensure that any risk-adjusted disparities are robust to reasonable perturbations, as we illustrate in Section 4.
Our formulation implicitly assumes that the utility of a decision is proportional to the risk . While reasonable in many cases, this may not hold in general. For example, if confiscating a weapon in a school zone has higher value than recovering one elsewhere, then probability of weapon recovery would be an imperfect proxy for utility. To address this concern, one could directly quantify (and control for) the utility of each context-specific action, though that approach can be difficult in practice. Alternatively, one could control for the factors by which utility varies, such as proximity to a school. Specifically, we can fit the model:
where are the relevant non-risk factors, and is a vector of coefficients for the covariates
is a vector of coefficients for the covariates. As before, the fitted coefficients provide a measure of unjustified disparities. While easier to carry out, this strategy can re-introduce included-variable bias, since the non-risk factors are only partially related to utility; as such, one might underestimate the actual unjustified disparities. These difficulties highlight the inherent challenges of measuring disparities when utilities are imprecisely specified.
Finally, we note that cannot typically be computed without additional assumptions, as it involves the potentially unobserved outcome . We thus further define the risk function :
Because the probability in Eq. (3.5) is conditioned on , can be estimated from the observable features and decisions. For example, an accurate estimate of can be constructed by first limiting to cases with , since is observed in those instances. If we assume that is ignorable given and , then is an accurate estimate of for all individuals, including those for whom action was taken. We address concerns regarding possible confounding in the next section.
We estimated risk above by assuming that decisions were ignorable given the observed covariates and . Formally, this ignorability assumption means that
In practice, however, it is likely that decision makers observe unrecorded information which is predictive of the outcome and is therefore used to inform their actions, violating ignorability. As an example, a stopped pedestrian’s response to police questioning may legitimately alter an officer’s estimate of risk, and thus the officer’s decision to search. In this case, searched individuals would be riskier on average than those with the same observed covariates who were not searched, since the former group is more likely to have provided suspicious answers to an officer’s questions. As such, estimates fit on searched pedestrians would systematically overstate risk for those who were not searched, in turn corrupting estimates of risk-adjusted disparities. Unfortunately, it is typically impossible to directly account for every factor that may plausibly affect risk—unlike legitimate non-risk variables which often can be explicitly enumerated.
We address this issue by adapting the method of Rosenbaum and Rubin (1983), as recently extended by Jung et al. (2017), for assessing the sensitivity of estimated causal effects to an unobserved binary covariate. This methodology works well in our setting, even though estimating a causal effect is not the goal. To start, we assume there exists an unobserved covariate that affects both the decision (e.g., whether or not to carry out a search) and also the potential outcome (e.g., recovery of a weapon if a search were conducted). Our key assumption is that the observed action is ignorable given the observed covariates , , and the unobserved covariate :
This model of confounding has three important parameters, each which may depend on the observed covariates and : (1) the effect of on the action ; (2) the effect of on the potential outcome ; and (3) the probability that . As we show below, once these parameters are specified, one can compute risk-adjusted disparities that account for the unmeasured confounding. Because the confounding is, by definition, unobserved, we cannot infer these parameters from the data. We can, however, specify plausible ranges for them and thus gauge the sensitivity of estimated disparities, as described in Section 4.2.
To begin, we note that without loss of generality, we can write
for appropriately chosen parameters and that depend on the observed covariates and . Here is the change in log-odds of taking action when versus when . We can similarly write
for parameters and . In the case of police searches, is the change in log-odds of recovering a weapon if searched when versus when .
For any posited values of the three parameters , , and , we can use the observed data to estimate and , as described in Rosenbaum and Rubin (1983) and Jung et al. (2017). First note that can be decomposed into two components conditioned on and , respectively:
The left-hand side of Eq. (3.10) depends only on the observed quantities , , and , and so can be directly estimated from the data (e.g., via a logistic regression). The right-hand side is a continuous, increasing function of , which takes values from 0 to 1 as goes from to . Thus, there is a unique value of that ensures the equality in Eq. (3.10) is satisfied.
Given the fitted values of , we can now estimate the distribution of given the observed covariates and decisions. By Bayes’ rule:
The right-hand side can be estimated using Eq. (3.8) and our estimate of computed above, yielding an approximation of the left-hand side.
Next, to estimate , we write:
The second equality above follows from the ignorability assumption stated in Eq. (3.7), and the third equality follows from Eq. (3.9). The left-hand side is the risk function defined in Eq. (3.5), which can be estimated from the observed data. Given the estimate of from above, and our assumed value of , the only unknown on the right-hand side is . As before, there is a unique value of that ensures the equality is satisfied.
We now depart from past work, and make use of the fact that we can estimate the full joint distribution of
We now depart from past work, and make use of the fact that we can estimate the full joint distribution of:
where we have used ignorability to write the first two terms in the product. Note that and can be estimated from Eqs. (3.9) and (3.8), respectively, is posited at the start of the analysis, and can be estimated from the empirical joint distribution observed in the data.
Imagine drawing a new, synthetic dataset of samples according to this (estimated) joint distribution. For each datapoint, we can use Eq. (3.9) to compute the ex-ante risk . This in turn allows us to regress the decision on risk (and optional legitimate covariates ):
to compute , the risk-adjusted disparities for group in the sampled dataset.
The fitted coefficient is a random variable that depends on the sample drawn,
and we are interested in its limiting value as
is a random variable that depends on the sample drawn, and we are interested in its limiting value asgoes to infinity, which is our estimate of risk-adjusted disparities, taking potential confounding into account.
One could approximate this limiting value by sampling a large number of datapoints and fitting the regression above.
But we can more efficiently compute the limit as follows.
First, we construct an expanded dataset by combining two copies of the observed data —one with set to 0, and another with set to 1 (recall that is not present in the original dataset).
We then fit a fractional-response logistic regression (Papke and Wooldridge, 1996) on the doubled dataset , weighting each datapoint by either or , and using (computed from Eq. (3.8 )) as the response variable.
)) as the response variable.
To see that this approach indeed yields the limiting value of , we show that the log-likelihood functions of the sampling process converge to the log-likelihood of the fractional-response regression. We start with some notation. First, let
where . Then, define the random variables,
Now, we can write the (normalized) log-likelihood of :
In the limit, we have:
Finally, letting denote the size of the original dataset , note that the limit of is equivalent to,
which is the log-likelihood that is optimized when fitting the weighted fractional-response regression to with response and weights , establishing the result.
We now apply our risk-adjusted regression developed above to investigate the New York Police Department’s (NYPD) “stop-and-frisk” practices. Police officers in the United States may stop and question pedestrians if they have “reasonable and articulable” suspicion of criminal activity; and officers may additionally conduct a “frisk” (i.e., a brief pat-down of one’s outer garments) if they believe the stopped individual is carrying a weapon. Though a policy of stopping and frisking individuals is not inherently illegal, a federal district court ruled that the NYPD carried out such stops with racial animus, violating the equal protection clause of the Fourteenth Amendment (Floyd v. City of New York, 2013).
The court in Floyd was interested in assessing claims of disparate treatment; here we re-analyze the data to test directly for disparate impact. We specifically focus on frisk decisions, as they have a clear goal of recovering weapons and a well-measured outcome—whether a weapon was in fact found. Frisk decisions are thus particularly amenable to our statistical approach. We study 2.2 million pedestrian stops that occurred between 2008 and 2011. For each stop, we have detailed information on the date, time, and location of the stop; the demographics of the stopped individual (e.g., age, gender, and race); the suspected crime; the reasons prompting the stop (e.g., “furtive movements” or “suspicious bulge”); and additional circumstances surrounding the stop (e.g., evasive responses to questioning, witness reports, or evidence of criminal activity in the vicinity).777This information is recorded in a standardized way on UF-250 forms that officers are required to complete after each stop. A copy of the form can be found online at: https://www.prisonlegalnews.org/news/publications/blank-uf-250-form-stop-question-and-frisk-report-worksheet-nypd-2016/.
To start, we note that white pedestrians are frisked during 44% of police stops, whereas black and Hispanic pedestrians are frisked in 57% and 58% of stops, respectively. This corresponds to stopped minority pedestrians having about 1.7 times greater odds of being frisked than whites. These raw disparities are computed without controlling for any potentially explanatory variables, and so represent an extreme case of omitted-variable bias. At the other extreme is the “kitchen sink regression,” which controls for all pre-frisk covariates in a standard logistic regression model. In this case, stopped blacks and Hispanics have about 1.2 times the odds of being frisked relative to whites. Though these kitchen-sink disparities are suggestive of disparate treatment (and similar evidence was indeed presented to the court in Floyd to support such an allegation), they may understate the extent to which the policy imposes an unjustified disparate impact on minorities, due to included-variable bias.
We now analyze the stop-and-frisk data via a risk-adjusted regression. We begin by accounting for included-variable bias, and we then assess the sensitivity of estimated risk-adjusted disparities to unmeasured confounding in Section 4.2.
We first randomly divide the stops into two approximately equal-sized training and test sets.
Using only stops in the training set that resulted in a frisk, we fit a model that estimates , the probability a weapon is found
conditional on all observable pre-frisk variables, including race.
To estimate this probability,
we use gradient-boosted decision trees,
a machine-learning algorithm known to perform well in a variety of such classification tasks.
, the probability a weapon is found conditional on all observable pre-frisk variables, including race. To estimate this probability, we use gradient-boosted decision trees, a machine-learning algorithm known to perform well in a variety of such classification tasks.888The gradient-boosted decision tree model was fit using the gbm package in R (Ridgeway et al., 2006). The AUC of this model on the test set is 81%, and model checks presented in Appendix A indicate that it yields predictions that are well-calibrated across groups. By assuming ignorability, we then use the fitted model to infer ex-ante risk for every pedestrian stopped in the test set, including those who were not frisked.
The distribution of these inferred risks, disaggregated by race, is shown in Figure 1, which illustrates several interesting patterns. First, the absolute level of risk is quite low, with even the riskiest pedestrians estimated—based on the recorded evidence—to be carrying weapons only about 10% of the time. This observation is consistent with the court’s ruling in Floyd that stops were often carried out without sufficient legal justification, in violation of the Fourth Amendment requirement that stops be based on “reasonable suspicion” (Goel et al., 2017).999The court ruled that the NYPD violated both the Fourth Amendment demand for reasonable suspicion when carrying out stops, and the Fourteenth Amendment prohibition against racial animus. Further, stopped white pedestrians are, on average, riskier than stopped minority pedestrians. In particular, 2.7% of stopped whites are estimated to carry weapons, compared to 1.5% of stopped blacks and 1.7% of stopped Hispanics. Following the logic of Becker’s outcome test, these differences are evidence of discrimination in the stop decision, since they suggest that officers stopped minorities on the basis of less evidence than whites.101010In and of themselves, these differences provide only weak evidence of bias, as outcome tests suffer from the problem of infra-marginality (as discussed in Section 2), and officers may stop individuals for legitimate reasons other than suspected weapon possession, potentially producing the observed patterns in the absence of discrimination. We note, however, that more thorough investigations of the NYPD’s practices have indeed found compelling statistical evidence of bias in stop decisions (Pierson et al., 2018a; Goel et al., 2016b; Gelman et al., 2007).
Now, to quantify risk-adjusted disparities, we fit the risk-adjusted regression in Eq. (3.1) on the test set.
We separately compute the black-white disparity by fitting a model only on stops of white and black pedestrians,
and likewise compute the Hispanic-white disparity by fitting a model only on stops of Hispanic and white pedestrians.
Because estimated risk is a data-dependant covariate, we estimate error bars for the risk-adjusted disparities via bootstrapping.
Specifically, we estimate the standard error of estimates by repeating our entire inference procedure—including estimation of risk—on 100 bootstrap samples of the original data; 95% confidence intervals are computed around the point estimate by adding and subtracting twice the bootstrapped standard error.
is a data-dependant covariate, we estimate error bars for the risk-adjusted disparities via bootstrapping. Specifically, we estimate the standard error of estimates by repeating our entire inference procedure—including estimation of risk—on 100 bootstrap samples of the original data; 95% confidence intervals are computed around the point estimate by adding and subtracting twice the bootstrapped standard error.
Figure 2 shows the results, together with the raw disparities and those estimated from a “kitchen sink” model.
We find that
stopped black and Hispanic pedestrians have about twice the
odds of being frisked than equally risky whites.
We also find that these estimated disparities are robust to several
different specifications of the risk transformation : logit-linear (as in Eq. (3.2 )),
binned by decile (as in Eq. (
)), binned by decile (as in Eq. (3.3)), and thin-plate spline (Wood, 2003).111111The thin-plate regression spline was fit using the mgcv package in R. Further, the risk-adjusted disparities are in fact greater than the raw disparities in frisk rates. This finding is consistent with the fact that stopped whites are riskier on average than stopped minorities, as discussed above. Finally, we see that the kitchen-sink regression dramatically underestimates the extent of disparate impact faced by minorities. In this case, the kitchen-sink model controls for a variety of features—including whether the suspect made “furtive movements”—that are correlated with race but are poor predictors of whether a pedestrian is carrying a weapon, skewing estimates of disparate impact.121212We exclude hair color and eye color from the kitchen-sink model, since these are obvious proxies for race that are effectively unrelated to risk, and which would thus be excluded in most traditional legal and statistical analyses of discrimination. As we would expect, including these variables as controls exacerbates the problem of included-variable bias, but our results show that such bias can occur even if obviously problematic variables are excluded.
Our analysis above assumes that risk of weapon possession is the only legitimate consideration for carrying out a frisk. It is possible, however, that there is a justifiable reason for frisking certain low-risk pedestrians more often than other, higher-risk individuals; and such a policy might in turn justify the risk-adjusted disparities we find between race groups. One possible policy justification is that different neighborhoods benefit from different enforcement standards. For example, frisks of low-risk pedestrians might be particularly effective at deterring criminal activity in high-crime neighborhoods by raising the perceived chance of getting caught; additionally, greater police presence in such areas might lower the effective costs of carrying out stops, making it feasible to frisk lower-risk individuals. While the merits of such policies are debated, we can still examine whether risk-adjusted disparities persist after controlling for location via Eq. (3.4). We specifically control for the police precinct in which the stop occurred, and the location type (e.g., public housing or public transit). The right-most panel in Figure 2 shows that the disparities cannot be explained by policing practices that may differ by location. We find the estimated disparity for Hispanics decreases (from 1.9 to 1.6) but is still large, and the disparity for blacks is approximately unchanged (at 2.0).
The disparities computed above account for included-variable bias by adjusting for each individual’s estimated risk. But, as discussed earlier, our estimates of risk may be skewed if officers observe factors that are predictive of risk but are not recorded in our data. We now apply the method of Section 3.2 to gauge the sensitivity of our estimated risk-adjusted disparities to such potential omitted-variable bias.
First we estimate the left-hand side of Eq. (3.10), , the probability of being frisked given the observed covariates. As when estimating our risk model, we use gradient-boosted decision trees, fit on the training set. On the test set, the fitted model has 83% AUC, and model checks in Appendix A indicate well-calibrated predictions across various subgroups of the data. Now recall that our sensitivity analysis is based on three key parameters: (1) , the effect of the confounder on being frisked; (2) , the effect of the confounder on recovering a weapon if a frisk is carried out; and (3) , the prevalence of the confounder. In principle, these parameters could vary arbitrarily with both and . However, for simplicity, and following standard practice (Rosenbaum and Rubin, 1983; Jung et al., 2017), we limit the dependence of these parameters on the covariates. We specifically assume that and are constant across all individuals. We further assume that the prevalence of the confounder depends only on an individual’s race. This means that while the confounder may be present at different rates for different groups, its effects on risk and frisk rates are the same for all those individuals for whom it is present.131313In preliminary numerical experiments, we found that our estimates did not change substantially if we allowed and to vary by group. Thus, for computational efficiency, we assume constant values for these parameters.
To assess the sensitivity of estimates, we now posit ranges for the parameters above, and then compute the minimum and maximum values of —our disparate impact estimate—across this range (via grid search). In constructing these ranges, we assume that the effect parameters ( and ) are non-negative, which ensures that decision makers respond to the confounder rationally—that is, we assume a confounder cannot increase a individual’s risk while also decreasing that individual’s probability of being frisked. In particular, we assume (for a parameter that we select below), and we assume that is unconstrained. What remains is to determine an appropriate upper bound on the effect sizes. There are several reasonable approaches outlined in the sensitivity literature. In some problems, there are “known unknowns,” variables that are not present in the data but whose effects can be estimated from theory or from additional investigation. Another approach is to fit a linear model estimating , and assume that any confounder will have a marginal effect no bigger than the largest effect associated with a binary feature in that model. We take a third approach, which is loosely related to cross-validation, as described next.
First, we produce a synthetic dataset in which the covariates and are identical to the original,
but where frisk decisions and potential outcomes are randomly generated according
to our frisk and risk models fit above.
In reality, our risk model estimates
but we assume
for the purposes of creating this synthetic dataset.
ignorability thus holds in , and so we can fit the risk-adjusted regression in Eq. (3.1) to compute the true racial disparities in this synthetic dataset.
Now, to simulate confounding, we censor a set of pre-frisk covariates in .
Specifically, we remove variables listed in
two sections of
the UF-250 stop forms that describe the “circumstances” prompting the encounter.
These sections consist of 20 binary variables—including, for example, “fits description” “actions indicative of casing”, and “changing direction at sight of officer”—that
are crucial for establishing the legal basis of the stop.
Thus, by censoring these variables, we are simulating severe unobserved confounding.
. Specifically, we remove variables listed in two sections of the UF-250 stop forms that describe the “circumstances” prompting the encounter. These sections consist of 20 binary variables—including, for example, “fits description” “actions indicative of casing”, and “changing direction at sight of officer”—that are crucial for establishing the legal basis of the stop. Thus, by censoring these variables, we are simulating severe unobserved confounding.
We now run the sensitivity analysis described above on the censored dataset with , corresponding to tripling the odds of frisking and of finding weapons on stopped individuals. By assuming this degree of confounding, the inferred sensitivity bands around our estimates easily cover the true risk-adjusted disparities in the synthetic dataset , as shown in Figure 3.141414These results are based on a logit-linear risk transformation . As shown in Figure 2, our results are largely robust to the precise transformation applied, and so for simplicity and speed we default to a logit-linear model. As before, confidence intervals are computed via bootstrapping. Specifically, we compute the bootstrap standard error of estimates assuming no confounding, as in Figure 2, and then add and subtract twice that computed quantity to the endpoints of the sensitivity bands. Accordingly, it appears that is a reasonable upper bound on the effect of any confounders in the real data. It might seem surprising that the censored estimates (in white) are so close to the true values (in red), even with 20 key variables removed. But we note that for a confounder to substantially change our estimates, it has to be relatively prevalent in the data, and also correlated with risk, frisk rate, and race. Some of the censored variables (such as “furtive movements”) have little relation to risk, others (such as “suspicious object”) are predictive but rare, and most are only moderately correlated with race. As a result, censoring has only a limited effect on our disparity estimates.
Finally, we compute sensitivity bands for our estimates of risk-adjusted disparities in the real NYPD stop-and-frisk data, assuming . After accounting for such possible confounding, Figure 4 shows that black and Hispanic pedestrians were still more likely to be frisked than equally-risky whites. The figure also shows that this finding holds if we allow frisk policies to vary by location.
With our risk-adjusted regression, we have sought to develop a simple, intuitive test that addresses the most serious concerns of omitted- and included-variable bias in disparate impact studies. On a detailed dataset of police stops, we found that these concerns are more than hypothetical possibilities. In particular, regressions that control for all available covariates—in line with common legal and statistical convention—can substantially skew estimates of discrimination.
Throughout our analysis, we have defined “disparate impact” in terms of a regression coefficient on protected-group identity in a model that controls for estimated risk. This choice is consistent with current practice, where we simply replace the usual set of control variables with a single variable capturing risk. Implicitly, our formalization means that we are measuring a particular weighted average of differences in decision rates across similarly risky individuals. While intuitively reasonable, this definition raises subtle questions of law and policy.
Consider, for example, Figure 5, where we plot frisk rates as a function of risk, disaggregated by race, estimated by separate logistic regression curves. Stopped black and Hispanic pedestrians are frisked more often than stopped whites at every level of risk. As a result, one would find that minorities face disparate impact regardless of how one averages across risk levels; the precise number might change, but the qualitative conclusion would remain the same. However, comparing blacks to Hispanics, the direction of the disparity depends on risk.151515Such a comparison between minority groups is unusual in disparate impact cases, but it illustrates the underlying theoretical issue. Low-risk Hispanics are frisked more often than similarly risky blacks, but high-risk blacks are frisked more often than their Hispanic counterparts. Consequently, a conclusion of disparate impact between blacks and Hispanics would depend heavily on the precise definition applied. The analysis is further complicated if the risk distributions differ substantially between groups. If, hypothetically, Hispanics were mostly low-risk and blacks mostly high-risk, majorities of both groups could argue that they were treated more harshly than equally risky members of the other group.161616Some scholars have similarly investigated interactions between race and other decision-making criteria. For example, Espenshade et al. (2004) find that preferences for underrepresented minorities in college admissions is greatest for applicants with SAT scores in the 1200–1300 range, and the effect is attenuated at lower scores. That analysis, however, found no score ranges where minority applicants faced an absolute disadvantage relative to whites with equal scores. We do not know of any research that has found a change in the direction of the disparities like we see between blacks and Hispanics in Figure 5.
The crossing of risk curves that we see in Figure 5 is a potentially widespread phenomenon, and, to our knowledge, disparate impact law has not yet resolved the underlying conceptual ambiguity it invokes. Many discussions of disparate impact tacitly assume that policies either consistently harm or help groups defined by protected traits. Such thinking can be seen in the original Griggs ruling, where the Supreme Court aimed to proscribe policies that acted as “built-in headwinds” for minorities. But, formally, disparate impact law concerns facially race-neutral policies, not intentional discrimination, and there is no theoretical or empirical guarantee that such policies will adversely impact all members of a particular group.
A related issue is the extent to which concern for unjustified disparities compels decision makers to act optimally. For example, Figure 5 suggests that officers are only marginally responsive to risk, with the lowest-risk individuals still frisked more than 40% of the time. If, instead, officers frisked only the riskiest people, they could frisk far fewer individuals—and, in particular, far fewer minorities—while recovering the same number of weapons (Goel et al., 2016a). A more efficient frisk strategy could thus reduce the burdens of policing on minorities while still maintaining public safety. Such efficiency is indeed one of the aims of statistical risk assessment tools that are now used in the criminal justice system and beyond to guide high-stakes decisions (Monahan and Skeem, 2016; Corbett-Davies and Goel, 2018; Chouldechova et al., 2018; Shroff, 2017). If these tools are shown to reduce racial disparities, are policymakers obliged—legally or ethically—to adopt them? The role of efficiency in disparate impact claims has largely gone unanswered by the courts, and adds yet another subtlety to defining and measuring disparities.
By foregrounding the role of risk in understanding disparities, we have aimed to clarify some of the thorny conceptual issues at the heart of disparate impact analysis. Though there are still important unresolved questions, we believe that our general statistical approach provides practitioners with a tractable way to assess disparities in many domains while avoiding some of the pitfalls with traditional methods. Looking forward, we hope this work spurs further theoretical and empirical research at the intersection of statistics, law, and public policy.
The 21st International Conference on Artificial Intelligence and Statistics (AISTATS), 2018a.