# Fairness Evaluation in Presence of Biased Noisy Labels

Risk assessment tools are widely used around the country to inform decision making within the criminal justice system. Recently, considerable attention has been devoted to the question of whether such tools may suffer from racial bias. In this type of assessment, a fundamental issue is that the training and evaluation of the model is based on a variable (arrest) that may represent a noisy version of an unobserved outcome of more central interest (offense). We propose a sensitivity analysis framework for assessing how assumptions on the noise across groups affect the predictive bias properties of the risk assessment model as a predictor of reoffense. Our experimental results on two real world criminal justice data sets demonstrate how even small biases in the observed labels may call into question the conclusions of an analysis based on the noisy outcome.

## Authors

• 7 publications
• 8 publications
• 19 publications
• ### Assessing the Fairness of Classifiers with Collider Bias

The increasing maturity of machine learning technologies and their appli...
10/08/2020 ∙ by Zhenlong Xu, et al. ∙ 3

• ### Counterfactual Risk Assessments, Evaluation, and Fairness

Algorithmic risk assessments are increasingly used to help humans make d...
08/30/2019 ∙ by Amanda Coston, et al. ∙ 0

• ### Qualitätsmaße binärer Klassifikationen im Bereich kriminalprognostischer Instrumente der vierten Generation

This master's thesis discusses an important issue regarding how algorith...
04/04/2018 ∙ by Tobias D. Krafft, et al. ∙ 0

• ### Feedback Effects in Repeat-Use Criminal Risk Assessments

In the criminal legal context, risk assessment algorithms are touted as ...
11/28/2020 ∙ by Benjamin Laufer, et al. ∙ 0

• ### Fairness Under Unawareness: Assessing Disparity When Protected Class Is Unobserved

Assessing the fairness of a decision making system with respect to a pro...
11/27/2018 ∙ by Jiahao Chen, et al. ∙ 0

• ### Fairness in Risk Assessment Instruments: Post-Processing to Achieve Counterfactual Equalized Odds

Algorithmic fairness is a topic of increasing concern both within resear...
09/07/2020 ∙ by Alan Mishler, et al. ∙ 0

• ### When the Oracle Misleads: Modeling the Consequences of Using Observable Rather than Potential Outcomes in Risk Assessment Instruments

Machine learning-based Risk Assessment Instruments are increasingly wide...
03/01/2020 ∙ by Niccolò Dalmasso, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The goal of recidivism risk assessment instruments (RAI’s) is to estimate the likelihood that an individual will reoffend at some future point in time, such as while on release pending trial, on probation or parole

(desmarais2013risk). Risk assessment tools have long been used in the criminal justice system to guide interventions aimed at reducing recidivism risk (james2015risk). More recently they have received considerable attention as major components of broader pretrial reform efforts seeking to reduce unnecessary pretrial detention without compromising public safety. From a public safety standpoint, society incurs a cost when a crime is committed, irrespective of whether the crime results in an arrest. The relevant fairness question in this context is thus whether a tool provides an “unbiased” prediction of who goes on to commit future crimes. However, because offending is not directly observed, risk assessment models are trained and evaluated on data where the target variable is rearrest, reconviction, or reincarceration.

While these observed proxies for offending may be of interest in their own right, they are problematic as a basis for predictive bias assessment, particularly with respect to race. Racial disparities in rearrest rates may stem from two separate causes: differential involvement in crime, and differential law enforcement practices, also known as differential selection (piquero2008assessing). Rearrest is a result of not only an individual’s actions, but also of law enforcement practices affecting the likelihood of getting arrested for crimes committed (or even for crimes not committed). The limited evidence that exists suggests that differential law enforcement is not a major factor in arrests for violent crimes (piquero2015understanding). Problematically, though, for lower level offenses, which form the majority of arrests in existing data, there is reason to believe that the likelihood of getting arrested for a committed offense does differ across racial groups. Evidence of differential selection is strongest in the case of drug crimes, where surveys suggest that whites are at least as likely as blacks to sell or use drugs; yet blacks are more than twice as likely to be arrested for drug-related offenses (rothwell2014war). This racially differential discrepancy between the unobservable outcome (reoffense) and the noisy observed variable (rearrest) poses a critical challenge when evaluating RAI’s for racial predictive bias. In this paper, we will refer to such differential discrepancy as target variable bias (TVB). As we show, in the presence of TVB, a model that appears to be fair with respect to rearrest could be an unfair predictor of reoffense.

We develop a statistical sensitivity analysis framework for evaluating RAI’s according to several of the most common fairness metrics, including calibration, predictive parity, and error rate balance. Our approach is conceptually inspired by sensitivity analysis approaches widely used in causal inference studies (rosenbaum2014sensitivity)

. When presenting analytic results it is common to report not only point estimates and confidence intervals, but also a parameter

reflecting the magnitude of unobserved confounding that would be sufficient to nullify the observed results. In this work we introduce a similar parameter, , that governs the level of label bias in the observed data. Our methods characterize how the fairness properties of a model vary with , and can be used to determine the level of label noise sufficient to contradict the observed findings about those properties. We illustrate our approach through a reanalysis of the fairness properties of the COMPAS RAI used in the ProPublica debate, and a risk assessment tool developed on data provided by the Pennsylvania Commission on Sentencing.

### 1.1 Related work

What we call target variable bias is often referred to as differential outcome measurement bias or differential outcome misclassification bias in the statistics and epidemiology literature on measurement error (carroll2006measurement; grace2016statistical). Most of the measurement error literature is concerned with the problem of non-differentially mismeasured exposure (treatment), covariates, and outcomes. That is, while this form of data bias has a name, it has received little attention relative to other measurement issues. The work of imai2010causal is a notable exception. They do consider the setting of differential measurement error, but their goal is different from ours in that they are seeking to estimate a causal effect parameter.

In the machine learning literature, our setting is known as

censoring positive and unlabeled (PU) learning (menon2015learning). This literature differs from the current work in two key ways. First, while the case of feature-independent noise has been widely studied (elkan2008learning; scott2009novelty; du2014analysis; liu2016classification; menon2015learning), our work contributes to the nascent literature on feature-dependent noise (menon2016learning; bekker2018learning; scott2018generalized; bootkrajang2018towards; cannings2018classification; he2018instance). We believe our paper is among the first to consider issues of fairness in the context of PU learning.

There are also connections between the goal of our work and causal approaches to algorithmic bias that have recently been proposed in the fairness literature (kusner2017counterfactual; loftus2018causal; kilbertus2017avoiding; nabi2018fair). These works provide an approach to addressing biases in the observed data by attempting to directly model the causal structure governing the data generating process. Problematically, the underlying assumptions are often not empirically testable, and when violated may result in incorrect inference.

Lastly, label noise has been briefly mentioned in prior work as a potential concern in the training and evaluation of RAI’s (johndrow2016fairness; Corbett-Davies:2017:ADM:3097983.3098095; corbett2018measure). However, none of these works undertake a formal analysis of how label noise affects training or evaluation.

## 2 Problem setup

We denote the observed noisy outcome (e.g., rearrest) by , the true unobserved outcome (e.g., reoffense) by , the set of covariates (e.g. age, criminal history) by , the group indicator (race) by , and the risk score (our RAI) by . The risk score can be thought of as an empirical estimate of . When discussing binary classification metrics, we will set a risk threshold applied to

to obtain the classifier

. The discrepancy between the observed and true outcome is captured in the noise rate function . A central aim of this work is to characterize what can be learned about the predictive bias properties of as a predictor of the true unobserved outcome under assumptions on the magnitude but not the structure of the noise.

We make two simplifying assumptions that, while implausible in practice, greatly simplify exposition in the main manuscript and reduce the notational overhead. First, we assume that the noise is one-sided, which rules out the case of “false arrests.”

###### Assumption 1.

for all and .

This allows us to drop the dependency on in the notation of , and rewrite as . That is, the discrepancy between and is due to the presence of “hidden recidivists”. Table 1

describes the general setup for this setting. The left table represents the observed confusion matrix expressed in terms of the cell frequencies

; the right table introduces the parameters . Large values of indicate that hidden recidivists are more likely to be classified as high risk, while large values of indicate that hidden recidivists are less likely to be classified as high risk. We also define that corresponds to the overall proportion of “hidden recidivists” in the observed data.

Second, in the main paper we suppose that one of the groups is being observed without bias.

###### Assumption 2.

for all .

That is, for we assume that . In the running COMPAS example, this amounts to operating as though we observed the true offenses for the black population. One could also think of as capturing the additional degree of hidden recidivism in the white population relative to the black population. Again, this assumption is made solely to simplify exposition, and it does not qualitatively affect the presented results. 111For this reason, in the paper we typically denote . In Supplement §B.3 we show how all results are readily extensible to the case where this assumption is removed.

As we shall show next in Section 3, most of the bounds in our sensitivity analysis correspond to the case where the hidden recidivists correspond to the highest/lowest-scoring () defendants for whom we observed . While these extreme cases may seem unlikely in practice, they generally cannot be ruled out on the basis of the observed data alone without further assumptions. In such settings, existing methods typically (1) assume some data generating mechanism to conduct sensitivity analysis (heckman1979sample; little2019statistical; robins2000sensitivity; molenberghs2014handbook)

, (2) assume parametric models and estimate the noise by EM algorithms

(rubin1976inference; bekker2018learning), or (3) impose stronger conditions on the noise processes. For instance, may be assumed to depend only on a subset of (bekker2018learning) or be a monotonic function of (menon2016learning; scott2018generalized).

In this paper we are primarily interested in what can be said about the predictive bias properties of an RAI without untestable structural assumptions on the noise process. We note, however, that our results can be adapted to incorporate structural assumptions when reasonable ones are available. For instance, an assumption tailored to our setting might be .222This is a slight modification of label-dependent noise, or noise at random. In the PU learning and missing data literature, the latter is known as selected at random (SAR) (bekker2018learning) and missing not at random (MNAR) (rubin1976inference) respectively. This would assume that the noise process is constant within groups. Such an assumption probabilistically rules out extreme cases for and , and, as we show in Supplement §A.2.2, it allows us to obtain tighter estimation results. There we also demonstrate how a range of results from the label-dependent noise literature can be easily adapted to our setting.

### 2.1 Data and background

In May 2016 an investigative journalism team at ProPublica released a report on a proprietary risk assessment instrument called COMPAS, developed by Northpointe Inc (now Equivant)(propublica2016). The investigation found that the COMPAS instrument had significantly higher false positive rates and lower false negative rates for black defendants than for white defendants. This evidence led the authors to conclude that COMPAS is biased against black defendants. The report was met with a critical response challenging its central conclusion (floresfalse; dieterich2016compas; corbett-davies-2016). Error rate imbalance, critics argued, is not an indication of racial bias. Instead, RAI’s should be assessed for properties such as predictive parity (dieterich2016compas) and calibration(floresfalse), which COMPAS was shown to satisfy. A series of papers reflecting on the debate showed that when recidivism prevalence varies across groups, as is observed to be the case in ProPublica’s Broward County data, a tool cannot simultaneously satisfy both predictive parity (calibration) and error rate balance (resp. balance for the positive and negative class) (kleinberg2016inherent; chouldechova2017fairlong; berk2017fairness).

One popular interpretation of such “impossibility results” is that error rate imbalance is a (perhaps inconsequential) artifact of differences in recidivism (rearrest) prevalence across groups. That is, if one were to assess the instrument on a population where prevalence was equal, the RAI could (might be expected to) achieve parity on all of the metrics simultaneously. Applying our framework to reanalyse the data in the setting where true offense rates are assumed to be the same across groups, we show that disparities with respect to (reoffense) may in fact be greater than those observed for (rearrest).

We also analyze a second private data set provided by the Pennsylvania Sentencing Commission for the purpose of research. This dataset contains information on all offenders sentenced in the state’s criminal courts between 2004-2006. In reports published by the Commission, they observe that the risk assessment tool they constructed appeared to overestimate risk for white offenders. While we do not have access to their tool, the tool we construct by applying regularized logistic regression to their data evidences the same miscalibration issues. Our empirical results are based on applying this score to a held out set of

offenders, of whom 65.4% are white.

## 3 Sensitivity analysis under target variable bias

This section presents our main technical results, coupled with experiments that demonstrate how the results may be used in practice. All proofs are contained in Supplement §B.1. Given observations and a classification threshold , we want to understand how the relationship between the observed () and unobserved () performance metrics depends on the noise level in the problem setup outlined in Section 2. Superscripts and

denote within-race group estimates. We present sensitivity analysis results for predictive parity, error rate balance (aka equalized odds

(hardt2016equality)), accuracy parity, and two tests of differential calibration. Supplement §C presents experiments on the COMPAS data set for two fairness-promoting algorithms. All code is available at https://github.com/ricfog/Fairness-tvb.

### 3.1 Error rate balance and predictive parity

We begin by presenting results for the false positive rate (), the false negative rate (), and the positive predicted value (). Our first result shows that the observed values and impose constraints on the true error rates even if no assumptions are made on the magnitude of the noise.

###### Proposition 3.1.

Suppose that . Then and cannot both hold. If , then the opposite inequalities can not both hold.

Proposition 3.1 permits us to rule out one of the possible relations between observed and true error rates based solely on observed quantities. Example: COMPAS. In ProPublica’s COMPAS analysis, we observe that and . We are thus in the case where , and therefore either or , or both.

The next set of results directly relate the observed metrics to the target quantities based on the noise level . Table 1 summarizes the relationship between the observed and target confusion tables used to derive these relationships. While a version of the results was previously reported in (claesen2015assessing), the case of and are novel.

###### Theorem 3.2.

Under the setup of Table 1, the target values , , and can be sharply related to observed quantities as follows:

 p01−αp00+p01−α≤FPR∗(α0,α1)≤p01p00+p01−α (1) p10p10+p11+α≤FNR∗(α0,α1)≤p10+αp10+p11+α (2) PPV≤PPV∗(α0,α1)≤p11+αp01+p11 (3)

Example: COMPAS. This result allows us to reanalyse ProPublica’s COMPAS data to answer the question: If the reoffense rate was equal across races, would disparities disappear? Figure 0(a) shows the possible values of , , and for fixed . At this choice of , the true reoffense rate among white defendants is assumed equal to the rate observed for black defendants. Since is fixed, and hence the metrics are a function of just . We see that for most values of disparities are even greater than what is observed. Furthermore, while there exist values of under which the true metric for white defendants would equal the observed (and assumed true) metric for black defendants, the equalizing value of differs across the metrics.

Figure 0(b) shows the theoretical bounds (orange lines) provided by Theorem 3.2 as functions of for the white population, and the observed metrics for the black population (grey lines) on the COMPAS data. We highlight the regions highlighted in red, which indicate areas where the true disparity in metrics could be of a different sign than what is observed. This plot also shows that parity on the true and is infeasible in this data at the given choice of classification threshold.

As a corollary of this result we can also study the question: Under what level of label noise could we expect disparities on a given metric to be smaller in truth than what was observed? First, note that when the observed recidivism rate is greater in group than , as in the case of the COMPAS example, we will generally observe and . A necessary condition for the disparity between the true error rates to be no larger than that for the observed rates is thus that and . The following corollary characterizes when this occurs.

###### Corollary 3.2.1.

In the notation of Theorem 3.2,

 FPR≥α1α ⟺FPR≤FPR∗(α,α1) (4) FNR≥α0α ⟺FNR≥FNR∗(α,α1), (5)

with equality on LHS iff there is equality on RHS.

The condition in (5) turns out to be equivalent to the odds ratio:333(kallus2018residual) obtain similar expressions in their study of “residual unfairness” in the context of a related data bias problem. They consider the setting where we fail to observe outcomes entirely for a fraction of the population (e.g., defendants who are not released on bail, and thus do not have the opportunity to recidivate). When viewed as functions of the underlying classification threshold , these odds ratios are interpreted in (kallus2018residual) as a type of stochastic dominance condition.

 P(^Y=1|Y∗=1,Y=0)/P(^Y=0|Y∗=1,Y=0)P(^Y=1|Y=1)/P(^Y=0|Y=1)≥1. (6)

This condition tells us that (5) holds precisely when the odds of correctly classifying a hidden recidivist to are greater than the odds of correctly classifying an observed recidivist, which seems unlikely to hold in practice. A similar interpretation can be derived for : condition (4) holds when the odds of misclassifying a hidden recidivist to are higher than those of correctly classifying an observed non-recidivist.

Example: COMPAS. Conditions (4) and (5) in Corollary 3.2 require and respectively. Note, however, that both conditions cannot simultaneously hold, as formally shown in Proposition 3.1.

In practice, if the predicted risk for hidden recidivists was generally low, condition (6) would likely not hold. Consequently, we would thus have , which says that the true disparity between groups would be greater than the observed disparity.

### 3.2 Accuracy equity

In their response to the ProPublica investigation, dieterich2016compas demonstrated that COMPAS satisfies predictive parity (equality of and across groups), and what they term accuracy equity (equality of ). menon2015learning and jain2017recovering previously considered estimation of the AUC under label noise, but in the simpler setting of label-dependent noise. Here we obtain bounds for the true AUC in the general instance-dependent noise setting through its relation to the Mann-Whitney U-statistic.

Let denote the number of observations with outcome . We will assume that there are hidden recidivists present in the observed data, with . Let denote the adjusted444In the case of ties among the scores, the U-statistic is calculated using fractional ranks. rank of observation when ordered in ascending order of the score . Lastly, let denote the sum of the ranks for observations in class . In this notation, the observed of is given by

 AUC=R1n1(n−n1)−n1+12(n−n1) (7)

Let denote the indexes of the lowest-ranked (i.e., lowest-scoring) observations in class . Likewise, let denote the indexes of the highest-ranked (i.e., highest-scoring) observations in class .

###### Proposition 3.3.

In the presence of hidden recidivists, the target value AUC is bounded as follows:

 R1+∑i∈L0,kri−βk(n0−k)(n1+k)≤AUC∗ ≤R1+∑i∈H0,kri−βk(n0−k)(n1+k) (8)

where .

It is easy to see that the upper and lower bounds correspond to the settings where the hidden recidivists are, respectively, the highest and lowest scoring defendants with . This result tells us, for instance, that if the hidden recidivists are more likely to have high scores, then the true will be greater than the observed . One key difference between the AUC result and the previous analysis of error metrics is that now the impact of label noise depends on the ranks of the hidden recidivists, and not only on the dichotomized version of the risk score.

Example: COMPAS. The observed AUC for both the black and white defendant population is around . Evaluating the bounds from the proposition for the white population, we find that for and , the is bounded between and , respectively. These bounds are very wide, but they can be narrowed if we are willing to make further assumptions on the likely ranks of the hidden recidivists.

### 3.3 Calibration testing via logistic regression

One of the most common metrics for assessing predictive bias of RAI’s is a test of calibration or differential prediction (skeem2015risk). Formally, we say that a risk score is well-calibrated with respect to if

 E[Y∣S=s,A=w]=E[Y∣S=s,A=b]. (9)

for all values of . This is equivalent to requiring that . Typically calibration is assessed by running a logistic regression and testing for statistical significance of in vs. or using a Wald or likelihood ratio test.555We adopt the shorthand to refer to the logistic regression model , where . Other covariates are occasionally also included in the regression. When the coefficients of are not statistically significant, is deemed to be well-calibrated with respect to . This approach was taken by floresfalse to confirm racial calibration for the COMPAS RAI. Note that in the presence of TVB, such tests provide evidence that is well-calibrated as a predictor of (rearrest). We wish to understand what this means about as a predictor of the true outcome (reoffense). Our main result is as follows.

###### Proposition 3.4.

Under a mild technical assumption on the design matrix,666The explanation of the assumption is deferred to the proof in the Supplement. While the assumption needs to be empirically verified case by case, in the COMPAS dataset it holds at every level of that we considered. for a logistic regression model of the form , for fixed , the bounds for the coefficients of and are achieved when the white defendants with the highest and lowest values of are hidden recidivists.

This result allows us to answer the question: What level of label noise is sufficient to contradict the observed findings that an RAI is (or is not) well-calibrated across groups? We provide two illustrative examples, one where the RAI is observed to be well-calibrated as a predictor of arrest, and the other where it is not.

Example: COMPAS. Figure 2 (a) shows the feasible values for the coefficient of in the COMPAS data for . The green and red areas correspond, respectively, to regions where the race coefficient is not and is statistically significant. Recall that non-significance of the race coefficient indicates that the model is well-calibrated. In this analysis, we find that a TVB level as low as might be sufficient for COMPAS to fail the calibration test across all possible noise realizations of that magnitude. At a noise level of only , calibration might also fail for some noise realizations of this magnitude. Note that our analytic results present bounds not just on the race coefficient but also on the score coefficient in the model. We present the two-dimensional bounds for the COMPAS tool in Supplement §B.4.

Example: Sentencing commission. Figure 2 (b) shows the results of the same experiment on sentencing commission (SC) data described in Section 2.1. In absence of TVB, Figure 2 (b) shows that this tool, unlike the COMPAS RAI, is not observed to be well-calibrated across groups. Indeed, the coefficient for is statistically significantly negative, indicating the RAI overestimates risk for white offenders. Our analysis showns that TVB as low as is sufficient to admit calibration. More generally, we see that for calibration might be possible for some realizations of the noise process. For a larger magnitude of TVB, the coefficient might be significant and positive; in other words, it would be possible for the instrument to underestimate the reoffense risk for the white population.

### 3.4 Calibration testing via chi-squared test

We also consider the general test of conditional independence in the setting where is either assumed to be discrete, or has been binned for the purpose of analysis. When is categorical, testing the saturated logistic model vs. is precisely testing the conditional independence of . This section thus extends the analysis from the previous section beyond the (likely misspecified) simple shift-alternative considered therein. There are several asymptotically equivalent tests that can be applied to test this hypothesis (hinkley1979theoretical). We use the Pearson chi-squared test, as it is the most straightforward to analyse.

The general setup for assessing the sensitivity of the chi-squared conditional independence test to TVB is described by Table 2

. Our goal is to understand the behavior of the chi-squared test statistic,

 T(h)=|S|∑k=1∑a,y(O(k)ay−E(k)ay)2/E(k)ay (10)

as a function of the hidden recidivist counts . The notations and

denote the “observed” and “expected” cell counts for calculating the chi-squared statistic. Expected counts are estimated from the data assuming the null hypothesis

is true. These quantities evaluate to

 O(k)ay =n(k)ay+hk1a=w(2y−1), and E(k)ay =(n(k)wy+n(k)by+(2y−1)hk)(n(k)a0+n(k)a1)/n(k).

The key observation is that, when viewed as a function of , the numerator terms are convex quadratics in , and the denominator terms are linear functions in , constrained to be positive.

We address two basic questions: (1) When appears racially well-calibrated for the observed , how large would , the number of hidden recidivists, have to be for to fail the calibration test for ? (2) When appears to underestimate risk for the one racial group, how large would have to be for to appear racially well-calibrated for ? Answering (1) entails maximizing the test statistic over subject to . Answering (2) entails minimizing the test statistic. Note that each inner summand of equation (10) is a quadratic-over-linear function, which is strongly convex (boyd2004convex). The test statistic as a function of thus has the form , where each is a strongly convex function. Since is a strongly convex separable function of the ’s, the minimization can be performed with a numerical convex solver. Note that it is also straightforward to incorporate convex constraints into the optimization. The maximization task is a case of a separable nonlinear optimization problem, for which general tools exist. For our analysis we instead present a practical greedy algorithm in Supplement §B.1.4.

Example: COMPAS. Figure 3

(a) shows the observed recidivism rates for black and white defendants across the range of the COMPAS decile score. When we apply the chi-squared test to test for calibration, we find that the COMPAS instrument appears well-calibrated with respect to race

. However, applying our method to maximize the test statistic, we find that the presence of just hidden recidivists is sufficient to break calibration. This is achieved when all hidden recidivists are located in score level 8. Looking at the data, this is unsurprising. Score level already has the largest observed discrepancy with the black defendant recidivism rate. Pushing this discrepancy further will rapidly cause the test to reject. Figure 3(b) shows the minimal shift necessary to break calibration when we impose a proportionality constraint that prohibits allocations that concentrate too much on a single bin. Specifically, we require that . This ensures that the proportion of true recidivists that are hidden in any score bin is no greater than . For our experiment we take . Under this constraint, we find that are sufficient to break calibration. These are allocated as .

Example: Sentencing commission. The right panel of Figure 3 shows the observed recidivism rates for black and white defendants across the range of the decile score we constructed based on the sentencing commission data. Unlike in the COMPAS example, we find that the SC score shows clear evidence of poor calibration . The RAI underestimates risk of rearrest for white offenders relative to black offenders across the range of score levels. This effect is especially pronounced in the highest scores. Applying our method to minimize the test statistic, we find that just hidden recidivists are sufficient to achieve calibration. While this may seem like a large number, there are white offenders in the data, of which are observed to reoffend. Thus the minimizing allocation requires only that of all true recidivists go unobserved. The minimizing allocation, represented in the left panel of Figure 3, is .

## 4 Conclusion

When target variable bias is a concern, the sensitivity analysis framework presented in this paper can be used to quantify the level of bias sufficient to call into question conclusions about the fairness of a model obtained from biased observed data. In the sentencing commission example, for instance, we find that a small gap in the likelihood of arrest could fully account for the observed miscalibration. Such observations may help inform deliberations of whether to correct for observed predictive bias when doing so would further increase outcome disparities. Furthermore, as our reanalysis of the ProPublica COMPAS data shows, the racial disparity story goes deeper than an imbalance in observed recidivism rates. Even if offense rates are equal across groups, the disparities could be worse with respect to offense than what is observed for arrest.

The sensitivity analysis approach outlined in this work has generally avoided making assumptions about how the likelihood of getting caught might depend on observable features, at a cost of producing fairly wide bounds. Existing work on self-report studies, wrongful arrests, and wrongful convictions may provide some insight into reasonable structural assumptions that may be incorporated to further refine the analysis (huizinga1986reassessing; hindelang1979correlates; gilman2014understanding).

## Organization of the Supplement

• section A2):

• motivating examples;

• estimators for the noise under conditions stronger than assumption 1.

• section B3):

• omitted proofs for section 3;

• extension of results under conditions stronger than assumption 1;

• extension of results under relaxation of assumption 2;

• further experiments.

• Section C:

• experiments on error rate balance with fairness-promoting algorithms.

## Appendix A Extension for section 2

In this section, we use and to indicate and respectively. We also drop the dependency of on and, if assumption 1 is used, on .

### a.1 Who are the likely hidden recidivists?

In section §3 we have argued that the worst case bounds in our sensitivity analysis occur when the hidden recidivists are either all in the low-risk bin () or all in the high-risk bin (). Here we present two thought examples reflecting on assumption 1. We show that, generally speaking, one can not rule out the “extreme” settings. Indeed, under assumption 1, the case is still possible.

Example 1 Suppose for instance that is a single binary covariate, , and . This gives and . If we set the classification threshold at , we would classify everyone with as high-risk and everyone with as low-risk. By construction, we have , meaning that all recidivists with are observed, whereas some fraction of recidivists with are hidden. This in turn means that all hidden recidivists are classified as high-risk (). A similar construction can be used to produce a case where , which corresponds to all hidden recidivists being classified as low-risk.

Example 2 The first example is admittedly highly contrived and unlikely to reflect any real world scenario. To model a more plausible scenario, we consider a setup in which we have a single feature , , and two forms for the likelihood of getting caught function:

 γInc,b(x)=1−(b+1)x/(1+bx), γDec,b(x)=1−(b+1)(1−x)/(1+b(1−x)).

The “Increasing” setting is one where the likelihood of getting caught increases with the likelihood of reoffense , with the functional form of the relationship governed by the parameter . The “Decreasing” setting has the likelihood of getting caught decreasing with the likelihood of reoffense. We equalize the proportion of high-risk and low-risk cases by thresholding at its median value in each simulation. Figure 4 shows a plot of how the fraction of hidden recidivists that get classified as high-risk varies with . Values larger than on this plot can be interpreted as settings where ; a value of , though never achieved, would correspond to the case . This suggests that, in general, the hidden recidivists are likely to be scattered across the range of the score , and are thus unlikely to concentrate entirely in the extremes of . In other words, the worst-case bounds presented in Section 3 are, unsurprisingly, likely to be overly conservative.

### a.2 Estimation of noise

In §2) we have argued that the assumption of constant noise is unrealistic in our setting. Indeed, in the introduction we cite the case of drug crimes (low-level offenses), where there appears to be an inconsistency in the number of arrests and users between the black and white populations; this fact might be attributed to differential policing. For other types of crimes we can imagine the effect of policing to be more similar across races. Although we suggest to account for more complex forms of the noise, one may wish to perform a sensitivity analysis under stronger assumptions on the noise process, e.g. assume the noise to be independent of the features conditionally on the observed labels. The case of constant noise has been intensively studied during the past two decades and it is fairly well understood. In this subsection we present a simple extension of this framework to account for noise constant within groups.

#### a.2.1 Estimation of one-sided label-dependent noise.

In the paper we work under the setup of assumption 1, that is of one-sided feature-dependent noise. Now, consider the following assumption.

###### Assumption 3.

.

Under assumptions 1 and 3 we refer to the noise as one-sided label-dependent. Since the noise rate is now constant, we drop the dependency on and rewrite .

We briefly describe three of the estimators for the noise rates commonly used in the literature. These estimators can be used for estimation of the noise rate in the setting of assumptions 1 and 3.

Estimator 1 The estimator proposed by (elkan2008learning) relies on the following assumption.

###### Assumption 4.

(strong separability) .

Then we have the following proposition.

###### Proposition A.1.

Under assumptions 1, 3, and 4, the following equality holds. For every ,

 γ=1−m(x). (11)

Proof of proposition A.1. Thanks to assumption 4, implies . Consequently we have for every . ∎

Estimators 2 and 3 rely on the following assumption.

###### Assumption 5.

(weak separability) .

Estimator 2 The following is also described in (elkan2008learning; liu2016classification; menon2015learning).

###### Proposition A.2.

Under assumptions 1, 3, and 5, the following equality holds.

 γ=1−supxm(x) (12)

Proof of proposition A.2. Recall the decomposition . Then, thanks to assumption 5, we have

 supxm(x)=(1−γ)supxm∗(x)=1−γ⟹γ=1−supxm(x).

Consequently the rate of convergence for the estimation of coincides with the one for .

Estimator 3 We define , the inverse noise rate, as

 ρ:=E[Y∗|Y=0]=α1−E[Y]=γ1−E[Y]E[Y∗]=γ1−E[Y]E[Y]1−γ=γ/(1−γ)(1−E[Y])/E[Y]. (13)

Note that identifies , and vice versa. An estimator for has been proposed by (scott2009novelty; scott2013classification).
Let and denote the densities of conditional on and respectively. Under assumptions 1, 3, and 5,

 ρ=ν(q0,q∗1)(1−ν(q∗1,q0))1−ν(q∗1,q0)ν(q0,q∗1) (14)

where and . corresponds to the left-derivative of the optimal ROC curve (scott2009novelty). The optimal ROC curve is given by any scorer that is a strictly monotone transformation of (clemenccon2008ranking). In (scott2009novelty) the estimator is recovered behind an assumption slightly weaker than assumption 5 that the authors call irreducibility; however, under this assumption, the convergence rate of the estimator is shown to be arbitrarily slow. (scott2015rate) introduces an assumption equivalent to 5 that guarantees faster convergence rates.

It is clear that if assumption 5 does not hold, then the estimated noise rate is only upper bounded by , and consequently .

#### a.2.2 Estimation of one-sided race- and one-sided label-dependent noise.

In our setting it is more reasonable to consider a noise process that depends on the race membership; indeed, the original motivation of our work was a concern regarding differential policing across races. To simplify notation, let ; similarly, . We formulate the following assumption.

###### Assumption 6.

.

The unconditional version of assumption 6 is clearly assumption 5. The following proposition can be interpreted as a generalization of proposition A.2.

###### Proposition A.3.

Under assumptions 1, 3, and 6, the following equality holds.

 γa=1−supxma(x)∀a∈{b,w}. (15)

Again, the convergence rate of the estimator of is identical to the one of the estimator of .

If race-specific classifiers are trained, then this framework inherits all the results from the label-dependent noise literature. Instead, if a unique classifier is trained, with race included in the feature set, then some of the results for model training and labels correction can be adapted to this setting.

We now estimate the values of on COMPAS data considering the setting of assumptions 1, 3, and 6

. We fit one classifier for each race group and tune the parameters via cross-validation on the training set. We use extreme gradient boosted trees (xgboost)

(chen2016xgboost)

, logistic regression (glmnet), k-nearest neighbors (knn), and support vector machines (svm). The resulting scores are thresholded at

according to Bayes decision rule and the accuracy on the test set is approximately 66% for all models and both races. The results of the estimation for estimators 1 and 2, with corresponding standard deviations, are reported in Table

3. Not surprisingly, the noise parameter for the white population is higher than that for the black population across all models. This result is a consequence of violation of the assumptions – that are unlikely to hold in practice – and poor performance of the models.

## Appendix B Extension for Section 3

### b.1 Omitted proofs

#### b.1.1 Error rates and predictive parity

Proof of proposition 3.1. Assume that . We now show by contradiction that and can not hold together. Indeed, the following two equivalences hold

 FNR≤FNR∗⟺FNR≤α0/α FPR≥FPR∗⟺1−FPR≥α0/α

thanks to corollary 3.2.1. It follows that , which is a contradiction.
The proof for the other case is analogous. ∎
Figure 5 provides a visual interpretation of the result.

Proof of theorem 3.2.1. Recall the following notation: .

• Proof of inequality (1). can be rewritten as

 E[^Y|Y∗=0]P(Y∗=0|Y=0)+E[^Y|Y=0,Y∗=1]P(Y∗=1|Y=0) =FPR∗(1−E[Y∗|Y=0])+α1αE[Y∗|Y=0]

thanks to the law of total probability, Bayes theorem and assumption

1 in sequence. Therefore is a convex combination of and . Rearranging the terms we obtain

 FPR∗=FPR−α1αE[Y∗|Y=0]1−E[Y∗|Y=0]=p01−α1p00+p01−α

For fixed , we obtain

 p01−αp00+p01−α≤FPR∗≤p01p00+p01−α.
• Proof of inequality (2). can be rewritten as

 P(^Y=0|Y=1)P(Y=1|Y∗=1)+P(^Y=0|Y=0,Y∗=1)P(Y=0|Y∗=1).

Then we have

 FNR∗=FNRE[Y|Y∗=1]+α0α(1−E[Y|Y∗=1])

which is derived as above. The last derivation follows the same strategy as above.

• Proof of inequality (3). can be rewritten as

 PPV+E[Y∗(1−Y)|^Y=1]=PPV+α1p01+p11.

Then, since the second term on the RHS is larger or equal to zero, the lower and upper bounds for will be given by and respectively. ∎

Proof of corollary (3.2.1). Let us first prove equivalence (5).

 FNR≥FNR∗⟺p10p10+p11≥p10+α0p10+p11+α⟺FNR≥α0α.

The proof of equivalence (4) for is similar.

 FPR≤FPR∗⟺p01p00+p01≤p01−α1p00+p01−α⟺FPR≥α1α.

Derivation of (6). Let us start with the case of . If the condition in (5) holds, then we have

 p10p10+p11≥α0α0+α1⟺α1p10≥α0p11⟺α1/α0p11/p10≥1

where we used Bayes theorem and law of total probability in sequence.
The odds ratio for can be derived in a similar manner. For the equivalence in (4) to hold we need

 p01p00+p01≥α1α0+α1⟺α0p01≥α1p00⟺α0/α1p00/p01≥1

where we used, again, Bayes theorem and law of total probability. ∎

#### b.1.2 Accuracy Equity

Proof of proposition  3.3. The Mann-Whitney U statistic can be computed according to

 U1\coloneqqR1−n1(n1+1)2=∑i:Yi=1ri−n1(n1+1)2

where are the adjusted ranks. We can calculate the AUC of as a classifier of from through the expression:

 AUC∗=U1n1n0=R1n1(n−n1)−n1+12(n−n1)

Now suppose that observations are unobserved recidivists. It is clear that the lower (upper) bound can be found by assuming observations corresponding to the lowest (highest) ranks such that to be recidivists; this provides the sharp bound in the proposition. This is in turn lower (upper) bounded by the case where the lowest (highest) ranks overall correspond to unobserved recidivists: for the lower bound, , while for the upper bound, . ∎

#### b.1.3 Calibration via logistic regression

Proof of proposition 3.4. For a fixed set proportion of hidden recidivists , we aim to prove that the bounds for the coefficient of race are achieved in the settings and .

Consider the random variables

where , , and with . Let . Consider the observations such that for , that is the observations are ordered increasingly according to the realizations of . Let be the MLE of the log-likelihood

 ℓ(β|y)=n∑i=1yilogσβ(xi)+(1−yi)log(1−σβ(xi)) (16)

where

 σβ(x)\coloneqq11+e−βTx.

Logistic regression aims at minimizing the negative log-likelihood in (16).
Consider two indices , , such that but . Now let be such that ; and . We are interested in the MLE for . Consider a second-order Taylor expansion of around :

 ℓ(β|y∗)≈ℓ(β†|y∗)+(β−β†)∇ℓ(β†|y∗)+12(β−β†)T∇2ℓ(β†|y∗)(β∗′−β∗).

Note that since can be rewritten as

 ℓ(β|y∗)=ℓ(β|y)+logσβ(xh)1−σβ(xh)−logσβ(xl)1−σβ(xl)

where

 logσβ(x)1−σβ(x)=logexp{βTx}=βTx

and thanks to the fact that the score evaluated at the MLE is zero. If we consider the problem of minimizing the negative log-likelihood, the Hessian is positive definite, and consequently its determinant is positive. We are interested in the direction of the search for . The minimizer of the Taylor expansion above for the negative log-likelihood with respect to is ). The Hessian is given by where for . Therefore we have

 ∇2(−ℓ(β†|y∗))=XTDX=⎡⎢ ⎢⎣∑i∈Psi∑i∈Psixi,2∑i∈Wsi∑i∈Pxi,2si∑i∈Psix2i,2∑i∈Wsixi,2∑i∈Ws