1 Introduction
Machine learning algorithms are increasingly used by employers, judges, lenders, and other experts to guide highstakes decisions. These algorithms, for example, can be used to help determine which job candidates are interviewed, which defendants are required to pay bail, and which loan applicants are granted credit. At a high level, creating such algorithmic policies is straightforward. First, based on detailed individuallevel data on past decisions and outcomes, one trains a statistical model to estimate the likelihood of key events under the possible actions one could take. For example, based on credit history, one might estimate an applicant’s risk of default if a loan is granted; or based on criminal history, one might estimate a defendant’s likelihood of appearing at trial if bail is set or, alternatively, if the defendant is released without requiring payment. Second, with these individuallevel estimates in hand, one codifies a decision rule that balances competing interests. For example, one might grant loans only to those deemed most creditworthy, subject to limits on available funds; or one might require bail only from defendants at greatest risk of skipping trial if released, setting the risk threshold to balance flight risk against the social and financial burdens of bail on defendants.
This basic strategy rests on the ability to accurately estimate counterfactual outcomes. In reality, however, such predictions can be difficult. For example, in the judicial context we only observe whether or not a particular defendant failed to appear at trial given the action the judge actually took (i.e., requiring bail or not); we do not observe what would have happened under the alternative judicial action. If judges have information about defendants that is not recorded in the data, and if this information correlates both with a judge’s decision and a defendant’s flight risk, estimates of risk will in general be biased. In statistical terms, there may be unmeasured confounding, and, as a result, we cannot ignore the process by which individuals are assigned to treatment.
This fundamental limitation has two immediate consequences. First, without accurate estimates of potential outcomes, one cannot in general construct optimal algorithmic policies. Second, and perhaps even more importantly, one cannot precisely forecast the effects of deploying a given policy. In particular, poor counterfactual estimates may lead lenders to take on unexpectedly high levels of credit risk, or courts to take on unexpectedly high levels of flight risk. For organizations to accurately evaluate the implications of deploying decision algorithms, it is thus important to gauge the sensitivity of forecasted effects to unmeasured confounding.
We make three key contributions in this article. First, we develop a flexible Bayesian method to evaluate decision algorithms in the presence of unmeasured confounding. Second, we show that the policy evaluation problem we consider here is a generalization of estimating heterogeneous treatment effects in observational studies—a problem of independent interest—and adapt our sensitivity analysis approach to that question. Third, we show that in certain common scenarios, one can construct nearoptimal decision algorithms even if there is unmeasured confounding. In particular, it is often possible to accurately rank individuals by risk—for example, risk of default or risk of skipping trial—even if one cannot perfectly estimate the true risk level itself.
To illustrate our methods, we consider judicial bail decisions, starting with a detailed dataset of over 150,000 judgments in a large, urban jurisdiction. Based on this information, we create realistic synthetic datasets for which potential outcomes and selection mechanisms are, by construction, fully known. We apply our methods to these synthetic datasets, selectively masking potential outcomes and the datageneration process to mimic realworld applications. Under a variety of conditions, we find that our approach accurately recovers true policy effects and subgrouplevel treatment effects, despite the presence of unmeasured confounding. In these simulations, our approach outperforms classical, nonBayesian sensitivity analysis techniques that recently have been adapted to the algorithmic decisionmaking context (Rosenbaum and Rubin, 1983; Jung et al., 2017). When applied to the special case of estimating average treatment effects, our general method compares favorably to existing, stateoftheart approaches that are tailored to that specific problem. Finally, we show that on these synthetic datasets, we can construct nearoptimal decision algorithms in the presence of unmeasured confounding.
1.1 Related work
Our work touches on three related but distinct lines of research—algorithmic decision making, offline policy evaluation, and sensitivity analysis—which we briefly discuss here. In field settings, decision makers often choose a course of action based on experience and intuition rather than on statistical analysis (Klein, 2017)
. This includes doctors classifying patients based on their symptoms
(McDonald, 1996), judges setting bail amounts (Dhami, 2003) or making parole decisions (Danziger et al., 2011), and managers determining which ventures will succeed (Åstebro and Elhedhli, 2006) or which customers to target (Wübben and Wangenheim, 2008). Despite the prevalence of this approach, a large body of work shows that in many domains intuitive inferences are inferior to those based on statistical models (Meehl, 1954; Dawes, 1979; Dawes et al., 1989; Camerer and Johnson, 1997; Tetlock, 2005; Kleinberg et al., 2017). As such, algorithmic decision aids are now being used in diverse settings, including law enforcement, education, employment, and medicine (Barocas and Selbst, 2016; Berk, 2012; Chouldechova et al., 2018; Goel et al., 2016a, b; Zeng et al., 2017). In particular, applications of risk assessment algorithms have a long history in criminal justice, dating back to parole decisions in the 1920s (Burgess, 1928). Several empirical studies have measured the effects of adopting such decision aids. For example, in a randomized controlled trial, the Philadelphia Adult Probation and Parole Department evaluated the effectiveness of a risk assessment tool developed by Berk et al. (2009), and found the tool reduced the burden on parolees without significantly increasing rates of reoffense (Ahlman and Kurtz, 2009).When determining whether to adopt a decision algorithm, it is important to anticipate its likely effects using only historical data, as expost evaluation may be expensive or otherwise impractical. This general problem is known as offline policy evaluation in the machine learning community. One approach to this problem is to first assume that treatment assignment is ignorable given the observed covariates, after which a variety of modeling techniques are theoretically justified, including regression, matching, and doubly robust estimation (Dudík et al., 2011; Zhang et al., 2012, 2015; Athey and Wager, 2017). A related issue—sometimes referred to as optimal treatment assignment—is to select an optimal policy for the task at hand, given a method of evaluation and a set of candidate policies (Gail and Simon, 1985; Manski, 2004; Dehejia, 2005; Chamberlain, 2011).
As the assumption of ignorability is often unrealistic in practice, sensitivity analysis methods seek to gauge the effects of unmeasured confounding on predicted outcomes. The literature on sensitivity analysis dates back at least to the work of Cornfield et al. (1959) on the link between smoking and lung cancer. In a seminal paper, Rosenbaum and Rubin (1983)
propose a framework for sensitivity analysis in which a binary unmeasured confounder affects both a binary response variable and a binary treatment decision. By sweeping over various plausible values for the strength of the assumed relationships, one obtains a corresponding range of estimated effects. Some extensions of the Rosenbaum and Rubin approach to sensitivity analysis include allowing for nonbinary response variables, and incorporating machine learning methods into the estimation process
(Imbens, 2003; Carnegie et al., 2016; Dorie et al., 2016). A complementary line of work extends classical sensitivity analysis by taking a Bayesian perspective and averaging, rather than sweeping, over values of the sensitivity parameters (McCandless et al., 2007; McCandless and Gustafson, 2017). In this setup, even a weakly informative prior distribution over the sensitivity parameters can exclude estimates that are not supported by the data. Jung et al. (2017) have recently extended the Rosenbaum and Rubin sensitivity framework to offline policy evaluation, though we do not know of any existing Bayesian approaches to the problem.2 Bayesian sensitivity analysis for policy evaluation
2.1 Preliminaries and problem definition
A common goal of causal inference is to estimate the average effect of a binary treatment on a response . That is, one often seeks to estimate , where and denote the potential outcomes under the two possible treatment values. Here we consider a variant of this problem that arises in many applied settings. Namely, given a policy that assigns treatment based on individual characteristics , we seek to estimate the average response . In our judicial bail application (described in detail in Section 3), is an algorithmic rule that determines which defendants are released on their own recognizance (RoR) and which are required to pay bail, indicates whether a defendant appears at trial,
is a vector of observed covariates, and
is the expected proportion of defendants who fail to appear in court when the rule is followed.Although is not a classical causal effect, it is closely related to several standard estimands in causal inference. To see this, first define the subgroup average treatment effect for a group to be . Next, define the policy as
In particular, . Let be an indicator function evaluating to 1 if its argument is true and to 0 otherwise. Then,
Hence, if one can estimate for arbitrary policies , then one can estimate subgroup average treatment effects for arbitrary groups . The case corresponds to the conditional average treatment effect, and the case corresponds to the average treatment effect. We can thus think of policy evaluation as a generalization of estimating heterogeneous treatment effects.
2.2 Policy evaluation without unmeasured confounding
To motivate the standard approach for estimating , we decompose it into the sum of two terms:
The proposed and observed policies are the same for the first term but differ for the second term. Thus estimating the first term is straightforward, since we directly observe the outcomes for this group,
. The key statistical difficulty is estimating the second term, for which we must impute the missing potential outcomes,
. That is, in these cases we must estimate what would have happened had the recommendation been followed, rather than the observed action. This challenge is known as the fundamental problem of causal inference (Holland, 1986).To estimate , it is common to start with a sample of historical data on individual covariates, treatment decisions, and observed outcomes. We can then estimate via:
(1) 
where is the observed outcome for individual when the proposed and observed policies agree. A simple approach to estimating the unobserved potential outcomes in the second term is to directly model the outcomes conditional on treatment and observed covariates, sometimes referred to as response surface modeling (e.g., Hill, 2012)
. In the case of pretrial detention, we can fit logistic regressions that estimates a defendant’s likelihood of failing to appear at court when either released (RoR) or detained (set bail), conditional on the available information. The unobserved potential outcome is then imputed via the corresponding fitted model.
This direct modeling approach implicitly assumes that the treatment is ignorable given the observed covariates (i.e., that there is no unmeasured confounding). Formally, ignorability means that
(2) 
In other words, conditional on the observed covariates, those who receive treatment are similar to those who do not. In the pretrial context, ignorability means that conditional on the observed covariates, those who are RoR’d are similar to those who are not. In particular, this assumption excludes the possibility of judges having access to information not recorded in the data—such as a defendant’s demeanor—that affects both the judge’s decision and the defendant’s likelihood of appearing at trial. In many situations, ignorability is a strong assumption. The main contribution of this paper is a Bayesian approach to assessing the sensitivity of policy outcomes to violations of ignorability, which we present next.
2.3 Policy evaluation with unmeasured confounding
When ignorability does not hold, the resulting estimates of can be strongly biased. To address this issue, the sensitivity literature typically starts by assuming that there is an unmeasured confounder for each individual such that ignorability holds given both and :
(3) 
In the pretrial setting, is meant to summarize all relevant information observed by a judge when making a decision but not recorded in the available data. One then generally imposes additional structural assumptions on the form of and its relationship to decisions and outcomes. We follow this basic template to obtain sensitivity estimates for policy outcomes.
At a high level, we model the observed data as draws from parametric distributions depending on the measured covariates and the unmeasured, latent covariates :
where , , , and are latent parameters with weakly informative priors. The inferred joint posterior distribution on these parameters then yields estimates of and for each individual . Finally, the posterior estimates of the potential outcomes yield a posterior estimate of via Eq. (1) that accounts for unmeasured confounding.
Our strategy, while straightforward, differs in three important ways from classical sensitivity methods. First, we directly model the treatment and potential outcomes, allowing for flexibility in the functional forms of , and . This flexibility is enabled by recent computational advances in blackbox inference for generative models (Carpenter et al., 2016); in the past, it was often necessary to analytically derive and maximize the corresponding likelihood function, significantly constraining model structure. Second, our Bayesian approach automatically excludes values of the latent variables that are not supported by the data (McCandless and Gustafson, 2017). In contrast, previous methods generally require more extensive parameter tuning to obtain reasonable estimates. Finally, as noted earlier, our focus is on the more general problem of estimating policy outcomes rather than the average treatment effects considered in past work.
We now describe the specific forms of , , and that we use throughout our analysis. First, we reduce the dimensionality of the observed covariate vectors down to three key quantities:

, the conditional average response if ;

, the conditional average response if ; and

, the propensity score.
In practice, these three quantities can be estimated via standard prediction models, like regularized regression.
Next, we divide the data into approximately equally sized groups, ranking and binning by the estimated outcome . (One might alternatively group the data by , or even jointly by and .) Denote the group membership of observation by . Then we model the observed data as follows:
In each of the first three equations above, outcomes are modeled as draws from a Bernoulli distribution whose mean depends on both the reduced covariates—
, , and —and the unmeasured confounder . To support complex relationships between the predictors and outcomes, we allow the coefficients to vary by group . That is, we model mean response as piecewise linear on the logit scale. Finally, to complete the Bayesian model specification, we must describe the prior distribution on the parameters. We provide these details in the Appendix, where we also show that our results are largely invariant to the exact choice of priors.3 An application to judicial decision making
To demonstrate our general approach to policy evaluation, we now consider in detail the case of algorithms designed to aid judicial decisions (Jung et al., 2017; Kleinberg et al., 2017). In the U.S. court system, pretrial release determinations are among the most common and consequential decisions for criminal defendants. After a defendant is arrested, he is usually arraigned in court, where a prosecutor presents a written list of charges. If the case proceeds to trial, a judge must decide whether the defendant should be released on his own recognizance (RoR) or subject to money bail, where release is conditional on providing collateral meant to ensure appearance at trial. Defendants who are not RoR’d and who cannot post bail themselves may await trial in jail, or pay steep fees to a bail bondsman to post bail on their behalf. Judges must therefore balance the burden that setting bail places on a defendant against the risk that the defendant may fail to appear (FTA) for his trial.^{1}^{1}1In many jurisdictions, judges may also consider the risk that a defendant will commit a new crime if released when deciding whether or not to set bail, but not in the jurisdiction we consider.
Here we consider algorithmic policies for assisting these judicial decisions, recommending either RoR or bail based on recorded characteristics of a defendant and his case. The policy evaluation problem is to estimate, based only on historical data, the proportion of defendants who would fail to appear if the algorithmic recommendations were followed. As discussed above, this is statistically challenging because one does not always observe what would have occurred had the algorithmic policy been followed. In particular, if the policy recommends releasing a defendant who was in reality detained, or recommends detaining a defendant who was in reality released, the relevant counterfactual is not observed. Further, since judges may—and likely do—base their decisions in part on information that is not recorded in the data, direct outcome models ignoring unmeasured confounding may be badly biased for counterfactual outcomes. We thus allow for a realvalued unobserved covariate that affects both a judge’s decision (RoR or bail) and also the outcome (FTA or not) conditional on that decision. For example, might correspond to a defendant’s perceived demeanor, with seemingly responsible defendants more likely to be RoR’d and also more likely to appear at their court proceedings. Our goal is to assess the sensitivity of flight risk estimates to such unmeasured confounding.
3.1 The effects of unmeasured confounding
Our analysis is based on 165,055 adult cases involving nonviolent offenses charged by a large urban prosecutor’s office and arraigned in criminal court between 2010 and 2015. These cases do not include instances where defendants accepted a plea deal at arraignment, where no pretrial release decision is necessary. For each case, we have 49 features describing characteristics of the current charges (e.g., theft, gunrelated), and 15 features describing characteristics of the defendant (e.g., gender, age, prior arrests). We also observe whether the defendant was RoR’d, and whether he failed to appear at any of his subsequent court dates. Applying the notation introduced in Section 2 to this dataset, refers to a vector of all observed characteristics of the th defendant, if bail was set, and if the defendant failed to appear at court. Overall, 69% of defendants in our dataset are RoR’d, and 15% of RoR’d defendants fail to appear. Of the remaining 31% of defendants for whom bail is set, 9% fail to appear. As a result, the overall FTA rate is 13%. To carry out our analysis, we first randomly select 10,000 cases^{2}^{2}2We use a relatively small number of cases for the test set to mitigate issues of computational complexity. from the full dataset which we set aside as our final test data. The remaining cases are split into two training data folds of equal size.
We begin by constructing a family of algorithmic decision rules, following the procedure of Jung et al. (2017). On the first training fold of about 77,500 cases, we fit an
regularized (lasso) logistic regression model to estimate the probability of FTA given release,
That is, we fit a logistic regression with the lefthand side indicating whether a defendant failed to appear at any of his court dates, and the righthand side comprised of all available covariates for the subset of defendants that the judge released. We use the superscript to indicate that these estimates are computed on the first fold of data, and are used exclusively to define the policies, not to evaluate them. With these estimates in hand, we construct a family of polices , indexed by a risk threshold for releasing individuals:For example, the policy recommends that bail be set for a defendant if and only if his estimated flight risk if released, , is at least 10%. These policies, which are based on a ranking of defendants by risk, are similar to pretrial algorithms used in practice (CorbettDavies et al., 2017).
Given the family of policies defined above, we next estimate the proportion of defendants who would fail to appear under each policy, accounting for unmeasured confounding. To do so, on the second training fold of about 77,500 cases, we fit three regularized logistic regression models to estimate each individual’s likelihood of failing to appear if he were RoR’d or were required to pay bail— and , respectively—as well as each individual’s likelihood of having bail set, . Finally, on the test fold of data, consisting of 10,000 cases, we fit our Bayesian sensitivity model of Section 2.3, using the estimated quantities , , and , and setting the number of groups to 10. Our use of a random walk prior on the coefficients helps to ensure that our results are largely independent of the specific value of (see the Appendix for details).
The results of this analysis are plotted in Figure 1. The blue dashed line shows, for each policy , the proportion RoR’d and the estimated proportion that FTA under the policy, , estimated under no unobserved confounding. For reference, the black point shows the status quo: judges in our dataset release 69% of defendants, resulting in an overall FTA rate of 13%. The light and dark gray bands show, respectively, the 95% and 50% credible intervals that result from the sensitivity analysis procedure under our model of unobserved confounding.
Figure 1 illustrates three key points. First, for policies that RoR almost all defendants (toward the righthand side of the plot), the blue line lies below our sensitivity bands, in line with expectations. If there is unmeasured confounding, we would expect those who were in reality detained to be riskier than they seem from the observed data alone; as a result, the direct outcome model underestimates the proportion of defendants that would fail to appear if all (or almost all) such defendants are RoR’d. Conversely, for policies that recommend bail for almost all defendants (toward the lefthand side of the plot), the blue line lies above the sensitivity bands, as we would expect, because those who were in reality released are likely less risky than they appear from the observed data alone. Second, as the policies move further from the status quo, toward the left and righthand extremes of the plot, the sensitivity bands grow in width, indicating greater uncertainty. This pattern reflects the fact that the data provide less direct evidence of flight risk as polices diverge from the status quo, heightening the potential impact of unmeasured confounding. Finally, even after accounting for unmeasured confounding, there are algorithmic policies that do substantially better than the status quo, in the sense that they RoR more defendants and simultaneously achieve a lower overall FTA rate.
As discussed above, policy evaluation is a generalization of estimating heterogeneous treatment effects. As such, our approach immediately yields estimates of conditional average treatment effects under our model of unmeasured confounding. For groups defined by age, gender, and number of previously missed court appearances, Figure 2 displays estimated effects of setting bail versus releasing all defendants in each group. In each case, the blue X is the estimated difference in FTA rate from the direct outcome model without unmeasured confounding. The thick and thin gray bars indicate 50% and 95% credible intervals resulting from our sensitivity analysis, and the point shows the median posterior subgroup treatment effect. Across all subgroups, we can see that the estimates ignoring unmeasured confounding underestimate the magnitude of the effect (i.e., are closer to zero), compared to estimates under our model of unobserved confounding. The plot also shows that there is considerable heterogeneity across defendants, particularly when stratified by number of prior FTAs, highlighting the importance of considering conditional average treatment effects. For example, for defendants with three or more previously missed court appearances, the average treatment effect is almost 20 percentage points lower than for defendants with no such lapses.
3.2 A simulation study
Our results above quantify the sensitivity of policy outcomes under one particular model of unmeasured confounding. Without further analysis, however, it is difficult to gauge whether that approach indeed provides accurate estimates of the true outcomes under such polices. To address this question, we next evaluate our method on a series of realistic, synthetic datasets for which the true outcomes are, by construction, known.
We begin with the dataset introduced in Section 3.1, a collection of 165,055 cases , where is a vector of covariates describing the defendant and characteristics of the charge, if the defendant was required to pay bail, and if the defendant failed to appear for his court appearance. To assess the performance of our sensitivity analysis procedure, we create a synthetic dataset where both potential outcomes for each defendant— and —are known. We do so by first estimating , , and using regularized logistic regression, as above. Then, for each individual in the original dataset, we create synthetic potential outcomes and treatment assignments via independent Bernoulli draws based on these estimated probabilities:
We denote the resulting synthetic, uncensored dataset by . Because both potential outcomes are listed, for any policy applied to we can exactly calculate , the proportion of defendants that fail to appear for their trial. We note that our synthetic dataset, by construction, satisfies ignorability. That is, conditional on the covariates , treatment assignment is independent of the potential outcomes.
We next censor in two ways. First, we restrict the observed covariates to a subset . In our simulation study, we consider three sets of restricted covariates: age; age and gender; and age, gender, and prior number of missed court appearances. Second, for each individual we remove the potential outcome for the action not taken, keeping only . Thus, for each of our three choices of , we have a dataset of the form . The covariates not included in correspond to unmeasured confounding. Starting from these three synthetic datasets, we carry out the policy construction and sensitivity analysis procedure described in Section 3.1. For example, in the case where is comprised of age and gender, we first find release polices based on the available covariates (i.e., age and gender), and then run our Bayesian sensitivity analysis to estimate the effects of unmeasured confounding.
The results of our simulation study are shown in Figure 3, with each panel corresponding to a different choice of and thus a different degree of unmeasured confounding. As in Figure 1, the blue lines show estimates based on direct outcome modeling, ignoring unmeasured confounding, and the gray bands represent 50% and 95% credible intervals based on our sensitivity analysis. Importantly, because these results are derived from synthetic datasets, we can also compute the true policy outcomes, which are indicated in the plot by the red lines. Across the three levels of unmeasured confounding that we consider, our sensitivity bands cover the ground truth line. Moreover, the bands accurately reflect the true level of confounding, with wider bands when is comprised of age, in which case there is substantial confounding, and narrower bands when is comprised of age, gender, and prior number of missed court appearance, in which case there is relatively little confounding.
Finally, in Figure 4, we evaluate the performance of our sensitivity analysis method at estimating subgroup treatment effects in the synthetic data. As in our analysis of the real data (Figure 2), we consider subgroups defined by age, gender, and number of previously missed court appearances. Each row in the plot corresponds to varying levels of censoring, and hence unmeasured confounding. In each panel, the direct outcome modeling estimates—which assume there is no unmeasured confounding—are marked with X’s; the thick and thin gray bars indicate 50% and 95% credible intervals, estimated using our sensitivity analysis approach; and the red points show the true subgroup treatment effects, calculated using the uncensored synthetic dataset . Across conditions, the sensitivity bars capture the ground truth treatment effects for almost all subgroups.
3.3 Comparison with previous methods for sensitivity analysis
Sensitivity analysis for offline policy evaluation is relatively new—we know of only one previous paper, Jung et al. (2017). There are, however, several methods for assessing the sensitivity of average treatment effects to unmeasured confounding, which, as noted above, is a specific case of policy evaluation. Here we compare our approach to Jung et al.’s sensitivity method for offline policy evaluation, as well as to two recent methods designed for average treatment effects (McCandless and Gustafson, 2017; Dorie et al., 2016).
We start by comparing to the approach recently proposed by Jung et al. (2017), which adapts the classical method of Rosenbaum and Rubin (1983), to the setting of offline policy evaluation. As discussed in Section 1.1 the Rosenbaum and Rubin framework—and by extension the method of Jung et al.—assumes that there is an unobserved binary covariate that affects both treatment assignment and expected response conditional on treatment. For example, might indicate whether a defendant is particularly risky after accounting for factors recorded in the data. There are four key parameters in this setup, which for concreteness we describe in the judicial context: (1) , the prevalence of the unmeasured confounder; (2) the effect of on a judge’s decision to set bail or RoR a defendant; (3) the effect of on the defendant’s likelihood to miss court if RoR’d; and (4) the effect of on the defendant’s likelihood to miss court if bail is set. By positing the magnitude of these effects (e.g., based on prior domain knowledge), one can estimate a plausible range of policy outcomes that account for unmeasured confounding.
Figure 5 shows the results of the Jung et al. procedure applied to the three synthetic datasets described in Section 3.2. In particular, we compute the minimum and maximum policy estimates obtained by sweeping over two parameters regimes suggested in their paper. In the first regime (top), we allow to vary from 0.1 to 0.9, assume that a defendant with has up to twice the odds of being detained as one with , and that can double the odds a defendant fails to appear, both if RoR’d or required to pay bail. We also consider a more extreme situation (bottom), where a defendant with has up to three times the odds of being detained as one with , and where can triple the odds a defendant fails to appear. In each scenario, the red lines indicate the true outcomes of each policy, computed from the uncensored synthetic dataset, the blue lines show estimates based on direct outcome models ignoring unmeasured confounding, and the gray bands indicate the minimum and maximum values of each policy over all parameter settings in the corresponding regime.
In contrast to our own sensitivity analysis results in Figure 3, the sensitivity bands from the Jung et al. approach in Figure 5 often fail to capture the ground truth policy estimates. Further, and more importantly, the sensitivity bands do not appropriately adapt to the differing levels of unmeasured confounding across datasets, as indicated by their relatively constant width across settings. As a result, one must manually tune the sensitivity parameters for each dataset to achieve satisfactory performance. While not impossible—and indeed such calibration is the norm in classical sensitivity methods—the need for manual adjustment is a significant limitation of nonBayesian approaches to sensitivity analysis.
Aside from the work of Jung et al. (2017) discussed above, we know of no other offtheshelf approaches for assessing the sensitivity of policy outcomes to unmeasured confounding. However, as shown in Section 2.1, policy evaluation is a generalization of estimating average treatment effects, and so we can compare our approach to methods designed for that problem. In our judicial application, the average treatment effect is the difference in the proportion of defendants who fail to appear at court when all are required to pay bail versus all being RoR’d. We note that it is a particularly strong test of our method to compare to approaches designed specifically to estimate the average treatment effect, as our method is intended to address the much more general problem of policy evaluation. In particular, one might expect that there is some cost to generalization, with our method exhibiting relatively worse performance on the narrow problem of estimating the average treatment effect in exchange for broader applicability.
We specifically compare our approach to two recently proposed methods for Bayesian sensitivity analysis of average treatment effects: the method of McCandless and Gustafson (2017), which we refer to as BSA, and the TreatSens method of Dorie et al. (2016). For space, we omit the details of these methods, but note that we implement the two sensitivity analysis procedures exactly as they were originally described; in the case of TreatSens, we use the authors’ publicly available R package to carry out our analysis. Figure 6 shows the results of estimating average treatment effects on our three synthetic datasets described above, where the true answer is indicated by the dashed red line, and our approach is labeled BSAP. For TreatSens, since we can specify different priors on the unobservable , we compute results with a standard normal prior—as used in our own method—and with uniform priors, as suggested by Dorie et al. The thick and thin lines show the 50% and 95% credible intervals. In all three synthetic datasets, representing different levels of unmeasured confounding, our approach is competitive with, and arguably even better than, the two methods we compare to. In particular, whereas the true answer is at the periphery of the 95% credible intervals generated by TreatSens, the ground truth lies near the posterior median of our approach. Further, our credible intervals are substantially more narrow than those resulting from BSA, indicating that our method can simultaneously achieve accuracy and precision.
4 Constructing optimal policies
In our analysis of judicial decisions, we constructed a family of policies by first estimating each defendant’s flight risk , and then, for any threshold , defining the policy to recommend release if and only if . This family of threshold policies is, necessarily, based on a defendant’s estimated risk —not his true risk —and as such may be affected by unmeasured confounding. Here we explore the effects of such unmeasured confounding on policy construction. We first show, by example, that unmeasured confounding can in theory lead to pathological outcomes, with the riskiest defendants released and the least risky detained. We then argue that such pathologies may not pose a substantial problem in practice.
To illustrate the potentially extreme effects of unmeasured confounding on constructing policies, we consider a simple, stylized example in Table 1. Suppose we only observe a defendant’s age , which for simplicity we assume is either ‘young’ or ‘old’. In this example, and as indicated in Table 1, young defendants are, on average, riskier than old defendants; i.e., . Based on this true risk ranking, one might reasonably release old defendants while demanding bail from young ones. Assume, however, that judges have access to additional information, and instead base their decisions on both a defendant’s age and gender , which we assume judges observe but we do not. In our example, young men are the highestrisk group, and so we further assume that judges demand bail from young men while releasing everyone else. Now, because the observed flight risk for released young defendants is relatively low—as described in Table 1—we have that , the opposite of the true ranking. Among released defendants, young defendants are indeed lower risk than old defendants, but, importantly, this pattern does not hold in the full population. Akin to Simpson’s paradox, unmeasured confounding can cause one to infer a risk ranking of defendants that inverts the true order. As such, one might mistakenly think it is reasonable to release young defendants (who, in the example, are actually high risk) while demanding bail from old defendants (who are low risk).
Despite this extreme hypothetical, in practice, unobserved confounding may not significantly impact the policies one constructs. Suppose, for example, that
systematically underestimates each individual’s risk by a constant factor. In this case, even though the risk estimates are biased, the policy that recommends detaining the 10% most risky defendants is unaffected. Particularly when faced with capacity constraints in allocating resources, it often makes sense to parameterize policies in terms of risk quantiles (e.g., the proportion
who are detained), rather than absolute risk thresholds (e.g., detaining those with risk exceeding ). Fortunately, as we show next, risk rankings—and hence quantilebased policies—are relatively robust to unmeasured confounding.We start by describing a relatively general setting in which unmeasured confounding has no effect on risk rankings. Though only an approximation of realworld conditions, this example provides insight into the problem. Suppose that defendants are partitioned into discrete, observable groups . These groups might, for example, correspond to subsets defined by age, gender, criminal history, and other factors. For illustration, we assume that the true distribution of risk (i.e., a defendant’s logodds of failing to appear if released) within each group is
. Importantly, as we discuss in more detail below, we assume that the withingroup variance
is constant across groups, though we allow the group means to differ. We further assume that judges observe this true risk, and release defendants if and only if their (true) risk is below some fixed threshold . By definition, group is, on average, riskier than group if and only if . The true, grouplevel risk ranking thus orders groups by . The problem is that we do not observe outcomes for a representative sample of defendants in each group—we only observe outcomes for those defendants who happened to be released—and so we cannot directly estimate . In this setup, however, we can still recover the true risk ranking. As we show in Proposition 1below, the mean of a righttruncated normal distribution is increasing in the mean of the underlying normal distribution. As a result, even though risk estimates based on released defendants are biased, these estimates preserve the true, grouplevel risk ranking.
Proportion  

Young  M  0.4  1  0.2  0.17  
Young  W  0.1  0  0.05  0.05  0.17 
Old  M  0.4  0  0.1  0.1  0.1 
Old  W  0.1  0  0.1  0.1  0.1 
Proposition 1.
Let , then if , we have . That is, for fixed and , the mean of the righttruncated normal distribution is an increasing function of the mean of the underlying normal distribution.
A proof of Proposition 1 is given in the Appendix. We present a visual representation of the result in the left panel of Figure 7. Based only on the means of the truncated distributions, one can recover the true group ranking. The key assumption in Proposition 1 is that the withingroup variance is constant across groups. Without this assumption, one cannot always recover risk rankings, as illustrated in the right panel of Figure 7. In that example, the mediumrisk group (shown in blue) has higher variance than the low risk group (shown in green), distorting the ranking inferred from the truncated distributions. A similar phenomenon is at play in the example described in Table 1. In that case, the withingroup variance of young defendants is substantially higher than for old defendants, corrupting the risk ranking. In both examples, one requires quite large differences in withingroup variance to alter the risk ranking, pointing to the robustness of such rankings in the presence of unmeasured confounding.
Proposition 1 establishes one particular set of assumptions under which risk rankings can be recovered in the presence of unmeasured confounding. We now present empirical evidence that unmeasured confounding does not substantially impact the policies one learns in our bail application. As in Section 3.2, we construct a synthetic dataset in which we have access to both potential outcomes for each defendant: failure to appear if RoR’d , and failure to appear if bail is required . As before, we censor the data by restricting to three different subsets of covariates: (1) age; (2) age and gender; and (3) age, gender, and number of previously missed court dates. For each subset of covariates, we derive two families of policies by ranking defendants along the observed dimensions. The first set of policies is based only on outcomes for released defendants. That is, as we would do in practice, we restrict to released defendants and fit a model that estimates flight risk conditional on the observed covariates. This fitted model is then used to estimate the flight risk of—and subsequently rank—all defendants, both those who were released and those who were not. The second set of policies is based on information for all defendants, where we use the counterfactual outcome for those defendants who were not released. In this case, the derived ranking is the best one can do with the available covariates. Of course, such an optimal policy cannot generally be learned in practice, but it is a natural benchmark to consider when assessing the effects of unmeasured confounding on policy construction.
In Figure 8, we evaluate both families of policies described above, comparing the proportion who are released to the proportion that fail to appear under each. The proportion who fail to appear is computed using the ground truth counterfactuals and . The red lines show results for the optimal policies, derived based on complete information about the potential outcomes. The black dashed lines show results for the policies one would learn in practice, by considering only the outcomes of those who were released. In all three conditions, the black and red lines are almost identical. Thus, even though unmeasured confounding can in theory lead to suboptimal risk rankings, it appears, at least in our case, that policy construction is robust to such hidden heterogeneity.
5 Discussion
As algorithms are increasingly used to guide highstakes decisions, it has become ever more important to accurately assess their likely effects. Here we have addressed this problem of offline policy evaluation by coupling classical statistical ideas on sensitivity analysis with modern methods of machine learning and largescale Bayesian inference. The result is a conceptually simple yet powerful technique for evaluating algorithmic decision rules.
One of the key strengths of our general strategy is its flexibility. In our bail application, both the treatment (RoR or bail) and the response (FTA or not) are binary. In principle, however, our approach can handle much more complicated treatment choices and outcomes simply by specifying alternative model forms. For example, instead of modeling the bail decision as binary, one might model treatment as a multinomial or continuous function corresponding to the amount of bail required of the defendant. Such modeling changes create relatively little overhead for practitioners, since one can use blackbox Bayesian inference to compute the posterior, circumventing the difficult analytic calculations that have hindered past methods. More complicated models typically come with longer inference times, though advances in Bayesian computation—including blackbox variational inference (Ranganath et al., 2014)—have produced algorithms capable of handling substantial complexity.
By definition, it is impossible to precisely quantify unmeasured confounding, and so all methods of sensitivity analysis require assumptions that are inherently untestable. Traditional methods handle this situation by requiring practitioners to specify parameters describing the structure and scale of the assumed confounding, informed, for example, by domain expertise. In contrast to that traditional approach, our Bayesian framework largely obviates the need to explicitly set sensitivity parameters. But there is no free lunch. In our case, the necessary side information enters through both the assumed form of the data generating process and the priors. By adopting an expressive model form and setting weakly informative priors, our approach balances the need to provide at least some information about the structure of the potential confounding with the impossibility of specifying it exactly. This middle ground appears to work well in practice, but it is useful to remember the conceptual underpinnings of our strategy when applying it to new domains.
Over the last two decades, sophisticated methods of machine learning have emerged and gained widespread adoption. More recently, these methods have been applied to traditional problems of causal inference and their modern incarnations, like offline policy evaluation. Machine learning and causal inference are two sides of the same coin, but the links between the two are still underdeveloped. Here we have bridged one such gap by porting ideas from classical sensitivity analysis to algorithmic decision making. Looking forward, we hope that sensitivity analysis is more tightly integrated into machine learning pipelines and, more generally, that our work spurs further connections between methods of causal inference and prediction.
Acknowledgements
We thank Alexander D’Amour and Andrew Gelman for helpful comments.
References
 Ahlman and Kurtz (2009) Ahlman, L. C. and Kurtz, E. M. (2009). The appd randomized controlled trial in low risk supervision: The effects on low risk supervision on rearrest. Philadelphia Adult Probation and Parole Department.

Åstebro and Elhedhli (2006)
Åstebro, T. and Elhedhli, S. (2006).
The effectiveness of simple decision heuristics: Forecasting commercial success for earlystage ventures.
Management Science, 52(3):395–409.  Athey and Wager (2017) Athey, S. and Wager, S. (2017). Efficient policy learning. arXiv preprint arXiv:1702.02896.
 Barocas and Selbst (2016) Barocas, S. and Selbst, A. D. (2016). Big data’s disparate impact. California Law Review, 104.
 Berk (2012) Berk, R. (2012). Criminal justice forecasts of risk: a machine learning approach. Springer Science & Business Media.
 Berk et al. (2009) Berk, R., Sherman, L., Barnes, G., Kurtz, E., and Ahlman, L. (2009). Forecasting murder within a population of probationers and parolees: a high stakes application of statistical learning. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(1):191–211.
 Burgess (1928) Burgess, E. W. (1928). Factors determining success or failure on parole. The workings of the indeterminate sentence law and the parole system in Illinois, pages 221–234.
 Camerer and Johnson (1997) Camerer, C. F. and Johnson, E. J. (1997). The processperformance paradox in expert judgment. In Goldstein, W. M. and Hogarth, R. M., editors, Research on judgment and decision making: Currents, connections, and controversies. Cambridge University Press.
 Carnegie et al. (2016) Carnegie, N. B., Harada, M., and Hill, J. L. (2016). Assessing sensitivity to unmeasured confounding using a simulated potential confounder. Journal of Research on Educational Effectiveness, 9(3):395–420.
 Carpenter et al. (2016) Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., and Riddell, A. (2016). Stan: A probabilistic programming language. Journal of Statistical Software, 20:1–37.
 Chamberlain (2011) Chamberlain, G. (2011). Bayesian aspects of treatment choice. In The Oxford Handbook of Bayesian Econometrics. Oxford University Press.
 Chouldechova et al. (2018) Chouldechova, A., BenavidesPrado, D., Fialko, O., and Vaithianathan, R. (2018). A case study of algorithmassisted decision making in child maltreatment hotline screening decisions. In Friedler, S. A. and Wilson, C., editors, Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 134–148, New York, NY, USA. PMLR.
 CorbettDavies et al. (2017) CorbettDavies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. (2017). Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806. ACM.
 Cornfield et al. (1959) Cornfield, J., Haenszel, W., Hammond, E. C., Lilienfeld, A. M., Shimkin, M. B., and Wynder, E. L. (1959). Smoking and lung cancer: recent evidence and a discussion of some questions. J. Nat. Cancer Inst, 22:173–203.
 Danziger et al. (2011) Danziger, S., Levav, J., and AvnaimPesso, L. (2011). Extraneous factors in judicial decisions. Proceedings of the National Academy of Sciences, 108(17):6889–6892.
 Dawes (1979) Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34(7):571.
 Dawes et al. (1989) Dawes, R. M., Faust, D., and Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243(4899):1668–1674.
 Dehejia (2005) Dehejia, R. H. (2005). Program evaluation as a decision problem. Journal of Econometrics, 125(1):141–173.
 Dhami (2003) Dhami, M. K. (2003). Psychological models of professional decision making. Psychological Science, 14(2):175–180.
 Dorie et al. (2016) Dorie, V., Harada, M., Carnegie, N. B., and Hill, J. (2016). A flexible, interpretable framework for assessing sensitivity to unmeasured confounding. Statistics in medicine, 35(20):3453–3470.
 Dudík et al. (2011) Dudík, M., Langford, J., and Li, L. (2011). Doubly robust policy evaluation and learning. ICML.
 Gail and Simon (1985) Gail, M. and Simon, R. (1985). Testing for qualitative interactions between treatment effects and patient subsets. Biometrics, pages 361–372.
 Goel et al. (2016a) Goel, S., Rao, J., and Shroff, R. (2016a). Personalized risk assessments in the criminal justice system. The American Economic Review, 106(5):119–123.
 Goel et al. (2016b) Goel, S., Rao, J. M., and Shroff, R. (2016b). Precinct or prejudice? Understanding racial disparities in New York City’s stopandfrisk policy. Annals of Applied Statistics, 10(1):365–394.
 Hill (2012) Hill, J. L. (2012). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics.
 Holland (1986) Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81(396):945–960.
 Imbens (2003) Imbens, G. W. (2003). Sensitivity to exogeneity assumptions in program evaluation. American Economic Review, 93(2):126–132.
 Jung et al. (2017) Jung, J., Concannon, C., Shroff, R., Goel, S., and Goldstein, D. G. (2017). Simple rules for complex decisions. ArXiv eprints.
 Klein (2017) Klein, G. A. (2017). Sources of power: How people make decisions. MIT press.
 Kleinberg et al. (2017) Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., and Mullainathan, S. (2017). Human decisions and machine predictions. The Quarterly Journal of Economics, 133(1):237–293.
 Manski (2004) Manski, C. F. (2004). Statistical treatment rules for heterogeneous populations. Econometrica, 72(4):1221–1246.
 McCandless and Gustafson (2017) McCandless, L. C. and Gustafson, P. (2017). A comparison of bayesian and monte carlo sensitivity analysis for unmeasured confounding. Statistics in Medicine.
 McCandless et al. (2007) McCandless, L. C., Gustafson, P., and Levy, A. (2007). Bayesian sensitivity analysis for unmeasured confounding in observational studies. Statistics in medicine, 26(11):2331–2347.
 McDonald (1996) McDonald, C. J. (1996). Medical heuristics: The silent adjudicators of clinical practice. Annals of Internal Medicine, 124(1 Part 1):56–62.
 Meehl (1954) Meehl, P. E. (1954). Clinical vs. statistical prediction. Minneapolis: University of Minnesota Press.
 Ranganath et al. (2014) Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. In Artificial Intelligence and Statistics, pages 814–822.
 Rosenbaum and Rubin (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society. Series B (Methodological), pages 212–218.
 Tetlock (2005) Tetlock, P. (2005). Expert political judgment: How good is it? How can we know? Princeton University Press.
 Wübben and Wangenheim (2008) Wübben, M. and Wangenheim, F. V. (2008). Instant customer base analysis: Managerial heuristics often get it right. Journal of Marketing, 72(3):82–93.
 Zeng et al. (2017) Zeng, J., Ustun, B., and Rudin, C. (2017). Interpretable classification models for recidivism prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180(3):689–722.
 Zhang et al. (2012) Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2012). A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010–1018.
 Zhang et al. (2015) Zhang, Y., Laber, E. B., Tsiatis, A., and Davidian, M. (2015). Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics, 71(4):895–904.
Appendix A Additional model details
In the main text, we described the likelihood function for our Bayesian model of unmeasured confounding. Here we complete the model specification by describing the prior distribution on the parameters.
On each of the unmeasured confounders , we set a prior. On the coefficients , , and we use a randomwalk prior. Intuitively, randomwalk priors ensure that adjacent groups have similar coefficient values, mitigating the dependence of results on the exact number of groups . Formally, the randomwalk prior on is given by
where
indicates the halfnormal distribution with standard deviation
. We analogously set priors for the parameters , , , , and .For the coefficients on the unobserved confounders , we set randomwalk priors with an additional constraint to ensure positivity, so that , , and all increase with . In the judicial context, for example, one can imagine that corresponds to unobserved risk of failing to appear, with judges more likely to demand bail from riskier defendants. This constraint, while not strictly necessary, facilitates estimation of the posterior distribution. Formally, for we have
We similarly set signconstrained randomwalk priors on and . For the main results in this paper, we set values of and . However, Figure 9 shows that the results are not substantially affected by the choice of or the prior distribution (parameterized by ).
Appendix B Proof of Proposition 1
As is wellknown, if a random variable
is normally distributed with mean and variance , the mean and variance of the truncated variable are given byand
where and are the PDF and CDF of the standard normal distribution, respectively, and . Differentiating with respect to , and using the identities and , we see that
Since this derivative is the ratio of two variances, it is positive, hence is increasing in , as desired.