1 Introduction
Causal inference generally requires two distinct elements: modeling observed potential outcomes and making assumptions about missing potential outcomes. While the first element can be investigated with standard modelchecking techniques, the second element is uninformed by the data and can only be probed by sensitivity analysis. Ideally, a sensitivity analysis should quantify how the results change under different assumptions about unobserved potential outcomes without perturbing the assumptions used to model the observed data (Linero and Daniels, 2017; Gustafson et al., 2018)
. Unfortunately, under popular sensitivity analysis frameworks, such “clean” sensitivity analyses are often difficult to construct, leaving investigators with an inconvenient choice: either employ a sensitivity analysis that conflates model checking with probing untestable assumptions, or restrict the data analysis to a simple class of models for which a clean sensitivity analysis is possible. This tension puts sensitivity analysis at odds with modern causal inference approaches, especially those that use nonparametric techniques, such as Bayesian nonparametric regression or other machine learning methods, to build highly flexible models of the observed data
(Hill, 2012; Athey and Wager, 2017; Hahn et al., 2017).In this paper, we propose a sensitivity analysis framework that resolves this tension in the context of assessing sensitivity to unobserved confounding in observational studies. Specifically, our framework allows for incorporating flexible, nonparametric methods into modelbased sensitivity analysis, where the investigator parameterizes sensitivity in terms of a joint model for the observed and unobserved potential outcomes. Our framework employs a particular “extrapolation” factorization of this joint potential outcome distribution that separates the observed data density, which is nonparametrically identified, from the missing data density, which is completely unidentified. Further, this factorization parameterizes the missing data density in terms of selection functions, which are easily interpreted, so that for any setting of the sensitivity parameters, we can interpret the implied distribution on missing potential outcomes as an extrapolation from the observed data distribution.
The extrapolation factorization has a long history in the missing data literature (Linero and Daniels, 2017). First introduced by John Tukey (recorded in Holland, 1986), the approach has been known by a number of names including exponential tilting (Birmingham et al., 2003; Scharfstein et al., 1999) and Tukey’s factorization (Franks et al., 2016). Our main contribution is to apply this approach to observational studies; to the best of our knowledge, we are the first thorough demonstration of its use for this application. This leads to several gains in practice. First, this approach enables separate quantification of the sensitivity of the results to both changes in the observed data model (testable) and sensitivity parameters (untestable). By separating observable and unobservable implications, we resolve common problems that arise from conflating sensitivity analysis and model checking. An important consequence of this separation is that the framework can easily be used for any probabilistic predictive model, allowing investigators to add sensitivity analysis to a causal investigation regardless of the statistical model. We demonstrate this flexibility by conducting sensitivity analysis with a range of modern statistical models including Bayesian Adapative Regression Trees (BART) for response surface estimation (Dorie et al., 2016) and Dirichlet processes models for nonparametric inference of residual distributions. We also demonstrate the use of the factorization with semicontinuous data using zeroinflated models.
Second, the sensitivity parameters defined in our framework are easily interpreted. Interpretatiblity of parameters is a central requirement in sensitivity analysis because the analysis probes a set of unidentified assumptions that must be calibrated by applying domain expertise or some other external knowledge. The parameters in our framework can be interpreted intuitively in terms of sampling bias, facilitating model specification and parameter calibration. This is a major advantage over traditional patternmixture models which do not explicitly parameterize the selection function.
Third, in this factorization, observed data inference is orthogonal to the specification of the selection function, which yields a straightforward framework for conducting post facto sensitivity analysis in any data set for which unconfoundedness was previously assumed. This fact implies substantial computational savings because we only need to fit a model to the observed data once. In contrast, in many sensitivity analysis frameworks, such as latent confounder models (Section 3), we typically need to fit the model separately for each set of sensitivity parameters.
The paper proceeds as follows. Section 2 introduces the formal setup for sensitivity analysis and highlights key issues. Section 3 discusses difficulties with latent confounder models, a popular sensitivity analysis framework in causal inference, which we use to motivate our method. Section 4 introduces the extrapolation factorization for causal inference and provides theoretical justification for applying the approach. Section 5 defines a particular model specification, the logisticexponential family model, and explores implementation details. Section 6 applies the extrapolation factorization to two examples, demonstrating the flexibility of the approach on a range of estimands with several Bayesian nonparametric estimation approaches. Finally, Section 7 discusses open issues and possible extensions. The appendix contains some additional technical material.
2 Setup and Overview of Sensitivity Analysis
2.1 Setup and Notation
We describe our approach using the potential outcomes framework (Neyman, 1923; Rubin, 1974). For outcome for unit , let and denote that unit’s potential outcomes if assigned to control or treatment, respectively. We let denote a binary treatment indicator and denote observed covariates. For compactness, we often write to denote the outcome for treatment level . Assuming SUTVA (Rubin, 1980), we can then write the observed outcome as . Finally, we write the propensity score as We assume an infinite population of iid units, from which we sample triplets and observe triplets . Given this, we suppress the subscript unless otherwise noted.
We focus on two classes of population estimands. First, we consider average treatment effects on the whole population as well as on the treated and control populations:
Secondly, we consider quantile treatment effects,
the difference in the th treatment quantile, , and control quantile, .
Each of these estimands is a contrast between the marginal completedata outcome distributions and . For each , the completedata distribution for each potential outcome can be written as a mixture of the distribution of observed and missing outcomes:
All factors in this expression are identified except for , which is completely uninformed by the data. Identifying these estimands thus requires untestable assumptions that characterize . Sensitivity analysis probes the robustness of estimates to these assumptions.
We consider sensitivity analyses for observational studies where the investigator wishes to test robustness to violations of the unconfoundedness assumption.
Assumption 1 (Unconfoundedness).
.
Unconfoundedness implies that , and is thus sufficient to identify our estimands of interest.
There is an extensive literature on assessing sensitivity to departures from unconfoundedness, dating back at least to the foundational work of Cornfield et al. (1959) on the link between smoking and lung cancer. Broadly, sensitivity analysis proceeds by parameterizing the conditional dependence between partiallyobservable potential outcomes and given covariates . The parameters of this dependence are called sensitivity parameters
. Investigators can then report how causal effect estimates change when the sensitivity parameters are allowed to vary within a selected range of plausible values. The plausibility of the range of sensitivity parameters is ultimately determined externally to the data analysis, e.g., by domain expertise. In this paper, we focus on modelbased sensitivity analysis, which specifies conditional dependence in terms of a family of models for the joint distribution of potential outcomes and treatment given covariates. For distinct approaches to sensitivity analysis, see
Rosenbaum and Silber (2009), Ding and VanderWeele (2016), and Zhao et al. (2017).2.2 Summary of Approach
In this paper, we contrast two approaches to modelbased sensitivity analysis, complete data modeling and observed data modeling. Both have been thoroughly explored in the missing data literature (van Buuren, 2015) but are perhaps less studied in causal inference. Complete data modeling is the standard approach for modelbased sensitivity analysis in causal inference (Rosenbaum and Rubin, 1983; Dorie et al., 2016). These models specify the joint distribution of outcomes and treatment assignment in terms of completedata outcome distributions that are parameterized by sensitivity parameters . This specification defines the observed and missing distributions and implicitly. Often, the conditional dependence of potential outcomes and treatment assignment is specified in terms of a latent confounder . We discuss these latent confounder models in greater detail in Section 3.
By contrast, we propose an observed data modeling approach, in which we specify the joint distribution in terms of the observed outcome distributions . Specifically, we parameterize the missing outcome distributions as an extrapolation of the observed outcome distribution
(1) 
This specification defines the completedata distributions implicitly. In (1), the fraction can be interpreted as importance weights that transform the observed outcome distribution into the missing outcome distribution parameterized by . We discuss the derivation of (1) in more detail in Section 4. Notably, the observed data distributions in this approach are free of the sensitivity parameter , which implies a clean separation of model checking and sensitivity analysis without the need for reparameterization. We discuss the importance of this separation in the next subsection.
Sensitivity analyses often involve two key design choices that are easily addressed by our approach. First, the analyst faces many choices in modeling the conditional dependence between outcomes and treatment. Our approach follows recent proposals to enable sensitivity analysis with flexible outcome models, especially Dorie et al. (2016), who use Bayesian Additive Regression Trees (BART) models for flexible outcome modeling (see also Carnegie et al., 2016). We contrast their approach with ours in Section 6.1. Similarly, Jung et al. (2018) combine flexible, blackbox modeling with sensitivity in the setting of algorithmic decision making. As we discuss in Section 3, our main critique has limited applicability to their approach, since Jung et al. (2018) focus exclusively on the setting with a binary outcome and sensitivity parameters that vary across covariates. Finally, in an approach that aligns well with ours, Hahn et al. (2016) propose flexible, nonparametric models that separate the observed data model from sensitivity parameters in the context of missing data.
Second, the analyst must summarize the results of the sensitivity analysis. Setting the range of sensitivity parameters is a critical first step; it can be useful to calibrate this range against similar measures of dependence between fully observed variables, e.g., and (Imbens, 2003; Ichino et al., 2008; Dorie et al., 2016). In some literatures, the range of causal estimates obtained by allowing sensitivity parameters to vary within the specified range is called an ignorance region, and we adopt this terminology in this paper (Vansteelandt et al., 2006). For ignorance region reporting, we focus on what is sometimes called the “tabular method”, where we evaluate causal effect estimates across a grid of sensitivity parameter values (Gustafson et al., 2018), and report the entire table of estimates. Our methodology is not restricted to this type of reporting, but we find it useful to demonstrate several important properties of our sensitivity analysis framework, including the nonidentification of sensitivity parameters. Alternatives to the tabular method include socalled Bayesian sensitivity analysis, where the investigator specifies a prior over sensitivity parameters and reports a posterior on treatment effects given the full model (McCandless et al., 2007; McCandless and Gustafson, 2017), and bound reporting, where the investigator simply reports the most extreme causal effect estimates obtained from sensitivity parameters within the specified range (Rosenbaum, 2017). One disadvantage of the tabular method is that it is only feasible to do this reporting with lowdimensional sensitivity parameters.
2.3 Separating Sensitivity Analysis from Model Checking
A prime motivation for our sensitivity analysis framework is the importance of separating sensitivity analysis from model checking. As opposed to sensitivity analysis, model checking probes observable parts of the model and explores a range of assumptions that induce distinct models for the observed data. In practice, many popular approaches to sensitivity analysis blur the line between sensitivity analysis and model checking by introducing sensitivity parameters that are informed by the observed data. Sensitivity analyses of this type are difficult to interpret because they deal with two interacting sets of assumptions at once. Specifically, they relax the unconfoundedness assumption while introducing parametric assumptions that fundamentally change the relationship between confounding and identification. This creates new types of sensitivity that go unexamined by the analysis itself. For example, under some parametric models for complete data, the unconfoundedness assumption, which has no observable implications on its own, becomes testable. In the context of missing data,
Linero and Daniels (2017) describe a detailed case, due to Kenward (1998), where slight changes in the tail thickness of a parametric completedata model specification result in different testbased conclusions about ignorability.Conflating model checking with sensitivity analysis also has practical implications. First, sensitivity analysis with identified sensitivity parameters is computationally expensive because each setting of the sensitivity parameters requires that the model be refit (Hahn et al., 2016). This is particularly onerous when a sensitivity analysis strategy is employed with modern nonparametric strategies that are relatively expensive to fit. Second, exploring a range of distinct model fits in a post hoc sensitivity analysis raises the spectre of data snooping. As opposed to a valid sensitivity analysis, where the investigator could tune an observed data model in a heldout sample, in a sensitivity analysis with identified sensitivity parameters, the investigator could choose the model fit that is the most favorable to their conclusions and declare it to be the “main” model whose robustness is being checked (Rosenbaum, 2017, p. 172, “What Is Not a Sensitivity Analysis?”). Finally, Gustafson et al. (2018) note that identified sensitivity parameters often introduce “funny business” when the prior information used to calibrate the sensitivity parameters ends up in competition with the observed data.
Our sensitivity analysis framework naturally enforces a separation between sensitivity analysis and model checking. This represents a distinct advantage over some completedata modeling approaches, where this separation is often impossible to achieve (Gustafson, 2015).
3 Difficulties with CompleteData Models
In this section, we discuss some difficulties with latent confounder models, a popular completedata modeling approach to sensitivity analysis that explicitly models unobserved confounders as latent variables. Despite their popularity, this setup remain poorly understood; to quote Gustafson (2015, Chapter 7) in his discussion of these models, “The crux of the situation is that we lack theoretical insight into even quite basic questions about what is going on.” We provide some theoretical insight here, showing that, in general, it is difficult to specify latent confounder models so that the sensitivity parameters are uninformed by the observed data. This difficulty motivates our distinct approach to sensitivity analysis, which we discuss in the next section.
3.1 Latent Confounder Models
Latent confounder modeling is a common approach for sensitivity analysis in causal inference and have a long history (Rosenbaum and Rubin, 1983). Generally, latent confounder models assert that unconfoundedness would hold if only an additional latent confounder were observed; that is, for some latent variable , . The model then simultaneously specifies the conditional distributions of treatment and potential outcomes given and . This marks a departure from selection and patternmixture factorizations, which model treatment and potential outcomes sequentially. In latent variable models, dependence on the latent variable
is indexed by a vector of sensitivity parameters
. The full model specification iswhere is the density of the latent confounder; is the potential outcome distribution for treatment given the observed and latent variables; and
is the probability of receiving treatment given both
and , written with invoke a parallel to the propensity score. Here, the sensitivity parameters encode how the treatment and potential outcomes depend on the unobserved confounder. A common assumption is that , which means that the effect of the latent confounder on the potential outcome is the same in each treatment arm. For sensitivity analysis, latent confounder models are usually parameterized so that some specific values of the sensitivity parameters indicate the “no unobserved confounding” case. This occurs whenever does not depend on or when is free of for both .Despite their intuitive appeal, latent confounder models often imply observed outcome distributions that depend on sensitivity parameters in ways that are not fully transparent. The observed outcome distribution in each arm has the following form:
(2) 
The distribution of observed outcomes is a mixture over the mixing measure . Depending on the specification, the sensitivity parameters are often partially, if not fully, identified.
3.2 A Simple Example of Sensitivity Parameter Identification
Before we turn to a general treatment of sensitivity parameter identification in latent confounder models, we give a simple example of a case where identification occurs, and the complications that arise as a result.
Example 1 (Normal Outcome, Binary Confounder).
Suppose we have a study with a continuous outcome and no covariates. We make the assumption that treatment was randomly assigned according to a Bernoulli design, but it is plausible that there exists a latent class that confounds the study. To test the robustness of our conclusions to the presence of such a latent class, we propose a sensitivity analysis by introducing a binary latent confounder. The model is parameterized as follows:
When , the model reduces to random assignment to treatment. The observed data distribution depends on the sensitivity parameters . To see this, let . By Equation 2, the distribution of observed outcomes is a twocomponent mixture of normals for :
(3) 
The observed and missing potential outcome distributions, and , respectively, have the same mixture components but different mixture weights. determines the difference between the component means. The mixture weights, and component means are identifiable under relatively weak assumptions (see Everitt, 1985).
In Figure 1, we demonstrate a range of data fits obtained from this model^{1}^{1}1In this simulation we set and .. As is common practice, we fit the model multiple times for a range of sensitivity parameters. In Figure 0(a) we plot the true (blue) and inferred (red) observed potential outcome densities for the control potential outcomes. Most settings of the sensitivity parameters result in severe misfit of the observed data, and would likely be rejected by a competent analyst if they were considered the “main” model for the observed data. In this case, the sensitivity analysis operates as a model checking exercise more than as an exploration of relaxed identification assumptions. In fact, based on this parametric specification, the investigator could reject the hypothesis that the study satisfies Assumption 1 based on model fit, but could still estimate the causal effect because all parameters of the mixture model are identified.
Illustrations of sensitivity parameter identification in two sensitivity models (Example 1 and Example 2), via variation in posterior predictive distributions at different sensitivity parameter settings. (a) True (blue) and inferred (red) observed control outcome densities,
, implied by the normal outcomebinary latent confounder model with sensitivity parameters and (Example 1). and represent the difference between the true and assumed sensitivity parameters. The observed data densities are only correctly inferred when we assume the true values of the sensitivity parameters ). Note that the true potential outcome distributionsare a mixture of normals with the same component means and different variances. (b) Posterior predictive expectation functions for observed control outcomes in the binary outcomenormal confounder model with a covariate
(Example 2). Solid line shows posterior expectation; shading shows a pointwise 50% posterior predictive interval. When either sensitivity parameter or is equal to zero, the surface is monotonic, but when their signs differ, the surface shows nonmonotonicity.More elaborate forms of the model in Example 1 appear throughout the literature, including extensions that incorporate covariates. Imbens (2003) first proposed a linear model for the mean of the potential outcomes, ; Dorie et al. (2016) extended Imbens’ approach and instead model using Bayesian Adapative Regression Trees. Under these specifications, the residuals of the observed outcomes are distributed as a mixture of normals, and the identification issue noted in Example 1 remains.
3.3 Sensitivity Parameter Identification in General
To close this section, we discuss conditions under which sensitivity parameters in latent confounder models are at least partially identified. The purpose of this discussion is to emphasize that, unless strong restrictions are put on on the sensitivity model, latent confounder models will generally define sensitivity parameters that are informed by the data. In some cases, these restrictions may encourage investigators to use lessthanideal models for their observed data so that they can perform a valid sensitivity analysis, or the restrictions may discourage investigators from performing sensitivity analysis at all.
We study the observed outcome model in (2) as a mixture model to show that identification occurs under rather weak conditions. Identification in mixture models is a wellstudied problem. Hennig (2000) contains a thorough reference list, with general treatments of the problem given by Teicher (1960), Chandra (1977), and Rao (1992). Everitt (1985) gives an accessible review of identification in the case of finite mixtures.
For our discussion in this section, we focus on partial identification, where the observed data distribution contains information that narrows down possible values for the sensitivity parameters. We formalize this information in terms of incompatibility between the observed data distribution and the complete data distribution induced by certain settings of the sensitivity parameters.
Definition 1 (Incompatibility).
We say the observed data distribution is incompatible with a sensitivity parameter setting iff there exists no complete data model with sensitivity parameters that implies .
We say that a sensitivity parameter is partially identified if the model family can induce observed data distributions that are incompatible with some values of the sensitivity parameters. For example, in the context of Example 1, the latent confounder model can induce multimodal observed outcome densities, but such densities are incompatible with the setting . Thus, is partially identified; one could, in principle, reject this sensitivity parameter value on the basis of a test of unimodality for the observed data densities.
Here, we show that partial identification occurs under fairly weak conditions on the latent confounder model. Stronger results regarding point identification under more restrictive parametric assumptions are available in the references cited above. Following Chandra (1977), we state our conditions in terms of model families. Let be the family of observed data distributions and let be the family of conditional outcome distributions . Recall that, by (2), the members of are mixtures over the members of .
Proposition 1 (Sufficient Condition for Partial Identification of Sensitivity Parameters).
Suppose the following two conditions hold:

, so that the mixture outcome distribution has a strictly larger model family than the conditional outcome distribution, and

There exist sensitivity parameter settings such that does not depend on , or such that is degenerate.
Then the sensitivity parameters are partially identified.
Proof.
For any setting that satisfies the second condition, the corresponding mixture distribution must lie within . Thus, any mixture distribution is incompatible with these sensitivity parameter settings, so the sensitivity parameter is partially identified. ∎
Proposition 1 is important because it outlines cases in which the parametric specification of the sensitivity model makes it possible to reject unconfoundedness using the observed data. The conditions in Proposition 1 are weak, but not universal. The second condition is almost always satisfied by design because it corresponds to having settings of the sensitivity parameters that represent the unconfounded case. The first condition, however, depends on the sample space of and the parametric specification, and does not hold in some common specifications. For example, in the sensitivity analysis of Rosenbaum and Rubin (1983), the outcomes and latent confounder are binary and sensitivity parameters are not shared across covariate strata (see also Jung et al., 2018)
. In this case, within each covariate stratum, mixtures of Bernoulli random variables are themselves Bernoulli so
, and the sensitivity parameters are unidentified. However, in the common case where sensitivity parameters are shared across covariate strata, the condition holds, even when outcomes are binary. We demonstrate this in the following example.Example 2 (Binary Outcome with Covariates).
Consider a latent confounder model with binary outcomes. We posit the existence of a normally distributed latent confounder:
(4)  
(5)  
(6)  
(7) 
Here, the observed outcome distribution is Bernoulli, and characterized by the function
In this example, is a smaller family than . In particular, the family only contains distributions whose mean function is monotone in , given the linear dependence in and the monotonicity of . Meanwhile, the family contains distributions whose mean function can be nonmonotone in because of the indirect dependence on introduced by the mixing measure . This nonmonotonicity is particularly pronounced when and have opposite sign for at least one of . We demonstrate how the posterior predictive distribution of observed data varies as a function of the sensitivity parameters in this model in Figure 1. Gustafson (2015, Chapter 7) discusses more general versions of this example, noting that there is no general strategy for separating information about parameters and the sensitivity parameters in a transparent way.
Finally, several authors have proposed completedata sensitivity models with unidentified sensitivity parameters, most of which require binary or categorical outcomes (Rosenbaum and Rubin, 1983; Robins, 1997; Vansteelandt et al., 2006; Daniels and Hogan, 2008). However, as we have demonstrated in this section, this nonidentification is difficult to achieve in general, may require nontrivial reparameterization, and is fragile when attempting to generalize these sensitivity analysis methods. This motivates the sensitivity analysis framework we present below, whose guiding principle is clean separation of observable and unobservable model implications.
4 The Extrapolation Factorization for Causal Inference
In this section, we present an approach to sensitivity analysis built around a specific factorization of the joint density of potential outcomes called the extrapolation factorization. This factorization splits the joint density of potential outcomes and treatment into factors that are either completely identified or completely unidentified by the observed data. Based on this factorization, it is straightforward to define a sensitivity analysis by parameterizing the distribution of missing potential outcomes in terms of an extrapolation from the observed data distribution. We begin my demonstrating the extrapolation factorization on only one arm of an observational study. Specifically, we examine the joint distribution of treated outcomes and the treatment indicator . This case is analogous to a missing data problem, where the extrapolation factorization has has been applied previously (Birmingham et al., 2003; Scharfstein et al., 1999; Franks et al., 2016; Linero and Daniels, 2017). The extrapolation factorization of their joint distribution is
(8) 
Here, the first two factors constitute the observed data density, which is nonparametrically identified, while the final factor is determined by the selection function, which is unidentified by easily interpreted. In our approach, we parameterize the selection function with sensitivity parameters . This factorization implies the missing data model in (1) that is explicitly a function of both of these factors. Thus, the unidentified selection function fully determines the relationship between observed outcome distribution and the distribution of missing outcomes.
In this section, we demonstrate how this factorization can be extended to causal inference. The key result can be summarized as: “causal inference is missing data twice”. Specifically, we show that for most estimands, under mild conditions, the extrapolation factorization can be applied separately in each arm of an observational study to specify a welldefined sensitivity analysis. Throughout this section we suppress conditioning on covariates for notational clarity and return to models that include covariates in Section 5.
4.1 The Extrapolation Factorization with Both Potential Outcomes
The extrapolation factorization can be extended to the twosided missingness setting of causal inference. In particular, we show here that the joint distribution of potential outcomes and treatment can be uniquely specified by supplementing the distribution of the observed data with three unidentified models: a model for treatment given alone; a model for treatment given alone; and a copula that specifies the dependence between ) and given .
This factorization we give below is valid under the following overlap assumption about the joint distribution of potential outcomes and treatment . This condition ensures that all factors are finite with probability 1.
Assumption 2 (Outcome Overlap Condition).
For each treatment level the support of the missing potential outcomes is a subset of the support of the observed potential outcomes. That is,
for all sets in the outcome sample space .
Under Assumption 2, the joint density can be decomposed into two univariate missing data densities and a copula. The derivation follows as a consequence of applying the extrapolation factorization to the marginal densities and :
(9) 
where we use
to represent a cumulative distribution function and where
is the copula density that characterizes the residual dependence between potential outcomes conditional on the assigned treatment. We refer to this factor as the conditional copula.The only unidentifiable factors in (4.1) are the marginal selection factors and , which specify the nonignorable selection mechanism in each arm, as well as the conditional copula. In this work, we focus on conducting sensitivity analyses by specifying the marginal selection mechanism, which implies the following joint treatment mechanism:
(10) 
Further, by (4.1), the distribution of the missing potential outcomes, conditional on the observed potential outcomes is
(11) 
The extrapolation factorization for causal inference is closely related to several existing approaches in the causal and missing data literature. Specifically, (11) highlights the relationship between the extrapolation factorization and importance weight methodology, since can be viewed as an importance weight and as the proposal distribution (Riddles et al., 2016). Our approach is also closely related to inverse probability weighting models for sensitivity analysis (e.g. Zhao et al., 2017), which are specified in terms of a joint selection function .
4.2 Sufficient Specifications
For a sensitivity analysis to be welldefined, the estimand of interest must be uniquely determined when the sensitivity parameters are fixed (Daniels and Hogan, 2008; Linero and Daniels, 2017). We now establish a key result: for a large class of estimands, including the ATE and QTE, to specify a welldefined sensitivity analysis, it is sufficient to specify the marginal selection functions, and without specifying the conditional copula. This is useful because the dependence between potential outcomes is unobservable, even in experiments, and thus assumptions about this dependence are difficult to calibrate. We refer to the general class of estimands for which this invariance holds as marginal contrasts.
Definition 2.
A causal estimand is a marginal contrast estimand iff it can be completely characterized as a function of the marginal distributions of the potential outcomes and . Specifically, writing as a functional of the joint distribution , and with a slight abuse of notation,
Theorem 1.
A marginal contrast estimand is uniquely defined by the marginal selection functions.
Proof.
See appendix. ∎
Stated differently, marginal contrast estimands, like and , are invariant to the conditional copula . In the specific case of Bayesian inference, the invariance of the estimand translates to invariance in estimation. In particular, the following corollary establishes when the posterior distribution for marginal contrast estimands exhibits invariance to the specification of the copula.
Corollary 1.1.
If the parameters of are distinct from the parameters of the treatment selection function and the observed data density, then the posterior distributions for marginal contrast estimands are invariant to the specification of the conditional copula in the model likelihood.
This establishes that the selection factors in each treatment arm are the only unidentifiable factors that need to be specified when estimating marginal contrast estimands. In Section 5 we propose useful and interpretable model specifications for assessing sensitivity to unmeasured confounding using the extrapolation factorization. In Section 6 we demonstrate our approach on both real and simulated data.
5 Logistic Model Specifications
In the previous section, we showed that, to conduct sensitivity analysis for many common causal estimands like the ATE and QTE, it is sufficient to supplement the observed data model with a specification of unidentified marginal selection functions . In this section, we propose a simple class of marginal selection function using a logistic specification, borrowing a common specification for treatment assignment in latent confounder models. We discuss parameter interpretation and calibration, and examine analytically how extrapolations under this specification behave when the observed data model is composed of exponential family distributions. In the appendix, Section B, we propose an alternative extrapolation approach based on latent class models, and compare their utility to the logistic models discussed below.
5.1 Specification and Interpretation
Under a logistic sensitivity specification, we posit that the logodds of receiving treatment are linear in some sufficient statistics of the potential outcomes, :
(12) 
where . This specification has sensitivity parameters , which describe how treatment assignment depends marginally on each potential outcome. Logistic selection is commonly used in latent confounder approaches to model the probability of treatment given the unobserved confounder. Here, we take a similar approach but instead assume that the treatment probabilities are logistic in (sufficient statistics of) the potential outcomes.
In this specification, the sensitivity parameters have a relatively straightforward interpretation: they specify how sufficient statistics of the potential outcomes are over or underrepresented among observed control and treated units. Consider the case where ; that is, the sufficient statistic for the logistic selection model is simply itself. In this setting, and are scalars and we can interpret their signs. For example, implies that units with large are overrepresented among treated units, and the observed treated unit average is too large. Likewise, implies that units with large values of are more likely to be assigned to treatment, so the observed control unit average is too small. We give several examples to illustrate interpretation in practice.
Example 3 (Same Sign).
Consider a study of a medical treatment with health outcome and unmeasured income variable . Suppose that larger values of correspond to good outcomes, higher income induces better outcomes (for example, with ) and that treatment is expensive, so is increasing in . In this study, both and are positively correlated with , so large values of are overrepresented among the observed treated units, and large values of are underrepresented among control units, corresponding to positive and . When and have the same sign, the treatment and control group means are biased in opposite directions, so the ATE changes rapidly when the sensitivity parameters are moved in these directions.
Example 4 (Opposite Signs).
Consider the canonical “perfect doctor” example (Rubin, 2003), where a particular medical treatment is prescribed by a doctor who is able to perfectly predict and for each patient, and assigns the patient to whichever arm gives them the higher outcome. Suppose that in this population is independent of ; that is, the individual treatment effect for each unit is unrelated to each unit’s expected outcome under random assignment to treatment. For example, suppose that are independently normal. Then the control arm and the treated arm both overrepresent high outcomes, so and . When and have opposite signs, the treated and control group means are biased in the same direction, so the ATE changes slowly when the sensitivity parameters are moved in these directions.
Example 5 (SingleArm Confounding).
Consider a study evaluating a job training program with wage outcome where, prior to the study, a subset of units is randomly given access to an alternative program that they can attend if they do not enroll in the treatment, indicated by . Suppose that the alternative program is beneficial on average (for example, suppose the potential outcomes are given by with and ), and units with access to the alternative program are less likely to enroll in treatment, so is decreasing in . Then, the observed outcomes under treatment are representative of the distribution of , so , but, under control, units with higher wage outcomes are overrepresented, so . When only one of the sensitivity parameters is nonzero, only one of the group means is biased, so the ATE changes moderately when the nonzero sensitivity parameter is moved.
5.2 Logistic Selection with Exponential Family Models
In this section, we explore some of the properties of the extrapolations implied by logistic selection functions when the observed data distribution is a member of the exponential family, or is a mixture of exponential family distributions. Following Franks et al. (2016), we show that with observed data distributions of this form, logistic selection functions induce analytical extrapolations. While logistic selection functions can be applied more broadly than to this class of models, we note that mixtures of exponential families are a rather flexible model class, and include nonparametric methods like Dirichlet process mixtures. Thus, beyond being convenient analytical examples, the results in this section can be used directly to efficiently implement sensitivity analysis with many flexible observed data models.
Single exponential family models.
We start by describing the extrapolations of observed data densities that belong to a single exponential family distribution and later describe the extension to mixtures. Here, we model the observed data as an exponential family distribution with natural parameter , possibly depending on ,
(13) 
where is the partition function, is the base measure and is the natural sufficient statistic, possibly multivariate, for the exponential family. When the selection function is assumed to have form (12), Franks et al. (2016) show that the missing data distribution belongs to the same exponential family as the observed data, with natural parameter ,
(14) 
The implied complete data density can then be expressed as a simple mixture of the observed and missing components. In this specification are sensitivity parameters that independently parameterize sensitivity for the control and treated groups. Importantly, the sensitivity parameters are independent of the covariates. We return to this point in the discussion.
To provide some intuition for how extrapolation operates under this specification, we present two simple examples.
Example 6 (Binary observed outcomes).
Let and
By Equation 14 the unobserved potential outcome distribution is:
Thus, the extrapolation factorization with logistic treatment assignment applied to Bernoulli data implies an additive shift in the logodds of the unobserved potential outcomes (relative to the observed potential outcome distribution).
Example 7 (Gaussian observed outcomes).
Because the normal distribution is a two parameter exponential family distribution with sufficient statistics , we consider a treatment assignment mechanism that is quadratic in the potential outcomes.
By Equation 14 the unobserved potential outcome distribution is:
(15) 
This model has two sensitivity parameters for each treatment arm. Assuming that the logistic function is linear in , i.e.
, implies that standard deviations of observed and missing potential outcome distributions are identical.
^{2}^{2}2Note that in the quadratic model we require to ensure propriety of the missing potential outcome distribution. In Section 6.1, we estimate from the observed data using BART, which allows us to conduct a sensitivity analysis on potential outcomes with flexibly estimated response surfaces.Mixtures of exponential families.
Logistic extrapolations of exponential family models are easily extended to mixtures of exponential family distributions. Let be a mixture of exponential family distributions with mixture weights . Then, as shown in Franks et al. (2016),
(16) 
where , as above, and
(17) 
Consider the following simple example.
Example 8 (Gaussian mixture observed outcomes).
We extend the normal model above to a finite mixture of normal distributions: . For simplicity we assume . Then, by Equations 16 and 17,
(18)  
In this extrapolation model, the sensitivity parameters affect both the mixture weights and the component means: the mixture weight for component increases as approaches .
In general, we can apply the extrapolation factorization with logistic selection to observed data densities modeled with nonparametric Bayesian methods like Dirichlet process mixtures (DPM) of any exponential family distribution, and can can easily adapt the mixture results to model structured semicontinuous data as well. In Section 6.2, we demonstrate both of these features by modeling zeroinflated income data, and use a DPM model for the continuous component.
5.3 Calibrating Sensitivity Parameters
Since sensitivity parameters are inherently unidentified, calibrating the magnitude of the parameters requires reasoning about plausible values using prior knowledge and domain expertise. Our proposed calibration approach is tied to the additional proportion of variation explained by in a regression of the treatment assignment indicator on and . This approach is consistent with corresponding proposals for some latent confounder models (Imbens, 2003; Cinelli and Hazlett, 2018), with the important difference that our parameterization involves coefficients for the potential outcomes rather than for a latent confounder. Other approaches that calibrate using inferred regression coefficients for observed covariates are also possible (Dorie et al., 2016; Blackwell, 2014; Middleton et al., 2016) but can be harder to interpret because of collinearity (see also Oster, 2017).
Here, we will again focus on the simple case where and and are scalars, so the probability of assignment to treatment has the following form:
(19) 
We suggest bounding so that explains a plausible amount of variation in the treatment assignment, above and beyond the covariates . To quantify this idea, we adopt a version of the “implicit ” measure from Imbens (2003)
, which generalizes varianceexplained measures to the case of logistic regression. In the discussion that follows, we work on the logit scale, so let
be the logit of the propensity score given variables . We define the variance explained by , and the partial variance explained by given , respectively, as(20)  
(21) 
Here, corresponds to the variance explained on the latent scale in the latent variable representation of logistic regression. The term appears because this is the variance of a standard logistic random variable, which is the distribution of the residual on the latent scale. Similarly, represents the proportional change in latent variance explained when is added to the set of predictors .
Because enters logodds of linearly, there is a onetoone, monotonic relationship between the magnitude of and .
Proposition 2 (Calibration Identity).
Suppose that is linear in , with form (19). Then
(22) 
where is the residual standard deviation of the potential outcome . Equivalently,
(23) 
Proof.
See appendix. ∎
We can use (23) to set bounds on based on a plausible upper bound on partial variance explained . Although the decision ultimately falls to domain expertise, we suggest using observed predictors to set a plausible target by analogy. For example, for each covariate , we can compute a partial variance explained by given all other covariates , and, based on expert knowledge, set the target for an appropriate covariate. We interpret this to mean that the information gained by adding to as a predictor of treatment assignment is comparable to the information gained by adding to . In some cases, alternative calibration schemes may be more appropriate. For example, one could calibrate for some core subset of covariates using (23), replacing with . In Section 6.2, we take this approach and apply a calibration scheme setting to be the empty set. See Cinelli and Hazlett (2018) for a discussion of related alternatives.
After choosing , we set using empirical estimates of and . Estimating is trivial using the propensity score model from the sensitivity analysis. Estimating the residual standard deviation is slightly more involved because it is a property of the complete distribution of potential outcomes . However, the observed data model and together determine , so we can treat as a function of and obtain a solution to (23). With the exponential family models considered in Section 5.2, is usually available in analytic form. In many observeddata model specifications, the residual standard deviation in both the observed and missing data densities are identical, so this recursive approach is not necessary (e.g. Equation 15 for ). Residual standard deviations may also differ for the treated and control potential outcomes; in this case, calibration can be done separately for and .
6 Applications of the Logistic Selection Model
In this section, we demonstrate the application of the logistic selection model to two examples. In the first example, we conduct sensitivity analysis for Bayesian estimation of the effects of blood pressure medication. In this example, we use a nonparametric estimate of the response surface but assume normally distributed residuals. In the second example, we conduct sensitivity analysis on the effect of a job training program on income and employment. In this example, the outcome is zeroinflated, and we focus on quantile treatment effects rather than average treatment effects. In addition to demonstrating the flexibility in modeling structured data, we also show how the extrapolation factorization can be applied to nonparametric residual models, by modeling the continuous component of the observed data densities using a Dirichlet Process Mixture of normal distributions. In both examples, we demonstrate the flexibility of our approach by estimating the posterior distribution of quantile treatment effects for a range of quantiles.
6.1 Analysis of NHANES data
In this section, we consider a study aimed at estimating the effect of ‘taking two or more antihypertensives’ on average diastolic blood pressure using data from the Third National Health and Nutrition Examination Survey (NHANES III) (Centers for Disease Control and Prevention (CDC), 1997), a comprehensive survey of Americans’ health and nutritional status. We follow the same set up as Dorie et al. (2016), and utilize pretreatment covariates like race, gender, age, income, body mass index (BMI), and whether the patient was insured, among others. We let be the average diastolic blood pressure for a subject in treatment arm , where means the subject was taking two or more antihypertensive medications and means the subject was not.
In this application, we showcase several properties of our sensitivity analysis method. First, we show that our method works well with nonparametric models for the observed data. Second, we highlight the clean separation of sensitivity analysis and model checking by examining how estimates change under both different model specifications and different sensitivity parameter specifications within each model. Finally, we show that our method allows simultaneous sensitivity analysis for the ATE and QTEs.
First, we assume the following data generating model for the potential outcomes:
(24)  
(25)  
(26) 
As shown in Section 5.2, the missing data distribution is therefore:
(27) 
Again following Dorie et al. (2016) we use Bayesian Adaptive Regression Trees (BART) (Chipman et al., 2010) to flexibly model and . In contrast to their approach, we focus on using BART to estimate only the observed potential outcomes and do not incorporate latent confounders into the BART model. We consider two variations of the observed potential outcome model. First, we posit a pooled model for the mean surface and residual variance with and . We also consider a model in which there is an interaction between the treatment assignment and the covariates. In this unpooled model, we independently estimate the mean surface, , for each treatment level, and similarly, infer distinct residual variances. Concretely, we assume . In the unpooled model, no information is shared about the response surfaces in the observed outcome models for the two treatment arms, whereas in the pooled model the coefficients are assumed to be identical. Although not explored here, a partial pooling approach is also possible.
The ATE and QTE can be estimated as functions of the estimated marginal complete data distributions for the potential outcomes (averaged over covariates) where:
(28) 
Calibration.
We calibrate the magnitude of the sensitivity parameters using the approach outlined in Section 5.3. The two most important predictors, in terms of partial coefficient of determination, , are age (in months) and body mass index (BMI). In this analysis we choose to bound the sensitivity parameters based on BMI, which has a partial varianceexplained value of approximately (See calibration plot in Figure 5). If we assume , applying the formula in (23) implies . Further, under this model for the observed data and logistic selection, the residual standard deviation of , , is the same as the residual standard deviation from the observed data regression . Thus, in the pooled model, we estimate and bound both and with . In the unpooled model, we estimate and and bound the sensitivity parameter magnitudes at and .