Covariate Balancing Sensitivity Analysis for Extrapolating Randomized Trials across Locations

by   Xinkun Nie, et al.
Stanford University

The ability to generalize experimental results from randomized control trials (RCTs) across locations is crucial for informing policy decisions in targeted regions. Such generalization is often hindered by the lack of identifiability due to unmeasured effect modifiers that compromise direct transport of treatment effect estimates from one location to another. We build upon sensitivity analysis in observational studies and propose an optimization procedure that allows us to get bounds on the treatment effects in targeted regions. Furthermore, we construct more informative bounds by balancing on the moments of covariates. In simulation experiments, we show that the covariate balancing approach is promising in getting sharper identification intervals.



There are no comments yet.


page 1

page 2

page 3

page 4


Covariate Balancing Methods for Randomized Controlled Trials Are Not Adversarially Robust

The first step towards investigating the effectiveness of a treatment is...

Kernel-Distance-Based Covariate Balancing

A common concern in observational studies focuses on properly evaluating...

Variational Quantum Algorithms for Euclidean Discrepancy and Covariate-Balancing

Algorithmic discrepancy theory seeks efficient algorithms to find those ...

Is it who you are or where you are? Accounting for compositional differences in cross-site treatment variation

Multisite trials, in which treatment is randomized separately in multipl...

Sensitivity Analysis in the Generalization of Experimental Results

Randomized controlled trials (RCT's) allow researchers to estimate causa...

The Effect of Medicaid Expansion on Non-Elderly Adult Uninsurance Rates Among States that did not Expand Medicaid

We estimate the effect of Medicaid expansion on the adult uninsurance ra...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Randomized control trials (RCTs) have long been the gold standard in evaluating public policy. In order to leverage insights learned from RCTs conducted in one location to drive policy change in other locations, there remains the concern about external validity, namely that the target population affected by the considered policy change may respond to the intervention differently than the population sample included in the RCT study (Cook et al., 2002; Olsen et al., 2013). For example, perhaps the randomized study was run in Texas, but we want to use results from the study to inform policy across all 50 states. Or perhaps participation in the randomized study was voluntary. As a concrete example, Attanasio et al. (2011) examine the effect of vocational training on labor market outcomes using a randomized study—but participants needed to apply to be part of the study, and so we may be concerned that the effect of vocational training for study participants may differ from the effect of training on the general population.

In cases where any cross-location divergences in populations can be explained away using observed covariates, the generalization problem is addressed by several authors, including Dehejia et al. (2019); Hirshberg et al. (2019); Hotz et al. (2005); Stuart et al. (2011), using ideas that generalize propensity score methods going back to Rosenbaum and Rubin (1983). In this paper, however, we are most interested in the case where locations differ along unobserved features, and so controlling for observed covariates doesn’t solve the cross-location generalization problem.

As discussed further below, our proposal builds upon the literature on sensitivity analysis in observational studies (Rosenbaum, 2002b)

, and in particular on recently proposed methods for sensitivity analysis via linear programming

(Aronow and Lee, 2012; Miratrix et al., 2017; Zhao et al., 2019). This literature is focused on cases where a randomized or observational study is confounded by unobserved features, and so the resulting treatment effect estimates do not have internal validity. Here, in contrast, we are not worried about the internal validity of our study; rather, we are concerned about how unobserved features may affect external validity when we aim to extrapolate results from RCT studies to other locations.

Our main finding is that in the cross-study generalization problem we have access to richer information than in the single-study sensitivity analysis problem, and that we can use this information to substantially improve power. At a high level, our approach is driven by the insight that we can collect data on observable features in both the location where the RCT was conducted and the target location we wish to extrapolate to, and that any selection weights that transport treatment effect estimates from one location to the other must respect moments of these observed covariates. As such, our motivation is qualitatively related to recent work on covariate-balancing estimation in causal inference (Athey et al., 2018; Graham et al., 2016; Hainmueller, 2012; Imai and Ratkovic, 2014; Zubizarreta, 2015).

1.1 Related Work

Generalizing results from randomized trials run in one location to other locations has been a central question in economic and healthcare policymaking (e.g., Cole and Stuart, 2010; Gechter, 2015; Hotz et al., 2005). Several authors have discussed generalizability bias (or transportability bias) that arises when there is either selection bias in a RCT or we try to generalize the results of one RCT to a different location that has a different covariate distribution (Bareinboim and Pearl, 2016; Olsen et al., 2013; Pressler and Kaizar, 2013). To address covariate shift in the case of no measured confounding, many existing works use the generalized propensity scores for either matching or weighting (Cole and Stuart, 2010; Dehejia et al., 2019; Hirshberg and Wager, 2021; Stuart et al., 2011; Westreich et al., 2017), or covariate matching (e.g., Gretton et al., 2009). We also note the recent advances in the literature on transportability in causal graphical models that focuses on designing graphical algorithms for identifying whether transportability is feasible from graphs (e.g., Bareinboim and Pearl, 2013, 2014, 2016; Pearl and Bareinboim, 2014). Our work is complementary to theirs in the sense that we don’t assume transportability is necessarily feasible due to unmeasured effect modifiers, and instead we compute identification intervals whose length varies with the distributional imbalance of these unobserved factors across locations.

We build upon the literature on sensitivity analysis in observational studies (Cornfield et al., 1959; Fogarty and Small, 2016; Imbens, 2003; Rosenbaum, 1987, 2002a, 2002b, 2010, 2011, 2014; Shen et al., 2011; VanderWeele and Ding, 2017) and in particular, the recently developed works on using linear programming (Aronow and Lee, 2012; Miratrix et al., 2017; Yadlowsky et al., 2018; Zhao et al., 2019). This line of work focuses on using sensitivity analysis to address internal validty of an observational study. On the other hand, our work is focused on addressing external validty of an RCT. The sensitivity model we have employed is closest in connection with the marginal sensitivity model advocated in Tan (2006) and Zhao et al. (2019).

Given that violations of transport assumptions is untestable for generalizing results from RCT to other locations, a few authors have proposed methods that perform a sensitivity analysis on how much violation on this assumption can lead to generalizability bias. Andrews and Oster (2017) considers the limit of the role of private information, and their approach only applies to the special case where the randomized study population is a subset of the target population we wish to apply a policy intervention to. Gechter (2015) considers the range of distributions of the treated potential outcomes conditioning on the controls potential outcomes. Nguyen et al. (2017)

considers the ranges on the mean of the unobservabed covariates. We view these approach as complementary to our proposal, as we consider the limit on the ratio of the probability density function on the unobserved covariates among locations, and further exploit covariate balance for improved power.

The core of our proposal is that by leveraging covariate information in the locations of interest, the set of weights for weighting outcomes can be used to balance moments on the covariates between locations. This would shorten identification intervals for the target population compared to prior work. In particular, the idea of covariate balancing has become popular in other contexts such as estimating average treatment effect (Athey et al., 2018; Bennett et al., 2018; Graham et al., 2016; Hainmueller, 2012; Hirshberg and Wager, 2021; Kallus, 2018; Wang and Zubizarreta, 2017; Zhao, 2019; Zubizarreta, 2015).

We note that there is a growing interest in combining observational data with RCTs (Athey et al., 2016; Kaizar, 2011; Kallus et al., 2018; Rosenman et al., 2018) to improve power. In our work, we focus on only using the RCT data, and leave it to future work on how to incorporate observational data in our framework. We also note the interesting and promising direction of using Bayesian hierarchical models for combining RCT results from multiple studies (e.g., Meager, 2016; Vivalt, 2016)

. They work in the setting of using aggregated experimental data on the study level, whereas our work leverages individual-level data. There is also a growing literature on transfer learning for domain shifts (see, e.g.,

Bastani (2020) and references therein for a review); in our case, we focus on transfer results from a randomized trial from one location to another.

2 Extrapolating on Observables and Unobservables

2.1 Formal Model and Assumptions

We assume there are two locations (e.g., two cities). In the first location, an RCT has been conducted to evaluate the impact of an intervention on some outcome of interest, e.g., whether a job training program improved participants’ earnings. In the second location, without running an additional RCT tailored for this new population, we want to ask the question of to what extend the previous RCT result is applicable to the new location. In both locations, we observe a set of covariates such as each citizen’s age and marital status, but we might not have access to other important covariates such as their education level.

Formally, we denote the two locations by Location and Location . Location is where the RCT is conducted, and we wish to generalize the RCT results to Location . Throughout the paper, we use the subscripts of and to denote respective quantities for the two locations. We assume the data are i.i.d generated in each location: for where the subscript corresponds to the two locations respectively, is the observed covariates, is the unmeasured covariates, is the randomized treatment assignment indicator where indicates that the treatment is assigned, and indicates otherwise, and for is the potential outcome corresponding to having received treatment or lack thereof. In Location , we observe data with , and the propensity score is a known constant. On the other hand, in Location , we assume we only observe the covariates . The causal estimand is , denoting the average treatment effects in the target Location . For convenience, we use to denote the location indicator for each unit.

We start with a few assumptions.

Assumption 1 (Conditional location independence).

For all and for all , .

This assumption implies that the distribution of the potential outcomes conditioning on the observed variables and any unmeasured effect modifiers is the same across the two locations. Most existing works in the literature assume transportability only conditioning on the observed covariates (e.g., Hotz et al., 2005). Given that the unmeasured effect modifier can have any association with the potential outcomes and conditionally on , we note that Assumption 1 does not impose any meaningful restrictions on its own.

Assumption 2 (Support inclusion).

. In particular, is absolutely continuous with respect to , where the covariate density ratio and the observed covariate density ratio exist.

Proposition 1.

Under Assumptions 1 and 2, we have


The above proposition shows that if the unmeasured effect modifiers were known and measured in both locations, then the causal quantity of interest is identified. In order to turn the above into an estimator, we could use a standard inverse weighted estimator:


such that . Since

is unobserved, the above estimator is not feasible. Instead, we further allow sensitivity models detailed in the next subsections which assume bounds on the ratio of the conditional probability density of unmeansured effect modifiers

. This would in turn allow us to derive bounds for via a linear programming optimization.

3 A sensitivity model for domain extrapolation

To relate and , we define the unobserved distribution shift ratio as

and use the shorthand . These capture the amount by which the unobserved effect modifiers affect the oracle estimator (2), which we can re-write as


Given we don’t know the density ratio terms in (2) due to the unmeasured effect modifier , we instead aim to get bounds on by estimating which does not depend on and by assuming a bound on . In particular, we assume the following sensitivity model that directly implies a bound on .

Assumption 3 (Transport sensitivity model).

There exists such that for all and all .

By Bayes rule, we immediately have the following:

Proposition 2.

By Assumption 3, for all and .

Analogous sensitivity models are common in the sensitivty analysis literature for observational studies (e.g. Rosenbaum, 2002b; Zhao et al., 2019) to assess robustness of findings to unmeasured confounding.

4 Bounds via linear programming

With a bound on and an estimated density ratio , we can derive bounds on from (3). Since we don’t know the ground truth value of , we supply a value and solve the following optimization problem to get the upper bound .


To get the lower bound , we instead take the infimium in the optimization above. This key idea of employing a linear program to bound the target estimand when the density ratio is unknown but can vary within some range is also seen in previous literature. For example, Aronow and Lee (2012) and Miratrix et al. (2017) employ similar ideas for identification of a population mean when the sampling selection weight is unknown.

Remark 1.

In practice, we suggest varying across a wide range to report the upper and lower bounds with domain experts guiding on realistic choices of . We also note that in the context of potential violations of exogeneity in observational studies, Imbens (2003)

proposes leveraging observed covariates to judge the plausbility of sensitivity parameters. We leave it to future work to investigate similar heuristics for guiding choices on


The roadmap for the rest of this section is as follows: We first show how to estimate in a way that allows us to achieve sharper bounds. Next, we use the estimate to get the upper bound by (4) (and respectively, the lower bound by taking the infimum of the same optization) for . Finally, we conclude with showing the bounds gives consistent coverage for the ground truth treatment effect in Location 1.

First, we estimate . While (4) takes in from any density ratio estimators, we suggest estimating directly by moments matching (Bradic et al., 2019; Imai and Ratkovic, 2014; Ning et al., 2017; Qin, 1998; Sugiyama et al., 2012). Doing so would allow us to effectively shorten the estimation bounds as detailed later in Section 5. To fix ideas, we denote and then assume the following:

Assumption 4.

We have access to basis function such that for some , and for some .

When the basis function is the identity, this simply implies that the density ratio follows a logistic form. On the other hand, we can take to be any bounded basis expansion, which makes the above assumption not as restrictive. In Section 6, we further relax this assumption and incorporate the model misspecification error. For the rest of the paper, we assume contains an intercept term.

By Assumption 2 and the fact that an RCT is conducted in Location , we have the following covariates balancing moment condition via a change of measure: for ,


We exploit the empirical counterpart of the above moment equation to estimate the . Given Assumption 4, we only need to estimate . Concretely, we solve the following optimization problem:


and then set


The reason we fit separately on the treated and control units is that it enables us to exactly match the empirical version of the moment condition in (5) for both treated and control groups (this follows immediately form the first-order condition in (6)):


We note that, if is unique, then both will eventually converge to in large samples.

By Assumption 4, the density ratio is identified, and the estimated density ratio is consistent, i.e. (Qin, 1998; Sugiyama et al., 2012)

. There is an exact mapping between estimating density ratios and estimating location propensity by treatment locations as random variables. We use a covariate balancing estimator for the density ratio, which corresponds to the covariate balancing estimator for propensity scores if we treated the locations as random variables. In the line of work for propensity score estimation, our covariate balancing approach is closely related to

Imai and Ratkovic (2014); Tan (2017, 2018); Zhao (2019). We can then substitute in for in (4) to compute the corresponding upper and lower bounds and .

Next, we show the resulting bounds give consistent coverage:

Theorem 3.

Under Assumptions 1, 2, 3 and 4 with , then with any in (4), for any ,


where .

5 Shorter bounds via covariate balancing

Although the optimization formulation in (4) gives consistent coverage on , the resulting bounds can often be too wide in practice to provide meaningful conclusions about the treatment effect . As a concrete example, assume the age distribution is the same for the location that we have conducted the RCT in and the location we wish to extrapolate findings to. We may find that from the RCT, the treatment effect is largest among the young population. The optimization procedure detailed in the previous section does not leverage information on age distributions in these two locations, and could construct weights that overweight or underweight the young population leading to wide bounds.

The key contribution of this work is to sharpen the estimation bounds by further leveraging covariate information available in both locations. In particular, similar to (5), we can also balance the following moments using a change of measure that conditions on both the observed and the unmeasured effect modifiers . Let . Then the following moment equation holds: for ,


We then add the empirical counterpart of the moment equality constraints in (10) to (4) as the last two equalities (a) and (b) below.

Similarly, we can estimate by taking infimum of the above optimization with the same set of constraints. Compared to (4), (5) makes a few modifications. First, it substitutes in the estimate for using (7). Second, it adds two equality constraints (a) and (b). By setting the derivative of (a) and (b) to 0 in expectation, we arrive at the moment condition in (10) for respectively.

Covariate balancing via (5) shortens estimation intervals compared to solving (4) due to the added equality constraints which limit the plausible range of ’s. To provide more intuition, we include in the Appendix a more technical comparison in the simple context where the covariates are discrete and the link function is the identity function.

The constructed intervals from the above optimization (5) gives consistent coverage of the underlying treatment effect parameter of interest.

Theorem 4.

Consider a sequence of data generating processes for both of the two locations with population size such that and for some constant . Under Assumptions 1, 2, 3 and 4 with and estimated with covariate balancing from (7), then with any in (5), for any ,


where .

This implies that as the sample size increases, if we employ a value in the optimization that is no less than what is needed for Assumption 3 to hold, then the constructed interval from the balancing estimator eventually covers the true treatment effect in the target location that we wish to apply policy interventions to.

Remark 2.

Because we learn the density ratio via covariate balancing, the optimization (5) is always feasible by taking . This is a key proof component for Theorem 4, as Slater’s conditions would hold, which allows us to leverage the zero duality gap to complete the proof.

6 Bounds under model misspecification

So far, we have assumed that the density ratio follows a logistic model with a basis expansion by Assumption 4. In this section, we relax this assumption and develop bounds that take model misspecification error into account.

Given any basis function , let be the population minimizer, i.e. for ,


and let


be the logistic approximation to the ground truth density ratio logit function

. For the rest of this section, instead of Assumption 3, we build upon the following sensitivity model instead.

Assumption 5 (Sensitivity model with model misspecification).

For a given basis function , for all and all .

Then immediately by Bayes rule, we have

Proposition 5.

By Assumption 5, for all and .

In practice, we don’t know the magnitude of . We supply an additional sensitivity parameter such that based on how much we believe the density ratio is misspecified with a chosen logistic model. We proceed to get bounds on by estimating via (7). We solve for by (5) but we substitute in for , and similarly we take the infimum to derive . Compared to the previous sections, we relax the constraint that is well specified with a logistic form, and we conclude with the same consistency result if we assume Assumption 5 instead of Assumption 3.

Theorem 6 (Consistency under model misspecifcation).

Consider a sequence of data generating processes for both of the two locations with population size such that and for some constant . Assume for some basis expansion , is estimated with covariate balancing from (7), and that Assumptions 1, 2, and 5 hold with sensitivity parameter , then with any and in (LABEL:eq:bal-opt-approx), for any ,

where .

7 California GAIN Program

We apply our proposed estimator to the California Greater Avenue for Independence (GAIN) dataset (Hotz et al., 2006). A policy analyst may run an RCT in one location and wishes to know the treatment effect in another location without running additional RCTs. By using our proposal, they could get bounds on the estimated treatment effects in the second location directly. We validate this proposal on the GAIN dataset which includes data from independent RCTs conducted in several selected counties in California to evaluate the impact of welfare-to-work programs on an individual’s future income. We focus on two counties: Los Angeles and Riverside. Suppose we had only run the RCT in Los Angeles and would like to extrapolate the results to Riverside. If there are no unmeasured effect modifiers that would affect both the outcome and location likelihood, we can simply use a weighted Hajek-style estimator. The goal is to get an uncertainty quantification of the extrapolated results in Riverside with varying degrees of how strong the unmeasured covariate shift is. Since the GAIN dataset includes the RCT data from Riverside, it allows us to validate the estimated bounds from our proposal against the ground truth treatment effect estimates from the RCT that had been conducted in Riverside.

We run the proposed estimator with covariate balancing to generalize treatment effect bounds from Los Angeles to Riverside and vice versa (referred to as “covariate balance” in Figure 1). For comparison, we also run the proposed estimator without leveraging covariate information (referred to as “no balance” in Figure 1). We use the mean quarter income over a follow-up period of 9 years post experiment as the outcome. For both counties, we use a simple difference-in-means between the the treated and the control groups to compute their ground truth treatment estimates (referred to as “ground truth” in the legend of Figure 1). Let . We quantify the strength of the unmeasured covariate shift by assuming different values in (4) and (5) for the “no balance” and ”covariate balance” approaches respectives.

Figure 1:

The left plot shows generalizing the RCT estimate from Los Angeles to Riverside, and the right plot shows generalizing the RCT estimate from Riverside to Los Angeles. The green line in both plots shows the difference-in-means estimate in the target location, which we denote by as the “ground truth” treatment effect. The x-axis quantifies the assumed bound on unobserved distributional shift. The red and purples lines is our proposed estimator with and without covariate balancing respectively. We compute the 95% confidence intervals via a percentile bootstrap with 1000 bootstrap samples (shown as dashed lines in the plots).

The left plot in Figure 1 shows generalizating the RCT estimate from Los Angeles to Riverside and the right plot shows the generalization results the other way. By varying along the x-axis, we vary the assumed bound on the distributional shift of the unmeasured effect modifiers. We see that the “covariate balance” approach meaningfully shortens the estimated bounds compared to the “no balance” approach, and the bounds give coverage to the ground truth estimates as

increases. To estimate the variance, we generate 1000 bootstrap samples in both locations simultaneously to account for stochastic fluctuations in the data. The dashed lines show the 95% confidence intervals using the percentile bootstrap, following

Zhao et al. (2019).

8 Additional Simulations

We consider the following setups adapted from the simulation study in Yadlowsky et al. (2018). For some covariate distribution ,

where denotes the location indicator. We consider the following two setups:

For Setup A, we let for and such that highly depends on the location . We vary and let and . We let be the true sensitivity parameter, but we assume it’s unknown to us. For Setup B, we let , , , and vary the sensitivity parameter in the optimization procedure, We let in Setup B and let in both setups.111Both parameters are taken from the simulation in Yadlowsky et al. (2018)

The form of is chosen such that Assumption 3 holds with , with , and . and we let the size of the total combined population across the two locations to be 1000.222We note that for the purpose of this simulation setup, it is natural to define location as an additional random variable in the data generating process to ensure the ground truth sensitivity bound to fall within . We use the mosek package for optimization, and we compare the percentile bootstrap confidence interval obtained through 1000 bootstrap samples among the difference-in-means estimator in location and the proposed estimator with covariate balancing. We see that the covariate balancing approach signficantly shortens the estimation interval, while Setup A in Figure 2 also shows that our balancing estimator (in red) is not conservative as its confidence interval just covers the ground truth (in green) once the parameter is increased to in this case.

Figure 2: The two plots above show generalizing the RCT estimate in Setup A and B respectively for a randomly chosen single realization of the data generating distribution. The x-axis quantifies the assumed bound on unobserved distributional shift. We assume we have access to data in both locations, but not the ground truth treatment effects. The green line in both plots shows the difference-in-means estimate of the ground truth in the target location. The red and purples lines are our proposed estimator with and without covariate balancing respectively. Given the single realization of the data generating distribution, we take 1000 bootstrap samples in both locations simultaneously, and compute the 95% percentile bootstrap confidence intervals (shown as colored dashed lines in the plots). The vertical dotted black line shows the true sensitivity parameter in Setup A and in Setup B.


  • Andrews and Oster [2017] Isaiah Andrews and Emily Oster. Weighting for external validity. Technical report, National Bureau of Economic Research, 2017.
  • Aronow and Lee [2012] Peter M Aronow and Donald KK Lee. Interval estimation of population means under unknown but bounded probabilities of sample selection. Biometrika, 100(1):235–240, 2012.
  • Athey et al. [2016] Susan Athey, Raj Chetty, Guido Imbens, and Hyunseung Kang. Estimating treatment effects using multiple surrogates: The role of the surrogate score and the surrogate index. arXiv preprint arXiv:1603.09326, 2016.
  • Athey et al. [2018] Susan Athey, Guido W Imbens, and Stefan Wager. Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):597–623, 2018.
  • Attanasio et al. [2011] Orazio Attanasio, Adriana Kugler, and Costas Meghir. Subsidizing vocational training for disadvantaged youth in colombia: evidence from a randomized trial. American Economic Journal: Applied Economics, 3(3):188–220, 2011.
  • Bareinboim and Pearl [2013] Elias Bareinboim and Judea Pearl. Meta-transportability of causal effects: A formal approach. In

    Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS)

    , pages 135–143, 2013.
  • Bareinboim and Pearl [2014] Elias Bareinboim and Judea Pearl. Transportability from multiple environments with limited experiments: Completeness results. In Advances in neural information processing systems, pages 280–288, 2014.
  • Bareinboim and Pearl [2016] Elias Bareinboim and Judea Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27):7345–7352, 2016.
  • Bastani [2020] Hamsa Bastani. Predicting with proxies: Transfer learning in high dimension. Management Science, 2020.
  • Bennett et al. [2018] Magdalena Bennett, Juan Pablo Vielma, and Jose R Zubizarreta. Building representative matched samples with multi-valued treatments in large observational studies: Analysis of the impact of an earthquake on educational attainment. arXiv preprint arXiv:1810.06707, 2018.
  • Bradic et al. [2019] Jelena Bradic, Stefan Wager, and Yinchu Zhu. Sparsity double robust inference of average treatment effects. arXiv preprint arXiv:1905.00744, 2019.
  • Cole and Stuart [2010] Stephen R Cole and Elizabeth A Stuart. Generalizing evidence from randomized clinical trials to target populations: The actg 320 trial. American journal of epidemiology, 172(1):107–115, 2010.
  • Cook et al. [2002] Thomas D Cook, Donald Thomas Campbell, and William Shadish. Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin Boston, MA, 2002.
  • Cornfield et al. [1959] Jerome Cornfield, William Haenszel, E Cuyler Hammond, Abraham M Lilienfeld, Michael B Shimkin, and Ernst L Wynder. Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer institute, 22(1):173–203, 1959.
  • Dehejia et al. [2019] Rajeev Dehejia, Cristian Pop-Eleches, and Cyrus Samii. From local to global: External validity in a fertility natural experiment. Journal of Business & Economic Statistics, pages 1–27, 2019.
  • Fogarty and Small [2016] Colin B Fogarty and Dylan S Small. Sensitivity analysis for multiple comparisons in matched observational studies through quadratically constrained linear programming. Journal of the American Statistical Association, 111(516):1820–1830, 2016.
  • Gechter [2015] Michael Gechter. Generalizing the results from social experiments: Theory and evidence from mexico and india. manuscript, Pennsylvania State University, 2015.
  • Graham et al. [2016] Bryan S Graham, Cristine Campos de Xavier Pinto, and Daniel Egel. Efficient estimation of data combination models by the method of auxiliary-to-study tilting (ast). Journal of Business & Economic Statistics, 34(2):288–301, 2016.
  • Gretton et al. [2009] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. Covariate shift by kernel mean matching.

    Dataset shift in machine learning

    , 3(4):5, 2009.
  • Hainmueller [2012] Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1):25–46, 2012.
  • Hirshberg and Wager [2021] David A Hirshberg and Stefan Wager. Augmented minimax linear estimation. The Annals of Statistics, forthcoming, 2021.
  • Hirshberg et al. [2019] David A Hirshberg, Arian Maleki, and Jose Zubizarreta. Minimax linear estimation of the retargeted mean. arXiv preprint arXiv:1901.10296, 2019.
  • Hotz et al. [2005] V Joseph Hotz, Guido W Imbens, and Julie H Mortimer. Predicting the efficacy of future training programs using past experiences at other locations. Journal of Econometrics, 125(1-2):241–270, 2005.
  • Hotz et al. [2006] V Joseph Hotz, Guido W Imbens, and Jacob A Klerman. Evaluating the differential effects of alternative welfare-to-work training components: A reanalysis of the california gain program. Journal of Labor Economics, 24(3):521–566, 2006.
  • Imai and Ratkovic [2014] Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.
  • Imbens [2003] Guido W Imbens. Sensitivity to exogeneity assumptions in program evaluation. American Economic Review, 93(2):126–132, 2003.
  • Kaizar [2011] Eloise E Kaizar. Estimating treatment effect via simple cross design synthesis. Statistics in Medicine, 30(25):2986–3009, 2011.
  • Kallus [2018] Nathan Kallus. Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pages 8895–8906, 2018.
  • Kallus et al. [2018] Nathan Kallus, Aahlad Manas Puli, and Uri Shalit. Removing hidden confounding by experimental grounding. In Advances in Neural Information Processing Systems, pages 10888–10897, 2018.
  • Meager [2016] Rachael Meager. Aggregating distributional treatment effects: A bayesian hierarchical analysis of the microcredit literature. Manuscript: MIT, 2016.
  • Miratrix et al. [2017] Luke W Miratrix, Stefan Wager, and Jose R Zubizarreta. Shape-constrained partial identification of a population mean under unknown probabilities of sample selection. Biometrika, 105(1):103–114, 2017.
  • Nguyen et al. [2017] Trang Quynh Nguyen, Cyrus Ebnesajjad, Stephen R Cole, and Elizabeth A Stuart. Sensitivity analysis for an unobserved moderator in rct-to-target-population generalization of treatment effects. The Annals of Applied Statistics, 11(1):225–247, 2017.
  • Ning et al. [2017] Yang Ning, Sida Peng, and Kosuke Imai. High dimensional propensity score estimation via covariate balancing, 2017.
  • Olsen et al. [2013] Robert B Olsen, Larry L Orr, Stephen H Bell, and Elizabeth A Stuart. External validity in policy evaluations that choose sites purposively. Journal of Policy Analysis and Management, 32(1):107–121, 2013.
  • Pearl and Bareinboim [2014] Judea Pearl and Elias Bareinboim. External validity: From do-calculus to transportability across populations. Statistical Science, pages 579–595, 2014.
  • Pressler and Kaizar [2013] Taylor R Pressler and Eloise E Kaizar. The use of propensity scores and observational data to estimate randomized controlled trial generalizability bias. Statistics in medicine, 32(20):3552–3568, 2013.
  • Qin [1998] Jing Qin. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619–630, 1998.
  • Rosenbaum [1987] Paul R Rosenbaum. Sensitivity analysis for certain permutation inferences in matched observational studies. Biometrika, 74(1):13–26, 1987.
  • Rosenbaum [2002a] Paul R Rosenbaum. Attributing effects to treatment in matched observational studies. Journal of the American statistical Association, 97(457):183–192, 2002a.
  • Rosenbaum [2002b] Paul R Rosenbaum. Observational studies, volume 10. Springer, 2002b.
  • Rosenbaum [2010] Paul R Rosenbaum. Design of observational studies, volume 10. Springer, 2010.
  • Rosenbaum [2011] Paul R Rosenbaum. A new u-statistic with superior design sensitivity in matched observational studies. Biometrics, 67(3):1017–1027, 2011.
  • Rosenbaum [2014] Paul R Rosenbaum. Weighted m-statistics with superior design sensitivity in matched observational studies with multiple controls. Journal of the American Statistical Association, 109(507):1145–1158, 2014.
  • Rosenbaum and Rubin [1983] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
  • Rosenman et al. [2018] Evan Rosenman, Art B Owen, Michael Baiocchi, and Hailey Banack. Propensity score methods for merging observational and experimental datasets. arXiv preprint arXiv:1804.07863, 2018.
  • Shen et al. [2011] Changyu Shen, Xiaochun Li, Lingling Li, and Martin C Were. Sensitivity analysis for causal inference using inverse probability weighting. Biometrical Journal, 53(5):822–837, 2011.
  • Stuart et al. [2011] Elizabeth A Stuart, Stephen R Cole, Catherine P Bradshaw, and Philip J Leaf. The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 174(2):369–386, 2011.
  • Sugiyama et al. [2012] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
  • Tan [2006] Zhiqiang Tan. A distributional approach for causal inference using propensity scores. Journal of the American Statistical Association, 101(476):1619–1637, 2006.
  • Tan [2017] Zhiqiang Tan. Regularized calibrated estimation of propensity scores with model misspecification and high-dimensional data. arXiv preprint arXiv:1710.08074, 2017.
  • Tan [2018] Zhiqiang Tan. Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data. arXiv preprint arXiv:1801.09817, 2018.
  • VanderWeele and Ding [2017] Tyler J VanderWeele and Peng Ding. Sensitivity analysis in observational research: introducing the e-value. Annals of internal medicine, 2017.
  • Vivalt [2016] Eva Vivalt. How much can we generalize from impact evaluations? 2016.
  • Wang and Zubizarreta [2017] Yixin Wang and José R Zubizarreta. Minimal approximately balancing weights: asymptotic properties and practical considerations. arXiv preprint arXiv:1705.00998, 2017.
  • Westreich et al. [2017] Daniel Westreich, Jessie K Edwards, Catherine R Lesko, Elizabeth Stuart, and Stephen R Cole.

    Transportability of trial results using inverse odds of sampling weights.

    American journal of epidemiology, 186(8):1010–1014, 2017.
  • Yadlowsky et al. [2018] Steve Yadlowsky, Hongseok Namkoong, Sanjay Basu, John Duchi, and Lu Tian. Bounds on the conditional and average treatment effect in the presence of unobserved confounders. arXiv preprint arXiv:1808.09521, 2018.
  • Zhao [2019] Qingyuan Zhao.

    Covariate balancing propensity score by tailored loss functions.

    The Annals of Statistics, 47(2):965–993, 2019.
  • Zhao et al. [2019] Qingyuan Zhao, Dylan S Small, and Bhaswar B Bhattacharya. Sensitivity analysis for inverse probability weighting estimators via the percentile bootstrap. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2019.
  • Zubizarreta [2015] José R Zubizarreta. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association, 110(511):910–922, 2015.

Appendix A Proof of Proposition 1

where the last two equalities used Assumptions 1 and 2 respectively.

Appendix B Proof of Theorem 3


Assume , let , then . By Proposition 1,

We can then show that

Since , the conclusion then follows.

Appendix C Proof of Theorem 4


Recall is the oracle weight. We note that by Proposition 2 and covariate balancing with the density ratio, taking is always a feasible solution to the optimization formulation (5). Thus, Slater’s conditions hold. Define

Define . Then we have

By Assumption 4, (see e.g. Sugiyama et al. [2012]

). By law of large numbers, we have as


and as , by (5),


where .

Conside a joint sequence of data generating setups for both locations with where , given , there must exist such that for all , . From the derivation above, we know that for any and for any , there exists such that for all , . Similarly, there exists such that for all , . Then, take , we have for , .

Thus, taking this joint sequence where and , we have

We then conclude that .

For any , we then have . Similarly, for any , we have , , and . We conclude that for any , .

Appendix D An example to show how covariate balancing shortens estimation intervals

To provide more intuition why covariate balancing shortens estimation intervals in Section 5, we consider the simple context where the covariates are discrete and the link function is the identity function. Consider the following:


and define . Then we have

Now let be identity, and let be binary, then

Compared to the approach without covariate balancing,

Essentially we want to maximize , but with covariate balancing, we limit the sum of certain weights such that .

Appendix E Proof of Theorem 6