In domains such as healthcare, education, and marketing there is growing interest in using observational data to draw causal conclusions about individual-level effects; for example, using electronic healthcare records to determine which patients should get what treatments, using school records to optimize educational policy interventions, or using past advertising campaign data to refine targeting and maximize lift. Observational datasets, due to their often very large number of samples and exhaustive scope (many measured covariates) in comparison to experimental datasets, offer a unique opportunity to uncover fine-grained effects that may apply to many target populations.
However, a significant obstacle when attempting to draw causal conclusions from observational data is the problem of hidden confounders: factors that affect both treatment assignment and outcome, but are unmeasured in the observational data. Example cases where hidden confounders arise include physicians prescribing medication based on indicators not present in the health record, or classes being assigned a teacher’s aide because of special efforts by a competent school principal. Hidden confounding can lead to no-vanishing bias in causal estimates even in the limit of infinite samples pearl2009causality.
In an observational study, one can never prove that there is no hidden confounding pearl2009causality. However, a possible fix can be found if there exists a Randomized Controlled Trial (RCT) testing the effect of the intervention in question. For example, if a Health Management Organization (HMO) is considering the effect of a medication on its patient population, it might look at an RCT which tested this medication. The problem with using RCTs is that often their participants do not fully reflect the target population. As an example, an HMO in California might have to use an RCT from Switzerland, conducted perhaps several years ago, on a much smaller population. The problem of generalizing conclusions from an RCT to a different target population is known as the problem of external validity rothwell2005external; andrews2017weighting, or more specifically, transportability bareinboim2013general; bareinboim2014external.
In this paper, we are interested in the case where fine-grained causal inference is sought, in the form of Conditional Average Treatment Effects (CATE), where we consider a large set of covariates, enough to identify each unit. We aim at using a large observational sample and a possibly much smaller experimental sample. The typical use case we have in mind is of a user who wishes to estimate CATE and has a relatively large observational sample that covers their population of interest. This observational sample might suffer from hidden confounding, as all observational data will to some extent, but they also have a smaller sample from an experiment, albeit one that might not directly reflect their population of interest. For example, consider The Women’s Health Initiative rossouw2002writing where there was a big previous observational study and a smaller RCT to study hormone replacement therapy. The studies ended up with opposite results and there is intense discussion about confounding and external validity: the RCT was limited due to covering a fundamentally different (healthier and younger) population compared with the observational study hernan2008observational; vandenbroucke2009hrt.
Differently from previous work on estimating CATE from observational data, our approach does not assume that all confounders have been measured, and we only assume that the support of the experimental study has some overlap with the support of the observational study. The major assumption we do make is that we can learn the structure
of the hidden confounding by comparing the observational and experimental samples. Specifically, rather than assuming that effects themselves have a parametric structure – a questionable assumption that is bound to lead to dangerous extrapolation from small experiments – we only assume that this hidden confounding function has a parametric structure that we can extrapolate. Thus we limit ourselves to a parametric correction of a possibly complex effect function learned on the observational data. We discuss why this assumption is possibly reasonable. Specifically, as long as the parametric family includes the zero function, this assumption is strictly weaker than assuming that all confounders in the observational study have been observed. One way to view our approach is that we bring together an unbiased but high-variance estimator from the RCT (possibly infinite-variance when the RCT has zero overlap with the target population) and a biased but low-variance estimator from the observational study. This achieves a consistent (vanishing biasand variance) CATE estimator. Finally, we run experiments on both simulation and real-world data and show our method outperforms the standard approaches to this problem. In particular, we use data from a large-scale RCT measuring the effect of small classrooms and teacher’s aids word1990state; krueger1999experimental to obtain ground-truth estimates of causal effects, which we then try and reproduce from a confounded observational study.
We focus on studying a binary treatment, which we interpret as the presence or absence of an intervention of interest. To study its fine-grained effects on individuals, we consider having treatment-outcome data from two sources: an observational study that may be subject to hidden confounding, and an unconfounded study, typically coming from an experiment. The observational data consists of baseline covariates , assigned treatments , and observed outcomes for . Similarly, the unconfounded data consists of for .
Conceptually, we focus on the setting where (1) the observational data is of much larger scale and/or (2) the support of the unconfounded data , does not include the population about which we want to make causal conclusions and targeted interventions. This means that the observational data has both the scale and the scope we want but the presence of confounding limits the study of causal effects, while the unconfounded experimental data has unconfoundedness but does not have the scale and/or scope necessary to study the individual-level effects of interest.
The unconfounded data usually comes from an RCT that was conducted on a smaller scale on a different population, as presented in the previous section. Alternatively, and equivalently for our formalism, it can arise from recognizing a latent unconfounded sub-experiment within the observational study. For example, we may have information from the data generation process that indicates that treatment for certain units was actually assigned purely as a (possibly stochastic) function of the observed covariates . Two examples of this would be when certain prognoses dictate a strict rule-based treatment assignment or in situations of known equipoise after a certain prognosis, where there is no evidence guiding treatment one way or the other and its assignment is as if at random based on the individual who ends up administering it. Regardless if the unconfounded data came from a secondary RCT (more common) or from within the observational dataset, our mathematical set up remains the same.
Formally, we consider each dataset to be iid draws from two different super-populations, indicated by the event taking either the value or . The observational data are iid draws from the population given by conditioning on the event : iid. Similarly, . Using potential outcome notation, assuming the standard Stable Unit Treatment Value Assumption (SUTVA), which posits no interference and consistency between observed and potential outcomes, we let be the potential outcomes of administering each of the two treatments and . The quantity we are interested in is the Conditional Average Treatment Effect (CATE):
Definition 1 (Cate).
The key assumption we make about the unconfounded data is its unconfoundedness:
This assumption holds if the unconfounded data was generated in a randomized control trial. More generally, it is functionally equivalent to assuming that the unconfounded data was generated by running a logging policy on a contextual bandit, that is, first covariates are drawn from the unconfounded population and revealed, then a treatment is chosen, the outcomes are drawn based on the covariates , but only the outcome corresponding to the chosen treatment is revealed. The second part of the assumption means that merely being in the unconfounded study does not affect the potential outcomes conditioned on the covariates . It implies that the functional relationship between the unobserved confounders and the potential outcomes is the same in both studies. This will fail if for example knowing you are part of a study causes you to react differently to the same treatment. We note that this assumption is strictly weaker than the standard ignorability assumption in observational studies. This assumption implies that for covariates within the domain of the experiment, we can identify the value of CATE using regression. Specifically, if , that is, if , then , where can be identified by regressing observed outcomes on treatment and covariates in the unconfounded data. However, this identification of CATE is (i) limited to the restricted domain of the experiment and (ii) hindered by the limited amount of data available in the unconfounded sample. The hope is to overcome these obstacles using the observational data.
Importantly, however, the unconfoundedness assumption is not assumed to hold for the observational data, which may be subject to unmeasured confounding. That is, both selection into the observational study and the selection of the treatment may be confounded with the potential outcomes of any one treatment. Let us denote the difference in conditional average outcomes in the observational data by
Note that due to confounding factors, for any , whether in the support of the observational study or not. The difference between these two quantities is precisely the confounding effect, which we denote
Another way to express this term is:
Note that if the observational study were unconfounded then we would have . Further note that a standard assumption in the vast majority of methodological literature makes the assumption that , even though it is widely acknowledged that this assumption isn’t realistic, and is at best an approximation.
In order to better understand the function , consider the following case: Assume there are two equally likely types of patients,“dutiful” and “negligent”. Dutiful patients take care of their general health and are more likely to seek treatment, while negligent patients do not. Assume is a medical treatment that requires the patient to see a physician, do lab tests, and obtain a prescription if indeed needed, while means no treatment. Let be some measure of health, say blood pressure. In this scenario, where patients are self-selected into treatment (to a certain degree), we would expect that both potential outcomes would be greater for the treated over the control: , . Since we also have that , and unless . Taken together, this shows that in the above scenario, we expect , if we haven’t measured any . This logic carries through in the plausible scenario where we have measured some , but do not have access to all the variables that allows us to tell apart “dutiful” from “negligent” patients. To sum up, this example shows that in cases where some units are selected such as those more likely to be treated are those whose potential outcomes are higher (resp. lower) anyway, we can expect to be negative (resp. positive).
Given data from both the unconfounded and confounded studies, we propose the following recipe for removing the hidden confounding. First, we learn a function over the observational sample . This can be done using any CATE estimation method such as learning two regression functions for the treated and control and taking their difference, or specially constructed methods such as Causal Forest wager2017estimation. Since we assume this sample has hidden confounding, is not equal to the true CATE and correspondingly
does not estimate the true CATE. We then learn a correction term which interpolates betweenevaluated on the RCT samples , and the RCT outcomes . This is a correction term for hidden confounding, which is our estimate of . The correction term allows us to extrapolate over the confounded sample, using the identity .
Note that we could not have gone the other way round: if we were to start with estimating over the unconfounded sample, and then estimate using the samples from the confounded study, we would end up constructing an estimate of , which is not the quantity of interest. Moreover, doing so would be difficult as the unconfounded sample is not expected to cover the confounded one.
Specifically, the way we use the RCT samples relies on a simple identity. Let be the propensity score on the unconfounded sample. If this sample is an RCT then typically for some constant, often .
Let be a signed re-weighting function. We have:
What Lemma 1 shows us is that
is an unbiased estimate of. We now use this fact to learn as follows:
The method is summarized in Algorithm 1.
Let us contrast our approach with two existing ones. The first, is to simply learn the treatment effect function directly from the unconfounded data, and extrapolate it to the observational sample. This is guaranteed to be unconfounded, and with a large enough unconfounded sample the CATE function can be learned crump2008nonparametric; pearl2015detecting. This approach is presented for example by bareinboim2013general for ATE, as the transport formula. However, extending this approach to CATE in our case is not as straightforward. The reason is that we assume that the confounded study does not fully overlap with the unconfounded study, which requires extrapolating the estimated CATE function into a region of sample space outside the region where it was fit. This requires strong parametric assumptions about the CATE function. On the other hand, we do have samples from the target region, they are simply confounded. One way to view our approach is that we move the extrapolation a step back: instead of extrapolating the CATE function, we merely extrapolate a correction due to hidden confounding. In the case that the CATE function does actually extrapolate well, we do no harm - we learn .
The second alternative relies on re-weighting the RCT population so as to make it similar to the target, observational population stuart2011use; hartman2015sate; andrews2017weighting. These approaches suffer from two important drawbacks from our point of view: (i) they assume the observational study has no unmeasured confounders, which is often an unrealistic assumption; (ii) they assume that the support of the observational study is contained within the support of the experimental study, which again is unrealistic as the experimental studies are often smaller and on somewhat different populations. If we were to apply these approaches to our case, we would be re-weighting by the inverse of weights which are close to, or even identical to, .
4 Theoretical guarantee
We prove that under conditions of parametric identification of , Algorithm 1 recovers a consistent estimate of over the , at a rate which is governed by the rate of estimating by . For the sake of clarity, we focus on a linear specification of . Other parametric specifications can easily be accommodated given that the appropriate identification criteria hold (for linear this is the non-singularity of the design matrix). Note that this result is strictly stronger than results about CATE identification which rely on ignorability: what enables the improvement is of course the presence of the unconfounded sample . Also note that this result is strictly stronger than the transport formula bareinboim2013general and re-weighting such as andrews2017weighting.
is a consistent estimator on the observational data (on which it’s trained): for
The covariates in the confounded data cover those in the unconfounded data (strong one-way overlap):
Identifiability of : is non-singular
, , and
have finite fourth moments in the experimental data:, ,
Strong overlap between treatments in unconfounded data:
Then is consistent
and is consistent on its target population
There are a few things to note about the result and its conditions. First, we note that if the so-called confounded observational sample is in fact unconfounded then we immediately get that the linear specification of is correct with because we simply have . Therefore, our conditions are strictly weaker than imposing unconfoundedness on the observational data.
Condition 1 requires that our base method for learning is consistent just as a regression method. There are a few ways to guarantee this. For example, if we fit by empirical risk minimization on weighted outcomes over a function class of finite capacity (such as a VC class) or if we fit as the difference of two regression functions each fit by empirical risk minimization on observed outcomes in each treatment group, then standard results in statistical learning bartlett2002rademacher ensure the consistency of L2 risk and therefore the L2 convergence required in condition 1. Alternatively, any method for learning CATE that would have been consistent for CATE under unconfoundedness would actually still be consistent for if applied. Therefore we can also rely on such base method as causal forests wager2017estimation and other methods that target CATE as inputs to our method, even if they don’t actually learn CATE here due to confounding.
Condition 2 captures our understanding of the observational dataset having a larger scope than the experimental dataset. The condition essentially requires a strong form of absolute continuity between the two covariate distributions. This condition could potentially be relaxed so long as there is enough intersection where we can learn . So for example, if there is a subset of the experiment that the observational data covers, that would be sufficient so long as we can also ensure that condition 4 still remains valid on that subset so that we can learn the sufficient parameters for .
Condition 3, the linear specification of , can be replaced with another one so long as it has finitely many parameters and they can be identified on the experimental dataset, i.e., condition 4 above would change appropriately.
Since unconfoundedness implies , whenever the parametric specification of contains the zero function (e.g., as in the linear case above since is allowed) condition 3 is strictly weaker than assuming unconfoundedness. In that sense, our method can consistently estimate CATE on a population where no experimental data exists under weaker conditions than existing methods, which assume the observational data is unconfounded.
Condition 5 is trivially satisfied whenever outcomes and covariates are bounded. Similarly, we would expect that if the first two parts of condition 5 hold (about and ) then the last one about would also hold as it is predicting outcomes . That is, the last part of condition 5 is essentially a requirement on our -leaner base method that it’s not doing anything strange like adding unnecessary noise to thereby making it have fewer moments. For all base methods that we consider, this would come for free because they are only averaging outcomes . We also note that if we impose the existence of even higher moments as well as pointwise asymptotic normality of
, one can easily transform the result to an asymptotic normality result. Standard error estimates will in turn require a variance estimate of.
Finally, we note that condition 6, which requires strong overlap, only needs to hold in the unconfounded sample. This is important as it would be a rather strong requirement in the confounded sample where treatment choices may depend on high dimensional variables d2017overlap, but it is a weak condition for the experimental data. Specifically, if the unconfounded sample arose from an RCT then propensities would be constant and the condition would hold trivially.
In order to illustrate the validity and usefulness of our proposed method we conduct simulation experiments and experiments with real-world data taken from the Tennessee STAR study: a large long-term school study where students were randomized to different types of classes word1990state; krueger1999experimental.
5.1 Simulation study
We generate data simulating a situation where there exists an un-confounded dataset and a confounded dataset, with only partial overlap. Let be a measured covariate, binary treatment assignment, an unmeasured confounder, and the outcome. We are interested in .
We generate the unconfounded sample as follows: , , . We generate the confounded sample as follows: we first sample and then sample from a bivariate Gaussian
This means that
come from a Gaussian mixture model wheredenotes the mixture component and the components have equal means but different covariance structures. This also implies that is linear.
For both datasets we set , where . The true CATE is therefore . We have that the true , which leads to the true . We then apply our method (with a CF base) to learn . We plot (See Figure 1) here the true and recovered with our method. Even with the limited un-confounded set (between ) making the full scope of the term in inaccessible, we are able to reasonably estimate . Other methods would suffer under the strong unobserved confounding.
5.2 Real-world data
Validating causal-inference methods is hard because we almost never have access to true counterfactuals. We approach this challenge by using data from a randomized controlled trial, the Tennessee STAR study word1990state; krueger1999experimental; mcfowland2018efficient. When using an RCT, we have access to unbiased CATE-estimates because we are guaranteed unconfoundedness. We then artificially introduce confounding by selectively removing a biased subset of samples.
The data: The Tennessee Student/Teacher Achievement Ratio (STAR) experiment is a randomized experiment started in 1985 to measure the effect of class size on student outcomes, measured by standardized test scores. The experiment started monitoring students in kindergarten and followed students until third grade. Students and teachers were randomly assigned into conditions during the first school year, with the intention for students to continue in their class-size condition for the entirety of the experiment. We focus on two of the experiment conditions: small classes(13-17 pupils), and regular classes(22-25 pupils). Since many students only started the study at first grade, we took as treatment their class-type at first grade. Overall we have 4509 students with treatment assignment at first grade. The outcome is the sum of the listening, reading, and math standardized test at the end of first grade. After removing students with missing outcomes 111The correlation between missing outcome and treatment assignment is ., we remain with a randomized sample of 4218 students: 1805 assigned to treatment (small class, ), and 2413 to control (regular size class, ). In addition to treatment and outcome, we used the following covariates for each student: gender, race, birth month, birthday, birth year, free lunch given or not, teacher id. Our goal is to compute the CATE conditioned on this set of covariates, jointly denoted .
Computing ground-truth CATE: The STAR RCT allows us to obtain an unbiased estimate of the CATE. Specifically, we use the identity in Eq. (3), and the fact that in the study, the propensity scores were constant. We define a ground-truth sample , where , . By Eq. (3) we know that within the STAR study.
Introducing hidden confounding: Now that we have the ground-truth CATE, we wish to emulate the scenario which motivates our work. We split the entire dataset (ALL) into a small unconfounded subset (UNC), and a larger, confounded subset (CONF) over a somewhat different population. We do this by splitting the population over a variable which is known to be a strong determinant of outcome krueger1999experimental: rural or inner-city (2811 students) vs. urban or suburban (1407 students).
We generate UNC by randomly sampling a fraction of the rural or inner-city students, where ranges from to . Over this sample, we know that treatment assignment was at random.
When generating CONF, we wish to obtain two goals: (a) the support of CONF should have only a partial overlap with the support of UNC, and (b) treatment assignment should be confounded, i.e. the treated and control populations should be systematically different in their potential outcomes. In order to achieve these goals, we generate CONF as follows: From the rural or inner-city students, we take the controls () that were not sampled in UNC, and only the treated () whose outcomes were in the lower half of outcomes among treated rural or inner-city students. From the urban or suburban students, we take all of the controls, and only the treated whose outcomes were in the lower half of outcomes among treated urban or suburban students.
This procedure results in UNC and CONF populations which do not fully overlap: UNC has only rural or inner-city students, while CONF has a substantial subset (roughly one half for ) of urban and suburban students. It also creates confounding, by removing the students with the higher scores selectively from the treated population. This biases the naive treatment effect estimates downward. We further complicate matters by dropping the covariate indicating rural, inner-city, urban or suburban from all subsequent analysis. Therefore, we have significant unmeasured confounding in the CONF population, and also the unconfounded ground-truth in the original, ALL population.
Metric: In our experiments, we assume we have access to samples from UNC and CONF. We use either UNC, CONF or both to fit various models for predicting CATE. We then evaluate how well the CATE predictions match on a held-out sample from (the set ALL minus the set UNC), in terms of RMSE. Note that we are not evaluating on CONF, but on the unconfounded version of CONF, which is exactly . The reason we don’t evaluate on ALL is twofold: First, it will only make the task easier because of the nature of the UNC set; second, we are motivated by the scenario where we have a confounded observational study representing the target population of interest, and wish to be aided by a separate unconfounded study (typically an RCT) available for a different population. We focus on a held-out set in order to avoid giving too much of an advantage to methods which can simply fit the observed outcomes well.
Baselines: As a baseline we fit CATE using standard methods on either the UNC set or the CONF set. Fitting on the UNC set is essentially a CATE version of applying the transport formula bareinboim2014external. Fitting on the CONF set amounts to assuming ignorability (which is wrong in this case), and using standard methods. The methods we use to estimate CATE are: (i) Regression method fit on over UNC (ii) Regression method fit separately on treated and control in CONF (iii) Regression method fit separately on treated and control in UNC. The regression methods we use in (i)-(iii) are Random Forest with 200 trees and Ridge Regression with cross-validation. In baselines (ii) and (iii), the CATE is estimated as the difference between the prediction of the model fit on the treated and the prediction of the model fit on the control. We also experimented extensively with Causal Forest wager2017estimation, but found it to uniformly perform worse than the other methods, even when given unfair advantages such as access to the entire dataset (ALL).
Results: Our two-step method requires a method for fitting on the confounded dataset. We experiment with two methods, which parallel those used as baseline: A regression method fit separately on treated and control in CONF, where we use either Random Forest with 200 trees or Ridge Regression with cross-validation as regression methods. We see that our methods, 2-step RF and 2-step ridge, consistently produce more accurate estimates than the baselines. We see that our methods in particular are able to make use of larger unconfounded sets to produce better estimates of the CATE function.See Figure 2 for the performance of our method vs. the various baselines.
In this paper we address a scenario that is becoming more and more common: users with large observational datasets who wish to extract causal insights using their data and help from unconfounded experiments on different populations. One direction for future work is combining the current work with work that looks explicitly into the causal graph connecting the covariates, including unmeasured ones triantafillou2015constraint; mooij2016joint. Another direction includes cases where the outcomes or interventions are not directly comparable, but where the difference can be modeled. For example, experimental studies often only study short-term outcomes, whereas the observational study might track long-term outcomes which are of more interest athey2016estimating.
We wish to thank the anonymous reviewers for their helpful suggestions and comments. (NK) This material is based upon work supported by the National Science Foundation under Grant No. 1656996.
Appendix A Proofs
Proof of Lemma 1.
We wish to prove We have that:
where equality (a) is by Assumption 1 and the definition of .
Proof of Thm. 1.
Let be the design matrix in the experimental data and let be the regression outcome and so that . Let and . Note that , which by condition 3 we can write as . Hence, we have
Next, consider the second term:
By Cauchy-Schwartz and condition 5,
And again by Cauchy-Schwartz,