Using Experimental Data to Evaluate Methods for Observational Causal Inference

10/06/2020 ∙ by Amanda Gentzel, et al. ∙ University of Massachusetts Amherst 0

Methods that infer causal dependence from observational data are central to many areas of science, including medicine, economics, and the social sciences. A variety of theoretical properties of these methods have been proven, but empirical evaluation remains a challenge, largely due to the lack of observational data sets for which treatment effect is known. We propose and analyze observational sampling from randomized controlled trials (OSRCT), a method for evaluating causal inference methods using data from randomized controlled trials (RCTs). This method can be used to create constructed observational data sets with corresponding unbiased estimates of treatment effect, substantially increasing the number of data sets available for evaluating causal inference methods. We show that, in expectation, OSRCT creates data sets that are equivalent to those produced by randomly sampling from empirical data sets in which all potential outcomes are available. We analyze several properties of OSRCT theoretically and empirically, and we demonstrate its use by comparing the performance of four causal inference methods using data from eleven RCTs.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Researchers in machine learning and statistics have become increasingly interested in methods that can estimate causal effects from observational data. Such interest is understandable, given the centrality of causal questions in fields such as medicine, economics, sociology, and political science

(morgan2015counterfactuals). Causal inference has also emerged as an important class of methods for improving the explainability and fairness of machine learning systems, since causal models explicitly represent the underlying mechanisms of systems and their likely behavior under counterfactual conditions (kusner2017counterfactual; pearl2019seven).

However, evaluating causal inference methods is typically far more challenging than evaluating methods that construct purely associational models. Both types of methods can be analyzed theoretically. However, empirical analysis—long a driver of research progress in machine learning and statistics—has been increasingly recognized as vital for research progress in causal inference (e.g., Dorie2019; Gentzel2019)

, and empirical evaluation is substantially more challenging to perform in the case of causal inference. Many associational models (e.g., classifiers and conditional probability estimators) can be evaluated using cross-validation or held-out test sets. However, causal inference aims to estimate the value or distribution of an outcome variable

under intervention, and evaluating such estimates requires an alternative route to estimating the effects of such intervention.

Most easily available data sets are either experimental (which can yield unbiased estimates of treatment effect) or observational (for which treatment effect is unknown). Since most causal inference methods are designed to infer causal dependence from observational data, accurate evaluation requires both observational data and corresponding unbiased estimates of treatment effect. Several recent efforts have attempted to address this problem (e.g., Dorie2019; Gentzel2019; Tu2019; Shimoni2018), most of which collect or modify data specifically for the purpose of evaluation. Some approaches induce dependence between variables in specially constructed or selected data, while others repurpose a simulator to produce data for evaluation. These approaches are promising and beneficial to the community, but creating individual, specialized new data sets is difficult and time-consuming, limiting the number of data sets available and thus limiting research progress.

We propose to exploit an additional source of data for evaluating causal inference methods: randomized controlled trials. Randomized controlled trials (RCTs) are designed and conducted for the express purpose of providing unbiased estimates of treatment effect. Many RCT data sets are publicly available, and more become available every day. Previous work has described how to sub-sample a specialized type of experimental data (one in which all potential outcomes are observed) to create constructed observational data sets.111The term ”constructed observational data” denotes empirical data to which additional properties common in observational data (e.g., confounding) have been synthetically introduced. This term is distinct from constructed observational studies, which are studies that collect and compare both experimental and observational data from the same domain (see “Related Work”). Surprisingly, this basic approach can be modified to produce constructed observational data from RCTs as well. Specifically, we: (1) Describe an algorithm to induce confounding bias in RCT data by sub-sampling, and prove that this approach is equivalent, in expectation, to the data generating process assumed by the potential-outcomes framework, a longstanding theoretical framework for causal inference; (2) Demonstrate the feasibility of this approach by applying multiple causal inference methods to observational data constructed from RCTs222Pointers to the data sets used in this paper, and R code to perform observational sampling, will be released upon publication.; and (3) Present a method for using the data rejected by the sub-sampling for evaluation, and show that it is equivalent to a held-out test set.

Creating Observational Data from Randomized Controlled Trials

Consider a data generating process that produces a binary treatment , outcome , and multiple covariates , each of which may be causal for outcome.333For ease of exposition, we describe the approach using binary treatment, but the approach is more general. We define to be the outcome for unit under treatment , referred to as a potential outcome. For each unit , both treatment values and are set by intervention and both potential outcomes and are measured. We refer to this type of data, where all potential outcomes are observed, as all potential outcomes (APO) data, denoted . Note that, due to the use of explicit interventions, such a data generating process produces experimental, rather than observational, data.

Recently, some researchers (louizos2017causal; Gentzel2019) have proposed sampling from APO data to produce constructed observational data. Such data sets are produced by probabilistically sampling a treatment value (and its corresponding outcome value) for every unit based on the values of one or more covariates (). We refer to as the biasing covariates. This procedure, shown in Algorithm 1, induces causal dependence between and , creating a confounder when also causes . We refer to such a data generating process as observational sampling from all potential outcomes (OSAPO) and denote a given data set generated in this way as . OSAPO is the data generating process assumed under the potential outcomes framework (rubin2005causal).

Data sets produced by OSAPO are extremely useful for evaluating causal inference methods. Causal inference methods can estimate treatment effect in , and these estimates can be compared to estimates derived from . Furthermore, the process of inducing bias by sub-sampling allows for a degree of control that can be exploited to evaluate a method’s resilience to confounding, by systematically varying the strength and form of dependence and whether variables in are observed. However, very few experimental data sets exist that record all potential outcomes for every unit, severely limiting the applicability of this approach.

Observational Sampling of RCTs

Now consider a slightly different data generating process, in which treatment is randomly assigned and only one potential outcome is measured for each unit , producing either or , but not both. This is the data generating process implemented by RCTs, in which every unit is randomly assigned a treatment value, and the outcome for that treatment is measured. Vast numbers of RCTs are conducted each year, and data sets from many of them are available publicly. In addition, growing efforts toward open science are continually increasing the number of publicly available RCT data sets.

This raises an intriguing research question: Can RCTs be sub-sampled to produce constructed observational data sets with the same properties as those produced by APO sampling?

Figure 1: Two procedures for sampling constructed observational data sets from experimental data. Left: From all potential outcomes (APO) data. Right: From randomized controlled trial (RCT) data. For some function

We describe one such sampling procedure in Algorithm 2 — observational sampling from randomized controlled trials (OSRCT)—which produces a data sample denoted . As in APO sampling, covariates bias the selection of a single treatment value for every unit . If unit actually received the selected treatment , we add to . Otherwise, that unit is ignored. As we show below, when treatment is binary and treatment and control groups are equal in size, the resulting constructed observational data set is, in expectation, half the size of the original, regardless of the form of the biasing. As discussed in “What Can OSRCT Evaluate?”, a causal inference method can then be applied to this data, and the results can be compared to the unbiased effect estimate from the original RCT data. This basic approach is shown in Figure 2.

An RCT can be thought of as a data set where one potential outcome for every unit is missing at random. Since OSRCT uses the biasing covariates to select treatment, and treatment was assigned randomly, the sub-sampling process only changes the dependence between the biasing covariates and treatment. This is the same as in APO sample. The probability of a given unit-treatment pair being included in the sub-sample is proportional in APO and RCT sampling. That is,  is equivalent to a random sample of .

Theorem 1.

For RCT data set , APO data set , and binary treatment with in , and units , , for all units .

Proof 1.

For every unit and any treatment , the biasing covariates are used to probabilistically select a treatment, which we denote , with probability .


Sub-sampling uniformly at random is equivalent to multiplying by a scaling factor, . When , .

Intuitively, the procedure outlined in Algorithm 1 works because treatment is randomly assigned in RCTs. The data is sub-sampled based solely on the value of a probabilistic function of the biasing covariates, which selects a value of treatment for every unit . Since the observed treatment is randomly assigned, it contains no information about any of ’s covariates. The only bias introduced by this sub-sampling procedure is the intended bias: a particular form of causal dependence from to .

Note that while Theorem 1 assumes equal probability of treatment and control, the approach generally applies even when . In this case, instead of sub-sampling by a factor of 0.5, the scaling factor is selected based on the treatment value. Since treatment is based solely on the value of the biasing covariates, this is equivalent to modifying the form of the biasing function.

One potential disadvantage of this approach is that sub-sampling to induce bias necessarily reduces the size of the resulting sample. Somewhat surprisingly, however, the degree of this reduction does not depend on the intensity of the biasing.

Theorem 2.

For binary treatment and RCT data set , if either , or , then .

A proof is provided in the Supplementary Material.

Figure 2: The process of creating observational-style data from a randomized controlled trial.

What Can OSRCT Evaluate?

The constructed observational data created by OSRCT has a substantial benefit over purely observational data: Unbiased estimates of causal effect can be obtained from the original RCT data, which can be compared to effect estimates from causal inference methods. In a well-designed RCT, treatment assignment is randomized such that, in expectation, the treatment and control groups are equivalent. This enables the unbiased estimation of the sample average treatment effect (ATE) as , where denotes the actual treatment received by unit . This estimate can be compared to estimates made by causal inference methods applied to the constructed observational data.

However, unlike APO data, RCT data only contains one treatment-outcome pair for every unit, limiting both the available effect estimates and how these data sets can be used. RCTs measure the effect of a single randomized intervention for every unit in the data set. Thus, we cannot estimate individual treatment effect (ITE) from RCT data, a measurement which is available when using APO data. However, OSRCT data can be used to evaluate a method’s ability to estimate the unit-level effects of interventions. Any causal inference method that can estimate can be evaluated by comparing those estimates against measurements in the RCT data.

Using the Complementary Sample for Evaluation

One challenge when evaluating causal inference methods on their ability to estimate unit-level effects of interventions is the need for a held-out test set. The constructed observational data is constructed by sub-sampling the original RCT data. This means that evaluating on all of the RCT data may produce a biased result by testing on a superset of the training data. One potential solution is to divide the RCT data into separate training and test sets. However, since OSRCT necessarily reduces the size of the training data by sub-sampling, the extra requirement of holding out a test set limits the number of RCTs that can be used, since not all randomized experiments will have enough data to learn effective models after two rounds of sub-sampling.

A more data-efficient approach is to use the data rejected by the biased sub-sampling. OSRCT sub-samples RCT data to create a probabilistic dependence between the biasing covariates and treatment. Based on the values of the biasing covariates, a treatment is selected for every unit. If that treatment is present in the data, the unit is included in the sample; otherwise the unit is rejected. This rejected sample (which we call the complementary sample) also has a causal dependence from the biasing covariates to treatment. The only difference is that the form of that dependence is the complement of that for the accepted sample, such that covariate values that lead to a high probability of treatment in the accepted sample would lead to a low probability of treatment in the complementary sample. Because we know the functional form of this induced bias, we can weight the data points in the complementary sample according to their probability of being included in the accepted sample. In aggregate, this type of weighting allows the complementary sample to approximate the distribution of the training data, and thus be used for testing. This is equivalent to inverse propensity score weighting (rosenbaum1983central).

Theorem 3.

For binary treatment , biasing covariates , outcome , estimated outcome , biased sample  and complementary sample , let . Then, for  = for .

A proof is provided in the Supplementary Material.

Assumptions, Limitations, and Opportunities

The validity of evaluation with OSRCT depends on several standard assumptions about the validity of the original RCT. Specifically, it assumes that treatment assignment is randomized and that all sampled units complete the study (no “drop-out”). Intriguingly, one standard assumption—that intent to treat does not differ from actual treatment—is not necessary. Even if this assumption is violated, the estimated treatment effect will correspond to the effect of intending to treat, and this estimand can still be used to evaluate the effectiveness of methods for observational causal inference.

Evaluation with OSRCT has some limitations. OSRCT can induce dependence between any covariate and treatment, but it cannot induce dependence between any covariate and outcome. In addition, while the original RCT data can yield an unbiased estimate of the effect of treatment on outcome, it cannot produce such estimates for any other pair of variables.

Constructing observational data also provides some unique opportunities. OSRCT produces data with non-random treatment assignment, and allows for variation in the level and form of that non-randomness. Additional factors of observational studies can also be simulated, such as measurement error, selection bias, and lack of positivity. While some of these may reduce the sample size of the constructed observational data due to additional sub-sampling, this can allow for the evaluation of a causal inference method’s robustness to many features of real-world data.

Related Work

The closest prior work (li2011unbiased) uses an identical idea for a subtly different task: estimating the reward of a contextual bandit policy without having to actually execute that policy. Specifically, they propose to evaluate a (non-random) contextual policy by sampling from the data produced by a randomized policy. They show that the resulting estimate is unbiased, despite its use of only a subsample of the data originally produced by the randomized policy. This method is widely employed to evaluate methods in fields such as computational advertising and recommender systems, and it has been extended with approaches such as bootstrapping (mary2014improving).

OSRCT exploits the same idea but in a different setting. In our setting, we have no interest in estimating the effect of a contextual policy that is known to the agent (which is somewhat analogous to what, in observational causal inference, is referred to as the ”average treatment effect on the treated”). Instead, our goal is to determine how well a given method estimates the average treatment effect (which, in contextual bandits, would be formulated as the reward difference between two specific policies), even though the algorithm only has access to the actions and outcomes of a single unknown and non-randomized policy.

Despite the similarity of tasks, this approach—observational sampling from RCTs—is almost entirely unknown within the causal inference community. For example, two recent papers that contain reviews of existing evaluation methods for causal inference methods—Dorie et al. (Dorie2019) and Gentzel et al. (Gentzel2019)—do not even mention this approach, despite the fact that it overcomes many of the most serious threats to validity for evaluation studies (e.g., reproducibility, realistic data distributions and complexity of treatment effects, multiple possible levels of confounding). A handful of papers have applied it in a one-off manner to evaluate causal inference methods (e.g., kallus2018removing; kallus2018confounding), but it has not been explicitly formalized or its advantages clearly described. As a result, it is almost never used.

In addition to this prior work on sampling for evaluating contextual bandit policies, other prior work has explicitly focused on evaluation methods in causal inference. This work has applied a variety of approaches to creating observational data sets such that a derived estimated treatment effect can be compared to some objective standard. The ideal approach would score highly on at least three characteristics: data availability (many data sets with the required characteristics can be easily obtained); internal validity (differences between estimated treatment effect and the standard can only be attributed to bias in the estimator); and external validity (the performance of the estimator will generalize well to other settings). Of three broad classes of prior work, each suffers from some deficiencies and none clearly dominate the others.

The first class of prior work uses observational data sets with known treatment effect. One approach gathers observational data about phenomena that are so well-understood that the causal effect is obvious (e.g., mooij2016distinguishing). Unfortunately, such situations are relatively rare, limiting data availability. Another approach is to use data from matched pairs of observational and experimental studies (e.g., Dixit2016; sachs2005causal). In many ways, such data sets appear to represent a nearly ideal scenario for evaluating methods for inferring causal effect from observational data. However, pairs of directly comparable observational and experimental studies have low data availability, and using paired studies with different settings or variable definitions can greatly reduce internal validity. Some “constructed observational studies” intentionally create matched pairs of experimental and observational data sets (e.g., lalonde1986evaluating; Hill2004; Shadish2008can), but these studies also have low data availability.

Another class of prior work generates observational data from synthetic or highly controlled causal systems (e.g., Tu2019; Gentzel2019; louizos2017causal; Kallus2018). In this way, the treatment effect is either directly known or can be easily derived from experimentation. Observational data is typically obtained via some biased sampling of the experimental data, often a variety of APO sampling. In the case of entirely synthetic data, data availability and internal validity are both high, but external validity is low, and such studies are often criticized as little more than demonstrations. External validity typically increases somewhat for highly controlled causal systems, but data availability drops significantly.

The final and newest class of existing work augments an existing observational study with a synthetic outcome, replacing the original outcome measurement (e.g., Dorie2019; Shimoni2018). Given the synthetic nature of the outcome, the causal effect is known. This class of approach has relatively high data availability, and it trades some loss of external validity (because real outcome measurements are replaced with synthetic ones) to gain internal validity (because the true treatment effect is known). Note particularly that both the treatment effect and the confounding are synthetic, because the function that determines the synthetic outcome determines how both the treatment and potential confounders affect the value of outcome.

The approach proposed here—OSRCT—augments, rather than replaces, these existing approaches. It occupies a unique position because it simultaneously has fairly high data availability, internal validity, and external validity. OSRCT’s data availability is relatively high because it can be applied to data from any moderately large RCT. Only synthetic data generators and approaches that augment observational data with synthetic outcomes probably have higher data availability, but both suffer in terms of external validity. OSRCT’s internal validity is relatively high because there exist many well-designed RCTs. Using synthetic data generators or highly controlled causal systems will typically produce somewhat higher internal validity, as will observational data with synthetic outcomes, but this is done at the cost of external validity or data availability. Finally, OSRCT’s external validity is relatively high because the distributions of all variables and the outcome function occur naturally, while only the confounding is synthetic. Only observational studies with known treatment effect have higher external validity, and these suffer from severe limitations on data availability.

Are RCT Data Sets Available?

OSRCT has the benefit of leveraging existing empirical data rather than requiring the creation of new data sets specifically for evaluating causal inference methods, but it does require that data from RCTs be available and generally accessible to causality researchers. Fortunately, this is increasingly the case. While many repositories that host RCTs are restricted for reasons of privacy and security, many other repositories allow access with only minimal restrictions. In some cases, access requires only registering with the repository and agreeing not to re-distribute the data or attempt to de-anonymize it. As long as these data sharing agreements are adhered to, such data can be easily acquired by causality researchers. This includes repositories such as Dryad, the Yale Institution for Social and Policy Studies Repository, the NIH National Institute on Drug Abuse Data Share Website, the University of Michigan’s ICPSR repository, the UK Data Service, and the Knowledge Network for Biocomplexity. An even larger set of repositories restricts access but will make data available upon reasonable request. Additional information about these repositories can be found in the Supplementary Material.

In addition, funding agencies and journals are increasingly requiring that researchers make anonymized individual patient data available upon reasonable request (Godleee7888; Ohmanne018647). For example, the United States’ National Institutes of Health (NIH) recently requested public feedback on a proposed data sharing policy with the aim of improving data management and the sharing of data created by NIH-funded projects (NIH2019). There is also increasing awareness and discussion in the medical community about the importance of sharing individual patient data, to allow for greater transparency and re-analysis (Drazen2015; kuntz2019individual; Banzi2019; Suvarna2015). All of this suggests increasing availability of individual patient data from randomized controlled trials.

Figure 3: Demonstration of OSRCT on data from 11 RCTs, split by outcome type. Top left: ATE (risk-ratio) for binary outcome, Top right: ATE for continuous outcome, Bottom left: Outcome estimation for binary outcome, Bottom right: Outcome estimation for continuous outcome. Outcome estimation errors were normalized by the range of the outcome. OSRCT allows us to evaluate causal inference methods on a wide range of data sets for which unbiased effect estimates are available.


To demonstrate the feasibility of OSRCT, we examine the performance of four popular causal inference methods using 11 RCT data sets, many more than is typically possible. Specifically, we examine propensity-score matching (rosenbaum1983central), outcome regression (morgan2015counterfactuals), tree-based outcome modeling (Bayesian additive regression trees (BART) (hill2011bayesian)), and a structure learning method (the PC algorithm (spirtes2000causation)). Details are provided in the Supplementary Material. We selected 11 RCTs from 6 repositories. (ISPSd084; ISPSd037; ISPSd113; ICPSR23980; DryadB8KG77; Dryad6d4p06g; UKDataService854092; UKDataService853369; NIDAP1S1; KNB1596312; KNBF1QF8R5T) For this analysis, we only selected data sets that were publicly available for download, requiring, at most, registering for an account with the repository. These data sets all have binary treatment, at least one measured pre-treatment covariate, and either a binary or a continuous outcome. For simplicity, we chose a single biasing covariate for each data set and applied the bias as a logistic function. We selected biasing covariates that were correlated with outcome, so that sub-sampling would introduce confounding. Details about data sets are provided in the Supplementary Material.

For each data set, we applied OSRCT to create a constructed observational data set, learned a model using each causal inference method, and evaluated them on their ability to both estimate ATE and to predict the effects of treatment on the complementary sample. For data sets with binary outcome, risk-ratio () was estimated as ATE. The ATE estimate was compared to the sample ATE estimated from the original RCT data. The errors in outcome estimation were weighted according to , and the mean absolute error was estimated. We ran this procedure 30 times for each data set. For ease of comparison, results are divided between binary and continuous outcome. ATE and outcome estimation errors are shown in Figure 3.

Figure 4: APO vs RCT sampling on Postgres data. Left: Mean absolute error of ATE estimates, Right: Mean error of estimated outcomes The similarity between the RCT and APO data sets suggests that OSRCT and OSAPO produce equivalent constructed observational data.

Performance varies significantly by data set. While some trends appear (such as structure learning performing worse with binary outcome), ultimately, further analysis is necessary to understand what features of these data sets lead to performance differences. It is clear, though, that data generated from OSRCT can lead to some interesting variations in performance, and a comprehensive evaluation of causal inference methods using RCT data is intended as future work.

Propensity score matching may have an advantage in this evaluation, since these methods explicitly model the functional form of treatment given covariates, which is the dependence that was induced. This demonstration uses a simple biasing function on a single covariate, likely making this an easier problem for propensity score methods to solve. The functional form of outcome given covariates and treatment arises naturally and is likely to be a much more complicated mechanism. Thus, when evaluating different classes of inference methods using OSRCT, the form of the biasing should be varied. Structure learning methods are at a disadvantage here. Structure learning methods first focus on learning the causal structure of the system, and then use parameters fit to that structure to estimate effects. Effect estimates are thus very sensitive to the learned structure. Across the 30 runs performed for each data set, the learned structure varies, producing higher variance in effect estimates.

Experimental Evaluation

To assess OSRCT’s effectiveness at approximating APO data, we performed an experiment using an APO data set provided by Gentzel2019, replicating their experimental setup. In this data, units are Postgres queries, interventions are Postgres settings (such as type of indexing), covariates are features of queries (such as the number of joins or the number of rows returned), and outcomes are measured results of running the query (such as runtime). If the Postgres database is queried in a recoverable manner, the same query can be run repeatedly while varying the treatment, creating APO data. For this analysis, consistent with Gentzel2019, we chose runtime as the outcome, indexing level as the treatment, and the number of rows returned by the query as the biasing covariate.

To compare RCT and APO data, We converted the APO Postgres data into RCT-style data by randomly sampling a single treatment for every unit. We then created constructed observational data from both the original APO data and the RCT-style data, creating  and . For , as described in Theorem 3, outcome estimation was evaluated by weighting the errors in the complementary sample. However, in , no complementary sample is created, since the selected treatment is guaranteed to be observed for every unit. Instead, we can divide  into training and test sets. If the RCT-style data is created by sub-sampling treatments equally, by Theorem 2, splitting  in half leads to a data set approximately the same size as , allowing for comparison with equal training set size. We estimated errors over 100 trials. Results are shown in Figure 4.

Results are very similar for the APO data and the RCT-style data constructed from it. Consistent with Theorem 1, this suggests that evaluation with OSRCT data produces equivalent results to OSAPO data. In addition, consistent with Theorem 3, the similarity in outcome estimates suggests that weighting the complementary sample produces equivalent results to an unweighted held-out test set.


Research progress in machine learning has long depended on high-quality empirical evaluation. Until recently, research in causal inference has been hindered due to an almost complete lack of empirical data resources. The growth in such data resources is slow, and the breadth of such data is still limited, especially when compared to the wealth of evaluation data sets available for associational machine learning.

Data from RCTs provides a large and growing source of data that can be used to evaluate causal inference methods. They have the benefit of being widely collected by researchers in many fields over many years, and are increasingly being made available for wider use. RCT data is available from a wide variety of domains, and unbiased estimates of causal effect can be obtained for evaluation. OSRCT can substantially increase the data available for evaluating causal inference methods.


Thanks to Purva Pruthi for help preparing data sets and to Kaleigh Clary, Sam Witty, and Kenta Takatsu for their helpful comments.

This research was sponsored by the Defense Advanced Research Projects Agency (DARPA), the Army Research Office (ARO), and the United States Air Force under under Cooperative Agreement W911NF-20-2-0005 and contracts FA8750-17-C-0l20 and HR001120C0031. Any opinions, findings and conclusions or recommendations expressed in this document are those of the authors and do not necessarily reflect the views of DARPA, ARO, the United States Air Force, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes not withstanding any copyright notation herein.