The increasing availability of large, observational datasets poses opportunities and challenges for statistical methodologists. These datasets often “contain detailed information about the target population of interest,” meaning they could have great utility for estimating the causal effects of a proposed intervention, such as a public health initiative (Hartman et al., 2013). Yet assignment to the treatment is non-random in these data, making standard methods prone to misstate the treatment effect.
Randomized control trials (RCTs) make causal inference substantially easier, because the researcher controls assignment of the intervention. Yet RCTs present their own challenges. A commonly raised concern is that “estimates from RCTs may lack external validity” (Hartman et al., 2013) due to the sampling scheme used to enroll participants. Moreover, in cases where the treatment effect varies by subpopulation, “experiments have to be very large and, in general, prohibitively costly” whereas “observational data is often available in much larger quantities” (Peysakhovich and Lada, 2016).
These issues motivate a hybrid approach, which makes use of the availability, size, and representativeness of observational data as well as the randomization inherent in RCT data. Implicitly, any such approach will also involve combining biased and unbiased estimators – a problem with substantial precedent in the literature(mundlak1978pooling; green1991james)
. The fundamental tool used in our approach is the propensity score: the estimated conditional probability of exposure to the intervention, given observed covariates.
Rosenbaum and Rubin (1984) showed that comparing treated individuals against control individuals for whom the propensity score is approximately equal yields a substantial reduction in bias. The propensity score is widely used in analysis of observational datasets, as comparing test and control units with similar propensity scores “tends to balance observed covariates that were used to construct the score” (Joffe and Rosenbaum, 1999). It has been less widely used in the context of RCTs. But propensity-based methods “have been shown to be useful even in randomized control settings, where the assignment mechanism is known and independent of the covariates” (Xu and Kalbfleisch, 2010), because they correct for chance imbalance in covariates.
In using the propensity score, we make several simplifying assumptions: we assume a strongly ignorable treatment assignment (SITA), as defined by Rosenbaum and Rubin (1984); and we allow for a heterogeneous treatment effect but assume that it varies only as a function of the propensity. Both are strong assumptions. The existence of a prospectively designed RCT in our setting may yield some insights into the factors affecting treatment assignment in the ODB, lending some additional plausibility to the SITA assumption. Yet it remains hard to defend in practice, and a natural extension of this work will involve engaging with unmeasured confounders.
The latter assumption is retained in its strictest form for simplicity, but can be greatly weakened by sub-stratifying on additional features assumed to be related to the heterogeneity of the treatment effect. A practical example is provided in Section 5.4. The treatment effect for subjects in the RCT is assumed to be the same function of the propensity that holds in the ODB. The distribution of covariates in the RCT can however be much different from the ODB due to factors such as varying enrollment criteria.
We assume that the covariate distribution in the ODB is the same as that of the target population. We use data from the RCT to provide much-needed control subjects in the strata with a high propensity in the ODB as well as test group subjects in the strata with a low propensity in the ODB. Note that this is a relatively simple case of causal transportability, as, per the discussion above, we assume the propensity score is a causal modifier and that other modifiers are known via subject matter expertise. We are also not engaging with the question of selection diagrams, and implicitly making strong invariance assumptions (pearl2014external).
Our two main estimators are a spiked-in estimator that simply merges the ODB and RCT within each stratum and a “dynamically weighted” estimator that mimics the best possible convex combination of estimates from the two populations, within each stratum.
The remainder of the paper is organized as follows. In Section 2, we define our notation, assumptions, estimand and estimators, including the spiked-in and dynamic weighting estimators that we propose. We work in the potential outcomes framework, in which treatment effects are ratio estimators due to the random numbers of subjects in each condition. For the large sample sizes of interest to us, delta method approximations to the mean and variance are accurate enough. Section 3 presents delta method estimates of the within-stratum bias and variance for our estimators. There we see theoretically that the spiked-in estimator can have an enormous bias if the covariates in the RCT do not follow the same distribution as those of the ODB.
Section 4 gives some numerical illustrations of our method for an ODB of size and an RCT of size . When the RCT covariates follow the ODB’s distribution, then the spiked-in estimator brings a large reduction in mean squared error over the ODB-only estimate. If, however, enrollment criteria bias the RCT (i.e. the treatment effect varies by subpopulation and the RCT covariate distribution differs substantively from that of the ODB), then the spiked-in estimator can be much worse than the ODB-only estimate. In either case, the dynamic weighted estimator brings an improvement over the ODB-only estimate. Section 5 introduces data from the Women’s Health Initiative (WHI). An overview of the components of the WHI and the data collection process is provided in Section 5, while the subsequent sections detail how a propensity model is fit to the WHI observational component and a “gold standard” estimate of the causal effect is derived. A variant of the spiked-in estimator is introduced in Section 5.4. That variant refines propensity strata via a prognostic score, leading to a “dual spiked” estimator. All the estimators are compared via bootstrap simulations in 5.5, and the dual spiked estimate is most accurate for the WHI data. Section 6 summarizes our conclusions and an Appendix contains two of the lengthier proofs.
2 Notation, assumptions and estimators
Some subjects belong to the randomized controlled trial (RCT) and others to the observational database (ODB). We assume that no subject is in both data sets. We write if subject is in the RCT and otherwise. Subject has an outcome
and some covariates that we encode in the vector. Subject receives either the test condition or the control condition.
The condition of subject is given by a treatment variable where if subject is in the test condition (and otherwise). Some formulas simplify when we can use parallel notation for both test and control settings. Accordingly we introduce and . Other formulas look better when focused on the test condition. For instance, letting and , the expression is immediately recognizable as a Bernoulli variance and is preferred to .
We adopt the potential outcomes framework of Neyman and Rubin. See Rubin (1974). Subject has two potential outcomes, and , corresponding to test and control conditions respectively. Then . The potential outcomes are non-random and we will assume that they are bounded. We work conditionally on the observed values of covariates and so are also non-random.
All of the randomness comes from the treatment variables . We use the notation
for Bernoulli random variables taking the valuewith probability and with probability . The ODB and RCT differ in how the are distributed.
Assumption 1 (ODB sampling).
If , then independently where with .
The function in Assumption 1 is a propensity. Because the propensity depends only on , and is never or , the ODB has a strongly ignorable treatment assignment (Rosenbaum and Rubin, 1984). Because the are independent, the outcome for subject is unaffected by the treatment for any subject . That is, our model for the ODB satisfies the stable unit treatment value assumption or SUTVA (Imbens and Rubin, 2015).
Assumption 2 (RCT sampling).
If , then independently for a common probability .
The RCT will commonly have but we do not assume this. Our RCT model also has a strongly ignorable treatment assignment and it too satisfies the SUTVA. We additionally assume that the ODB is independent of the RCT.
|, ,||Subject and stratum indices, stratum weights|
|, , ,||ODB and RCT subject sets and strata|
|,||Covariates and ODB propensities|
|, ,||Test, control and observed responses|
|, ,||Test, control and observed indicators.|
|Stratum treatment effects: ODB, RCT, merged|
|, ,||Stratum sample sizes: ODB, RCT, merged|
|,||ODB and RCT estimates of|
|, ,||Spiked, weighted, dynamic estimates of|
|,||Average potential responses by ODB stratum|
|,||Average potential responses by RCT stratum|
|,||Average propensities by ODB and RCT strata|
|,||One minus average propensities|
|,||Response-propensity covariances in the ODB|
Our comparison of treatment versus control is based on stratification by propensity as described by Imbens and Rubin (2015). This is one of several matching strategies mentioned in Stuart and Rubin (2007).
We use strata defined by propensity intervals. For the ODB these are
The RCT is similarly stratified via
Note that the RCT is stratified according to the propensity that those observations would have had, had they been in the ODB. Imbens and Rubin (2015) suggest strata containing approximately equal numbers of observations. We have instead given them equal sized propensity ranges. In practice, some small strata might have to be merged together. Sometimes we refer to strata as ‘bins’.
We work here as though the propensity function is known exactly. In practice, will be replaced by an estimated propensity fit to the ODB. We suppose that the ODB is large enough to obtain a good propensity estimate. Under Assumption 1, the true propensity is a function of the observed variables .
The sample sizes of the ODB and RCT are and respectively. Ordinarily . The ODB and RCT sample sizes within stratum are and . The within-stratum average treatment effects are
where means over empty strata are taken to be in (1).
For , if and then and we call their common value .
Assumption 3 leaves undefined when . If only one of and is positive then we take its treatment effect for . If both are , then we will not need .
For all , .
Assumption 4 is an idealization that simplifies some derivations, and we need it in one instance to estimate a quantity that depends on both potential outcomes of a single subject. In some of our simulations we will relax that assumption to make constant for all units at a fixed propensity score, rather than within a stratum. Xie et al. (2012) have argued for analyzing the pattern of treatment effects solely as a function of the propensity score, an approach taken by a number of social science researchers (Brand and Davis, 2011; DellaPosta, 2013). Because our strata are based on propensity, Assumption 4 is very nearly true under the model of Xie et al. (2012).
Our estimand is a global average treatment effect defined by
for weights with . The weights can be chosen to match population characteristics. Our choice is to take which is reasonable when the ODB represents the target population of interest. With this choice, whenever and we have a well defined for every stratum that contributes to . We may still have for some strata with .
We now introduce some estimators designed to make use of the advantages of both the RCT and the ODB data. Our estimators all take the form for different within-stratum estimates .
Our two simplest proposed estimators each use just one of the two populations. The ODB-only estimate of the treatment effect in stratum is
Then . A potential problem with is that small values of , corresponding to the left-most bins, have subjects with small propensity values. Then may contain very few observations with and may have high variance. Similarly for large , may contain very few observations with
which again leads to high variance. That is, the ‘edge bins’ can have very skewed sample sizes causing problems for.
An analogous RCT-only estimator is where
Because the RCT assigns treatments with constant probability, the edge bins have less skewed treatment outcomes. However, because the RCT is small, we may find that several of the strata have very small sample sizes .
Our first hybrid estimator is , where
The RCT data are ‘spiked’ into the ODB strata. This spiked-in estimator can improve upon the ODB estimator by increasing the number of treated units in the low propensity edge bins and increasing the number of control units in the high propensity edge bins. Even a small number of such balancing observations can be extremely valuable.
The spiked-in estimator is not a convex combination of and , because the pooling is first done among the test and control units. Our final two estimators are constructed as convex combinations of and .
The weighted average estimator uses
It weights and according to the number of data points involved in each estimate.
Our final estimator is a “dynamic weighted average” . It uses weights for and that are estimated from the data. Those weights are chosen to minimize an estimate of mean squared error (MSE) derived using the delta method in the following section. While the precise form of this estimator will be discussed next, we can observe its approximate optimality via the following result, recalling that the RCT estimator will in general be unbiased.
Let and be independent estimators of a common quantity , with bias, variance and mean squared errors, , , and for . For , let . Then
This linear combination has
Independence of the yields while linearity of expectation yields . Optimizing over yields the result. ∎
3 Delta method results
In this section we develop some delta method moment approximations. Letbe a random vector with mean and a finite covariance matrix. Let be a function of that is twice differentiable in an open set containing and let and be first and second order Taylor approximations to around . Then the delta method mean and variance of are
Sometimes, to combine estimates, we will need a delta method mean for a weighted sum of those estimates. We will also need a delta method variance for a weighted sum of independent random variables. We use the following natural expressions
without making recourse to Taylor approximations.
3.1 Population quantities
We will study our estimators in terms of some population quantities. These involve some unobserved values of or . For instance, the test and control stratum averages in the ODB are
and it is typical that both of these are unobserved. Corresponding values for the RCT are and .
When we merge ODB and RCT strata we will have to consider a kind of skew in which the within-stratum mean responses above differ between the two data sets. To this end, define
Under Assumption 3, . We will use .
Now we define several other population quantities. Let be a finite non-empty set of indices such as one of our strata or . For each , let be a pair of bounded potential outcomes and let be independent random variables and let . Some of our results add the condition that all for some .
For so equipped, we define average responses
For example, above is . We use average treatment probabilities
These become , , and in a natural notation when is or .
The above quantities are averages over uniformly distributed in as distinct from expectations with respect to random . We also need some covariances of this type between response and propensity values,
We will find that these quantities play an important role in bias. If for instance the larger values of tend to co-occur with higher propensities then averages are biased up.
The delta method variances of our estimators depend on the following weighted averages of squares and cross products
where and . The quantity is the lead term in and is similar. More details about these quantities are in the Appendix where Theorem 1 is proved.
Let be , or . Then under Assumption 4, .
3.2 Main theorem
We will compare the efficiency of our five estimators using their delta method approximations. We state two elementary propositions without proof and then give our main theorem. Results for our various estimators are mostly direct corollaries of that theorem.
Let and be jointly distributed random variables with
be jointly distributed random variables with meansand respectively, and finite variances. Let . Then
Let , , , be jointly distributed random variables with finite variances and means and respectively, for . Let . Then
See Section Proof of Theorem 1. ∎
The implied constant in for equation (16) holds for all .
3.3 Delta method means and variances
We define the delta method bias of an estimate via .
See Section Proof of Corollary 2. ∎
The RCT has a very tiny delta method bias which arises purely from the ratio estimator (random denominator) form of . Conditional on there being at least one treated and one control subject in the stratum, it can be shown that is exactly unbiased rather than just asymptotically unbiased. This follows from symmetry: at every value of , the estimator is drawn uniformly at random from all permutations of the labels of who is treated and who is not, and unbiasedness follows.
In our motivating scenarios we anticipate that so that for most . Then the first term in is only slightly smaller than for the ODB-only estimate, and at most a small variance reduction is to be expected from weighting.
The spiked-in estimator’s bias and variance cannot be computed as a corollary of Theorem 1, but they can be computed directly.
The spike-in estimates are computed by pooling and into their union. ∎
To relate the bias of to that of the other estimators, we write it in terms of the quantities computed using and . Denoting these quantities using an additional subscript of and ,
The bias for is zero. The bias for has terms analogous to the second and third (and error) terms above, but the first term is new to . This term is linear in . For large values of , this term will dominate, yielding biases that can easily exceed those of . This is the fundamental danger of the spiked-in estimator: if the mean potential outcomes differ substantially between ODB and RCT subjects with similar value of the propensity score function, then the estimation will be poor due to large bias.
3.4 The dynamic weighted estimator
The bias-variance tradeoffs are intrinsically different in each stratum. Using results from the prior section, we derive a dynamic weighted estimator that uses different weights in each stratum. Our dynamic weighted estimator is based on Assumption 4, though we will test it in settings where that assumption does not hold.
From Proposition 1, the MSE-optimal convex combination of and is where The dynamic weighted estimator is
for plug-in estimators of and . To obtain our MSE estimates we use