A meta-analytic framework to adjust for bias in external control studies

by   Devin Incerti, et al.

While randomized controlled trials (RCTs) are the gold standard for estimating treatment effects in medical research, there is increasing use of and interest in using real-world data for drug development. One such use case is the construction of external control arms for evaluation of efficacy in single-arm trials, particularly in cases where randomization is either infeasible or unethical. However, it is well known that treated patients in non-randomized studies may not be comparable to control patients – on either measured or unmeasured variables – and that the underlying population differences between the two groups may result in biased treatment effect estimates as well as increased variability in estimation. To address these challenges for analyses of time-to-event outcomes, we developed a meta-analytic framework that uses historical reference studies to adjust a log hazard ratio estimate in a new external control study for its additional bias and variability. The set of historical studies is formed by constructing external control arms for historical RCTs, and a meta-analysis compares the trial controls to the external control arms. Importantly, a prospective external control study can be performed independently of the meta-analysis using standard causal inference techniques for observational data. We illustrate our approach with a simulation study and an empirical example based on reference studies for advanced non-small cell lung cancer. In our empirical analysis, external control patients had lower survival than trial controls (hazard ratio: 0.907), but our methodology is able to correct for this bias. An implementation of our approach is available in the R package ecmeta.



There are no comments yet.


page 1

page 2

page 3

page 4


Augmenting control arms with Real-World Data for cancer trials: Hybrid control arm methods and considerations

Randomized controlled trials (RCTs) are the gold standard for assessing ...

A note on the amount of information borrowed from external data in hybrid controlled trials with time-to-event outcomes

In situations where it is difficult to enroll patients in randomized con...

Combining Real-World and Randomized Control Trial Data Using Data-Adaptive Weighting via the On-Trial Score

Clinical trials with a hybrid control arm (a control arm constructed fro...

Clarifying Selection Bias in Cluster Randomized Trials: Estimands and Estimation

In cluster randomized trials, patients are recruited after clusters are ...

A Clinical Trial Derived Reference Set for Evaluating Observational Study Methods

Informing clinical practice in a learning health system requires good ca...

Reducing survivorship bias due to heterogeneity when comparing treated and controls with a different start of follow up

In comparative effectiveness research, treated and control patients migh...

Informed Bayesian survival analysis

We overview Bayesian estimation, hypothesis testing, and model-averaging...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Randomized controlled trials (RCTs) are considered the gold standard for estimation of treatment effects in medical research. Randomization ensures that no variables—measured or unmeasured—are systematically related to treatment assignment and that, consequently, the resulting treatment effect estimates are unbiased. However, there are potential barriers that limit the use of RCTs in certain settings: for example, a randomized control may be deemed unethical for disease areas with high unmet need or infeasible for targeted therapies with low expected enrollment.[9, 42]. Thus, despite the scientific advantages of RCTs, single-arm trials are common during early drug development [38] and have even been used for regulatory approval.[24].

At the same time, interest in using real-world data (RWD) for drug development is increasing, as evidenced by the 21st Century Cures Act, which requires the U.S. Food and Drug Administration to evaluate the use of real-world evidence to support approval of new indications or to support post-approval study requirements. One such application of RWD is the construction of external control arms for single-arm trials so that efficacy can be assessed without a randomized control.[10, 40, 11] These estimates may not only be useful for regulators, but can guide internal decision making on drug development in biopharmaceutical companies as well.[7]

Unfortunately, it is well known that there are many potential sources of bias in non-randomized studies.[25]

In external control studies, potential biases are concerning because they may result in inflated type I error, which may be of particular importance for regulatory applications. When external controls are constructed using RWD, bias may be caused by systematic differences between trial and RWD patient populations that are difficult to control for. For instance, trial patients may tend to have better outcomes since they are more likely to be treated in specialized academic centers and given higher levels of attention. Furthermore, endpoints may be measured differently (e.g. progression endpoints in oncology) or assessed with different levels of quality in routine clinical practice than in clinical trials. Finally, an important consideration is the extent to which outcomes for a given disease area can be explained by the observable data. If there are unobserved and highly prognostic confounders that vary across studies, then between study variability will be high and estimates from an external control study will be less certain.


While external control analyses are relevant to a range of disease areas with different primary endpoints, we consider time-to-event outcomes here, although the proposed methodology could be readily adapted to other classes of endpoints. This is motivated, in part, by the oncology setting where overall survival (OS) is typically the primary endpoint in phase III studies. OS is attractive for external control studies because it is measured in the same way in trials and RWD, ensuring that differences between trial and external control patients are not due to measurement differences. A focus on oncology may also be warranted given that much of the external control work to date has focused on diseases with high unmet need such as oncology and rare diseases.[17, 47] Indeed, as precision medicine has become more common, the number of eligible patients in oncology trials has decreased and some oncology indications have paradoxically become “rare diseases”.[46]

It is therefore of interest to develop methods for analysis of time-to-event outcomes that can adjust for the bias and the additional variability caused by external controls. In a prospective single-arm study, causal inference methods for observational data such as propensity score techniques are used to estimate the hazard ratio (HR) of the experimental treatment arm relative to the external control. In this paper, we propose a meta-analytic approach that leverages a set of historical reference studies to adjust HR estimates from such a study. Each reference study includes the control arm of a clinical trial (i.e., the internal control) and an external control that is constructed based on the inclusion and exclusion criteria of the trial. To further increase comparability, propensity score models can be developed for each reference study to match the external control to the randomised treatment arm. A meta-analysis of the reference studies on the log HR scale is then performed to assess the compatibility of the internal and propensity score-matched external control arms. The treatment effects from the prospective single-arm study, are, in turn, adjusted based on the bias and between-study variability estimated from the meta-analysis. Importantly, the analysis for the single-arm study is conducted independently of the meta-analysis so that adjustment requires no modifications to the standard causal inference methods used in analyses with external controls.

We use a simulation study to assess the operating characteristics of our methodology under different scenarios and then apply our method to a hypothetical single-arm study for treatment of advanced non-small cell lung cancer (aNSCLC), where we were able to use a set of 14 reference studies in the meta-analysis. Using a prespecified analysis plan that utilized propensity score methods, we estimated a mean bias (internal control vs. external control) on the log HR scale across the studies of , or a HR of approximately ; that is, our analysis suggests that external control patients had shorter survival than trial controls, which would bias results in favor of the experimental treatment. The between study variability (on the log HR scale) was as the HR varied from a low of to a high of across studies. An adjustment for the mean bias and between study variability would consequently widen uncertainty intervals and increase point estimates of the HR (i.e., decrease the treatment benefit).

The remainder of this paper is organized as follows. Section 2 introduces the meta-analytic framework. Section 3 presents results from the simulation study and Section 4 reports results from the aNSCLC example. Section 5 puts our methodology in context, discusses challenges and key considerations for implementation, and notes other potential applications. Finally, Section 6 concludes.

2 Methodology

2.1 Model formulation

We consider a set of studies, reference studies and a new study (i.e., a prospective single-arm trial), each of which may be considered to have 3 arms:

  • : the randomized experimental treatment arm

  • : the randomized internal control arm

  • : the external control arm

Figure (a) represents the data generating process for survival outcomes in terms of a directed acyclic graph. We let be the “true” values of the parameters that generate survival outcomes (e.g., might represent the rate parameter in an exponential survival model or the shape and scale parameters in a Weibull model). The estimates of the parameters in the study datasets are denoted by . The light colored nodes are not observable while the moderately dark colored nodes are observable. Our aim is to estimate treatment effects by comparing survival outcomes generated by the darkest nodes, to , in the new study. The challenge is of course that we have no randomized control so is not observable in the new study and a standard approach that compares to

is likely to result in bias because the external control arm is not randomly assigned. As such, we model the degree of similarity between the outcomes in the internal and external control arms using a normal distribution with parameters for the mean,

, and variance,


It is of note that neither nor from the references studies impact survival outcomes in the new study. We can consequently disregard the outcomes for the treatment arms from the reference studies and reformulate our problem in terms of HRs as shown in Figure (b), where is the true log HR and is an estimate of the log HR in the study sample. In other words, we can focus solely on comparisons between the internal and external controls in the reference studies.

(a) Absolute outcomes
(b) Hazard ratios
Figure 1: Directed acyclic graph describing the data generating process for survival outcomes. contains parameters describing the generating process for survival outcomes and represents a log hazard ratio. The “hat” distinguishes an estimated parameter from the true value of a parameter. and model the degree of similarity between the internal and external controls. The light colored nodes are not observable while the moderately dark colored nodes are observable. The darkest colored nodes reflect the quantities of interest: our aim is to estimate treatment effects by comparing the experimental treatment arm to a hypothetical internal control arm, which is defined in terms of log HRs as .

In what follows, we describe a meta-analytic model for estimation of log HRs for comparisons between the internal and external controls, , and an approach for generating the posterior distribution of our quantity of interest, . While our primary analysis uses Bayesian techniques for parameter estimation, quantification of uncertainty, and prediction, Appendix A details an alternative simulation based approach based on maximum likelihood estimation. We discuss the pros and cons of each as well as considerations for priors in the Bayesian model in Section 5. The methodology can be implemented using the publicly available R package ecmeta.

2.2 Meta-analytic model

A comparison between the internal and external controls in terms of the log HR can be modeled as,


where indexes a study and is the sampling variance. We assume that the true log HR is distributed with mean and variance ,


so that is a measure of the average bias in the external control arms across studies and is a measure of between study variability. It follows that the estimated log HR in study is distributed as,


When a Bayesian approach is used for estimation, priors are placed on and . We use a normal distribution for with mean and variance large enough so that the prior is uninformative. The prior for is arguably a more important consideration since it usually has a larger influence on the posterior distribution, particularly when the number of reference studies is small. In our software implementation, we allow for three uninformative priors that have been suggested in the literature including uniform [20] and half-t [21, 32] distributions for and a distribution [43] for .

2.3 Prediction for a new study

In a RCT, the treatment effect is estimated in an unbiased manner by comparing the experimental treatment arm to the internal control arm. Since there is no internal control arm in a single-arm trial, we adjust the HR from the single-arm study to obtain a HR that we would have obtained had an internal control existed. Such an adjustment is made by leveraging the following relationship between log HRs,


Note that this relationship will hold exactly for the true log HRs under the assumption of proportional hazards. For estimated log HRs (i.e., ), the relationship is only an approximation and will have a small error attached to it.

A prediction for in a new study then proceeds in three steps. The first step draws posterior samples of after using standard causal inference methods for observational data (such as those from a propensity score adjusted Cox regression) to estimate the log HR, , and its variance, . Conveniently, no modifications to standard analyses with external controls are required. A model for the sampling distribution for (using as data and treating as fixed) is given by,


where a normal prior is used for with mean and large variance. The second step samples using posterior draws of and from the meta-analytic model,


Finally, in the third step, the posterior draws of and are combined to obtain a posterior sample of using Equation 4.

3 Simulations

3.1 Setup and scenarios

Six scenarios were simulated as described in Table 1. For each scenario, parameters for 10,000 studies were generated as a function of median survival time and the number of events in each of the three arms. The log of median survival and the log of the number of events across studies were modeled using normal distributions centered at the medians displayed in the table and with variability determined by the coefficient of variation; that is, median survival times and the number of events were drawn from lognormal distributions with location parameters equal to the medians and the scale parameters equal to the coefficients of variation.

Median survival time Number of events
[Median (CV)] [Median (CV)]
Description Treatment Internal External Treatment Internal External
control control control control
S1 There is no between study heterogeneity (i.e., each study has the same median survival) and the number of events does not vary between studies. 24 (0%) 15 (0%) 12 (0%) 100 (0%) 70 (0%) 50 (0%)
S2 There is no between study variability and the number of events in each arm varies between studies. 24 (0%) 24 (0%) 18 (0%) 250 (20%) 250 (20%) 250 (20%)
S3 There is independent between-study variability in all three arms and variability in the number of events in each arm between studies. 24 (40%) 24 (20%) 18 (20%) 250 (20%) 250 (20%) 250 (20%)

A null-hypothesis scenario where the true randomized hazard ratio is fixed at 1 for all studies and the external control is biased with the magnitude of bias varying between studies. The number of events in the randomized study and external control arm vary between studies and the two randomized arms are of equal size (i.e., a 1:1 randomization ratio).

24 (20%) - HR=1 18 (20%) 150 (20%)* 250 (20%)
S5 An alternative-hypothesis scenario where the true randomized hazard ratio is fixed at 0.5 for all studies and the external control is biased with the magnitude of bias varying between studies. The number of events in the randomized study and external control arm vary between studies and the two randomized arms are of equal size (i.e., a 1:1 randomization ratio). 48/24 (20%) - HR=0.5 18 (20%) 150 (20%)* 250 (20%)
S6 An alternative-hypothesis scenario where the randomized hazard ratio varies between studies and the external control is biased with the magnitude of bias varying between studies (this may simulate scenarios where the reference set of studies are drawn from studies examining a class of multiple different compounds). The number of events in the randomized study and external control arm vary between studies and the two randomized arms are of equal size (i.e., a 1:1 randomization ratio). 35 (40%) 24 (20%) 18 (20%) 250 (20%)* 250 (20%)

CV = coefficient of variation. Median survival times and the number of events were drawn from lognormal distributions with location parameters equal to the medians in the table and scale parameters equal to the coefficient of variation.

*The sizes of the two RCT arms were fixed to be equal for each study assuming a 1:1 randomization ratio.

The true hazard ratio between the two RCT arms was fixed.

Table 1: Simulation scenarios

Parameters were always simulated independently for the external control arms. Parameters for each arm of the RCT were simulated in different ways depending on the scenario. In some scenarios the parameters of the two randomized arms were simulated independently of each other (S1, S2, S3) and in other scenarios they were set according to a fixed HR or a fixed 1:1 randomization ratio (S4, S5, S6).

Patient level data was simulated separately for each arm and study using exponential distributions where the rate parameter was set according to the median survival times. A Cox model was then used to estimate three log HRs—

, , and —for each study. Note that we assumed, for simplicity, that there was no censoring, meaning that the sample size was equal to the number of events. Given that variability in the log HRs is driven by the number of events, this assumption should not be influential.

To evaluate the meta-analytic methodology and assess the impact of varying the number of references studies, the 10,000 simulated studies were split into replications where was the number of reference studies, which we varied from 4 to 9. In each replication, studies were used to estimate the meta-analytic model and the remaining study was used as the new single-arm study for which a prediction of was made.

In Section 3.2

, we report estimates from a Bayesian model and consequently specified priors for the hyperparameters. We used a diffuse

prior for and following the advice of Gelman [21]

, a half-Cauchy (a half-t with 1 degree of freedom) prior for

with location and a large scale parameter (). In our discussion (Section 5) we consider alternative approaches including uniform and inverse gamma priors as well as estimation via maximum likelihood. Plots summarizing evaluations of these alternative approaches are provided in the Appendix (Appendix C).

3.2 Results

Figure 2 plots estimates of bias by scenario and the number of reference studies used to fit the meta-analytic model. The point estimate for the log HR, , was estimated using the median of the posterior distribution and bias was computed for the th replication as where is the true value of the log HR from the simulation. The boxplot characterizes the distribution of bias across the simulation replications. There is, in general, no evidence of bias as the median value of bias across the replications was approximately 0 in each scenario. Similarly, there is no evidence that the number of reference studies has a discernible impact on bias.

Figure 2: Bias of the adjusted estimates of the log hazard ratio as a function of the simulation scenario and the number of reference studies. The meta-analysis was performed using a Bayesian model with a half-Cauchy prior for .

The coverage of nominal 95% confidence intervals are plotted in

Figure 3

. The results are conservative with some overcoverage since the 95% credible intervals (CrIs) from the model contain the true value of the log HR,

, more often than the nominal 95%. The extent of overcoverage is decreasing in the number of reference studies, which suggests that a larger set of reference studies can improve the precision of the estimates.

Figure 3: Coverage of 95% confidence intervals of the adjusted estimates of the log hazard ratio as a function of the simulation scenario and number of reference studies. The meta-analysis was performed a using Bayesian model with a half-Cauchy prior for .

One consequence of our methodology is that uncertainty intervals will increase after adjustment. We assessed the implications of this on the power of the Bayesian model in Figure 4. Power and type I error were computed from a one-sided test where a treatment effect was deemed significant if the upper limit of the 95% CrI was less than zero. In null-hypothesis scenarios—S2 and S4—where the true log HR was 1, the type I error was below (represented by the dotted line), suggesting that the methodology was able to prevent type I error inflation. In the remaining scenarios, power was generally increasing in the number of reference studies. Increasing the number of reference studies therefore seems beneficial since it reduces uncertainty (Figure 3) and increases power.

Figure 4: Power of adjusted estimates of the log hazard ratio by scenario and the number of reference studies. Power (S1, S3, S5, S6) and type I error (S2, S4) were computed from a one-sided test. The dotted horizontal line represent the expected type I error of 0.025 under the null. The meta-analysis was performed a using Bayesian model with a half-Cauchy prior for .

4 Example: Advanced non-small cell lung cancer

4.1 Reference studies

We constructed a set of 14 reference studies using phase II and phase III RCTs for treatment of patients with aNSCLC. To ensure that patient-level data was available, we restricted trials to those conducted by Roche (which we had access to). The set of trials is the same as in a study by Carrigan et al.[13], but includes three additional trials: NCT02367781, NCT02367794, and NCT02657434. In some cases (NCT01496742 and NCT01493843), multiple treatment arms (with corresponding control arms) were available for a single trial. We treated those as separate studies and refer to each unique treatment arm and control arm pair as a unique reference study in the remainder of this article. The full list of reference studies is shown in Table 2.

A unique external control cohort was constructed for each of the internal control arms from RWD. To increase comparability between the internal and external controls, the inclusion and exclusion criteria from the trials were applied to the external control cohorts. Following Carrigan et al., we used the Flatiron health database—which provides retrospective, longitudinal de-identified clinical information derived from electronic health records—as our data source. The data originated from academic and community cancer clinics across the United States and contains both structured and unstructured (via technology-enabled abstraction) elements.[8, 29] The data are de-identified and subject to obligations to prevent re-identification and protect patient confidentiality. Institutional review board approval of the study was obtained prior to study conduct, and included a waiver of informed consent.

Comparison Sample size
Clinical trial Treatment Control Treatment Internal External
control control
1 NCT02008227 Atezolizumab Docetaxel 425 425 526
2 NCT01903993 Atezolizumab Docetaxel 144 143 476
3 NCT02366143 Atezolizumab + Bevacizumab + Carboplatin Bevacizumab + Carboplatin 356 336 602
4 NCT01351415 Bevacizumab + SOC SOC 245 240 381
5 NCT01519804 MetMAb + Platinum + Paclitaxel Placebo + Platinum + Paclitaxel 55 54 1945
6 NCT01496742 MetMAb + Bevacizumab + Platinum + Paclitaxel Placebo + Bevacizumab + Platinum 69 70 956
7 NCT01496742 MetMAb + Platinum + Pemetrexed Placebo + Platinum + Pemetrexed 59 61 3244
8 NCT01366131 MEGF0444A + Bevacizumab + Carboplatin + Paclitaxel Placebo + Bevacizumab + Carboplatin + Paclitaxel 52 52 963
9 NCT01493843 Pictilisib (340 mg) + Carboplatin + Paclitaxel Placebo (340 mg) + Carboplatin + Paclitaxel 126 125 1274
10 NCT01493843 Pictilisib (340 mg) + Carboplatin + Paclitaxel + Bevacizumab Placebo (340 mg) + Carboplatin + Paclitaxel + Bevacizumab 79 79 953
11 NCT01493843 Pictilisib (260 mg) + Carboplatin + Paclitaxel + Bevacizumab Placebo (260 mg) + Carboplatin + Paclitaxel + Bevacizumab 62 30 953
12 NCT02367781 Atezolizumab + Nab-Paclitaxel + Carboplatin Nab-Paclitaxel + Carboplatin 451 228 87
13 NCT02367794 Atezolizumab + Nab-Paclitaxel + Carboplatin Nab-Paclitaxel + Carboplatin 343 340 497
14 NCT02657434 Atezolizumab + Carboplatin or Cisplatin + Pemetrexed Carboplatin or Cisplatin + Pemetrexed 292 286 1536

Note: All external control datasetets were built using the Flatiron Health database. Sample sizes in the external control datasets were based on the number of patients remaining in the data after applying the trial inclusion and exclusion criteria.

Table 2: Reference studies for meta-analysis

4.2 Propensity score methodology

We used propensity score methods to estimate log HRs (internal versus external control),

, and standard errors,

, in each study, . We considered the estimand of interest to be the average treatment effect on the treated (ATT), which is the log HR among the trial participants. The time-to-event outcome was overall survival (OS).

To mitigate the risk of underestimating bias, we prespecified our analysis. Our primary analysis used inverse probability of treatment weights (IPTW) permitting estimation of the ATT (i.e., IPTW-ATT weights) so that internal controls received a weight of 1 and external controls received a weight of

where is the propensity score.[2] To reduce the influence of extreme weights, external control patients with propensity scores below the 1st percentile and above the 99th percentile were removed.

Propensity scores were modeled using logistic regression with the same covariates used by Carrigan et al.

[13]—age, sex, race (White, Black, other), histology (non-squamous, squamous), smoking status (current/former, never), disease stage at initial diagnosis (Advanced - IIIB/IV, Early - IIIA or below), and time from initial diagnosis until either the start of treatment (RWD) or randomization (trial data). Nonlinear functions of the continuous covariates—age and time from initial diagnosis—were considered, but did not improve covariate balance so simpler linear terms were employed.

HRs were estimated using IPTW-ATT weighted Cox proportional hazards models. The models were fit without covariate adjustment (i.e., they only included a single covariate for treatment assignment) to facilitate estimation of a marginal HR,[14] which is typically the estimand of interest from a RCT. Confidence intervals were computed using a robust sandwich-type variance estimator.[5].

Missing covariates were imputed using multiple imputation with the

aregImpute function from the R package Hmisc.[23] The data source (trial or RWD) was interacted with the covariates so that there was effectively a separate imputation model for the trial and external control patients. Given recommendations from recent simulation studies, we estimated log HRs separately in each imputed dataset and then estimated pooled log HRs and confidence intervals using Rubin’s rules.[27, 22]

Additional details and sensitivity analyses are described in Appendix B.

4.3 Estimation of meta-analytic model

The IPTW-ATT weighted estimates of the log HRs and 95% confidence intervals are shown in Figure 5. The HRs tend to be below 1, implying that survival is, on average, shorter among Flatiron patients than the internal controls. The implication is that a standard analysis that compares an experimental treatment arm to an external control in aNSCLC will tend to be biased in favor of the treatment.

Figure 5: Log hazard ratios (HRs) for comparisons between internal and external control arms, by reference study. Error bars represent 95% confidence intervals. Survival times are, on average, shorter for the external controls if the and longer if the .

A formal assessment of bias and between study variability was performed using the meta-analysis model described in Section 2.2. We estimated the model using the same Bayesian approach employed in the simulations described in Section 3. In particular, we used a diffuse prior for and a half-Cauchy prior for with location and scale parameter .

The estimates were consistent with Figure 5: the posterior median for the bias, , was with a 95% CrI ranging from a lower limit (2.5th percentile) of to an upper limit (97.5th percentile) of , again implying that the external controls tended to have shorter survival. The estimates for (posterior median: ; 95% CrI: to

) suggest that there is significant between study variability, which is not unexpected for a scale hyperparameter in a hierarchical model with a relatively small number of observations. A non-negligible fraction of the between study variability was caused by reference study 5, which appears to be a significant outlier: after removal,

decreased considerably (posterior median: ; 95% CrI: to ) and declined slightly (posterior median: ; 95% CrI: to ).

4.4 Hypothetical analysis for a new study

To illustrate our approach, we considered a hypothetical single-arm trial with an external control arm in which the estimated HR, , was 0.70. To generate reasonable estimates of the standard error, we used the same propensity score methods described in Section 4.2 to compare the experimental treatment arms to the external controls and estimate log hazard ratios for each reference study . We then set the standard error of the log hazard ratio equal to the median of the standard errors across the reference studies, .

To adjust the log HRs, we used a meta-analysis of the estimates from the previous section to generate the posterior predictive distribution for

as detailed in Section 2.3. Figure 6 demonstrates the adjustment process.[19] The top figure is the posterior distribution of the log HR from the treatment and external control comparison, , which is centered around with dispersion determined by the standard error. The bottom figure is the posterior distribution of the log HR from a comparison between the treatment and a hypothetical internal control (i.e., the log HR after “adjustment”), . The estimated bias, , shifts the posterior distribution rightward so that it is centered (posterior median) at in the bottom figure (i.e., the treatment benefit from the naive estimate in the top figure was overly optimistic). Furthermore, consideration of the between study variability, , results in a posterior distribution that is more spread out, meaning that uncertainty intervals for are wider than those for . In this illustration the adjustment was important and could have changed a decision: the log HR was nominally significant at the 5% level in the naive analysis but was not significant after adjustment.

Figure 6: Posterior distributions for log hazard ratios. The top figure is the distribution of the log hazard ratio, , for a comparison of the experimental treatment arm to the external control arm (TRTvEC). The bottom figure is the distribution of the log hazard ratio, , for a comparison between the treatment arm and a hypothetical internal control arm (TRTvIC), generated after adjusting using estimates from the meta-analysis assessing the similarity of the internal and external control arms (). Shaded areas represent 95% credible intervals (2.5th percentile to 97.5th percentile) and the vertical line is the posterior median.

4.5 Model checking

To check the model, we split the reference studies into training and test sets. The model was “trained" by performing the meta-analysis on the training reference studies and “tested” by predicting a treatment effect for the test study, assuming it was a single-arm trial without an internal control. Leave-one-out cross-validation was used to generalize performance, whereby reference studies were partitioned into training and test sets 14 times (with 13 of the 14 reference studies used for training and the remaining reference study used for testing), resulting in 14 separate predictions. For sake of comparison, we made both adjusted and unadjusted predictions. The adjusted predictions, denoted as , were given by the posterior median of and the unadusted predictions were the estimated log HRs from comparisons between the treatment and external control, .

To assess the quality of the predictions, we computed residuals, , where was the actual log HR estimated from the trial,

. Residuals were then standardized by dividing by the standard deviation of

. A QQ-plot of the standardized residuals against a theoretical normal distribution is displayed in Figure 7.

Figure 7: QQ-plot assessing differences between observed and predicted log hazard ratios. The observed log hazard ratios are the estimates from the trial, . The adjusted predictions are given by the posterior median of and the unadjusted prediction is . The residuals (observed minus predicted) on the y-axis are standardized by dividing by the standard deviation of .

As expected, the majority of residuals from the unadjusted predictions (left panel) are above the 45 degree line; in other words, the distribution of residuals was centered to the right of and predictions were therefore biased in favor of the treatment arms. In contrast, predictions in the adjusted case (right panel) are close to the 45 degree line and are residuals are roughly centered around 0, suggesting that the methodology is able to reasonably correct for bias. Reference study 5 is again an outlier as it lies below the 45 degree line in both cases since predicted log HRs were far too high for this study.

5 Discussion

We present a new meta-analytic approach that leverages historical studies to adjust treatment effects in single-arm studies augmented by external controls for the additional bias and variability caused by their non-randomized design. Our example analysis in aNSCLC suggests that this additional bias and variability is important to adjust for and that propensity score methods alone may be unable to completely eliminate confounding. Across 14 studies, the posterior median of the HR for a comparison between the internal and external controls (after inverse probability weighting) was ; that is, survival times were, on average, shorter among external control patients, which would result in a bias in favor of the experimental treatment. Similarly, the between study variability (on the log HR scale) was significant (posterior median: ), suggesting that log HRs vary from study to study beyond what would be expected from sampling error alone.

Our approach was inspired by the surrogate marker literature that uses meta-analysis to estimate the association between treatment effects for a surrogate marker and treatment effects for a primary clinical outcome.[15, 12]. In this context, the treatment effect from a single-arm study, , is akin to a treatment effect for a surrogate marker and the treatment effect from an RCT, is analogous to the treatment effect for the primary clinical outcome. An important difference that we leverage in this paper though (Equation 4) is that there is a mechanical relationship between and determined by . Namely, and both contain the same “treated” subjects, which is not the case when assessing the suitability of a surrogate marker.

While our approach is appealing because it can adjust for unmeasured confounding, there are a number of challenges and questions that must be considered in practice. For one, it is not obvious how the reference studies should be selected or how similar they must be to the single-arm trial of interest. There are a large number of selection criteria that could be considered including the phase of the trial, characteristics of the trial population, drug indication or class, and line of therapy, among others. In some cases a single-arm study may be similar to candidate reference studies on some criteria but not others and analysts will need to determine whether a reference study can be reasonably generalized to the single-arm study. Further study spanning multiple indications and drug classes may be valuable to assess whether the relationships between internal and external controls are generalizable—potentially allowing for extrapolation—or whether they vary across setting and scenario-specific relationships need to be established.

A potential challenge is that there may not be a large set of suitable reference studies, especially for rare diseases or especially novel treatments. Our simulations provide some guidance on the required number of studies and suggest that reasonable estimates can be obtained even when the number of reference studies is smaller. For instance, in cases where there were as few as 4 reference studies, estimates were unbiased and coverage probabilities were too high, implying that conclusions would be conservative. Put differently, our approach helps control type I error even when the number of reference studies is small because it adjusts for bias and results in estimates of the standard error that are conservative. Of course, increasing the number of reference studies is beneficial since it reduces the uncertainty of the estimates and increases power.

A methodological consideration is the prior chosen for the parameter , representing between study variability. Indeed, when the sample size is small, the scale hyperparameter in a hierarchical model can be sensitive to its prior. In our aNSCLC example, we used a half-Cauchy prior as recommended by Gelman.[21] To assess the sensitivity of this choice, we also considered a gamma prior (scale = 0.001, rate = 0.001) for and a uniform prior for with a lower limit of 0 and a large upper limit of 100. As shown in Figure C.1, the posterior distributions for

are very similar when using a uniform or half-Cauchy distribution; however, as noted by Gelman, the gamma prior on the precision results in a posterior distribution that is peaked closer to 0 because the marginal likelihood is higher near 0. A similar pattern was observed in our simulations. For instance, coverages of nominal 95% confidence intervals were very similar when using half-Cauchy or uniform priors, whereas coverage probabilities were lower when using a gamma prior (

Figure C.2). Likewise, Figure C.3 shows that power was similar when using a uniform or half-Cauchy prior, but slightly higher with a gamma prior.

A related choice is whether to use a Bayesian or frequentist approach to estimate the meta-analytic model. In some cases, Bayesian methods may be required, particularly if the number of reference studies is too small and an informative prior is needed to estimate . Furthermore, our simulation results (Figure C.2) suggest that our frequentist approach is too optimistic and results in undercoverage; conversely, the Bayesian approach—especially with a half-Cauchy or uniform prior on —is conservative and results in overcoverage. One potential advantage of the frequentist approach is that it does not take as long to fit, which can be useful in simulation studies; however, in practice, we have found that the Bayesian model converges quickly and that sampling is fast. Taken together, we would generally recommend using a Bayesian approach.

One assumption that is worth revisiting is the relationship between HRs described in Equation 4, which will only hold exactly under proportional hazards. We believe this is a reasonable assumption in most clinical trials in oncology with the HR as the primary endpoint and may not be a large concern in practice. The aNSCLC example is a case in point as our methodology seemed to work well. Moreover, in cases where hazards are clearly not proportional, it is unclear whether the HR from a typical unweighted Cox regression model is the best way to measure treatment effects or whether it is even a meaningful quantity.[39, 28, 44] In such cases, it would likely be necessary to extend the meta-analysis to consider differences in survival curves between the internal and external controls, rather than just differences in the log HRs. An alternative would therefore be to consider a multi-dimensional treatment effect, such as the parameters of a parametric [31] or flexible parametric [26] survival distribution.

While the meta-analysis is straightforward to perform, estimation of the log HRs, , requires a considerable amount of effort, since a separate causal inference analysis must be performed for each reference study. Our approach will consequently make it more burdensome to create a prespecified analysis plan for a single-arm study. In our aNSCLC example, we needed to specify the imputation model for missing data, covariates to control for, functional form of the propensity score model, and propensity score methods (e.g., weighting, matching, etc). We opted to apply a relatively uniform methodology across the 14 studies, but performing a separate analysis for each reference study increases the complexity of the analysis and programming.

A more general question relevant to both the new single-arm study and the meta-analysis is whether propensity score methods should be fully prespecified or if some discretion should be left to the analyst. One advantage of propensity score methods is that the design of the study can be separated from the analysis so that the propensity score method can be refined without knowledge of the outcomes.[36, 37, 45] It may, for instance, be reasonable to tweak the specification of the propensity score model after assessing its impact on the balance and overlap of baseline covariates.[33, 3, 6]. We chose to do this in our analysis and found it useful as certain specifications such as inclusion of interaction terms resulted in worse balance and more extreme weights. However for use in a regulatory setting pre-specification may be required or at a minimum demonstration that any changes were made without any knowledge of the outcomes. Otherwise, there is considerable risk that estimates of bias will be underestimated.

When creating the set of reference studies, an additional question is whether specification of the propensity score model used to ultimately compare the internal and external controls should be refined using the combined trial data, the experimental arm alone, or the internal control arm alone. An argument in favor of using the internal control arm alone is that the meta-analysis does not require the experimental arm. Yet, we decided to assess the propensity score model based on an analysis that only included the experimental trial arm, since the internal control arm would not be available in a single-arm trial. That said, in practice, these arms should have similar characteristics because they are randomized and the decision should have little impact on the conclusions.

Our meta-analytic approach may also be useful outside of the prospective single-arm setting. Due to the increasing interest in the use of RWD for drug development, there is both a need and an interest in assessing the extent to which RWD can emulate arms from completed clinical trials. Current emulation efforts include comparative-effectiveness studies [18] in which RWD is used to construct both the treatment and control arms and external control studies with a trial experimental arm and RWD control arm [13]. However, it is unclear how to determine whether an emulation was successful. In the external control setting, the meta-analysis described in this paper comparing the internal and external control arms provides one such validation tool. The advantage of a formal meta-analysis is that it estimates both bias and between study variability as well the uncertainty of those estimates, while accounting for the sampling error in each of the validation studies. A similar approach based on the surrogate marker literature could be used for comparative effectiveness studies, with the RCT treatment effect serving as the primary endpoint and the RWD (emulated study) treatment effect as the surrogate marker.

6 Conclusion

External control arms can be used to estimate treatment efficacy in single-arm trials, but bias and potential unlimited inflation of type 1 error is a serious concern because the external control arm is not randomized. To address this problem, we have proposed a meta-analytic framework that uses a set of historical reference studies to adjust estimates in a new single-arm study for the additional bias and variability caused by the lack of randomization. The approach is straightforward to apply since existing causal inference methods typically used in observational studies do not need to be modified in any way and predictions in a new study can be implemented using our software. By minimizing the risk of false positives, we believe that our approach can help increase acceptance of external control studies in both internal—such as “go/no-go” decisions on whether to move a molecule forward for further study—and regulatory decision making.


  • [1] A. Abadie and G. W. Imbens (2006) Large sample properties of matching estimators for average treatment effects. econometrica 74 (1), pp. 235–267. Cited by: §B.2.2.
  • [2] P. C. Austin and E. A. Stuart (2015) Moving towards best practice when using inverse probability of treatment weighting (iptw) using the propensity score to estimate causal treatment effects in observational studies. Statistics in medicine 34 (28), pp. 3661–3679. Cited by: §4.2.
  • [3] P. C. Austin (2008) A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003. Statistics in medicine 27 (12), pp. 2037–2049. Cited by: §5.
  • [4] P. C. Austin (2010) Statistical criteria for selecting the optimal number of untreated subjects matched to each treated subject when using many-to-one matching on the propensity score. American journal of epidemiology 172 (9), pp. 1092–1097. Cited by: §B.2.2.
  • [5] P. C. Austin (2014) The use of propensity score methods with survival or time-to-event outcomes: reporting measures of effect similar to those used in randomized experiments. Statistics in medicine 33 (7), pp. 1242–1258. Cited by: §4.2.
  • [6] S. V. Belitser, E. P. Martens, W. R. Pestman, R. H. Groenwold, A. De Boer, and O. H. Klungel (2011) Measuring balance and model selection in propensity score methods. Pharmacoepidemiology and drug safety 20 (11), pp. 1115–1129. Cited by: §5.
  • [7] U. Beyer, D. Dejardin, M. Meller, K. Rufibach, and H. U. Burger (2020) A multistate model for early decision-making in oncology. Biometrical Journal 62 (3), pp. 550–567. Cited by: §1.
  • [8] B. Birnbaum, N. Nussbaum, K. Seidl-Rathkopf, M. Agrawal, M. Estevez, E. Estola, J. Haimson, L. He, P. Larson, and P. Richardson (2020) Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the ehr for oncology research. arXiv preprint arXiv:2001.09765. Cited by: §4.1.
  • [9] N. Black (1996) Why we need observational studies to evaluate the effectiveness of health care. Bmj 312 (7040), pp. 1215–1218. Cited by: §1.
  • [10] M. Burcu, N. A. Dreyer, J. M. Franklin, M. D. Blum, C. W. Critchlow, E. M. Perfetto, and W. Zhou (2020) Real-world evidence to support regulatory decision-making for medicines: considerations for external control arms. Pharmacoepidemiology and drug safety 29 (10), pp. 1228–1235. Cited by: §1.
  • [11] H. U. Burger, C. Gerlinger, C. Harbron, A. Koch, M. Posch, J. Rochon, and A. Schiel (2021) The use of external controls: to what extent can it currently be recommended?. Pharmaceutical Statistics. Cited by: §1, §1.
  • [12] T. Burzykowski, G. Molenberghs, and M. Buyse (2006) The evaluation of surrogate endpoints. Springer Science & Business Media. Cited by: §5.
  • [13] G. Carrigan, S. Whipple, W. B. Capra, M. D. Taylor, J. S. Brown, M. Lu, B. Arnieri, R. Copping, and K. J. Rothman (2020) Using electronic health records to derive control arms for early phase single-arm lung cancer trials: proof-of-concept in randomized controlled trials. Clinical Pharmacology & Therapeutics 107 (2), pp. 369–377. Cited by: §B.1, §4.1, §4.2, §5.
  • [14] R. Daniel, J. Zhang, and D. Farewell (2021) Making apples from oranges: comparing noncollapsible effect estimators and their standard errors after adjustment for different covariate sets. Biometrical Journal 63 (3), pp. 528–557. Cited by: §4.2.
  • [15] M. J. Daniels and M. D. Hughes (1997) Meta-analysis for the evaluation of potential surrogate markers. Statistics in medicine 16 (17), pp. 1965–1982. Cited by: §5.
  • [16] A. Diamond and J. S. Sekhon (2013) Genetic matching for estimating causal effects: a general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics 95 (3), pp. 932–945. Cited by: §B.2.2.
  • [17] B. A. Feinberg, A. Gajra, M. E. Zettler, T. D. Phillips, E. G. Phillips Jr, and J. K. Kish (2020) Use of real-world evidence to support fda approval of oncology drugs. Value in Health 23 (10), pp. 1358–1365. Cited by: §1.
  • [18] J. M. Franklin, E. Patorno, R. J. Desai, R. J. Glynn, D. Martin, K. Quinto, A. Pawar, L. G. Bessette, H. Lee, E. M. Garry, et al. (2021) Emulating randomized clinical trials with nonrandomized real-world evidence studies: first results from the rct duplicate initiative. Circulation 143 (10), pp. 1002–1013. Cited by: §5.
  • [19] J. Gabry, D. Simpson, A. Vehtari, M. Betancourt, and A. Gelman (2019) Visualization in bayesian workflow. Journal of the Royal Statistical Society: Series A (Statistics in Society) 182 (2), pp. 389–402. Cited by: §4.4.
  • [20] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin (2013) Bayesian data analysis. CRC Press. Cited by: §A.2, §2.2.
  • [21] A. Gelman (2006) Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper). Bayesian analysis 1 (3), pp. 515–534. Cited by: §2.2, §3.1, §5.
  • [22] E. Granger, J. C. Sergeant, and M. Lunt (2019) Avoiding pitfalls when combining multiple imputation and propensity scores. Statistics in medicine 38 (26), pp. 5120–5132. Cited by: §4.2.
  • [23] F. E. Harrell Jr, with contributions from Charles Dupont, and many others. (2021) Hmisc: harrell miscellaneous. Note: R package version 4.5-0 External Links: Link Cited by: §4.2.
  • [24] A. J. Hatswell, G. Baio, J. A. Berlin, A. Irs, and N. Freemantle (2016) Regulatory approval of pharmaceuticals without a randomised controlled study: analysis of ema and fda approvals 1999–2014. BMJ open 6 (6), pp. e011666. Cited by: §1.
  • [25] M. A. Hernán and J. M. Robins (2020) Causal inference: what if. Boca Raton: Chapman & Hall/CRC. Cited by: §1.
  • [26] J. P. Jansen (2011) Network meta-analysis of survival data with fractional polynomials. BMC medical research methodology 11 (1), pp. 1–14. Cited by: §5.
  • [27] C. Leyrat, S. R. Seaman, I. R. White, I. Douglas, L. Smeeth, J. Kim, M. Resche-Rigon, J. R. Carpenter, and E. J. Williamson (2019) Propensity score analysis with partially observed covariates: how should multiple imputation be used?. Statistical methods in medical research 28 (1), pp. 3–19. Cited by: §4.2.
  • [28] R. S. Lin, J. Lin, S. Roychoudhury, K. M. Anderson, T. Hu, B. Huang, L. F. Leon, J. J. Liao, R. Liu, X. Luo, et al. (2020) Alternative analysis methods for time to event endpoints under nonproportional hazards: a comparative analysis. Statistics in Biopharmaceutical Research 12 (2), pp. 187–198. Cited by: §5.
  • [29] X. Ma, L. Long, S. Moon, B. J. Adamson, and S. S. Baxi (2020) Comparison of population characteristics in real-world clinical oncology databases in the us: flatiron health, seer, and npcr. medRxiv. Cited by: §4.1.
  • [30] T. Nguyen, G. S. Collins, J. Spence, J. Daurès, P. Devereaux, P. Landais, and Y. Le Manach (2017) Double-adjustment in propensity score matching analysis: choosing a threshold for considering residual imbalance. BMC medical research methodology 17 (1), pp. 1–8. Cited by: §B.2.3.
  • [31] M. J. Ouwens, Z. Philips, and J. P. Jansen (2010) Network meta-analysis of parametric survival curves. Research synthesis methods 1 (3-4), pp. 258–271. Cited by: §5.
  • [32] N. G. Polson and J. G. Scott (2012) On the half-cauchy prior for a global scale parameter. Bayesian Analysis 7 (4), pp. 887–902. Cited by: §2.2.
  • [33] P. R. Rosenbaum and D. B. Rubin (1984) Reducing bias in observational studies using subclassification on the propensity score. Journal of the American statistical Association 79 (387), pp. 516–524. Cited by: §5.
  • [34] P. R. Rosenbaum and D. B. Rubin (1985) Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician 39 (1), pp. 33–38. Cited by: §B.2.2.
  • [35] D. B. Rubin (1973) The use of matched sampling and regression adjustment to remove bias in observational studies. Biometrics, pp. 185–203. Cited by: §B.2.3.
  • [36] D. B. Rubin (2001) Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services and Outcomes Research Methodology 2 (3), pp. 169–188. Cited by: §5.
  • [37] D. B. Rubin (2008) For objective causal inference, design trumps analysis. The Annals of Applied Statistics 2 (3), pp. 808–840. Cited by: §5.
  • [38] V. Sambucini (2015) Comparison of single-arm vs. randomized phase ii clinical trials: a bayesian approach. Journal of biopharmaceutical statistics 25 (3), pp. 474–489. Cited by: §1.
  • [39] M. Schemper, S. Wakounig, and G. Heinze (2009) The estimation of average hazard ratios by weighted cox regression. Statistics in medicine 28 (19), pp. 2473–2489. Cited by: §5.
  • [40] H. Schmidli, D. A. Häring, M. Thomas, A. Cassidy, S. Weber, and F. Bretz (2020) Beyond randomized clinical trials: use of external controls. Clinical Pharmacology & Therapeutics 107 (4), pp. 806–816. Cited by: §1.
  • [41] J. S. Sekhon (2008) Multivariate and propensity score matching software with automated balance optimization: the matching package for r. Journal of Statistical Software, Forthcoming. Cited by: §B.2.2.
  • [42] R. Simon, G. Blumenthal, M. Rothenberg, J. Sommer, S. A. Roberts, D. K. Armstrong, L. LaVange, and R. Pazdur (2015) The role of nonrandomized trials in the evaluation of oncology drugs. Clinical Pharmacology & Therapeutics 97 (5), pp. 502–507. Cited by: §1.
  • [43] D. J. Spiegelhalter, K. R. Abrams, and J. P. Myles (2004) Bayesian approaches to clinical trials and health-care evaluation. Vol. 13, John Wiley & Sons. Cited by: §2.2.
  • [44] M. J. Stensrud and M. A. Hernán (2020) Why test for proportional hazards?. Jama 323 (14), pp. 1401–1402. Cited by: §5.
  • [45] E. A. Stuart (2010) Matching methods for causal inference: a review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics 25 (1), pp. 1. Cited by: §B.2.2, §5.
  • [46] K. Thorlund, L. Dron, J. J. Park, and E. J. Mills (2020) Synthetic and external controls in clinical trials–a primer for researchers. Clinical Epidemiology 12, pp. 457. Cited by: §1.
  • [47] J. Wu, C. Wang, S. Toh, F. E. Pisa, and L. Bauer (2020) Use of real-world evidence in regulatory decisions for rare diseases in the united states—current status and future directions. Pharmacoepidemiology and drug safety 29 (10), pp. 1213–1218. Cited by: §1.

Data availability statement

Data from Flatiron Health, Inc were used to construct the external control cohorts. These de-identified data may be made available upon request and are subject to a license agreement with Flatiron Health; interested researchers should contact Flatiron Health to determine licensing terms. For the internal control cohorts for clinical trials 1, 2, 3, 12 and 13 qualified researchers may request access to individual patient level clinical data for each separate study through a data request platform. At the time of writing this request platform is Vivli (https://vivli.org/ourmember/roche/). For the remaining clinical trials qualified researchers may submit an enquiry to Vivli to request access to individual patient level clinical data. For up to date details on Roche’s Global Policy on the Sharing of Clinical Information and how to request access to related clinical study documents, see here: https://go.roche.com/data_sharing.

The code used to produce the results reported in the paper is available at https://github.com/phcanalytics/ecmeta-manuscript. Documentation and installation instructions for the R package ecmeta can be found at https://github.com/phcanalytics/ecmeta.

Appendix A Maximum likelihood approach

a.1 Estimation

and can be estimated by optimization of the log-likelihood function,


a.2 Prediction for a new study

Following Gelman et al. (see p. 66) [20], we note that the posterior predictive distribution for a new observation follows a t distribution with location , scale , and degrees of freedom where is the mean and is the variance from the observations. In our problem it follows that follows a t distribution with mean and scale .

A simulation procedure for obtaining draws of then proceeds as follows:

  1. Draw

  2. Draw where is a student t distribution with degrees of freedom, location , and scale

  3. Use the draws from 1 and 2 to compute

Appendix B Propensity score methodology

This section expands on the propensity score methodology from Section 4.2 and describes the sensitivity analyses. All results and code are available online in the GitHub repository for the paper.

b.1 Coding of covariates

Although we aimed to code categorical covariates in a consistent way across the reference studies, this was not always possible. Race was typically coded as White versus Other since the number of observations within non-White race categories was usually very small. Two exceptions were NCT02367781 and NCT01351415. We used three categories (White, Black, Other) for the former since there were a significant number of Black patients in the RCT; for the latter, race was coded as Asian or Non-Asian since it was coded in that way in the trial. Similarly, histology was only included if it was not part of the inclusion and exclusion criteria for a trial and covariates were excluded if they were not collected for a particular trial.

We explored both linear terms and restricted cubic splines with 4 knots placed at equally spaced quantiles of the support for the continuous variables (age and time since initial diagnosis). In addition, an interaction between cancer stage and time since diagnosis was considered based on the models employed in Carrigan et al.

[13] A specification was chosen after evaluation of the balance of the covariates subsequent to population adjustment with the propensity score methods. We ultimately chose the model with linear terms and no interactions since it tended to result in the best balance and was less likely to generate extreme weights.

b.2 Sensitivity analyses

b.2.1 Trimming

Recall that the primary analysis used IPTW-ATT weights and removed external control patients with propensity scores below the 1st percentile and above the 99th percentile. A sensitivity analysis was also performed in which no patients were removed from the sample.

b.2.2 Propensity score matching

Propensity score matching was used as a sensitivity analysis. Two approaches were utilized. First, we matched patients on the linear propensity score with greedy 1:1 nearest neighbor matching. For a given treated subject, 1:1 nearest neighbor matching selects the control subject whose value of the linear propensity score is closest; that is, for a treated subject , the control subject with the minimum value of was chosen where is the propensity score for subject . In cases where multiple control subjects were equally close to a treated subject, a treated subject was chosen at random. We believe 1:1 matching is reasonable given simulation evidence showing that mean square error is typically minimized when matching either 1 or 2 controls to each treated subject.[4] Greedy matching meant that control subjects were chosen one at a time and the order in which they were chosen mattered. Matching was performed without replacement (except in cases where the number of control subjects was less than the number of treated subjects) to ensure that the matched controls were independent, although it is worth noting that there is some evidence that matching with replacement can reduce bias.[1, 45].

Our second approach used genetic matching, which can model the propensity score in a more flexible way.[41, 16] In particular, one challenge of nearest neighbor matching—and propensity core methods more generally—is that it is difficult to specify a model that achieves satisfactory covariate balance. Genetic matching can help since the weight given to each covariate is based on an evolutionary search algorithm that iteratively checks and improves balance. We implemented this algorithm using both the linear propensity score and all the covariates from the propensity score model, so that matching on the propensity score and the Mahalanobis distance were limiting cases.

Both matching algorithms were performed with and without a caliper. In the case of the former, a caliper on the linear propensity score of 0.25 standard deviation was used.[34].

b.2.3 Double adjustment

To facilitate simple estimation of marginal HRs, the Cox models in the primary analysis did not adjust for covariates. A sensitivity analysis was performed in which regression adjustment was used to estimate hazard ratios (i.e., by including all covariates used in the propensity score model in the Cox model). [35, 30].

Appendix C Supplementary figures

Figure C.1: Distribution of the between study variability parameter, , by analysis method. The half-t and uniform priors are on and the inverse gamma prior is on . The half-Cauchy prior has location 0 and scale 25; the inverse gamma prior has shape 0.001 and rate 0.001; and the uniform prior has a lower limit of 0 and upper limit of 100.
Figure C.2: Coverage of 95% confidence intervals by analysis method, scenario, and the number of reference studies.
Figure C.3: Power by analysis method, scenario, and the number of reference studies.