1 Introduction
Randomized controlled trials (RCTs) are considered the gold standard for estimation of treatment effects in medical research. Randomization ensures that no variables—measured or unmeasured—are systematically related to treatment assignment and that, consequently, the resulting treatment effect estimates are unbiased. However, there are potential barriers that limit the use of RCTs in certain settings: for example, a randomized control may be deemed unethical for disease areas with high unmet need or infeasible for targeted therapies with low expected enrollment.[9, 42]. Thus, despite the scientific advantages of RCTs, singlearm trials are common during early drug development [38] and have even been used for regulatory approval.[24].
At the same time, interest in using realworld data (RWD) for drug development is increasing, as evidenced by the 21st Century Cures Act, which requires the U.S. Food and Drug Administration to evaluate the use of realworld evidence to support approval of new indications or to support postapproval study requirements. One such application of RWD is the construction of external control arms for singlearm trials so that efficacy can be assessed without a randomized control.[10, 40, 11] These estimates may not only be useful for regulators, but can guide internal decision making on drug development in biopharmaceutical companies as well.[7]
Unfortunately, it is well known that there are many potential sources of bias in nonrandomized studies.[25]
In external control studies, potential biases are concerning because they may result in inflated type I error, which may be of particular importance for regulatory applications. When external controls are constructed using RWD, bias may be caused by systematic differences between trial and RWD patient populations that are difficult to control for. For instance, trial patients may tend to have better outcomes since they are more likely to be treated in specialized academic centers and given higher levels of attention. Furthermore, endpoints may be measured differently (e.g. progression endpoints in oncology) or assessed with different levels of quality in routine clinical practice than in clinical trials. Finally, an important consideration is the extent to which outcomes for a given disease area can be explained by the observable data. If there are unobserved and highly prognostic confounders that vary across studies, then between study variability will be high and estimates from an external control study will be less certain.
[11]While external control analyses are relevant to a range of disease areas with different primary endpoints, we consider timetoevent outcomes here, although the proposed methodology could be readily adapted to other classes of endpoints. This is motivated, in part, by the oncology setting where overall survival (OS) is typically the primary endpoint in phase III studies. OS is attractive for external control studies because it is measured in the same way in trials and RWD, ensuring that differences between trial and external control patients are not due to measurement differences. A focus on oncology may also be warranted given that much of the external control work to date has focused on diseases with high unmet need such as oncology and rare diseases.[17, 47] Indeed, as precision medicine has become more common, the number of eligible patients in oncology trials has decreased and some oncology indications have paradoxically become “rare diseases”.[46]
It is therefore of interest to develop methods for analysis of timetoevent outcomes that can adjust for the bias and the additional variability caused by external controls. In a prospective singlearm study, causal inference methods for observational data such as propensity score techniques are used to estimate the hazard ratio (HR) of the experimental treatment arm relative to the external control. In this paper, we propose a metaanalytic approach that leverages a set of historical reference studies to adjust HR estimates from such a study. Each reference study includes the control arm of a clinical trial (i.e., the internal control) and an external control that is constructed based on the inclusion and exclusion criteria of the trial. To further increase comparability, propensity score models can be developed for each reference study to match the external control to the randomised treatment arm. A metaanalysis of the reference studies on the log HR scale is then performed to assess the compatibility of the internal and propensity scorematched external control arms. The treatment effects from the prospective singlearm study, are, in turn, adjusted based on the bias and betweenstudy variability estimated from the metaanalysis. Importantly, the analysis for the singlearm study is conducted independently of the metaanalysis so that adjustment requires no modifications to the standard causal inference methods used in analyses with external controls.
We use a simulation study to assess the operating characteristics of our methodology under different scenarios and then apply our method to a hypothetical singlearm study for treatment of advanced nonsmall cell lung cancer (aNSCLC), where we were able to use a set of 14 reference studies in the metaanalysis. Using a prespecified analysis plan that utilized propensity score methods, we estimated a mean bias (internal control vs. external control) on the log HR scale across the studies of , or a HR of approximately ; that is, our analysis suggests that external control patients had shorter survival than trial controls, which would bias results in favor of the experimental treatment. The between study variability (on the log HR scale) was as the HR varied from a low of to a high of across studies. An adjustment for the mean bias and between study variability would consequently widen uncertainty intervals and increase point estimates of the HR (i.e., decrease the treatment benefit).
The remainder of this paper is organized as follows. Section 2 introduces the metaanalytic framework. Section 3 presents results from the simulation study and Section 4 reports results from the aNSCLC example. Section 5 puts our methodology in context, discusses challenges and key considerations for implementation, and notes other potential applications. Finally, Section 6 concludes.
2 Methodology
2.1 Model formulation
We consider a set of studies, reference studies and a new study (i.e., a prospective singlearm trial), each of which may be considered to have 3 arms:

: the randomized experimental treatment arm

: the randomized internal control arm

: the external control arm
Figure (a) represents the data generating process for survival outcomes in terms of a directed acyclic graph. We let be the “true” values of the parameters that generate survival outcomes (e.g., might represent the rate parameter in an exponential survival model or the shape and scale parameters in a Weibull model). The estimates of the parameters in the study datasets are denoted by . The light colored nodes are not observable while the moderately dark colored nodes are observable. Our aim is to estimate treatment effects by comparing survival outcomes generated by the darkest nodes, to , in the new study. The challenge is of course that we have no randomized control so is not observable in the new study and a standard approach that compares to
is likely to result in bias because the external control arm is not randomly assigned. As such, we model the degree of similarity between the outcomes in the internal and external control arms using a normal distribution with parameters for the mean,
, and variance,
.It is of note that neither nor from the references studies impact survival outcomes in the new study. We can consequently disregard the outcomes for the treatment arms from the reference studies and reformulate our problem in terms of HRs as shown in Figure (b), where is the true log HR and is an estimate of the log HR in the study sample. In other words, we can focus solely on comparisons between the internal and external controls in the reference studies.
In what follows, we describe a metaanalytic model for estimation of log HRs for comparisons between the internal and external controls, , and an approach for generating the posterior distribution of our quantity of interest, . While our primary analysis uses Bayesian techniques for parameter estimation, quantification of uncertainty, and prediction, Appendix A details an alternative simulation based approach based on maximum likelihood estimation. We discuss the pros and cons of each as well as considerations for priors in the Bayesian model in Section 5. The methodology can be implemented using the publicly available R package ecmeta.
2.2 Metaanalytic model
A comparison between the internal and external controls in terms of the log HR can be modeled as,
(1) 
where indexes a study and is the sampling variance. We assume that the true log HR is distributed with mean and variance ,
(2) 
so that is a measure of the average bias in the external control arms across studies and is a measure of between study variability. It follows that the estimated log HR in study is distributed as,
(3) 
When a Bayesian approach is used for estimation, priors are placed on and . We use a normal distribution for with mean and variance large enough so that the prior is uninformative. The prior for is arguably a more important consideration since it usually has a larger influence on the posterior distribution, particularly when the number of reference studies is small. In our software implementation, we allow for three uninformative priors that have been suggested in the literature including uniform [20] and halft [21, 32] distributions for and a distribution [43] for .
2.3 Prediction for a new study
In a RCT, the treatment effect is estimated in an unbiased manner by comparing the experimental treatment arm to the internal control arm. Since there is no internal control arm in a singlearm trial, we adjust the HR from the singlearm study to obtain a HR that we would have obtained had an internal control existed. Such an adjustment is made by leveraging the following relationship between log HRs,
(4) 
Note that this relationship will hold exactly for the true log HRs under the assumption of proportional hazards. For estimated log HRs (i.e., ), the relationship is only an approximation and will have a small error attached to it.
A prediction for in a new study then proceeds in three steps. The first step draws posterior samples of after using standard causal inference methods for observational data (such as those from a propensity score adjusted Cox regression) to estimate the log HR, , and its variance, . Conveniently, no modifications to standard analyses with external controls are required. A model for the sampling distribution for (using as data and treating as fixed) is given by,
(5) 
where a normal prior is used for with mean and large variance. The second step samples using posterior draws of and from the metaanalytic model,
(6) 
Finally, in the third step, the posterior draws of and are combined to obtain a posterior sample of using Equation 4.
3 Simulations
3.1 Setup and scenarios
Six scenarios were simulated as described in Table 1. For each scenario, parameters for 10,000 studies were generated as a function of median survival time and the number of events in each of the three arms. The log of median survival and the log of the number of events across studies were modeled using normal distributions centered at the medians displayed in the table and with variability determined by the coefficient of variation; that is, median survival times and the number of events were drawn from lognormal distributions with location parameters equal to the medians and the scale parameters equal to the coefficients of variation.
Median survival time  Number of events  
[Median (CV)]  [Median (CV)]  
Description  Treatment  Internal  External  Treatment  Internal  External  
control  control  control  control  
S1  There is no between study heterogeneity (i.e., each study has the same median survival) and the number of events does not vary between studies.  24 (0%)  15 (0%)  12 (0%)  100 (0%)  70 (0%)  50 (0%) 
S2  There is no between study variability and the number of events in each arm varies between studies.  24 (0%)  24 (0%)  18 (0%)  250 (20%)  250 (20%)  250 (20%) 
S3  There is independent betweenstudy variability in all three arms and variability in the number of events in each arm between studies.  24 (40%)  24 (20%)  18 (20%)  250 (20%)  250 (20%)  250 (20%) 
S4  A nullhypothesis scenario where the true randomized hazard ratio is fixed at 1 for all studies and the external control is biased with the magnitude of bias varying between studies. The number of events in the randomized study and external control arm vary between studies and the two randomized arms are of equal size (i.e., a 1:1 randomization ratio). 
24 (20%)  HR=1  18 (20%)  150 (20%)*  250 (20%)  
S5  An alternativehypothesis scenario where the true randomized hazard ratio is fixed at 0.5 for all studies and the external control is biased with the magnitude of bias varying between studies. The number of events in the randomized study and external control arm vary between studies and the two randomized arms are of equal size (i.e., a 1:1 randomization ratio).  48/24 (20%)  HR=0.5  18 (20%)  150 (20%)*  250 (20%)  
S6  An alternativehypothesis scenario where the randomized hazard ratio varies between studies and the external control is biased with the magnitude of bias varying between studies (this may simulate scenarios where the reference set of studies are drawn from studies examining a class of multiple different compounds). The number of events in the randomized study and external control arm vary between studies and the two randomized arms are of equal size (i.e., a 1:1 randomization ratio).  35 (40%)  24 (20%)  18 (20%)  250 (20%)*  250 (20%) 
CV = coefficient of variation. Median survival times and the number of events were drawn from lognormal distributions with location parameters equal to the medians in the table and scale parameters equal to the coefficient of variation.
*The sizes of the two RCT arms were fixed to be equal for each study assuming a 1:1 randomization ratio.
The true hazard ratio between the two RCT arms was fixed.
Parameters were always simulated independently for the external control arms. Parameters for each arm of the RCT were simulated in different ways depending on the scenario. In some scenarios the parameters of the two randomized arms were simulated independently of each other (S1, S2, S3) and in other scenarios they were set according to a fixed HR or a fixed 1:1 randomization ratio (S4, S5, S6).
Patient level data was simulated separately for each arm and study using exponential distributions where the rate parameter was set according to the median survival times. A Cox model was then used to estimate three log HRs—
, , and —for each study. Note that we assumed, for simplicity, that there was no censoring, meaning that the sample size was equal to the number of events. Given that variability in the log HRs is driven by the number of events, this assumption should not be influential.To evaluate the metaanalytic methodology and assess the impact of varying the number of references studies, the 10,000 simulated studies were split into replications where was the number of reference studies, which we varied from 4 to 9. In each replication, studies were used to estimate the metaanalytic model and the remaining study was used as the new singlearm study for which a prediction of was made.
In Section 3.2
, we report estimates from a Bayesian model and consequently specified priors for the hyperparameters. We used a diffuse
prior for and following the advice of Gelman [21], a halfCauchy (a halft with 1 degree of freedom) prior for
with location and a large scale parameter (). In our discussion (Section 5) we consider alternative approaches including uniform and inverse gamma priors as well as estimation via maximum likelihood. Plots summarizing evaluations of these alternative approaches are provided in the Appendix (Appendix C).3.2 Results
Figure 2 plots estimates of bias by scenario and the number of reference studies used to fit the metaanalytic model. The point estimate for the log HR, , was estimated using the median of the posterior distribution and bias was computed for the th replication as where is the true value of the log HR from the simulation. The boxplot characterizes the distribution of bias across the simulation replications. There is, in general, no evidence of bias as the median value of bias across the replications was approximately 0 in each scenario. Similarly, there is no evidence that the number of reference studies has a discernible impact on bias.
The coverage of nominal 95% confidence intervals are plotted in
Figure 3. The results are conservative with some overcoverage since the 95% credible intervals (CrIs) from the model contain the true value of the log HR,
, more often than the nominal 95%. The extent of overcoverage is decreasing in the number of reference studies, which suggests that a larger set of reference studies can improve the precision of the estimates.One consequence of our methodology is that uncertainty intervals will increase after adjustment. We assessed the implications of this on the power of the Bayesian model in Figure 4. Power and type I error were computed from a onesided test where a treatment effect was deemed significant if the upper limit of the 95% CrI was less than zero. In nullhypothesis scenarios—S2 and S4—where the true log HR was 1, the type I error was below (represented by the dotted line), suggesting that the methodology was able to prevent type I error inflation. In the remaining scenarios, power was generally increasing in the number of reference studies. Increasing the number of reference studies therefore seems beneficial since it reduces uncertainty (Figure 3) and increases power.
4 Example: Advanced nonsmall cell lung cancer
4.1 Reference studies
We constructed a set of 14 reference studies using phase II and phase III RCTs for treatment of patients with aNSCLC. To ensure that patientlevel data was available, we restricted trials to those conducted by Roche (which we had access to). The set of trials is the same as in a study by Carrigan et al.[13], but includes three additional trials: NCT02367781, NCT02367794, and NCT02657434. In some cases (NCT01496742 and NCT01493843), multiple treatment arms (with corresponding control arms) were available for a single trial. We treated those as separate studies and refer to each unique treatment arm and control arm pair as a unique reference study in the remainder of this article. The full list of reference studies is shown in Table 2.
A unique external control cohort was constructed for each of the internal control arms from RWD. To increase comparability between the internal and external controls, the inclusion and exclusion criteria from the trials were applied to the external control cohorts. Following Carrigan et al., we used the Flatiron health database—which provides retrospective, longitudinal deidentified clinical information derived from electronic health records—as our data source. The data originated from academic and community cancer clinics across the United States and contains both structured and unstructured (via technologyenabled abstraction) elements.[8, 29] The data are deidentified and subject to obligations to prevent reidentification and protect patient confidentiality. Institutional review board approval of the study was obtained prior to study conduct, and included a waiver of informed consent.
Comparison  Sample size  
Clinical trial  Treatment  Control  Treatment  Internal  External  
control  control  
1  NCT02008227  Atezolizumab  Docetaxel  425  425  526 
2  NCT01903993  Atezolizumab  Docetaxel  144  143  476 
3  NCT02366143  Atezolizumab + Bevacizumab + Carboplatin  Bevacizumab + Carboplatin  356  336  602 
4  NCT01351415  Bevacizumab + SOC  SOC  245  240  381 
5  NCT01519804  MetMAb + Platinum + Paclitaxel  Placebo + Platinum + Paclitaxel  55  54  1945 
6  NCT01496742  MetMAb + Bevacizumab + Platinum + Paclitaxel  Placebo + Bevacizumab + Platinum  69  70  956 
7  NCT01496742  MetMAb + Platinum + Pemetrexed  Placebo + Platinum + Pemetrexed  59  61  3244 
8  NCT01366131  MEGF0444A + Bevacizumab + Carboplatin + Paclitaxel  Placebo + Bevacizumab + Carboplatin + Paclitaxel  52  52  963 
9  NCT01493843  Pictilisib (340 mg) + Carboplatin + Paclitaxel  Placebo (340 mg) + Carboplatin + Paclitaxel  126  125  1274 
10  NCT01493843  Pictilisib (340 mg) + Carboplatin + Paclitaxel + Bevacizumab  Placebo (340 mg) + Carboplatin + Paclitaxel + Bevacizumab  79  79  953 
11  NCT01493843  Pictilisib (260 mg) + Carboplatin + Paclitaxel + Bevacizumab  Placebo (260 mg) + Carboplatin + Paclitaxel + Bevacizumab  62  30  953 
12  NCT02367781  Atezolizumab + NabPaclitaxel + Carboplatin  NabPaclitaxel + Carboplatin  451  228  87 
13  NCT02367794  Atezolizumab + NabPaclitaxel + Carboplatin  NabPaclitaxel + Carboplatin  343  340  497 
14  NCT02657434  Atezolizumab + Carboplatin or Cisplatin + Pemetrexed  Carboplatin or Cisplatin + Pemetrexed  292  286  1536 
Note: All external control datasetets were built using the Flatiron Health database. Sample sizes in the external control datasets were based on the number of patients remaining in the data after applying the trial inclusion and exclusion criteria.
4.2 Propensity score methodology
We used propensity score methods to estimate log HRs (internal versus external control),
, and standard errors,
, in each study, . We considered the estimand of interest to be the average treatment effect on the treated (ATT), which is the log HR among the trial participants. The timetoevent outcome was overall survival (OS).To mitigate the risk of underestimating bias, we prespecified our analysis. Our primary analysis used inverse probability of treatment weights (IPTW) permitting estimation of the ATT (i.e., IPTWATT weights) so that internal controls received a weight of 1 and external controls received a weight of
where is the propensity score.[2] To reduce the influence of extreme weights, external control patients with propensity scores below the 1st percentile and above the 99th percentile were removed.Propensity scores were modeled using logistic regression with the same covariates used by Carrigan et al.
[13]—age, sex, race (White, Black, other), histology (nonsquamous, squamous), smoking status (current/former, never), disease stage at initial diagnosis (Advanced  IIIB/IV, Early  IIIA or below), and time from initial diagnosis until either the start of treatment (RWD) or randomization (trial data). Nonlinear functions of the continuous covariates—age and time from initial diagnosis—were considered, but did not improve covariate balance so simpler linear terms were employed.HRs were estimated using IPTWATT weighted Cox proportional hazards models. The models were fit without covariate adjustment (i.e., they only included a single covariate for treatment assignment) to facilitate estimation of a marginal HR,[14] which is typically the estimand of interest from a RCT. Confidence intervals were computed using a robust sandwichtype variance estimator.[5].
Missing covariates were imputed using multiple imputation with the
aregImpute function from the R package Hmisc.[23] The data source (trial or RWD) was interacted with the covariates so that there was effectively a separate imputation model for the trial and external control patients. Given recommendations from recent simulation studies, we estimated log HRs separately in each imputed dataset and then estimated pooled log HRs and confidence intervals using Rubin’s rules.[27, 22]Additional details and sensitivity analyses are described in Appendix B.
4.3 Estimation of metaanalytic model
The IPTWATT weighted estimates of the log HRs and 95% confidence intervals are shown in Figure 5. The HRs tend to be below 1, implying that survival is, on average, shorter among Flatiron patients than the internal controls. The implication is that a standard analysis that compares an experimental treatment arm to an external control in aNSCLC will tend to be biased in favor of the treatment.
A formal assessment of bias and between study variability was performed using the metaanalysis model described in Section 2.2. We estimated the model using the same Bayesian approach employed in the simulations described in Section 3. In particular, we used a diffuse prior for and a halfCauchy prior for with location and scale parameter .
The estimates were consistent with Figure 5: the posterior median for the bias, , was with a 95% CrI ranging from a lower limit (2.5th percentile) of to an upper limit (97.5th percentile) of , again implying that the external controls tended to have shorter survival. The estimates for (posterior median: ; 95% CrI: to
) suggest that there is significant between study variability, which is not unexpected for a scale hyperparameter in a hierarchical model with a relatively small number of observations. A nonnegligible fraction of the between study variability was caused by reference study 5, which appears to be a significant outlier: after removal,
decreased considerably (posterior median: ; 95% CrI: to ) and declined slightly (posterior median: ; 95% CrI: to ).4.4 Hypothetical analysis for a new study
To illustrate our approach, we considered a hypothetical singlearm trial with an external control arm in which the estimated HR, , was 0.70. To generate reasonable estimates of the standard error, we used the same propensity score methods described in Section 4.2 to compare the experimental treatment arms to the external controls and estimate log hazard ratios for each reference study . We then set the standard error of the log hazard ratio equal to the median of the standard errors across the reference studies, .
To adjust the log HRs, we used a metaanalysis of the estimates from the previous section to generate the posterior predictive distribution for
as detailed in Section 2.3. Figure 6 demonstrates the adjustment process.[19] The top figure is the posterior distribution of the log HR from the treatment and external control comparison, , which is centered around with dispersion determined by the standard error. The bottom figure is the posterior distribution of the log HR from a comparison between the treatment and a hypothetical internal control (i.e., the log HR after “adjustment”), . The estimated bias, , shifts the posterior distribution rightward so that it is centered (posterior median) at in the bottom figure (i.e., the treatment benefit from the naive estimate in the top figure was overly optimistic). Furthermore, consideration of the between study variability, , results in a posterior distribution that is more spread out, meaning that uncertainty intervals for are wider than those for . In this illustration the adjustment was important and could have changed a decision: the log HR was nominally significant at the 5% level in the naive analysis but was not significant after adjustment.4.5 Model checking
To check the model, we split the reference studies into training and test sets. The model was “trained" by performing the metaanalysis on the training reference studies and “tested” by predicting a treatment effect for the test study, assuming it was a singlearm trial without an internal control. Leaveoneout crossvalidation was used to generalize performance, whereby reference studies were partitioned into training and test sets 14 times (with 13 of the 14 reference studies used for training and the remaining reference study used for testing), resulting in 14 separate predictions. For sake of comparison, we made both adjusted and unadjusted predictions. The adjusted predictions, denoted as , were given by the posterior median of and the unadusted predictions were the estimated log HRs from comparisons between the treatment and external control, .
To assess the quality of the predictions, we computed residuals, , where was the actual log HR estimated from the trial,
. Residuals were then standardized by dividing by the standard deviation of
. A QQplot of the standardized residuals against a theoretical normal distribution is displayed in Figure 7.As expected, the majority of residuals from the unadjusted predictions (left panel) are above the 45 degree line; in other words, the distribution of residuals was centered to the right of and predictions were therefore biased in favor of the treatment arms. In contrast, predictions in the adjusted case (right panel) are close to the 45 degree line and are residuals are roughly centered around 0, suggesting that the methodology is able to reasonably correct for bias. Reference study 5 is again an outlier as it lies below the 45 degree line in both cases since predicted log HRs were far too high for this study.
5 Discussion
We present a new metaanalytic approach that leverages historical studies to adjust treatment effects in singlearm studies augmented by external controls for the additional bias and variability caused by their nonrandomized design. Our example analysis in aNSCLC suggests that this additional bias and variability is important to adjust for and that propensity score methods alone may be unable to completely eliminate confounding. Across 14 studies, the posterior median of the HR for a comparison between the internal and external controls (after inverse probability weighting) was ; that is, survival times were, on average, shorter among external control patients, which would result in a bias in favor of the experimental treatment. Similarly, the between study variability (on the log HR scale) was significant (posterior median: ), suggesting that log HRs vary from study to study beyond what would be expected from sampling error alone.
Our approach was inspired by the surrogate marker literature that uses metaanalysis to estimate the association between treatment effects for a surrogate marker and treatment effects for a primary clinical outcome.[15, 12]. In this context, the treatment effect from a singlearm study, , is akin to a treatment effect for a surrogate marker and the treatment effect from an RCT, is analogous to the treatment effect for the primary clinical outcome. An important difference that we leverage in this paper though (Equation 4) is that there is a mechanical relationship between and determined by . Namely, and both contain the same “treated” subjects, which is not the case when assessing the suitability of a surrogate marker.
While our approach is appealing because it can adjust for unmeasured confounding, there are a number of challenges and questions that must be considered in practice. For one, it is not obvious how the reference studies should be selected or how similar they must be to the singlearm trial of interest. There are a large number of selection criteria that could be considered including the phase of the trial, characteristics of the trial population, drug indication or class, and line of therapy, among others. In some cases a singlearm study may be similar to candidate reference studies on some criteria but not others and analysts will need to determine whether a reference study can be reasonably generalized to the singlearm study. Further study spanning multiple indications and drug classes may be valuable to assess whether the relationships between internal and external controls are generalizable—potentially allowing for extrapolation—or whether they vary across setting and scenariospecific relationships need to be established.
A potential challenge is that there may not be a large set of suitable reference studies, especially for rare diseases or especially novel treatments. Our simulations provide some guidance on the required number of studies and suggest that reasonable estimates can be obtained even when the number of reference studies is smaller. For instance, in cases where there were as few as 4 reference studies, estimates were unbiased and coverage probabilities were too high, implying that conclusions would be conservative. Put differently, our approach helps control type I error even when the number of reference studies is small because it adjusts for bias and results in estimates of the standard error that are conservative. Of course, increasing the number of reference studies is beneficial since it reduces the uncertainty of the estimates and increases power.
A methodological consideration is the prior chosen for the parameter , representing between study variability. Indeed, when the sample size is small, the scale hyperparameter in a hierarchical model can be sensitive to its prior. In our aNSCLC example, we used a halfCauchy prior as recommended by Gelman.[21] To assess the sensitivity of this choice, we also considered a gamma prior (scale = 0.001, rate = 0.001) for and a uniform prior for with a lower limit of 0 and a large upper limit of 100. As shown in Figure C.1, the posterior distributions for
are very similar when using a uniform or halfCauchy distribution; however, as noted by Gelman, the gamma prior on the precision results in a posterior distribution that is peaked closer to 0 because the marginal likelihood is higher near 0. A similar pattern was observed in our simulations. For instance, coverages of nominal 95% confidence intervals were very similar when using halfCauchy or uniform priors, whereas coverage probabilities were lower when using a gamma prior (
Figure C.2). Likewise, Figure C.3 shows that power was similar when using a uniform or halfCauchy prior, but slightly higher with a gamma prior.A related choice is whether to use a Bayesian or frequentist approach to estimate the metaanalytic model. In some cases, Bayesian methods may be required, particularly if the number of reference studies is too small and an informative prior is needed to estimate . Furthermore, our simulation results (Figure C.2) suggest that our frequentist approach is too optimistic and results in undercoverage; conversely, the Bayesian approach—especially with a halfCauchy or uniform prior on —is conservative and results in overcoverage. One potential advantage of the frequentist approach is that it does not take as long to fit, which can be useful in simulation studies; however, in practice, we have found that the Bayesian model converges quickly and that sampling is fast. Taken together, we would generally recommend using a Bayesian approach.
One assumption that is worth revisiting is the relationship between HRs described in Equation 4, which will only hold exactly under proportional hazards. We believe this is a reasonable assumption in most clinical trials in oncology with the HR as the primary endpoint and may not be a large concern in practice. The aNSCLC example is a case in point as our methodology seemed to work well. Moreover, in cases where hazards are clearly not proportional, it is unclear whether the HR from a typical unweighted Cox regression model is the best way to measure treatment effects or whether it is even a meaningful quantity.[39, 28, 44] In such cases, it would likely be necessary to extend the metaanalysis to consider differences in survival curves between the internal and external controls, rather than just differences in the log HRs. An alternative would therefore be to consider a multidimensional treatment effect, such as the parameters of a parametric [31] or flexible parametric [26] survival distribution.
While the metaanalysis is straightforward to perform, estimation of the log HRs, , requires a considerable amount of effort, since a separate causal inference analysis must be performed for each reference study. Our approach will consequently make it more burdensome to create a prespecified analysis plan for a singlearm study. In our aNSCLC example, we needed to specify the imputation model for missing data, covariates to control for, functional form of the propensity score model, and propensity score methods (e.g., weighting, matching, etc). We opted to apply a relatively uniform methodology across the 14 studies, but performing a separate analysis for each reference study increases the complexity of the analysis and programming.
A more general question relevant to both the new singlearm study and the metaanalysis is whether propensity score methods should be fully prespecified or if some discretion should be left to the analyst. One advantage of propensity score methods is that the design of the study can be separated from the analysis so that the propensity score method can be refined without knowledge of the outcomes.[36, 37, 45] It may, for instance, be reasonable to tweak the specification of the propensity score model after assessing its impact on the balance and overlap of baseline covariates.[33, 3, 6]. We chose to do this in our analysis and found it useful as certain specifications such as inclusion of interaction terms resulted in worse balance and more extreme weights. However for use in a regulatory setting prespecification may be required or at a minimum demonstration that any changes were made without any knowledge of the outcomes. Otherwise, there is considerable risk that estimates of bias will be underestimated.
When creating the set of reference studies, an additional question is whether specification of the propensity score model used to ultimately compare the internal and external controls should be refined using the combined trial data, the experimental arm alone, or the internal control arm alone. An argument in favor of using the internal control arm alone is that the metaanalysis does not require the experimental arm. Yet, we decided to assess the propensity score model based on an analysis that only included the experimental trial arm, since the internal control arm would not be available in a singlearm trial. That said, in practice, these arms should have similar characteristics because they are randomized and the decision should have little impact on the conclusions.
Our metaanalytic approach may also be useful outside of the prospective singlearm setting. Due to the increasing interest in the use of RWD for drug development, there is both a need and an interest in assessing the extent to which RWD can emulate arms from completed clinical trials. Current emulation efforts include comparativeeffectiveness studies [18] in which RWD is used to construct both the treatment and control arms and external control studies with a trial experimental arm and RWD control arm [13]. However, it is unclear how to determine whether an emulation was successful. In the external control setting, the metaanalysis described in this paper comparing the internal and external control arms provides one such validation tool. The advantage of a formal metaanalysis is that it estimates both bias and between study variability as well the uncertainty of those estimates, while accounting for the sampling error in each of the validation studies. A similar approach based on the surrogate marker literature could be used for comparative effectiveness studies, with the RCT treatment effect serving as the primary endpoint and the RWD (emulated study) treatment effect as the surrogate marker.
6 Conclusion
External control arms can be used to estimate treatment efficacy in singlearm trials, but bias and potential unlimited inflation of type 1 error is a serious concern because the external control arm is not randomized. To address this problem, we have proposed a metaanalytic framework that uses a set of historical reference studies to adjust estimates in a new singlearm study for the additional bias and variability caused by the lack of randomization. The approach is straightforward to apply since existing causal inference methods typically used in observational studies do not need to be modified in any way and predictions in a new study can be implemented using our software. By minimizing the risk of false positives, we believe that our approach can help increase acceptance of external control studies in both internal—such as “go/nogo” decisions on whether to move a molecule forward for further study—and regulatory decision making.
References
 [1] (2006) Large sample properties of matching estimators for average treatment effects. econometrica 74 (1), pp. 235–267. Cited by: §B.2.2.
 [2] (2015) Moving towards best practice when using inverse probability of treatment weighting (iptw) using the propensity score to estimate causal treatment effects in observational studies. Statistics in medicine 34 (28), pp. 3661–3679. Cited by: §4.2.
 [3] (2008) A critical appraisal of propensityscore matching in the medical literature between 1996 and 2003. Statistics in medicine 27 (12), pp. 2037–2049. Cited by: §5.
 [4] (2010) Statistical criteria for selecting the optimal number of untreated subjects matched to each treated subject when using manytoone matching on the propensity score. American journal of epidemiology 172 (9), pp. 1092–1097. Cited by: §B.2.2.
 [5] (2014) The use of propensity score methods with survival or timetoevent outcomes: reporting measures of effect similar to those used in randomized experiments. Statistics in medicine 33 (7), pp. 1242–1258. Cited by: §4.2.
 [6] (2011) Measuring balance and model selection in propensity score methods. Pharmacoepidemiology and drug safety 20 (11), pp. 1115–1129. Cited by: §5.
 [7] (2020) A multistate model for early decisionmaking in oncology. Biometrical Journal 62 (3), pp. 550–567. Cited by: §1.
 [8] (2020) Modelassisted cohort selection with bias analysis for generating largescale cohorts from the ehr for oncology research. arXiv preprint arXiv:2001.09765. Cited by: §4.1.
 [9] (1996) Why we need observational studies to evaluate the effectiveness of health care. Bmj 312 (7040), pp. 1215–1218. Cited by: §1.
 [10] (2020) Realworld evidence to support regulatory decisionmaking for medicines: considerations for external control arms. Pharmacoepidemiology and drug safety 29 (10), pp. 1228–1235. Cited by: §1.
 [11] (2021) The use of external controls: to what extent can it currently be recommended?. Pharmaceutical Statistics. Cited by: §1, §1.
 [12] (2006) The evaluation of surrogate endpoints. Springer Science & Business Media. Cited by: §5.
 [13] (2020) Using electronic health records to derive control arms for early phase singlearm lung cancer trials: proofofconcept in randomized controlled trials. Clinical Pharmacology & Therapeutics 107 (2), pp. 369–377. Cited by: §B.1, §4.1, §4.2, §5.
 [14] (2021) Making apples from oranges: comparing noncollapsible effect estimators and their standard errors after adjustment for different covariate sets. Biometrical Journal 63 (3), pp. 528–557. Cited by: §4.2.
 [15] (1997) Metaanalysis for the evaluation of potential surrogate markers. Statistics in medicine 16 (17), pp. 1965–1982. Cited by: §5.
 [16] (2013) Genetic matching for estimating causal effects: a general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics 95 (3), pp. 932–945. Cited by: §B.2.2.
 [17] (2020) Use of realworld evidence to support fda approval of oncology drugs. Value in Health 23 (10), pp. 1358–1365. Cited by: §1.
 [18] (2021) Emulating randomized clinical trials with nonrandomized realworld evidence studies: first results from the rct duplicate initiative. Circulation 143 (10), pp. 1002–1013. Cited by: §5.
 [19] (2019) Visualization in bayesian workflow. Journal of the Royal Statistical Society: Series A (Statistics in Society) 182 (2), pp. 389–402. Cited by: §4.4.
 [20] (2013) Bayesian data analysis. CRC Press. Cited by: §A.2, §2.2.
 [21] (2006) Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper). Bayesian analysis 1 (3), pp. 515–534. Cited by: §2.2, §3.1, §5.
 [22] (2019) Avoiding pitfalls when combining multiple imputation and propensity scores. Statistics in medicine 38 (26), pp. 5120–5132. Cited by: §4.2.
 [23] (2021) Hmisc: harrell miscellaneous. Note: R package version 4.50 External Links: Link Cited by: §4.2.
 [24] (2016) Regulatory approval of pharmaceuticals without a randomised controlled study: analysis of ema and fda approvals 1999–2014. BMJ open 6 (6), pp. e011666. Cited by: §1.
 [25] (2020) Causal inference: what if. Boca Raton: Chapman & Hall/CRC. Cited by: §1.
 [26] (2011) Network metaanalysis of survival data with fractional polynomials. BMC medical research methodology 11 (1), pp. 1–14. Cited by: §5.
 [27] (2019) Propensity score analysis with partially observed covariates: how should multiple imputation be used?. Statistical methods in medical research 28 (1), pp. 3–19. Cited by: §4.2.
 [28] (2020) Alternative analysis methods for time to event endpoints under nonproportional hazards: a comparative analysis. Statistics in Biopharmaceutical Research 12 (2), pp. 187–198. Cited by: §5.
 [29] (2020) Comparison of population characteristics in realworld clinical oncology databases in the us: flatiron health, seer, and npcr. medRxiv. Cited by: §4.1.
 [30] (2017) Doubleadjustment in propensity score matching analysis: choosing a threshold for considering residual imbalance. BMC medical research methodology 17 (1), pp. 1–8. Cited by: §B.2.3.
 [31] (2010) Network metaanalysis of parametric survival curves. Research synthesis methods 1 (34), pp. 258–271. Cited by: §5.
 [32] (2012) On the halfcauchy prior for a global scale parameter. Bayesian Analysis 7 (4), pp. 887–902. Cited by: §2.2.
 [33] (1984) Reducing bias in observational studies using subclassification on the propensity score. Journal of the American statistical Association 79 (387), pp. 516–524. Cited by: §5.
 [34] (1985) Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician 39 (1), pp. 33–38. Cited by: §B.2.2.
 [35] (1973) The use of matched sampling and regression adjustment to remove bias in observational studies. Biometrics, pp. 185–203. Cited by: §B.2.3.
 [36] (2001) Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services and Outcomes Research Methodology 2 (3), pp. 169–188. Cited by: §5.
 [37] (2008) For objective causal inference, design trumps analysis. The Annals of Applied Statistics 2 (3), pp. 808–840. Cited by: §5.
 [38] (2015) Comparison of singlearm vs. randomized phase ii clinical trials: a bayesian approach. Journal of biopharmaceutical statistics 25 (3), pp. 474–489. Cited by: §1.
 [39] (2009) The estimation of average hazard ratios by weighted cox regression. Statistics in medicine 28 (19), pp. 2473–2489. Cited by: §5.
 [40] (2020) Beyond randomized clinical trials: use of external controls. Clinical Pharmacology & Therapeutics 107 (4), pp. 806–816. Cited by: §1.
 [41] (2008) Multivariate and propensity score matching software with automated balance optimization: the matching package for r. Journal of Statistical Software, Forthcoming. Cited by: §B.2.2.
 [42] (2015) The role of nonrandomized trials in the evaluation of oncology drugs. Clinical Pharmacology & Therapeutics 97 (5), pp. 502–507. Cited by: §1.
 [43] (2004) Bayesian approaches to clinical trials and healthcare evaluation. Vol. 13, John Wiley & Sons. Cited by: §2.2.
 [44] (2020) Why test for proportional hazards?. Jama 323 (14), pp. 1401–1402. Cited by: §5.
 [45] (2010) Matching methods for causal inference: a review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics 25 (1), pp. 1. Cited by: §B.2.2, §5.
 [46] (2020) Synthetic and external controls in clinical trials–a primer for researchers. Clinical Epidemiology 12, pp. 457. Cited by: §1.
 [47] (2020) Use of realworld evidence in regulatory decisions for rare diseases in the united states—current status and future directions. Pharmacoepidemiology and drug safety 29 (10), pp. 1213–1218. Cited by: §1.
Data availability statement
Data from Flatiron Health, Inc were used to construct the external control cohorts. These deidentified data may be made available upon request and are subject to a license agreement with Flatiron Health; interested researchers should contact Flatiron Health to determine licensing terms. For the internal control cohorts for clinical trials 1, 2, 3, 12 and 13 qualified researchers may request access to individual patient level clinical data for each separate study through a data request platform. At the time of writing this request platform is Vivli (https://vivli.org/ourmember/roche/). For the remaining clinical trials qualified researchers may submit an enquiry to Vivli to request access to individual patient level clinical data. For up to date details on Roche’s Global Policy on the Sharing of Clinical Information and how to request access to related clinical study documents, see here: https://go.roche.com/data_sharing.
The code used to produce the results reported in the paper is available at https://github.com/phcanalytics/ecmetamanuscript. Documentation and installation instructions for the R package ecmeta can be found at https://github.com/phcanalytics/ecmeta.
Appendix A Maximum likelihood approach
a.1 Estimation
and can be estimated by optimization of the loglikelihood function,
(7) 
a.2 Prediction for a new study
Following Gelman et al. (see p. 66) [20], we note that the posterior predictive distribution for a new observation follows a t distribution with location , scale , and degrees of freedom where is the mean and is the variance from the observations. In our problem it follows that follows a t distribution with mean and scale .
A simulation procedure for obtaining draws of then proceeds as follows:

Draw

Draw where is a student t distribution with degrees of freedom, location , and scale

Use the draws from 1 and 2 to compute
Appendix B Propensity score methodology
This section expands on the propensity score methodology from Section 4.2 and describes the sensitivity analyses. All results and code are available online in the GitHub repository for the paper.
b.1 Coding of covariates
Although we aimed to code categorical covariates in a consistent way across the reference studies, this was not always possible. Race was typically coded as White versus Other since the number of observations within nonWhite race categories was usually very small. Two exceptions were NCT02367781 and NCT01351415. We used three categories (White, Black, Other) for the former since there were a significant number of Black patients in the RCT; for the latter, race was coded as Asian or NonAsian since it was coded in that way in the trial. Similarly, histology was only included if it was not part of the inclusion and exclusion criteria for a trial and covariates were excluded if they were not collected for a particular trial.
We explored both linear terms and restricted cubic splines with 4 knots placed at equally spaced quantiles of the support for the continuous variables (age and time since initial diagnosis). In addition, an interaction between cancer stage and time since diagnosis was considered based on the models employed in Carrigan et al.
[13] A specification was chosen after evaluation of the balance of the covariates subsequent to population adjustment with the propensity score methods. We ultimately chose the model with linear terms and no interactions since it tended to result in the best balance and was less likely to generate extreme weights.b.2 Sensitivity analyses
b.2.1 Trimming
Recall that the primary analysis used IPTWATT weights and removed external control patients with propensity scores below the 1st percentile and above the 99th percentile. A sensitivity analysis was also performed in which no patients were removed from the sample.
b.2.2 Propensity score matching
Propensity score matching was used as a sensitivity analysis. Two approaches were utilized. First, we matched patients on the linear propensity score with greedy 1:1 nearest neighbor matching. For a given treated subject, 1:1 nearest neighbor matching selects the control subject whose value of the linear propensity score is closest; that is, for a treated subject , the control subject with the minimum value of was chosen where is the propensity score for subject . In cases where multiple control subjects were equally close to a treated subject, a treated subject was chosen at random. We believe 1:1 matching is reasonable given simulation evidence showing that mean square error is typically minimized when matching either 1 or 2 controls to each treated subject.[4] Greedy matching meant that control subjects were chosen one at a time and the order in which they were chosen mattered. Matching was performed without replacement (except in cases where the number of control subjects was less than the number of treated subjects) to ensure that the matched controls were independent, although it is worth noting that there is some evidence that matching with replacement can reduce bias.[1, 45].
Our second approach used genetic matching, which can model the propensity score in a more flexible way.[41, 16] In particular, one challenge of nearest neighbor matching—and propensity core methods more generally—is that it is difficult to specify a model that achieves satisfactory covariate balance. Genetic matching can help since the weight given to each covariate is based on an evolutionary search algorithm that iteratively checks and improves balance. We implemented this algorithm using both the linear propensity score and all the covariates from the propensity score model, so that matching on the propensity score and the Mahalanobis distance were limiting cases.
Both matching algorithms were performed with and without a caliper. In the case of the former, a caliper on the linear propensity score of 0.25 standard deviation was used.[34].
b.2.3 Double adjustment
To facilitate simple estimation of marginal HRs, the Cox models in the primary analysis did not adjust for covariates. A sensitivity analysis was performed in which regression adjustment was used to estimate hazard ratios (i.e., by including all covariates used in the propensity score model in the Cox model). [35, 30].
Comments
There are no comments yet.