Abstract
Clinical trials with a hybrid control arm (a control arm constructed from a combination of randomized patients and realworld data on patients receiving usual care in standard clinical practice) have the potential to decrease the cost of randomized trials while increasing the proportion of trial patients given access to novel therapeutics. However, due to stringent trial inclusion criteria and differences in care and data quality between trials and community practice, trial patients may have systematically different outcomes compared to their realworld counterparts. We propose a new method for analyses of trials with a hybrid control arm that efficiently controls bias and type I error. Under our proposed approach, selected realworld patients are weighted by a function of the “ontrial score,” which reflects their similarity to trial patients. In contrast to previously developed hybrid control designs that assign the same weight to all realworld patients, our approach upweights of realworld patients who more closely resemble randomized control patients while dissimilar patients are discounted. Estimates of the treatment effect are obtained via Cox proportional hazards models. We compare our approach to existing approaches via simulations and apply these methods to a study using electronic health record data. Our proposed method is able to control type I error, minimize bias, and decrease variance when compared to using only trial data in nearly all scenarios examined. Therefore, our new approach can be used when conducting clinical trials by augmenting the standardofcare arm with weighted patients from the EHR to increase power without inducing bias.
1 Introduction
Randomized clinical trials are the goldstandard for testing a new treatment though there can be some potential disadvantages to running a traditional clinical trial. Clinical trials for rare diseases can take a long time to accrue patients due to the rarity of the disease, which makes the clinical trial and drug approval process take longer. Additionally, if a prior Phase II trial has shown superiority of the new treatment over the standard treatment, it may not be ethical to randomize patients in a 1:1 ratio, rather it may be preferred to use a 2:1 or 3:1 ratio favoring the intervention over the standardofcare. This can result in lower power and the inability to detect an effect even if one truly exists.
It is therefore appealing to consider combining clinical trial data on patients receiving a novel treatment with data on patients receiving the control therapy derived from electronic health records (EHR). [21] The appeal of including external patients receiving the standardofcare is that it reduces or even completely eliminates the need to randomize patients to a control arm in the current trial. [16] In a trial with an external control arm, all data on the standardofcare is derived from EHR, while, in a hybrid control arm external patients receiving the standardofcare from an EHR are combined with randomized trial patients in the control arm. [1]
Electronic health records (EHR) contain a vast amount of data that can be relatively easily leveraged for research. These data are by nature observational, and, as such, there are a few key features of EHR data that are worth noting. First, EHR were developed for clinical care and billing purposes. Therefore, some of the information that researchers may be interested in may not be collected or may be contained in narrative text notes, which are difficult to analyze. Furthermore, the healthcare system in the United States, as well as in many other countries, is fragmented such that a patient’s medical records may be spread across the databases of multiple healthcare systems, which can result in an incomplete or inadequate picture of a patient’s health when relying on data from an individual EHR database.
Though conducting research with EHR data has distinct challenges, approaches have been developed to make beneficial use of this data source.
In 1976, Pocock proposed a method to combine randomized patients receiving the standardofcare from historical clinical trials with intervention arm patients from a new trial to address the fact that many studies of the day examining the efficacy of a new treatment did not contain a randomized control arm, which made it difficult to draw causal conclusions. [14] The approach represents a hybrid control arm combining patients randomized to the trial control arm with patients from the control arm of a historical trial. Pocock proposed six criteria for evaluating what constitutes an acceptable historical control arm as well as how much weight to assign to historical control patients relative to randomized control patients. [14]
Pocock’s six criteria can also be applied to the use of EHR data to construct the hybrid control arm of a trial with varying success. These six proposed guidelines along with considerations for application to an EHRderived hybrid control arm are presented below:

“Such a group must have received a precisely defined standard treatment which must be the same as the treatment for the randomized controls.” Patients derived from EHR data can be selected such that they are receiving the same primary treatment as the randomized trial control patients. However, supportive care and the care environment may differ between the trial and routine clinical practice.

“The group must have been part of a recent clinical study which contained the same requirements for patient eligibility.” By definition the EHR patients are not part of a recent clinical trial but efforts should be made to identify an EHRderived cohort using eligibility criteria as similar to the trial as possible. Due to limitations of EHR data capture, it may be difficult to apply all clinical trial inclusion/exclusion criteria. [15]

“The methods of treatment evaluation must be the same.” This criterion may or may not be met, depending on the outcome and method of outcome ascertainment used in the trial. While it is imperative for the outcome measure to be the same, outcome ascertainment may differ between the trial and EHR data capture. For example, even an outcome such as death can have sensitivity between and , meaning that there are deaths that are not recorded in the EHR. [2]

“The distributions of important patient characteristics in the group should be comparable with those in the new trial.” Patients receiving the control treatment in routine care may differ in many ways from patients participating in a trial. The requirement of the same distribution of patient characteristics in the EHR patient pool as the trial patients is possible but highly unlikely to be exact; similarity, however, should be striven for.

“The previous study must have been performed in the same organization with largely the same clinical investigators.” This criterion is not able to be met by definition.

“There must be no other indications leading one to expect differing results between the randomized and historical controls.” This is unlikely to be met completely due to many differences between realworld and clinical trial care. However, if the EHR data is contemporaneous with the current trial it is reasonable to assume that care received by the EHR patients is similar to the care received by the trial control patients.
Overall, if an EHR cohort can be constructed such that the patients are contemporaneous with the clinical trial, the same inclusion/exclusion criteria are applied to ensure comparable patient populations are under study, EHR control patients receive the same treatment as the clinical trial control arm patients, and the outcome and method of outcome ascertainment are as similar between EHR data capture and the trial as possible, it may be appropriate to use EHR data as part of a hybrid control arm.
There are several approaches to using external control patients when estimating treatment effects, including Pocock’s approach which was among the first. Although these methods were developed in the context of historical controls, where control patients are drawn from a previously conducted clinical trial, they can also be applied to the case where control patients are drawn from an EHR database. This does not affect the implementation of the methods, though interpretation must be made carefully. We assume throughout this paper that only patients in the trial receive the intervention therapy. Pocock’s method assumes that the parameter of interest is the difference between the mean outcome in the intervention arm patients and the mean outcome in the standardofcare arm patients in the clinical trial. Pocock’s method also assumes that the true mean value for the trial standardofcare patients follows a normal distribution centered at a weighted sum of the sample mean of the trial standardofcare patients and the sample mean of the external standardofcare patients, where weights are selected based on the extent to which external standardofcare patients are believed to be representative of trial standardofcare patients and with a standard deviation also dependent upon these factors.
[14]Chen, et al. introduced a Bayesian approach to estimating the effect of a treatment when using hybrid control arms that relies on power priors. [3, 7] This approach incorporates external standardofcare arm data with the current trial data by taking only a fraction of the information from each external standardofcare patient. [3, 7]
The power prior can estimate many different estimands, including differences in means or proportions, hazard ratios, or odds ratios. In this method, the pool of external standardofcare patients is weighted as a whole and the external standardofcare patients are assigned anywhere from
to of the weight that a current trial participant, whether intervention or standardofcare, receives in the final model. [3, 7] When , the weight assigned to external standardofcare patients, is , the power prior approach is equivalent to using no data from the external standardofcare patients, and when is , the power prior approach is the same as fully pooling the external standardofcare patients with the current trial data. [3, 7] This method may be more interpretable than Pocock’s method as the amount of information incorporated is quantified directly through rather than through a variance parameter. [3, 7] Similar to Pocock’s method, the amount of information incorporated from the external standardofcare patients must be prespecified by the researcher and sensitivity analyses are recommended to determine the robustness of results to choice of . [3, 7, 14]Duan and Ye in 2008 [5] and Neuenschwander, et al. in 2009 [13] concurrently developed the normalized power prior approach, which extended the power prior model to estimate from the data rather than using a prespecified . [5, 13] The normalized power prior approach allows for more weight to be allocated to the external standardofcare patients when the external standardofcare patients are similar in terms of outcome to the current trial patients and less weight when they are dissimilar. [5, 13]
These earlier methods focused on weighting external standardofcare patients as a group, whether the weight is prespecified or datadriven. More recent extensions have considered individualweighting of external standardofcare patients based on the similarity of each individual to patients in the trial standardofcare arm. One such adaptation of the power prior approach proposes dividing the external standardofcare patients into subgroups based on their similarity to the trial patients and assigning a weight to each subgroup. [22] Another recently proposed method uses a modification of the propensity score, called the ontrial score, to create matches between the external standardofcare patients and the trial standardofcare patients in order to create a hybrid standardofcare arm consisting of patients most similar to those in the clinical trial intervention arm. [9]
In this paper, we propose a new dataadaptive weighting method that addresses the limitation of assigning a single weight to the entire group of external standardofcare patients by assigning weights to each individual in the external standardofcare arm based on similarity to trial patients using the ontrial score. The use of individualized weights helps to account for the fact that patients included in EHR databases may be more heterogeneous than patients included in clinical trials. The proposed approach incorporates more information from standardofcare patients who are more similar to trial participants than those who are not.
The structure of the paper is as follows: in section 2 we define notation, outline the existing approaches used for hybrid standardofcare arms and introduce our proposed method for combining trial standardofcare patients with external standardofcare patients. In section 3, simulations are presented to assess the relative performance of our new method compared to existing methods. Section 4 applies all methods discussed to a clinical study for patients with metastatic castrationresistant prostate cancer comparing the standardofcare treatment of prednisone with the new treatment of abiraterone acetate in conjunction with prednisone with with external standardofcare patients from a pseudo EHR, and section 5 provides a summary and discussion.
2 Methods
2.1 Notation
We assume the existence of a trial of size and an external data source, e.g. an EHR database consisting of patients meeting comparable inclusion/exclusion criteria and receiving the same treatment as trial standardofcare arm patients, of size . Let . Let each patient in each database have information on a set of covariates, . Additionally, let be an indicator such that takes the value of 1 if the patient is in the external data source and 0 if the patient is enrolled in the trial. Similarly, let be a treatment indicator such that takes a value of 0 if the patient receives the standardofcare and 1 if the patient receives the intervention. In our numerical experiments below, we a have a time to event outcome variable, , where is the time of the event of interest (failure time), is the censoring time. Also, let be a status indicator where takes a value of 1 if and 0 otherwise. Extensions to outcomes of other variable types follow directly from the likelihoodbased formulation below. We denote the data available for the set of external standardofcare patients as , the set of randomized standardofcare patients as , the set of randomized intervention arm patients as , and the set of all trial patients as . We also let denote a target parameter of interest that represents treatment efficacy which could be parameterized as a difference in mean event times or hazard ratios comparing intervention arm and standardofcare patients.
Below we summarize existing and proposed approaches to incorporating data from external standardofcare patients into an analysis of treatment efficacy.
2.2 Existing Approaches to Hybrid Control Trials
2.2.1 Naïve Approaches
Ignoring the external data and using only the current trial data serves as a positive control with regards to the minimum bias that can be attained when estimating the parameter of interest. In this method, only the data from the patients enrolled in the trial are analyzed and the patients from the external data source are left out.
Fully pooling the external data with the trial data serves as a negative control with regards to the amount of bias that is likely to occur when estimating the treatment effect. In this case, the external patients are given the same weight as the trial patients and all external patients are included in the analysis.
2.2.2 Bayesian Approaches
The power prior (PP) approach combines the external patients with the trial patients such that each external patient has a weight less than 1. [7] The power prior approach assigns the same weight, , such that , to all patients in the external data source and the value of is prespecified by the researcher. The power prior approach proposes the following prior distribution for : which yields the following posterior distribution for : .
The normalized power prior approach (NPP) is similar to the power prior approach except that is estimated from the data rather than being prespecified by the researcher. [5] The normalized power prior approach specifies a conditional prior for given and a marginal distribution for . The normalized power prior approach has the form:
which results in the following posterior:
If is proper then the normalized power prior will also be proper.
2.2.3 Lin’s Method
Lin’s method uses an ontrial score, similar to a propensity score, where the outcome of interest is inclusion in the trial to construct a matched set of external standardofcare patients and weight their likelihood contribution. [9]
The ontrial score is estimated as the probability that a patient is in the clinical trial given their baseline covariates using a logistic regression model. Next, optimal pair matching is performed using the ontrial score so that each trial patient receiving the intervention is matched with an external standardofcare patient. The selected external standardofcare patients form a pool from which
patients are randomly drawn so that the augmented trial has a 1:1 ratio of treated to standardofcare patients. In the outcome model the external standardofcare patients are weighted by their ontrial scores while the trial patients are given a weight of one. [9]2.3 Proposed Approach: DataAdaptive Weighting
In our proposed approach, we let the ontrial score be defined as the probability that the patient is included in the trial given the observed baseline covariates, . To maximize the similarity between external and randomized standardofcare patients, we then limit the set of external standardofcare patients to the subset with the highest ontrial scores, such that the number of external standardofcare patients selected results in a hybrid control arm of the same size as the intervention arm. Let represent data for the subset of with the largest ontrial scores. The ontrial scores are then transformed to obtain values for such that and standardized such that . The inverse odds weight is used as we are interested in the average treatment effect on the treated, or in this case, the average treatment effect for those ontrial. This weighting method assigns all trial patients their full weight and only up or downweights the selected external standardofcare patients.
Estimation for dataadaptive weighting then uses a prior for of the form:
(1) 
which gives the following posterior:
(2) 
We note that all patients are used in the estimation of the ontrial score as, assuming trial patients are randomly assigned to intervention arm, the distribution of baseline covariates is the same for standardofcare arm and intervention arm patients. Here the ontrial score is estimated via a logistic regression, rather than being jointly estimated with
. However, the ontrial score may be estimated using more flexible modeling such as a random forest or ensemble machine learning if desired.
2.4 Estimation
The Bayesian estimation approach for the models presented above, under certain conditions, can be approximated using a frequentist analog. For example, in the case of the DAW method, when a noninformative prior is used for , the prior for DAW, , is equivalent to . In this case the posterior mode corresponds to the maximum likelihood estimator for a weighted parametric survival model. In this case, estimates from a weighted Cox proportional hazards model with weights of 1 for trial patients and for selected external standardofcare patients will provide similar estimates to the Bayesian estimates with flat priors. This approach may be preferable to a fully Bayesian estimation approach because of its relative computational efficiency and insensitivity to prior specification. In numerical experiments below, we investigate performance of estimation using this weighted Cox approach for all methods described above.
Our proposed method, dataadaptive weighting (DAW), builds on the power prior approach and Lin’s approach. While the power prior approach and normalized power prior approach both use the same value for all patients, DAW uses individual weights for the external patients depending on their similarity to the trial patients. Lin’s approach uses individual weights for each of the external patients. However, the selected external patients are based upon matching on the ontrial score and selected subjects are directly weighted by the ontrial score.
3 Simulation Study
We conducted a simulation study to investigate the bias, efficiency, effective sample size, and type I error of the existing methods outlined above and the dataadaptive weighting method. Data were simulated with the objective of generating simulated data resembling a realworld study using trial data and external standardofcare data from an EHR database.
Data were simulated for four covariates (), a realworld indicator (), a treatment indicator (), a failure time (), and a censoring time (). We simulated data for trials of two different sizes (), each with two different randomization ratios of the number of patients in the intervention arm to the number of patients in the standardofcare arm (2:1 and 3:1). The number of external standardofcare patients available was equal to the number of trial patients. These values were selected to mirror realworld scenarios for unbalanced clinical trials and provide enough potential EHR patients to distinguish between the performance of the various methods. The hazard ratio for failure for patients in the intervention arm versus randomized standardofcare patients (treatment effect) was allowed to take values of and
. Two different strengths of confounding of the relationship between baseline covariates and failure were also explored: mild confounding and strong confounding. We note that these covariates confound the relationship between enrollment in the trial and the outcome because the baseline covariate distribution differs between trial and external standardofcare patients. Analyses limited to the trial population are unconfounded because there is no relationship between trial arm and baseline covariates due to randomization. Censoring rate was held constant across all simulation scenarios. External standardofcare patients had censoring times arising from an exponential distribution with rate 0.4 and trial patients had censoring times arising from an exponential distribution with rate 0.1.
Relationships among the simulated variables and the complete set of distributions and parameter values used to simulate data, along with examples of EHRderived covariates used to motivate the simulation study are provided in Table 1.
Variable  Trial Distribution  External Distribution  Analogous Variable  
Gender  
College Degree  
HDL Cholesterol  
BMI  
T  ,  0  Treatment Indicator  

Failure Time  



Censoring Time 
In numerical examples, the ontrial score was estimated using logistic regression including all baseline covariates as predictors. Weighted Cox proportional hazards models with a treatment indicator as the sole covariate were fit as the outcome model to estimate the treatment effect.
Each simulation scenario was repeated 1,000 times. We estimated bias relative to the true marginal treatment effect, empirical variance, 95% confidence interval coverage probabilities, power (for scenarios with nonnull treatment effects), and type I error (for scenarios with a null treatment effect). Power and type I error are estimated for hypothesis testing using a significance threshold of 0.05.
We first examined the performance of alternative methods as we varied the ratio of patients in the intervention arm to standardofcare patients in the trial. In these analyses neither the proportion of trial patients in the intervention arm nor the number of EHR patients available as a function of the number of trial patients affected the pattern of the results. Therefore, only a 2:1 intervention arm to standardofcare arm ratio with the same number of EHR patients available as trial patients are presented here. Sample KaplanMeier curves for each treatment hazard ratio and confounding level combination show the difference in survival over time across the three groups of patients (Supplemental Figure S1).
3.1 Simulation Results
The results of the simulation study show that fully pooling the data from the available EHR patients with the trial data results in large biases under all conditions examined (Figure 1). As expected, using only the trial patient data results in negligible bias. The power prior method was also substantially biased across all three values explored, with having the least bias and having the most bias (Figure 1). The normalized power prior was biased when the trial size was 100 but displayed minimal bias when the trial size was 1,000. This is explained by the fact that was estimated to be approximately 0.220.36 for the trial size of 100 and 0.010.03 for the trial size of 1,000 (Table 4). Therefore, although the normalized power prior approach performed well in terms of bias this is because little information from the EHR patients was incorporated. Both Lin’s method and DAW displayed low bias across all scenarios examined. DAW had consistently lower bias than Lin’s method (Figure 1).
The variances of the estimates reflect the extent to which data from external standardofcare subjects was incorporated. The trial only approach had the largest variance and full pooling of all EHR patients with the trial patients had the smallest variance (Figure 2). The power prior approach had variance between the trial only and full pooling methods, with variance inversely proportional to . The normalized power prior had substantially smaller variance than the trial only approach when the trial size was 100 due to the larger (Figure 2, Table 4). DAW had smaller variance than Lin’s method under all conditions examined. This is due to the larger effective sample size for any given scenario (Figure 2, Table 3).
DAW was able to achieve the targeted 1:1 intervention to standardofcare ratio, while Lin’s method resulted in substantially smaller effective sample sizes. DAW includes about twice as many external patients as Lin’s method does in the scenarios examined (Table 3).
The full pooling and power prior methods had high power due to their large effective sample sizes and biased treatment effect estimated which were biased away from the null; however, had the covariate effects been in the other direction the bias would have been towards the null and these methods would have had lower power (Figures 3, 1). The normalized power prior approach had higher power relative to the trial only method when the trial size was 100 but not when the trial size is 1,000. Lin’s method and DAW had higher power than the trial only approach and were quite similar to one another when the trial size was 100; Lin’s method had slightly higher power than DAW when the trial size was 1,000 (Figure 3).
Clearly, unless the strength of confounding and the trial size are small, full pooling of all EHR patients or using the power prior with one of the values examined provides very poor coverage of the true HR (Figure 4). NPP only provides nominal coverage when the trial size is 1,000. Both Lin’s method and DAW had nominal coverage in all scenarios except under strong confounding when the trial size was 1,000; in that case Lin’s method had approximately coverage and DAW had coverage except when the marginal treatment ratio is nearly 1, in which case the coverage dropped to (Figures 4, 5).
Type I error was controlled at the 5% level when only the trial data were analyzed, and was poorly controlled under the full pooling and power prior methods (Table 2). As expected, NPP controlled type I error when was small (i.e., sample size of 1,000) and when there was mild confounding (Tables 2, 4). Both Lin’s method and DAW controlled type I error under all scenarios examined except when there was strong confounding and a trial size of 1,000; type I error was slightly inflated in this case to around (Table 2).
Mild Confounding  Strong Confounding  

Trial = 100  Trial = 1,000  Trial = 100  Trial = 1,000  
Trial Only  0.05  0.051  0.052  0.046 
Full Pooling  0.126  0.716  0.356  0.999 
PP,  0.053  0.177  0.098  0.619 
PP,  0.069  0.403  0.192  0.953 
PP,  0.097  0.583  0.287  0.996 
NPP  0.061  0.061  0.117  0.047 
Lin  0.049  0.046  0.044  0.060 
DAW  0.052  0.048  0.050  0.059 
Mild Confounding  Strong Confounding  

Trial = 100  Trial = 1,000  Trial = 100  Trial = 1,000  
Trial Only  100  1000  100  1000 
Full Pooling  200  2000  200  2000 
PP,  125  1250  125  1250 
PP,  150  1500  150  1500 
PP,  175  1750  175  1750 
NPP  135  1033  123  1013 
Lin  116  1166  116  1166 
DAW  134  1340  134  1340 
Mild Confounding  Strong Confounding  

Trial = 100  Trial = 1,000  Trial = 100  Trial = 1,000  
HR: 0.5  0.36  0.03  0.22  0.01 
HR: 0.75  0.36  0.03  0.23  0.01 
HR: 0.875  0.36  0.03  0.24  0.01 
HR: 1  0.35  0.03  0.23  0.01 
Note: Hazard ratios listed are the conditional treatment hazard ratios as opposed to the marginal treatment hazard ratios.
4 Case Study: Metastatic CastrationResistant Prostate Cancer
Clinical Trial  




N = 1000  N = 394  N = 791  
Age, median (IQR)  69 (63  76)  69 (63  75)  69 (64  75)  
ECOG PS, N (%)  
0  244 (24.4)  140 (35.5)  262 (33.1)  
1  572 (57.2)  209 (53.0)  447 (56.5)  
2  184 (18.4)  45 (11.4)  82 (10.4)  
Gleason Score, N (%)  
1  66 (6.6)  0 (0.0)  1 (0.1)  
2  11 (1.1)  15 (3.8)  31 (3.9)  
3  12 (1.2)  1 (0.3)  2 (0.3)  
4  4 (0.4)  1 (0.3)  3 (0.4)  
5  19 (1.9)  3 (0.8)  4 (0.5)  
6  78 (7.8)  2 (0.5)  24 (3.0)  
7  304 (30.4)  32 (8.1)  76 (9.6)  
8  152 (15.2)  151 (38.3)  286 (36.2)  
9  354 (35.4)  76 (19.3)  159 (20.1)  
10  0 (0.0)  113 (28.7)  205 (25.9)  
PSA, median (IQR)  399 (98  914)  139 (41  412)  120 (37  354)  
LDH, median (IQR)  293 (225  397)  235 (188  321)  222 (187  308)  
ALP, median (IQR)  246 (109  447)  126 (83  268)  125 (79  254)  
Hb, median (IQR)  11.1 (10.1  12.5)  12.0 (10.8  12.8)  11.9 (0.9  12.9)  
Testosterone, median (IQR)  11.5 (5.6  20.0)  12.0 (5.7  20.0)  13.0 (5.8  20.2) 
The objective of this analysis was to compare the performance of alternative methods described above in a realworld context in which data were available from a clinical trial and a pseudo EHR dataset, which was constructed from the standardofcare arm from the clinical trial. We compared the effect of abiraterone acetate (an androgen synthesis inhibitor) plus prednisone compared to prednisone alone on overall survival in patients with metastatic castrationresistant prostate cancer progressing after chemotherapy. Metastatic castrationresistant prostate cancer (MCRPC) is a type of advanced prostate cancer that no longer completely responds to treatments that lower testosterone. [20] The study sample included patients with MCRPC from a phase 3 randomized doubleblind clinical trial (NCT00638690) conducted by Janssen Research Development, L.L.C.. Complete details of trial eligibility and treatment protocols have been previously published. [4]
The clinical trial population included 1,185 patients with MCRPC progressing after taxane chemotherapy. [19]
Patients were randomized in a 2:1 ratio to treatment with abiraterone acetate plus prednisone or treatment with prednisone alone. Patients were enrolled from 2008 to 2009 and followed for five years, or until death. Treatment arm was classified based on the arm to which a patient was randomized, regardless of whether they crossed over to openlabel abiraterone acetate at any point (Table
5).The pseudo EHR dataset was constructed by sampling patients from the standardofcare arm in the clinical trial such that the baseline covariate distribution of the resultant sample differed between the EHR and clinical trial (Table 5, Supplemental Figure S2). We assume that all patients had the same set of covariates associated with poor performance for patients with MCRPC recorded at the baseline encounter. [18, 6, 8, 11, 17, 12]
Specifically, all standardofcare arm patients were sampled with replacement to create a population of size 10,000 from which we could draw our pseudo EHR sample. Next, each patient was assigned a probability of sampling according to a nonlinear function of the baseline covariates. By constructing this sampling probability using a nonlinear functional form, the estimated ontrial score in the DAW approach will be misspecified. This reflects the realworld scenario where we are unlikely to be able to correctly specify this model. A psuedo EHR sample of size 1,000 was then drawn. The sampling probabilities were generated such that patients were more likely to be included in the psuedo EHR if they: were younger, had a higher ECOG score, had a higher Gleason score, had a higher lab value for PSA, LDH, Hb, and ALP, or had a lower testosterone value. The logistic sigmoid function was used to relate the ECOG and Gleason scores to the sampling probability in order to separate out those with high versus low scores rather than having alinear additive effect as the score increased. The square root of the testosterone lab value was used to shrink the effect of a high testosterone value on being included in the sample.
Due to missingness in some variables, multiple imputation via predictive mean matching was used with 5 imputations. The median of the imputed covariates was calculated across imputations and included in the ontrial score, which is valid in the case where the covariates do not inform treatment assignment such as in a clinical trial.
[10] Postimputation covariate distributions stratified by data source were similar to preimputation covariate distributions.4.1 Case Study Results
In order to appropriately interpret the results of the case study we first must evaluate each of Pocock’s six criteria for the external standardofcare data. [14] In this case meeting most of the criteria was trivial due to the fact that the pseudo EHR was created from the standardofcare arm of the clinical trial, except for , since the pseudo EHR had patients selected in a biased fashion such that their covariate distribution was slightly different and their outcomes were somewhat worse than the clinical trial.
The case study of MCRPC patients had a high rate of death. Of the 791 intervention arm patients in the trial there were 645 deaths, of the 394 patients receiving the standardofcare in the trial there were 331 deaths, and of the 1000 patients in the pseudo EHR there were 937 deaths. The median survival time was 15.6 months (95 CI: 14.716.8) for the intervention arm in the trial , 11.2 months (95 CI: 10.413.3) for the standardofcare arm of the trial, and 8.0 months (95 CI: 7.98.6) for the pseudo EHR. It is clear that the patients in the pseudo EHR had inferior survival relative to both arms of the clinical trial. This is likely to be true in reality as patients often receive more supportive care in a clinical trial than in regular clinical practice and tend to have different covariate distributions due to restrictive inclusion/exclusion criteria.
The hazard ratio for death for patients on abiraterone acetate plus prednisone as compared to patients on prednisone alone was 0.86 (95 CI: 0.750.98) using only patients enrolled in the clinical trial. When all pseudo EHR patients were added to the analysis population, the hazard ratio for death was 0.60 (95 CI: 0.550.66) (Figure 7). The hazard ratio for death as estimated by the power prior method with the three different values were between the trialonly and full pooling methods, as expected. The normalized power prior method estimated and therefore was virtually identical to the trial only method as it effectively borrowed information from only four registry patients. Lin’s method returned results somewhat similar to the trial only method, with an estimated hazard ratio of 0.76 (95 CI: 0.680.86), adding just over 200 patients to the analysis (Figure 7). The estimated hazard ratio using DAW was 0.85 (95 CI: 0.760.94), which is almost identical to that obtained with the trial only data and also had a smaller confidence interval due to the fact that 397 patients were added so that the augmented trial had a 1:1 randomization ratio (Figure 7). With the exception of NPP, which only added 4 patients, DAW returned results most similar to the trialonly result while achieving improved efficiency (Figure 7).
5 Discussion
Dataadaptive weighting allows for the external patients who are most similar to the clinical trial to be selected and weighted such that the augmented trial has a 1:1 randomization ratio, which results in minimal bias and tighter confidence intervals as compared to using only the trial data. We compared the performance of alternative methods for constructing and analyzing a hybrid control arm in terms of bias, variance, power, confidence interval coverage, and type I error across various scenarios for trial size, trial randomization ratio, strength of covariate effects, and treatment effect.
Based on the results from our simulation study it is clear that fully pooling external patients with trial patients has the potential to produce highly biased results, have poor coverage, and substantially inflate type I error rates. Similarly, the power prior at all three alpha values examined exhibited poor performance, although results were attenuated towards the trial only analyses. The normalized power prior either exhibited moderate bias and poor coverage and type I error rate or had little bias but failed to incorporate much information from the EHR database. The case study examined here shows how the methods perform on real data when Pocock’s criteria are met when creating a hybrid control trial. While observed covariates can be accounted for using ontrial scores as in Lin’s method and DAW, the other criteria are extremely important to avoid confounding by unobserved characteristics of patients or their care environment, and creation of a hybrid control arm is not recommended if they are not met.
Lin’s method has reduced bias and variance compared to the more traditional methods, though DAW was able to achieve lower bias and variance than Lin’s method. Both methods performed very similarly with regards to confidence interval coverage and type I error rates. While this may initially cause one to conclude that Lin’s method and DAW are both good methods to use for hybrid control arms, there are several points to be made regarding Lin’s method. First, Lin’s method becomes more difficult to implement with larger trial and/or external data source sizes as optimal matching can be cumbersome or even impossible with large samples. Second, while the number of EHR patients selected by Lin’s method nominally produces a 1:1 ratio, weighting by the ontrial score results in an effective sample size that has fewer standardofcare than intervention arm patients, reducing efficiency of this approach. Finally, it is unclear what estimand Lin’s method estimates as the external patients are weighted by their ontrial score and the trial patients are given a weight of 1. These issues are addressed by the DAW as the IOW are scaled to ensure that the 1:1 ratio is preserved and the use of IOW allows one to estimate the estimand of interest: the average treatment effect for those ontrial.
While our simulation study evaluated a large combination of possible characteristics of a trial and realworld data source, there are additional factors that could have an effect on the results that were not examined, including differential covariate effects on the outcome between EHR and trial patients and potential differential error in outcome ascertainment between EHR and trial patients. Furthermore, due to the computational demands of Bayesian estimation for these methods, we have evaluated performance of a frequentist analogue, which targets a different estimand and may produce different results from a Bayesian implementation, particularly for small sample sizes where the Bayesian central limit theorem has little effect. One must also consider the direction of the bias induced by covariates that differ between the trial and realworld populations in order to determine the effect of the differential covariate distributions on power and type I error.
Based on the results of these simulations and the realworld data example, when working with hybridcontrol arm data we recommend using the DAW method in order to minimize bias and variance while maximizing coverage and properly controlling the type I error rate. Additionally, DAW estimates the average treatment effect for those ontrial, which is the estimand of interest.
6 Acknowledgements
This study, carried out under YODA Project , used data obtained from the Yale University Open Data Access Project, which has an agreement with JANSSEN RESEARCH DEVELOPMENT, L.L.C.. The interpretation and reporting of research using this data are solely the responsibility of the authors and does not necessarily represent the official views of the Yale University Open Data Access Project or JANSSEN RESEARCH DEVELOPMENT, L.L.C..
7 Data Availability
The data that support the findings of this study are available from JANSSEN RESEARCH DEVELOPMENT, L.L.C. via the Yale University Open Data Access Project. Restrictions apply to the availability of these data, which were used under license for this study. Data are available at https://yoda.yale.edu/ with the permission of the Yale University Open Data Access Project.
8 Funding
Research reported in this publication was supported in part by NIH grant R21CA227613. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
9 Declaration of conflicting interests
The author(s) declared no other potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
References
 [1] (202010) Realworld evidence to support regulatory decisionmaking for medicines: Considerations for external control arms. Pharmacoepidemiology and Drug Safety 29 (10), pp. 1228–1235. Note: Publisher: John Wiley & Sons, Ltd External Links: ISSN 10538569, Link, Document Cited by: §1.
 [2] (201905) An evaluation of the impact of missing deaths on overall survival analyses of advanced non–small cell lung cancer patients conducted in an electronic health records database. Pharmacoepidemiology and Drug Safety 28 (5), pp. 572–581. Note: Publisher: John Wiley & Sons, Ltd External Links: ISSN 10538569, Link, Document Cited by: item 3.
 [3] (2000) Power prior distributions for generalized linear models. Journal of Statistical Planning and Inference 84 (12), pp. 121–137. Cited by: §1.
 [4] (201105) Abiraterone and increased survival in metastatic prostate cancer.. The New England journal of medicine 364 (21), pp. 1995–2005 (eng). External Links: ISSN 15334406 00284793, Document Cited by: §4.
 [5] (2008) Normalized power prior Bayesian analysis. The Univeristy of Texas at San Antonio, College of Business Working Paper Series. Cited by: §1, §2.2.2.
 [6] (2002) Prognostic value of the Gleason score in prostate cancer. BJU international 89 (6), pp. 538–542. Note: ISBN: 14644096 Publisher: Wiley Online Library Cited by: §4.
 [7] (2000) Power prior distributions for regression models. Statistical Science 15 (1), pp. 46–60. Cited by: §1, §2.2.2.
 [8] (201706) Prognosis of prostate cancer with initial prostatespecific antigen >1,000 ng/mL at diagnosis. OncoTargets and therapy 10, pp. 2943–2949 (eng). Note: Publisher: Dove Medical Press External Links: ISSN 11786930, Link, Document Cited by: §4.
 [9] (201903) Propensityscorebased priors for Bayesian augmented control design. Pharmaceutical Statistics 18 (2), pp. 223–238. Note: Publisher: John Wiley & Sons, Ltd External Links: ISSN 15391604, Link, Document Cited by: §1, §2.2.3.
 [10] (201602) A comparison of two methods of estimating propensity scores after multiple imputation. Statistical Methods in Medical Research 25 (1), pp. 188–204. Note: Publisher: SAGE Publications Ltd STM External Links: ISSN 09622802, Link, Document Cited by: §4.
 [11] (2019) Prognostic value of lactate dehydrogenase in metastatic prostate cancer: A systematic review and metaanalysis. Clinical genitourinary cancer 17 (6), pp. 409–418. Note: ISBN: 15587673 Publisher: Elsevier Cited by: §4.
 [12] (2017) Effects of expanding the lookback period to all available data in the assessment of covariates: Effects of Expanding the Lookback Approach. Pharmacoepidemiology and Drug Safety 26. External Links: Document Cited by: §4.
 [13] (2009) A note on the power prior. Statistics in Medicine 28 (28), pp. 3562–3566. Cited by: §1.
 [14] (1976) The combination of randomized and historical controls in clinical trials. Journal of chronic diseases 29 (3), pp. 175–188. Cited by: §1, §1, §1, §4.1.
 [15] (2020) Using electronic health record data to identify comparator populations for comparative effectiveness research. Journal of Medical Economics, pp. 1–5. Note: ISBN: 13696998 Publisher: Taylor & Francis Cited by: item 2.
 [16] (202004) Beyond Randomized Clinical Trials: Use of External Controls. Clinical Pharmacology & Therapeutics 107 (4), pp. 806–816. Note: Publisher: John Wiley & Sons, Ltd External Links: ISSN 00099236, Link, Document Cited by: §1.
 [17] (201407) Alkaline phosphatase: an overview. Indian journal of clinical biochemistry : IJCB 29 (3), pp. 269–278 (eng). Note: Edition: 2013/11/26 Publisher: Springer India External Links: ISSN 09701915, Link, Document Cited by: §4.
 [18] (1993) Performance status assessment in cancer patients. An interobserver variability study. British journal of cancer 67 (4), pp. 773–775. Note: ISBN: 15321827 Publisher: Nature Publishing Group Cited by: §4.
 [19] The YODA Project. External Links: Link Cited by: §4.
 [20] (201501) Metastatic castrationresistant prostate cancer: time for innovation. Future Oncology 11 (1), pp. 91–106. Note: Publisher: Future Medicine External Links: ISSN 14796694, Link, Document Cited by: §4.
 [21] (201908) Design and Evaluation of an External Control Arm Using Prior Clinical Trials and RealWorld Data. Clinical Cancer Research 25 (16), pp. 4993. External Links: Link, Document Cited by: §1.
 [22] (2019) Propensity scoreintegrated composite likelihood approach for incorporating realworld evidence in singlearm clinical studies. Journal of biopharmaceutical statistics, pp. 1–13. Cited by: §1.
Comments
There are no comments yet.