A key success factor in clinical trials is the timely recruitment of trial participants. A 2020 study of 87 trials from the Tufts Center for the Study of Drug Development suggests that efforts to optimize the recruitment process in recent years have been successful: enrollment timelines were within their target in 77% of the studies sampled, increased from 48% in 2012 (tuftsCSDD_2020). This study, however, was limited in scope to a selected few large pharmaceutical companies, and other recent studies suggest that more precise and granular prediction of patient recruitment in the trial design stage is needed to further reduce the risk of trial recruitment delays and failures (briel_2016; healy_2018).
Research in this field is often limited by the volume of clinical trial data to generate meaningful enrollment rate estimates and to validate proposed methods. A recent review by gkioni_2019 described multiple statistical models that have been proposed over the years, however 11 out of the 13 reviewed papers required investigator-provided enrollment rate estimations to generate predictions in the design stage. The remaining two papers either did not validate their methods or only validated on a small sample of studies (bakhshi2013; Gajewski2008). Additionally, many of these past approaches assume that enrollment rates are constant over time or even across trial sites, which rarely holds in practice. While it is likely that other approaches exist, methods and evaluation results for recruitment prediction in clinical trials are rarely reported.
In this work, we propose a data driven, machine learning (ML) approach to predict site enrollment counts for every month over the course of a prospective clinical trial while accounting for variations in site initiation times, seasonality, and other site and study characteristics. This approach is based on patient enrollment records from thousands of historical clinical trials and enables predictions to be made using only the characteristics of the prospective trial and a list of trial sites. To the best of our knowledge, this is the first reported application of ML for clinical trial enrollment prediction where performance of multiple methods is evaluated and reported based on a large number of trials.
2.1 Data Sources
Medidata’s Electronic Data Capture (EDC) system integrates clinical trial operational data from various sponsors111The sponsor of a clinical trial is the pharmaceutical company that initiated and oversaw the trial. or contract research organizations (CRO222A CRO is a company that is optionally contracted to manage clinical trials for a sponsor.) and disease areas globally. To predict trial enrollment we leverage the operational data at the trial (i.e., “study”), trial site (i.e., “study-site”) and patient levels, such as 1) trial eCRF finalization date333eCRF stands for electronic case report form. Its finalization date in the EDC system is used as a proxy for the trial initialization date., disease indication444A disease indication corresponds to one or more ICD-9 disease codes. These indications are hierarchically combined into indication groups and then into therapeutic areas., phase, sponsor and/or CRO; 2) study-site creation date555This is an EDC system date that captures when a healthcare facility record is added to a trial database, used as a proxy for the study-site activation date which marks the time from which a site can begin enrolling., country; 3) patient enrollment status and enrollment date. All patient data has been removed or anonymized, and trial sponsors and/or CROs have granted Medidata authorization to use this combined operational data for research purposes.
Medidata Acorn AI has invested substantially in standardizing site metadata across trials. By the time the data was generated for this work, about 20K unique healthcare facilities have been identified across over 500K study-sites. This unique site identifier is used to generate features that track site enrollment performance in previous trials, as explained in detail in sec:feature.
Additionally, public data sources are used to augment the trial operational data, including age and gender inclusion criteria and number of trial arms from the AACT database (ctgov), and country population and population density data from The World Bank Group (worldbank).
2.2 Cohort Summary
Our initial cohort contains over 11K trials that were initiated between 2008 and 2020 with at least 1 patient enrolled. Trials with potentially erroneous or incomplete data are removed, such as those with abnormally short enrollment duration (e.g., less than 4 months), missing or misaligned key enrollment milestone dates, and/or unknown therapeutic area (TA). Trials with likely bulk-uploaded enrollment data are detected and excluded, such as those with over 90% of subjects enrolled in the same calendar month. Certain types of studies are excluded because of their distinct enrollment patterns or limited data availability, including observational, pediatric and device trials. tab:waterfall in the appendix summarizes the number of studies, study-sites and patients after each filtering step.
The final modeling dataset contains about 1.2 million enrolled patients from 7,542 trials and over 270K study-sites. The trials span over 400 sponsors, 80 CROs, 15 TAs, 600 indications and phase I through IV. Oncology is the largest TA (43% of studies), and phase II and III are the dominating phases (30% and 43% of studies, respectively). Over 18K unique healthcare facilities from over 90 countries are represented, with the U.S. being the most represented country (40% of study-sites). tab:datasummary in the appendix summarises the enrollment related statistics.
3.1 Problem Definition
The aim of the present work is to predict the monthly enrollment count for a clinical trial site throughout the study-site enrollment duration, i.e., from the study-site creation date until the target enrollment for the trial has been reached. These monthly site enrollment count predictions can then be aggregated to estimate country and study level enrollment timelines. Of note, we use the observed study-site creation date for model training and evaluation in the present work for simplicity; additional information regarding the prediction of time to study-site creation is provided in sec:site enrollment start date.
3.2 Feature Generation
We generate features at the study, country, and site levels, as well as time-varying features at the site-month level, using similar methods to capture performance from relevant historical trials. Relevant historical trials include those that completed enrollment at the same clinical trial site, within the same country, and/or had similar trial characteristics such as the TA or phase. Performance metrics of interest include site enrollment rates, monthly site enrollment counts, site query rates, and other data quality metrics. For example, to compute the mean historical enrollment rate by TA and country for a U.S. site in an oncology trial with an eCRF finalization date of January 10, 2018, we would compute the mean site enrollment rate of all study-sites in the U.S. that completed enrollment in oncology trials prior to January 10, 2018.
Additional features aiming to estimate disease prevalence are also derived using historical trial data. More specifically, the indication disease prevalence is estimated globally and within each country by computing the percentage of patients enrolled in historical trials for a given indication compared to patients enrolled historically across all indications.
These historical features are often missing due to lack of historical trials within a particular set of grouping variables. In general, we apply various forms of hierarchical imputation that use values from features with higher level groupings to impute missing values for features with lower level groupings. For example, for features that are grouped by indication, we impute missing values using the corresponding value at the indication group level, followed by the TA level if missing values still exist.
To capture variation over time, we include a number of time related features at the study-site-month level such as the number of months since the study-site creation date, the calendar month and the number of days in a given month. In total, our starting feature set consists of roughly 100 study level, 30 country level, 100 site level and 15 site-month level features.
Baseline Models We develop two baseline models with constant study-site level enrollment rate prediction for performance comparison. The first uses the mean historical enrollment rate666Defined as the study-site enrollment count divided by study-site enrollment duration. observed for a given site within a given indication. If a site lacks history within a given indication, we impute the value hierarchically starting from country and indication to indication, indication group and finally TA.
The second baseline is an XGBoost(xgb) model trained on site enrollment rate with a tweedie (tweedie)loss function and the study, country and site level features as described in Section 3.2.
Monthly Enrollment Prediction Models We explored three methods for this prediction problem: LightGBM (lightGBM) with a tweedie loss function, Zero-Inflated Poisson (ZIP) regression (zip), and a family of hurdle models (Hurdle) with Poisson, truncated Poisson or negative binomial count distributions. Selection of these methods was motivated primarily by the unique shape of the monthly site enrollment count distribution which has a large mass at zero due to non-enrolling sites. LightGBM offers additional advantages in handling non-linear relationships, interactions and missing values, as well as having a faster training time than other ensembled tree models like XGBoost.
3.4 Feature Selection and Processing for Monthly Enrollment Prediction Models
Prior to model training, we apply a set of pre-processing steps, including winsorizing extreme values of the historical features, removing highly correlated features, and pooling categorical features such as indication. For the ZIP and hurdle models, additional steps are required such as imputing missing data, one-hot encoding categorical variables, removing highly sparse variables, and normalizing feature values. The final models contained 90 features,777About 50 study level, 10 country level, 25 site level features are included in the final models. and all site-month level features are preserved.
4 Result Summary
The modeling dataset is split randomly by study into discovery and hold-out test sets using a 3:1 ratio. The discovery set is then randomly split by study into 5 cross-validation folds. We conduct model hyper-parameter tuning, including the selection of ZIP and hurdle model distributions with cross-validation. Appendix B provides discussion on random versus time-based split.
To compare model performance, we evaluate the mean squared error (MSE) and mean absolute error (MAE) of predicted versus observed enrollment at the study, study-site and study-site-month levels. The study level prediction performance is the most important as it drives study enrollment planning and is less affected by noise from the more granular levels.
tab:result in the appendix shows the detailed cross-validation results across models. The ZIP model has the best study level MAE and MSE, outperforming the historical rate baseline by 53% and 80%, respectively. The LightGBM model has the same study level MAE yet worse MSE compared to ZIP. The XGBoost site enrollment rate baseline has the best study-site-month level MAE. However, it tends to under-predict the enrollment rate, leading to inferior study level performance.
A secondary consideration is the model training and prediction time. The ZIP model takes over a day to train while the LightGBM model only takes a few minutes. Given comparable model performance, the LightGBM model is selected as the final model for evaluation on the held-out test set.
tab:result1 shows the study level MAE by cross-validation and on the hold-out set. We also generate the MAE of study enrollment on two interim enrollment milestone dates: 50% of patients enrolled, and 90% of patients enrolled, on the test set. A large improvement is observed over baselines at each milestone. Additionally, the MSE and MAE at all three levels in the hold-out set are consistent with the cross-validation results for LightGBM.
|Study Level MAE||CV Results||Hold-out Test Set Results|
|Models||Last PE||50% PE||90% PE||Last PE|
|Mean historical enrollment rate||136||63||104||120|
|XGB site enrollment rate||93||51||93||98|
High Impact Features We evaluate feature importance in the LightGBM model using the average gain associated with each feature. The top 5 features are trial indication group, average historical enrollment by month within the same country or TA, site country and the number of months since enrollment started. The relative importance of the top 5 features accounts for 42% among all features. Appendix C shows the relative feature importance of the top 20 features.
5 Discussion and Future Work
In this work, we show the power of machine learning models for clinical trial enrollment prediction. Given a rich set of trial operational data, we are able to evaluate a range of machine learning models that generate time-varying predictions at the study-site-month level for prospective clinical trials, largely outperforming the more commonly used projections of site enrollment rate. These predictions can serve as guidelines in the trial design stage and can be refined with subject matter expert input and observed enrollment as the trial progresses.
Despite the volume and granularity of the data used, the present work does not account for all factors that impact the complex process of trial recruitment. We discuss some prominent limitations of the current work and areas for development in Appendix A.
We thank our colleagues from the Data Operations team and Data Science Engineering team at Medidata Acorn AI for their efforts in data standardization, and building and maintaining the data pipeline behind the modeling dataset. We would also like to thank Jef Benbanaste and Anthony Ford for their review and comments that greatly improved the manuscript. Additionally, we thank Hrishikesh Karvir, Fanyi Zhang and Laura Katz for their contributions to an earlier version of this work.
Medidata’s analyses are based upon anonymized and aggregated clinical trial data developed pursuant to customer authorization. Medidata’s cross-industry cohort is used only under these protections; collaborations with Medidata would apply similar aggregation requirements.
Appendix A Future Work
Prominent limitations of the current work and areas for development continued from the main text.
a.1 Site Activation Date
As mentioned in Section 3.1, we assume the study-site creation date is known when training the model and reporting model performance in this work. For trial planning, a separate model has been developed to predict the time from the trial initialization date to each site’s creation date which can be used in combination with the monthly enrollment predictions. Study level enrollment prediction performance with predicted site creation date is interestingly almost identical to that with the observed site creation date.
Separately, the actual site activation date is heavily impacted by country regulations and a trial sponsor’s planned activation schedule which is not explicitly captured in our historical performance data. We are exploring other systems that better capture site activation events to gain more understanding of this process and further improve our predictions in the trial design stage.
a.2 Additional Features to Explore
While the data explored in this work provide useful information about enrollment patterns in historical trials, additional information about patient availability near clinical trial sites, including information about competing trials that are recruiting similar patients nearby, are of interest for further model development. Similarly, data about epidemiological events such as COVID-19 and associated local regulations near sites could further improve model performance under such conditions. In addition, better usage of trial design information, e.g., disease severity or line of therapy, could help to account for enrollment differences across trials.
a.3 Prediction Interval Generation
In addition to the predicted mean enrollment over time, prediction intervals are needed to evaluate uncertainty in the enrollment forecast. Common methods used in the industry are Monte-Carlo simulation (ABBAS2007220) or analytical solutions based on certain underlying stochastic process assumptions (ani2011)
. However, we find that a simulation approach under the tweedie distribution assumption has poor calibration results and requires the dispersion parameter to be scaled up substantially to capture uncertainties in the data. Additionally, the simulation approach is relatively slow, and it can become exponentially complex to account for different levels of uncertainty (e.g., the elapsed time on the trial) and/or the covariance structure of sites within the same study. As a result, we are considering data driven approaches such as quantile regression to alleviate these issues.
Appendix B Challenges and Performance with Time-based Split
For modeling data with a time component, using a time-based training, validation and test data split often provides the most realistic estimation of model performance when applied to new data. However, the trial enrollment data used in the present work poses special challenges to using a time-based split. First of all, trials can enroll for multiple years, and data split by either the enrollment start time or end time would inevitably bias the test set with either shorter or longer enrolling studies. Secondly, trial enrollment duration is often correlated with the therapeutic area; for example, oncology trials often take years to enroll while dermatology trials typically finish recruitment within a year. Thus, in addition to a disproportional enrollment duration distribution between training, validation and test datasets, there would be bias in the trial disease area. The above concerns could be alleviated given a sufficiently large number of historical trials, however, the majority of studies in our modeling dataset started between 2014 and 2019 which is a relative short period to split 3-ways given an average 2-year enrollment duration. Therefore, we use a random split by study to avoid the potential data distribution bias and its impact on model and feature selection decisions.
To understand model performance on future studies, we conduct the following analysis to compare performance under the time-based split and the random split with the selected LightGBM model, parameters and features. For the time-based split, we train a model on historical trials with a trial initiation date before a given date and evaluate model performance on trials that started in the following quarter. We conduct the above process iteratively for each quarter between 2012 and 2020. As a comparison, we use the cross-validation out-of-fold prediction in the same set of studies, where studies are split randomly into the cross-validation folds. Not surprisingly, we observe a study-level MAE increase of about 10 patients prior to 2015 using a time-based split due to only having a few studies in the training set. The gap gradually reduces to less than 2 patients beyond 2018. While the performance metrics become worse under the time-based split, the performance difference between the LightGBM model and the historical rate baseline remains substantial.
Appendix C Feature Importance
Appendix D Additional Tables
tab:waterfall shows the step-wise cohort filtering criteria and the resulting change in the data size. tab:datasummary shows enrollment related summary statistics across trials in the modeling data. tab:result provides more detailed cross-validation results.