1 Introduction
Predictive models, such as disease risk scores, are typically trained on a single, or few, data sources but are often expected to work well in other environments, that may vary in their population characteristics, clinical settings, and policies (steyerberg_prediction_2016). In many cases, model performance deteriorates significantly in these external environments, as demonstrated repeatedly (e.g., ohnuma2017prediction), and most recently for the widely implemented proprietary Epic Sepsis Model (wong_external_20211) and for COVID19 risk models (reps_implementation_20212).
Model robustness – that is, its ability to provide accurate prediction despite changes, e.g., in the characteristics of input covariates – can be demonstrated using external validation, the process of evaluating model performance on data sources that were not used for its derivation. However, full access to medical datasets is often limited due to privacy, regulatory and commercial factors. Therefore, we aim to estimate the performance of a given model on external sources using only their more commonly available statistical characteristics.
Here, we propose an algorithm which reweights individuals in an internal sample to match external statistics, potentially reported in preceding publications or characterization studies (e.g., recalde_characteristics_2021
); then estimates the performance on the external sample using the reweighted internal one. We focus on cases that are common in the healthcare domain, where the size of samples (that is, number of individuals) is much larger than the number of features. In such cases, infinite number of weight sets may recapitulate the external statistics, therefore the proposed algorithms searches for weights with a minimal divergence from a uniform distribution.
We first study the strengths and limitations of our suggested approach using simulated data; then split a sample from a primary care dataset into ”internal” versus ”external” subsets based on demographic information, and validate the approach using a prediction model for year risk of complications in ulcerative colitis patients; and, finally, use the entire primary care data to estimate the performance of three stroke risk scores in seven external resources and compare it to the actual performance, as reported in a recent study (reps_feasibility_2020).
2 Related Work
The task of evaluating model performance in external samples, often with (at least some) data shift (finlayson_clinician_2021), is tightly coupled with that of training robust models, as evaluation is a necessary step in model selection and optimization.
One line of work handling data shifts adopts ideas from causal inference. Specifically, causal models (bareinboim_causal_2016) can distinguish invariant relations between risk factors (e.g., biological or physiological) and outcomes from context or environmentdependent mechanisms (subbaswamy_preventing_2019). subbaswamy_evaluating_2021 developed a method for analyzing model robustness (or stability) that, given a model, a set of distributionfixed (immutable) variables and a set of distributionvarying (mutable) variables, identifies the subpopulation with the worst average loss; thus, enabling evaluation of model safety, with no external information.
Sample reweighting is commonly applied to adjust for confounders, either measured (hainmueller_entropy_2012) or unmeasured (streeter_adjusting_2017), and to account for selection bias (kalton_weighting_2003), typically leveraging fullyaccessible samples. Methodologically, the optimization problem we derive is similar to that studied for entropy balancing (hainmueller_entropy_2012)
, which attempts to reweight a sample (e.g., control group) so its prespecified set of moments exactly match that of another sample (e.g., treatment group), while maximizing the weight entropy (that is, keeping weights as close as possible to uniform). We note, however, that we explore a different usecase and, consequently, optimize over moments of an otherwise inaccessible sample (rather than samples from an accessible data source).
3 Estimation Algorithm
The goal of the proposed method is to estimate the performance of a prediction model, e.g., risk score, on an external sample, given some of its statistical properties, and using an internal, fullyaccessible data. Briefly, we reweight an internal sample to obtain the external statistics, then compute model performance on the weighted sample as an estimate of the external performance.
3.1 Problem Formulation
Let and
denote an observation (or feature) vector and a binary outcome
^{2}^{2}2We focus here on binary outcomes, as these are commonly used – and reported – in healthcare applications; it is possible to extend the proposed approach to continuous outcomes, using an appropriate performance measure and statistical characteristics., respectively, for an individual . Suppose we have access to observations for individuals in an internal sample:and summary statistics for an external sample (with individuals):
where is a set of transformations on individuallevel observations. For example:
allows computation of features mean in subsets of individuals with and without the outcome (as often reported in a study’s Table ).
We aim at estimating the performance of a model on the external sample , using and observations from . To this end, we search for weights , such that the statistical properties of the weighted sample approximate these of the external one. Let denote the space of such weight sets:
Where is a matrix whose rows are . As may be infinitely large, we propose to search for a set of weights that is also closest to uniform. This additional constraint is based on a proximity assumption, intuitively that the external distribution is relatively similar to the distribution in that is closest to the internal distribution.
Using the reweighted sample we can now estimate two types of performance measures:

Measures that can be expressed as a pointwise loss function,
, for which we estimate the expected loss of the model in the external sample as:For example, for a model that computes the probability of an outcome
, we can estimate the expected negative loglikelihood by setting 
Nondecomposable measures that can be evaluated on weighted samples. For example, the area under the receiver operating characteristic curve (AUC).
Below we present a model independent scheme, which minimizes an fdivergence function (for example, maximizes the weights entropy); and in app:modeldep, we derive a model (and loss) dependent scheme, which maximizes a weighted upper bound on the model loss and the regularized divergence function.
Modelindependent optimization scheme.
To find a weighted representation of an internal sample that replicates the external expectations, we solve the following optimization problem:
(1) 
where , fdivergence, for discrete measures and is:
and is a convex function, with . For example, when , Optimization Problem (1) becomes:
(2) 
where is the entropy function.
We derive a dual formulation of Problem (2), similar to hainmueller_entropy_2012, in app:dual; and show that the optimal solution has the form:
where . In other words, the optimal weights are normalized exponents of a linear function of . We note that, as the number of features is typically much smaller than the sample size, the solution to the dual problem is expected to be more numerically stable than the primal’s.
In cases where , Problem (1) can be adjusted to tradeoff, using hyperparameter , the accuracy at which the weighted internal sample reproduce the external statistics and proximity and rewritten as:
(3) 
where the norm can be or .
3.2 Detecting Estimation Failure
To estimate the performance of a model in an external sample , with distribution of transformed features, we assume that whenever . This condition is analogous to the positivity assumption in causal inference, except that it is one sided. In other words, the support of can be a strict subset of the support of . Although this assumption cannot be verified, its violations can be detected when external expectations cannot be attained in the internal sample.
3.3 Implementation
We used R’s CVXR (fu_cvxr_2020) library to solve the optimization problem and WeightedROC library (hocking_weightedroc_2020) to compute weighted AUC. To deal with cases where we used the relaxed Problem (3) with and norm. To alleviate numerical issues, we set a minimum weight parameter (
) and remove features with a small standard deviation (
).4 Empirical Evaluation
To evaluate the accuracy of the proposed algorithm, we estimated the external performance of various models using an internal sample and limited external statistical characteristics, in three scenarios: (a) simulating data using a structural equation model (bareinboim_causal_2016) for ”internal” and ”external” environments; training an outcome prediction model on the internal sample and evaluating its performance on the external one; (b) extracting a cohort of newly diagnosed ulcerative colitis individuals in IMRDUK data; synthetically splitting this cohort into ”internal” and ”external” samples; training a complication risk model on the internal sample and evaluating its performance on the external one; and (c) extracting atrial fibrillation patient cohorts in IMRDUK data as an internal sample; evaluating the performance of three stroke risk models in multiple inaccessible claim and EMR databases using their published statistical characteristics.
4.1 Synthetic Data
We simulated synthetic data using structural equation models that contain a hidden variable , features , a binary outcome
, and a deterministic binary variable
where denotes an internal environment and denotes an external one (fig:simmodel). This framework allows examining the strengths and limitations of the proposed algorithm subject to different types of data shifts.We defined the simulations using the following structural equations model:
where
and are coefficient vectors, the rest of the coefficients are scalars, , are independent sources of variability and .
This model is similar to the anchor regression model (rothenhausler_anchor_2021), replacing the continuous outcome with a binary one. The dependency of on the hidden variable induces correlations between features, and the interaction term induces differences in the correlations structure between environments. Therefore, the coefficient controls the ”strength” of the shift in correlations between features, depending on the environment; and the coefficient controls the shift in direct effect of on .
4.1.1 Implementation
Here, we set the dimension of to be and sample coefficients , , , , and . We let only and (but not to ) affect the outcome by setting and .
As studies do not typically report correlations between features within each outcome class, we tested our algorithm in different scenarios of correlation shifts. Specifically, we used three configurations of , where , , or , emulating weak, medium, and strong correlation shifts, respectively.
Given a specific simulation model, we generated three data sets, namely internal training and tests sets and an external data set. We computed the mean and variance of every feature in
, separately for individuals with and, in the external set. Next, we trained an elastic net regularized logistic regression model on the internal training set and computed the AUC on the internal test and external sets. Finally, we applied the performance estimation algorithm on the internal test set, using external expectations, and compared the estimated AUC to the actual one.
Supplementary fig:simexamples presents examples of generated samples with varying values of . For each setting, we generated models, and from each sampled data with varying sizes (, , , , ).
4.1.2 External Performance Estimation
The results of the proposed algorithm, in terms of divergence from uniform weights and AUC estimation accuracy, for different values of and data size are shown in tab:simcharacteristics. As expected, weight divergence from uniform () and estimation error grow with .
fig:simerror presents the estimation error of the external AUC values, as a function of correlation shift strength and sample size, . Estimation quality is lower for strong shifts in correlations which are not captured in the shared expectations, whereas milder differences result in good estimations. For comparison, the difference between internal and actual external AUC values is around (tab:simcharacteristics).
4.2 Synthetic Data Split: Complications of Ulcerative Colitis
Next, we studied the IMRDUK primary care data and synthetically split it into ”internal” and ”external” sets based on various demographic criteria. Specifically, we trained a model on the internal sample to predict the year risk of intestinal surgery (or death) in ulcerative colitis (UC) patients; estimated its performance on the external sample, using limited external statistics; and compared the estimated and observed performance.
4.2.1 Clinical Background
UC is a chronic inflammatory bowel condition with consistently increasing incidence rates in both newly industrialized and developed countries (benchimol_changing_2014; windsor_evolving_2019; kaplan_four_2021). The increase in its prevalence has a significant impact on healthcare financial burden due to chronically administered medications as well as hospitalizations and surgical procedures (windsor_evolving_2019).
UC pathogenesis is not well understood. Presumed risk factors for a more complicated disease include younger age at diagnosis, extensive disease, use of steroids and immunosupressive drugs, and being a nonsmoker (kolianipace_prognosticating_2019).
4.2.2 Implementation
The UC onset cohort includes individuals with at least two diagnoses of inflammatory bowel disease (IBD) or with an IBD diagnosis and a prescription for an IBD medication; who have an ulcerative colitis diagnosis and no Crohn’s disease diagnosis. We set index (or cohort entry) date to the first IBD diagnosis or medication prescription and required that individuals have a minimum observation of days prior to index date. We excluded subjects with insufficient followup.
For each individual in the ulcerative colitis cohort we extracted a set of features, previously reported as associated with increased intestinal surgery risk (kolianipace_prognosticating_2019). These include age (and age), sex, smoking, being underweight or overweight, presence of perianal disease, and use of steroids; and considered sets of predefined features (per OHDSI’s Feature Extraction R library), e.g., drugs prescribed to a at least 1,000 subjects up to 90 days after index date. The outcome considers procedure codes for colostomy, colectomy, ileostomy, small intestinal resection, stricturoplasty, balloon dilation, drainage of perianal abscess, drainage of intraabdominal abscess, or death, within years following index date. Definition of all concept sets and cohorts are available at https://atlasdemo.ohdsi.org/.
We split the IMRDUK data into internal and external sets based on individual age or country of living, as described below.
4.2.3 External Performance Estimation: Ulcerative Colitis, Split by Age
In the following experiments, we split the subset of individuals who live in England by their age. Specifically, in the first experiment, the internal set contained the youngest subjects ( years) and the external set – the older ones; and in the second experiment, the internal set contains the older individuals ( years) and the internal set – the remaining older individuals. In each of these setups we randomly split the internal set to training () and test (); we repeated the trainingtest random split times. Next, we trained a linear model as well as a nonlinear one, using XGBoost (chen_xgboost_2016), computed model’s AUC on the internal test and external sets, and estimated the external AUC using the internal set and the expectations of the external one. To maintain positivity and to emulate an environment dependent hidden factor, we excluded age from the feature set. The populations were different in several observed characteristics, notably, percentage of women, underweight and overweight; see tab:uc.age for details.
Overall, the external performance estimations, using either elastic net or XGBoost, are close to the actual ones (fig:englandage), notably for external younger samples, where the difference between the internal and external performance is large (right panel).
4.2.4 External Performance Estimation: Ulcerative Colitis, Split by Country
Next, we split the UC cohort by country of residence and considered the subcohort of individuals living in England as the internal sample and those living in Scotland, Wales and Northern Ireland as three distinct external samples. Similarly to the age split analysis, we split the internal sample into training and test sets, repeatedly times; trained a model on each training set; extracted expectations for the external samples; and evaluated model performance on the internal test and external sets.
The characteristics of different subpopulations are presented in tab:uc_geog; fig:uk shows the external performance evaluation results, attesting to the (much) improved accuracy of the estimated AUC values, compared to internal performance.
4.3 Distinct Datasets: Stroke Risk Models
4.3.1 Clinical Background
Atrial fibrillation is a common cardiac rhythm disorder, associated with increased risk of stroke (sagris_atrial_2021). Risk factors associated with the occurrence of stroke include older age, various comorbidities (in particular, hypertension, diabetes, and renal disease) and smoking (singer_new_2013). To guide treatment, multiple risk scores have been devised and externally evaluated in several studies (van_den_ham_comparative_2015). Recently, reps_feasibility_2020 replicated five previously published prognostic models that predict stroke in females newly diagnosed with atrial fibrillation; and externally validated their performance across nine observational healthcare datasets. Below, we use our proposed algorithm and the limited perdatabase statistical characteristics, as it appears in reps_feasibility_2020, to estimate the external performance of these risk scores.
4.3.2 Implementation
We downloaded reps_feasibility_2020’s analysis package and applied it to the IMRDUK data, with the following modifications that adjust the study definitions to a primary care setting:
Target cohorts.
We considered ECGrelated procedures and conditions, in addition to measurements, within 30 days prior the atrial fibrillation diagnosis, as an optional inclusion criterion.
Outcome cohort.
As stroke, typically not diagnosed in a primary care setting, may be poorly recorded for deceased individuals, we added death as an entry event to the stroke cohort.
Feature definitions.
We extended the time window for extraction of model features to span the entire history of each individual until, and including, the date of the first atrial fibrillation event; included individuals with estimated glomerular filtration rate (eGFR) lower than mL/min/m in the end stage renal disease cohort, as originally defined in the ATRIA risk model (singer_new_2013); and defined former smokers as individuals with an observation of smoker, as well as those diagnosed with tobacco dependence syndrome.
For each individual, the analysis package computed a stroke risk score given her set of features, as extracted from IMRDUK; then, calculated score performance, visàvis recorded stroke (and death) events. To estimate score performance in each external sample, we weighted individuals in the IMRDUK data using the proposed algorithm to reproduce the sample’s populations characteristics, as reported in reps_feasibility_2020, and computed the score performance for the weighted individual cohort. We computed confidence intervals using bootstrapping iterations.
Population attributes (reps_feasibility_2020) include percentage of individuals in certain age groups ( years,  years and above years), comorbidities (hypertension, congestive heart failure, congestive cardiac failure, coronary heart disease, valvular heart disease, chronic and end stage renal disease, proteinuria, diabetes, and rheumatoid arthritis) and being a former smoker.
4.3.3 External Performance Estimation
A comparison between risk score performance, as reported by reps_feasibility_2020, and the estimated performance is shown in fig:strokeallages. For the full cohort (top panel), in three out of six datasets, the confidence interval of the ATRIA estimation overlaps the actual AUC (fig:strokeallagesa); in two other datasets, the estimation is better than the internal, IMRDUK based performance. Qualitatively similar results are observed for the CHADS2 and QStroke risk scores (fig:strokeallagesb and c, respectively); as well as for women years or older (bottom panel).
We note that for two additional risk scores, Framingham and CHADSVASc, and two datasets, Ajou University School Of Medicine (AUSOM) and Integrated Primary Care Information (IPCI), reps_feasibility_2020 do not provide necessary statistical characteristics. Additionally, AUC values of IBM MarketScan^{®} Medicare Supplemental Database (MDCR) were not provided for the full atrial fibrillation cohort and those of IBM MarketScan^{®} Commercial Database (CCAE) are missing for the older female subcohort. Therefore, in all those cases, performance estimations are not reported.
5 Discussion
We presented an algorithm that estimates the performance of prediction models on external samples from their limited statistical characteristics; and demonstrated its utility using synthetic data, synthetic split of an ulcerative colitis cohort from a single database into age groups and by country of living, and a recent risk model benchmark of stroke risk models on multiple external samples. Importantly, our proposed algorithm can help identifying models that perform well across multiple clinical settings and geographies, even when detailed test data from such settings is not available. It can thus direct development of robust models and accelerate deployment to external environments.
The algorithm relies on two assumptions: onesided positivity and proximity. Both assumptions cannot be fully tested, but clear violations of the former one can be detected, for example, when the expected value of a feature is nonzero in the external distribution but all the individuals in the internal set have a zero value for that feature. Intuitively, proximity is more likely to be plausible when the statistical information becomes more detailed. Therefore, whereas our preliminary experiments involved only marginal statistics of features it may be informative to test the performance of the algorithm when more detailed statistics are available, for example interactions among features or information available in deep characterization studies burn_deep_20201.
We believe that the proposed methodology can serve as a building block in network studies that aim to construct robust models across datasets when data sharing is limited, e.g., by regulatory constraints. Although federated learning methods may be a promising avenue for such scenarios, it would be interesting to explore in which cases the proposed algorithm can facilitate a oneshot federated learning scheme, that does not require deployment of federated algorithm clients in all network nodes.
In future work, we will combine the proposed algorithm with methods that aim to construct robust models such as those that leverage distributionally robust optimization (buhlmann_invariance_2020); study methods that exploit the relations between calibration and robustness (wald_calibration_2022); and look into decomposing AUC (pmlrv54eban17a), so it can be optimized explicitly.
Institutional Review Board (IRB)
This study has been approved by IQVIA Scientific Review Committee (Reference numbers: 21SRC066, 22SRC002).
We thank Drs Roni Weisshof and Ramit Magen, Rambam Health Care Campus, Haifa, Israel, for their help in defining the ulcerative colitis predictive model; and Prof Seng Chan You, Yonsei University Health System, Seoul, Republic of Korea, for pointing us to the stroke external validation study. We are also grateful to KI Institute’s researchers and, specifically, to Nir Kalkstein and Yonatan Bilu, for many fruitful discussions.
References
Appendix A Modeldependent optimization scheme
An upper bound of a model ’s weighted loss , up to a finite sample error, can be derived as follows:
The tightness of the bound may depend on the number of expectations we consider. Furthermore, as , and consequently , may not represent all interfeature dependencies existing in the data, an additional constraint may yield improved estimations:
(4) 
As we increase , the bound may become tighter but confidence may decrease.
Appendix B Modelindependent dual optimization problem
Recall that optimization Problem (2) is defined as follows:
(5) 
where . Denoting
(6) 
Problem (2) becomes:
(7) 
Following Equation 5.11 in boyd2004convex the dual function is:
where is the conjugate of the negativeentropy function (boyd2004convex, p. 222):
Therefore,
The Lagrangian of the primal problem is:
(8) 
Let be the optimal solution of . Then, following Section 5.5.3 of boyd2004convex, the solution of the primal problem minimizes the Lagrangian at :
giving
This result shows that the optimal weights are normalized exponents of a linear function of the data points.
Appendix C Supplementary Figures
fig:simexamples
[ internal] [ external]
[ internal] [ external]
[ internal] [ external]