DeepAI
Log In Sign Up

Estimating Model Performance on External Samples from Their Limited Statistical Characteristics

02/28/2022
by   Tal El-Hay, et al.
0

Methods that address data shifts usually assume full access to multiple datasets. In the healthcare domain, however, privacy-preserving regulations as well as commercial interests limit data availability and, as a result, researchers can typically study only a small number of datasets. In contrast, limited statistical characteristics of specific patient samples are much easier to share and may be available from previously published literature or focused collaborative efforts. Here, we propose a method that estimates model performance in external samples from their limited statistical characteristics. We search for weights that induce internal statistics that are similar to the external ones; and that are closest to uniform. We then use model performance on the weighted internal sample as an estimation for the external counterpart. We evaluate the proposed algorithm on simulated data as well as electronic medical record data for two risk models, predicting complications in ulcerative colitis patients and stroke in women diagnosed with atrial fibrillation. In the vast majority of cases, the estimated external performance is much closer to the actual one than the internal performance. Our proposed method may be an important building block in training robust models and detecting potential model failures in external environments.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/07/2021

Librispeech Transducer Model with Internal Language Model Prior Correction

We present our transducer model on Librispeech. We study variants to inc...
10/20/2020

An ensemble meta-prediction framework to integrate multiple external models into a current study

Disease risk prediction models are used throughout clinical biomedicine....
07/11/2020

Generalization of Deep Convolutional Neural Networks – A Case-study on Open-source Chest Radiographs

Deep Convolutional Neural Networks (DCNNs) have attracted extensive atte...
10/01/2022

Paradoxes and resolutions for semiparametric fusion of individual and summary data

Suppose we have available individual data from an internal study and var...
06/15/2020

Privacy-Preserving Technology to Help Millions of People: Federated Prediction Model for Stroke Prevention

prevention of stroke with its associated risk factors has been one of th...
11/23/2018

Privacy-preserving Transfer Learning for Knowledge Sharing

In many practical machine-learning applications, it is critical to allow...
12/27/2021

MedShift: identifying shift data for medical dataset curation

To curate a high-quality dataset, identifying data variance between the ...

1 Introduction

Predictive models, such as disease risk scores, are typically trained on a single, or few, data sources but are often expected to work well in other environments, that may vary in their population characteristics, clinical settings, and policies (steyerberg_prediction_2016). In many cases, model performance deteriorates significantly in these external environments, as demonstrated repeatedly (e.g., ohnuma2017prediction), and most recently for the widely implemented proprietary Epic Sepsis Model (wong_external_2021-1) and for COVID-19 risk models (reps_implementation_2021-2).

Model robustness – that is, its ability to provide accurate prediction despite changes, e.g., in the characteristics of input covariates – can be demonstrated using external validation, the process of evaluating model performance on data sources that were not used for its derivation. However, full access to medical datasets is often limited due to privacy, regulatory and commercial factors. Therefore, we aim to estimate the performance of a given model on external sources using only their more commonly available statistical characteristics.

Here, we propose an algorithm which reweights individuals in an internal sample to match external statistics, potentially reported in preceding publications or characterization studies (e.g., recalde_characteristics_2021

); then estimates the performance on the external sample using the reweighted internal one. We focus on cases that are common in the healthcare domain, where the size of samples (that is, number of individuals) is much larger than the number of features. In such cases, infinite number of weight sets may recapitulate the external statistics, therefore the proposed algorithms searches for weights with a minimal divergence from a uniform distribution.

We first study the strengths and limitations of our suggested approach using simulated data; then split a sample from a primary care dataset into ”internal” versus ”external” subsets based on demographic information, and validate the approach using a prediction model for -year risk of complications in ulcerative colitis patients; and, finally, use the entire primary care data to estimate the performance of three stroke risk scores in seven external resources and compare it to the actual performance, as reported in a recent study (reps_feasibility_2020).

2 Related Work

The task of evaluating model performance in external samples, often with (at least some) data shift (finlayson_clinician_2021), is tightly coupled with that of training robust models, as evaluation is a necessary step in model selection and optimization.

One line of work handling data shifts adopts ideas from causal inference. Specifically, causal models (bareinboim_causal_2016) can distinguish invariant relations between risk factors (e.g., biological or physiological) and outcomes from context- or environment-dependent mechanisms (subbaswamy_preventing_2019). subbaswamy_evaluating_2021 developed a method for analyzing model robustness (or stability) that, given a model, a set of distribution-fixed (immutable) variables and a set of distribution-varying (mutable) variables, identifies the sub-population with the worst average loss; thus, enabling evaluation of model safety, with no external information.

Sample reweighting is commonly applied to adjust for confounders, either measured (hainmueller_entropy_2012) or unmeasured (streeter_adjusting_2017), and to account for selection bias (kalton_weighting_2003), typically leveraging fully-accessible samples. Methodologically, the optimization problem we derive is similar to that studied for entropy balancing (hainmueller_entropy_2012)

, which attempts to reweight a sample (e.g., control group) so its prespecified set of moments exactly match that of another sample (e.g., treatment group), while maximizing the weight entropy (that is, keeping weights as close as possible to uniform). We note, however, that we explore a different use-case and, consequently, optimize over moments of an otherwise inaccessible sample (rather than samples from an accessible data source).

3 Estimation Algorithm

The goal of the proposed method is to estimate the performance of a prediction model, e.g., risk score, on an external sample, given some of its statistical properties, and using an internal, fully-accessible data. Briefly, we reweight an internal sample to obtain the external statistics, then compute model performance on the weighted sample as an estimate of the external performance.

3.1 Problem Formulation

Let and

denote an observation (or feature) vector and a binary outcome

222We focus here on binary outcomes, as these are commonly used – and reported – in healthcare applications; it is possible to extend the proposed approach to continuous outcomes, using an appropriate performance measure and statistical characteristics., respectively, for an individual . Suppose we have access to observations for individuals in an internal sample:

and summary statistics for an external sample (with individuals):

where is a set of transformations on individual-level observations. For example:

allows computation of features mean in subsets of individuals with and without the outcome (as often reported in a study’s Table ).

We aim at estimating the performance of a model on the external sample , using and observations from . To this end, we search for weights , such that the statistical properties of the weighted sample approximate these of the external one. Let denote the space of such weight sets:

Where is a matrix whose rows are . As may be infinitely large, we propose to search for a set of weights that is also closest to uniform. This additional constraint is based on a proximity assumption, intuitively that the external distribution is relatively similar to the distribution in that is closest to the internal distribution.

Using the reweighted sample we can now estimate two types of performance measures:

  • Measures that can be expressed as a pointwise loss function,

    , for which we estimate the expected loss of the model in the external sample as:

    For example, for a model that computes the probability of an outcome

    , we can estimate the expected negative log-likelihood by setting

  • Non-decomposable measures that can be evaluated on weighted samples. For example, the area under the receiver operating characteristic curve (AUC).

Below we present a model independent scheme, which minimizes an f-divergence function (for example, maximizes the weights entropy); and in app:model-dep, we derive a model (and loss) dependent scheme, which maximizes a weighted upper bound on the model loss and the regularized divergence function.

Model-independent optimization scheme.

To find a weighted representation of an internal sample that replicates the external expectations, we solve the following optimization problem:

(1)

where , f-divergence, for discrete measures and is:

and is a convex function, with . For example, when , Optimization Problem (1) becomes:

(2)

where is the entropy function.

We derive a dual formulation of Problem (2), similar to hainmueller_entropy_2012, in app:dual; and show that the optimal solution has the form:

where . In other words, the optimal weights are normalized exponents of a linear function of . We note that, as the number of features is typically much smaller than the sample size, the solution to the dual problem is expected to be more numerically stable than the primal’s.

In cases where , Problem (1) can be adjusted to trade-off, using hyper-parameter , the accuracy at which the weighted internal sample reproduce the external statistics and proximity and rewritten as:

(3)

where the norm can be or .

3.2 Detecting Estimation Failure

To estimate the performance of a model in an external sample , with distribution of transformed features, we assume that whenever . This condition is analogous to the positivity assumption in causal inference, except that it is one sided. In other words, the support of can be a strict subset of the support of . Although this assumption cannot be verified, its violations can be detected when external expectations cannot be attained in the internal sample.

3.3 Implementation

We used R’s CVXR (fu_cvxr_2020) library to solve the optimization problem and WeightedROC library (hocking_weightedroc_2020) to compute weighted AUC. To deal with cases where we used the relaxed Problem (3) with and norm. To alleviate numerical issues, we set a minimum weight parameter (

) and remove features with a small standard deviation (

).

4 Empirical Evaluation

To evaluate the accuracy of the proposed algorithm, we estimated the external performance of various models using an internal sample and limited external statistical characteristics, in three scenarios: (a) simulating data using a structural equation model (bareinboim_causal_2016) for ”internal” and ”external” environments; training an outcome prediction model on the internal sample and evaluating its performance on the external one; (b) extracting a cohort of newly diagnosed ulcerative colitis individuals in IMRD-UK data; synthetically splitting this cohort into ”internal” and ”external” samples; training a complication risk model on the internal sample and evaluating its performance on the external one; and (c) extracting atrial fibrillation patient cohorts in IMRD-UK data as an internal sample; evaluating the performance of three stroke risk models in multiple inaccessible claim and EMR databases using their published statistical characteristics.

4.1 Synthetic Data

We simulated synthetic data using structural equation models that contain a hidden variable , features , a binary outcome

, and a deterministic binary variable

where denotes an internal environment and denotes an external one (fig:sim-model). This framework allows examining the strengths and limitations of the proposed algorithm subject to different types of data shifts.

fig:sim-model

Figure 1: Graphical representation of the data-generating causal model. is an environment variable (e.g., in a clinical setting, specific healthcare system), is a hidden variable (encoding, for example, an individual’s healthcare status), is a set of observed features (e.g., prescribed medications or lab test results) and is a binary outcome (e.g., disease onset or progression).

We defined the simulations using the following structural equations model:

where

and are coefficient vectors, the rest of the coefficients are scalars, , are independent sources of variability and .

This model is similar to the anchor regression model (rothenhausler_anchor_2021), replacing the continuous outcome with a binary one. The dependency of on the hidden variable induces correlations between features, and the interaction term induces differences in the correlations structure between environments. Therefore, the coefficient controls the ”strength” of the shift in correlations between features, depending on the environment; and the coefficient controls the shift in direct effect of on .

4.1.1 Implementation

Here, we set the dimension of to be and sample coefficients , , , , and . We let only and (but not to ) affect the outcome by setting and .

As studies do not typically report correlations between features within each outcome class, we tested our algorithm in different scenarios of correlation shifts. Specifically, we used three configurations of , where , , or , emulating weak, medium, and strong correlation shifts, respectively.

Given a specific simulation model, we generated three data sets, namely internal training and tests sets and an external data set. We computed the mean and variance of every feature in

, separately for individuals with and

, in the external set. Next, we trained an elastic net regularized logistic regression model on the internal training set and computed the AUC on the internal test and external sets. Finally, we applied the performance estimation algorithm on the internal test set, using external expectations, and compared the estimated AUC to the actual one.

Supplementary fig:sim-examples presents examples of generated samples with varying values of . For each setting, we generated models, and from each sampled data with varying sizes (, , , , ).

4.1.2 External Performance Estimation

The results of the proposed algorithm, in terms of divergence from uniform weights and AUC estimation accuracy, for different values of and data size are shown in tab:sim-characteristics. As expected, weight divergence from uniform () and estimation error grow with .

fig:sim-error presents the estimation error of the external AUC values, as a function of correlation shift strength and sample size, . Estimation quality is lower for strong shifts in correlations which are not captured in the shared expectations, whereas milder differences result in good estimations. For comparison, the difference between internal and actual external AUC values is around (tab:sim-characteristics).

tab:sim-characteristics Internal External Est. AUC AUC Error (Weak) (Med.) (Strong)

Table 1: Algorithm performance in -unit simulated datasets averaged on 200 sampling repetitions. The estimation error column presents the mean of the absolute values of errors.

, Kullback Leibler divergence between the derived and uniform weights, and estimation error increase with stronger correlation shifts.

fig:sim-error

Figure 2: Estimation error (absolute value of the difference between actual and estimated external AUC values; y-axis) for weak, medium, and strong correlation shifts, as a function of sample size, (x-axis). Whiskers correspond to - AUC percentiles, over models.

4.2 Synthetic Data Split: Complications of Ulcerative Colitis

Next, we studied the IMRD-UK primary care data and synthetically split it into ”internal” and ”external” sets based on various demographic criteria. Specifically, we trained a model on the internal sample to predict the -year risk of intestinal surgery (or death) in ulcerative colitis (UC) patients; estimated its performance on the external sample, using limited external statistics; and compared the estimated and observed performance.

4.2.1 Clinical Background

tab:uc.age Internal: individuals years old, Internal: individuals years old, External: individuals years old External: individuals years old Internal External Internal External n Townsend deprivation index Score () () () () Available (%) (%) (%) (%) Female (%) (%) (%) (%) Smoking (%) (%) (%) (%) Steroids (%) (%) (%) (%) Body mass index (BMI) Underweight (%) (%) (%) (%) Overweight (%) (%) (%) (%) Perianal disease (%) (%) (%) (%) Complications (%) (%) (%) (%)

Table 2: Characteristics of internal and external samples, split by age.

UC is a chronic inflammatory bowel condition with consistently increasing incidence rates in both newly industrialized and developed countries (benchimol_changing_2014; windsor_evolving_2019; kaplan_four_2021). The increase in its prevalence has a significant impact on healthcare financial burden due to chronically administered medications as well as hospitalizations and surgical procedures (windsor_evolving_2019).

UC pathogenesis is not well understood. Presumed risk factors for a more complicated disease include younger age at diagnosis, extensive disease, use of steroids and immunosupressive drugs, and being a non-smoker (koliani-pace_prognosticating_2019).

4.2.2 Implementation

The UC onset cohort includes individuals with at least two diagnoses of inflammatory bowel disease (IBD) or with an IBD diagnosis and a prescription for an IBD medication; who have an ulcerative colitis diagnosis and no Crohn’s disease diagnosis. We set index (or cohort entry) date to the first IBD diagnosis or medication prescription and required that individuals have a minimum observation of days prior to index date. We excluded subjects with insufficient follow-up.

For each individual in the ulcerative colitis cohort we extracted a set of features, previously reported as associated with increased intestinal surgery risk (koliani-pace_prognosticating_2019). These include age (and age), sex, smoking, being underweight or overweight, presence of perianal disease, and use of steroids; and considered sets of predefined features (per OHDSI’s Feature Extraction R library), e.g., drugs prescribed to a at least 1,000 subjects up to 90 days after index date. The outcome considers procedure codes for colostomy, colectomy, ileostomy, small intestinal resection, stricturoplasty, balloon dilation, drainage of perianal abscess, drainage of intra-abdominal abscess, or death, within years following index date. Definition of all concept sets and cohorts are available at https://atlas-demo.ohdsi.org/.

We split the IMRD-UK data into internal and external sets based on individual age or country of living, as described below.

fig:england-age Internal: individuals years old, Internal: individuals years old, External: individuals years old External: individuals years old AUC   [Elastic net]

[XGBoost]

[Elastic net] [XGBoost]

Figure 3:

Actual and estimated external performance in England UC cohorts, split by age. Boxes show the external median AUC and inter-quantile range (IQR,

and percentiles) over repetitions; solid line represents the median internal AUC and dashed lines represent the IQR.

4.2.3 External Performance Estimation: Ulcerative Colitis, Split by Age

In the following experiments, we split the subset of individuals who live in England by their age. Specifically, in the first experiment, the internal set contained the youngest subjects ( years) and the external set – the older ones; and in the second experiment, the internal set contains the older individuals ( years) and the internal set – the remaining older individuals. In each of these setups we randomly split the internal set to training () and test (); we repeated the training-test random split times. Next, we trained a linear model as well as a non-linear one, using XGBoost (chen_xgboost_2016), computed model’s AUC on the internal test and external sets, and estimated the external AUC using the internal set and the expectations of the external one. To maintain positivity and to emulate an environment dependent hidden factor, we excluded age from the feature set. The populations were different in several observed characteristics, notably, percentage of women, underweight and overweight; see tab:uc.age for details.

Overall, the external performance estimations, using either elastic net or XGBoost, are close to the actual ones (fig:england-age), notably for external younger samples, where the difference between the internal and external performance is large (right panel).

4.2.4 External Performance Estimation: Ulcerative Colitis, Split by Country

tab:uc_geog England Wales Northern Ireland Scotland n Age (years) () () () () Townsend deprivation index Score () () () () Available (%) (%) (%) (%) Female (%) (%) (%) (%) Smoking (%) (%) (%) (%) Steroids (%) (%) (%) (%) Body mass index (BMI) Underweight (%) (%) (%) (%) Overweight (%) (%) (%) (%) Perianal disease (%) (%) (%) (%) Complications (%) (%) (%) (%)

Table 3: Characteristics of ulcerative colitis, country-based sub-cohorts.

fig:uk AUC   [Elastic net]  [XGBoost]

Figure 4: Actual and estimated performance in external, country-based UC samples. Boxes show the external median AUC and IQR over repetitions; solid and dashed lines represent the internal median AUC and IQR, respectively.

Next, we split the UC cohort by country of residence and considered the sub-cohort of individuals living in England as the internal sample and those living in Scotland, Wales and Northern Ireland as three distinct external samples. Similarly to the age split analysis, we split the internal sample into training and test sets, repeatedly times; trained a model on each training set; extracted expectations for the external samples; and evaluated model performance on the internal test and external sets.

The characteristics of different sub-populations are presented in tab:uc_geog; fig:uk shows the external performance evaluation results, attesting to the (much) improved accuracy of the estimated AUC values, compared to internal performance.

4.3 Distinct Datasets: Stroke Risk Models

4.3.1 Clinical Background

Atrial fibrillation is a common cardiac rhythm disorder, associated with increased risk of stroke (sagris_atrial_2021). Risk factors associated with the occurrence of stroke include older age, various comorbidities (in particular, hypertension, diabetes, and renal disease) and smoking (singer_new_2013). To guide treatment, multiple risk scores have been devised and externally evaluated in several studies (van_den_ham_comparative_2015). Recently, reps_feasibility_2020 replicated five previously published prognostic models that predict stroke in females newly diagnosed with atrial fibrillation; and externally validated their performance across nine observational healthcare datasets. Below, we use our proposed algorithm and the limited per-database statistical characteristics, as it appears in reps_feasibility_2020, to estimate the external performance of these risk scores.

4.3.2 Implementation

We downloaded reps_feasibility_2020’s analysis package and applied it to the IMRD-UK data, with the following modifications that adjust the study definitions to a primary care setting:

Target cohorts.

We considered ECG-related procedures and conditions, in addition to measurements, within 30 days prior the atrial fibrillation diagnosis, as an optional inclusion criterion.

Outcome cohort.

As stroke, typically not diagnosed in a primary care setting, may be poorly recorded for deceased individuals, we added death as an entry event to the stroke cohort.

Feature definitions.

We extended the time window for extraction of model features to span the entire history of each individual until, and including, the date of the first atrial fibrillation event; included individuals with estimated glomerular filtration rate (eGFR) lower than mL/min/m in the end stage renal disease cohort, as originally defined in the ATRIA risk model (singer_new_2013); and defined former smokers as individuals with an observation of smoker, as well as those diagnosed with tobacco dependence syndrome.

For each individual, the analysis package computed a stroke risk score given her set of features, as extracted from IMRD-UK; then, calculated score performance, vis-à-vis recorded stroke (and death) events. To estimate score performance in each external sample, we weighted individuals in the IMRD-UK data using the proposed algorithm to reproduce the sample’s populations characteristics, as reported in reps_feasibility_2020, and computed the score performance for the weighted individual cohort. We computed confidence intervals using bootstrapping iterations.

Population attributes (reps_feasibility_2020) include percentage of individuals in certain age groups (- years, - years and above years), comorbidities (hypertension, congestive heart failure, congestive cardiac failure, coronary heart disease, valvular heart disease, chronic and end stage renal disease, proteinuria, diabetes, and rheumatoid arthritis) and being a former smoker.

fig:stroke-all-ages Women diagnosed with atrial fibrillation
[ATRIA]  [CHADS2]  [Q-Stroke]
Women or more years old, diagnosed with atrial fibrillation
[ATRIA]  [CHADS2]  [Q-Stroke]

Figure 5: Performance estimation for three stroke risk scores across seven external datasets. Blue circles represent the actual AUC value as reported by reps_feasibility_2020, red diamonds show the weighted estimations, and whiskers denote confidence intervals. Solid lines represent the internal AUC, as computed in the IMRD-UK cohorts, with dashed lines denoting confidence intervals.

4.3.3 External Performance Estimation

A comparison between risk score performance, as reported by reps_feasibility_2020, and the estimated performance is shown in fig:stroke-all-ages. For the full cohort (top panel), in three out of six datasets, the confidence interval of the ATRIA estimation overlaps the actual AUC (fig:stroke-all-agesa); in two other datasets, the estimation is better than the internal, IMRD-UK based performance. Qualitatively similar results are observed for the CHADS2 and Q-Stroke risk scores (fig:stroke-all-agesb and c, respectively); as well as for women years or older (bottom panel).

We note that for two additional risk scores, Framingham and CHADSVASc, and two datasets, Ajou University School Of Medicine (AUSOM) and Integrated Primary Care Information (IPCI), reps_feasibility_2020 do not provide necessary statistical characteristics. Additionally, AUC values of IBM MarketScan® Medicare Supplemental Database (MDCR) were not provided for the full atrial fibrillation cohort and those of IBM MarketScan® Commercial Database (CCAE) are missing for the older female sub-cohort. Therefore, in all those cases, performance estimations are not reported.

5 Discussion

We presented an algorithm that estimates the performance of prediction models on external samples from their limited statistical characteristics; and demonstrated its utility using synthetic data, synthetic split of an ulcerative colitis cohort from a single database into age groups and by country of living, and a recent risk model benchmark of stroke risk models on multiple external samples. Importantly, our proposed algorithm can help identifying models that perform well across multiple clinical settings and geographies, even when detailed test data from such settings is not available. It can thus direct development of robust models and accelerate deployment to external environments.

The algorithm relies on two assumptions: one-sided positivity and proximity. Both assumptions cannot be fully tested, but clear violations of the former one can be detected, for example, when the expected value of a feature is non-zero in the external distribution but all the individuals in the internal set have a zero value for that feature. Intuitively, proximity is more likely to be plausible when the statistical information becomes more detailed. Therefore, whereas our preliminary experiments involved only marginal statistics of features it may be informative to test the performance of the algorithm when more detailed statistics are available, for example interactions among features or information available in deep characterization studies burn_deep_2020-1.

We believe that the proposed methodology can serve as a building block in network studies that aim to construct robust models across datasets when data sharing is limited, e.g., by regulatory constraints. Although federated learning methods may be a promising avenue for such scenarios, it would be interesting to explore in which cases the proposed algorithm can facilitate a one-shot federated learning scheme, that does not require deployment of federated algorithm clients in all network nodes.

In future work, we will combine the proposed algorithm with methods that aim to construct robust models such as those that leverage distributionally robust optimization (buhlmann_invariance_2020); study methods that exploit the relations between calibration and robustness (wald_calibration_2022); and look into decomposing AUC (pmlr-v54-eban17a), so it can be optimized explicitly.

Institutional Review Board (IRB)

This study has been approved by IQVIA Scientific Review Committee (Reference numbers: 21SRC066, 22SRC002).

We thank Drs Roni Weisshof and Ramit Magen, Rambam Health Care Campus, Haifa, Israel, for their help in defining the ulcerative colitis predictive model; and Prof Seng Chan You, Yonsei University Health System, Seoul, Republic of Korea, for pointing us to the stroke external validation study. We are also grateful to KI Institute’s researchers and, specifically, to Nir Kalkstein and Yonatan Bilu, for many fruitful discussions.

References

Appendix A Model-dependent optimization scheme

An upper bound of a model ’s weighted loss , up to a finite sample error, can be derived as follows:

The tightness of the bound may depend on the number of expectations we consider. Furthermore, as , and consequently , may not represent all inter-feature dependencies existing in the data, an additional constraint may yield improved estimations:

(4)

As we increase , the bound may become tighter but confidence may decrease.

Appendix B Model-independent dual optimization problem

Recall that optimization Problem (2) is defined as follows:

(5)

where . Denoting

(6)

Problem (2) becomes:

(7)

Following Equation 5.11 in boyd2004convex the dual function is:

where is the conjugate of the negative-entropy function (boyd2004convex, p. 222):

Therefore,

The Lagrangian of the primal problem is:

(8)

Let be the optimal solution of . Then, following Section 5.5.3 of boyd2004convex, the solution of the primal problem minimizes the Lagrangian at :

giving

This result shows that the optimal weights are normalized exponents of a linear function of the data points.

Appendix C Supplementary Figures

fig:sim-examples [ internal] [ external]
[ internal] [ external]
[ internal] [ external]

Figure 6: Simulation examples with varying values of . Dot colors denote outcome class, diamonds represent class means. The shift in correlation between and , given an outcome class, increases with .

Appendix C Supplementary Figures

fig:sim-examples [ internal] [ external]
[ internal] [ external]
[ internal] [ external]