1 Introduction
Linked electronic medical record (EMR) systems provide a massive reservoir of information that can help researchers understand and treat both common and rare medical conditions. Specifically, EMR data includes both structured data, such as lab values and diagnostic codes, and unstructured data in the form of freetext medical notes and images. The majority of EMR data is natively captured in unstructured form, and this has fostered the development of learning algorithms to extract research ready variables (Pons et al. [2016], Wang et al. [2018]). Both structured and unstructured data may be used towards identification of key clinical indicators or outcomes, where accuratelyderived outcomes can greatly improve downstream analyses such as the generation of prognostic models for disease risk, or predictive scores that could guide treatment. Ultimately both structured data and carefully processed unstructured data are necessary, but extracting specific findings from unstructured data is often expensive and timeconsuming. The traditional approach requires highly trained clinicians or technicians who can transcribe medical notes into coded variables. The manual abstraction process is not scalable to massive EMR cohorts and has motivated using machinelearning methods, such as natural language processing (NLP) methods for medical text data (Chapman et al. [2001], Carroll et al. [2012]
), and deep learning strategies for medical images (
Esteva et al. [2017]), as scalable alternatives. Yet, any algorithm development relies on a base of training and validation data, and the purpose of this manuscript is to outline efficient study designs that can facilitate costeffective data collection for the development of new prediction tools.In order to both develop (i.e. train) and validate (i.e. test) datadriven machinelearning algorithms requires “labeled data”, which is a sample containing both feature (predictor) and outcome information. For example, classification of imaging findings from radiology report text may use word indicators as features and cliniciandefined actual case/control statuses as binary outcomes. In typical EMR settings, labeled data is not readily available, therefore a small subset of the underlying cohort needs to be selected for outcome abstraction. The information content in labeled samples is crucial towards efficient and accurate machinelearning modeling. For classification tasks, a wellknown challenge for training is statistical rarity, where cases (outcome=1) are disproportionately less frequent than controls (outcome=0), and the fact that the training sample outcome class distribution may affect classification accuracy has been demonstrated both empirically (Weiss and Provost [2001], Batista et al. [2004], Wei and Dunbrack Jr [2013]) and theoretically (Xue and Hall [2015]). Unsurprisingly, such class distributional “imbalance” is almost always observed for clinical outcomes. One approach in the machinelearning literature that has been proposed to address statistical rarity involves resampling the training sample to eliminate controls (undersampling) or replicating cases (oversampling), in order to rebalance the effective outcome class distribution in training, and hopefully to improve ultimate model prediction accuracy (Chawla et al. [2002], He and Garcia [2009]). However, such analysisbased resampling procedures assume that an initial labeled data sample is already available, and these strategies disregard the potential cost associated with labeled data collection (Weiss and Provost [2001]).
When data collection resources are scarce, targeted sampling methods in epidemiology have offered highly efficient research designs. In contrast to analysisbased resampling procedures, epidemiologic sampling methods are defined at the design stage of studies prior to data collection. One wellknown sampling method is the casecontrol design (Prentice and Pyke [1979]
), where collected samples may be appropriately analyzed using a logistic regression model, and has the attractive advantage that estimation proceeds as if a simple random sample were collected (although the regression intercept is biased). Sampling designs such as the casecontrol, where expensive data ascertainment is based on strata defined by values of a cheaper auxiliary variable, may be viewed as special cases of the general twophase sampling design (
Neyman [1934], Chatterjee et al. [2003]). In the context of effect estimation, targeted sampling through twophase has been shown to provide efficiency over simple random sampling (Zhao et al. [2009], McIsaac and Cook [2014]), especially when using sampling variables that are highly correlated and informative for the outcome (Zhao et al. [2012]). However, the effect of selectively sampled training data on ultimate machinelearning prediction accuracy has not been thoroughly investigated.For clinical outcome identification using EMR data, an imperfect alternative to abstracted outcomes may be based on summaries of related structured data elements, such as International Classification of Disease (ICD) codes and simple keyword searches queried within prespecified time frames. Such “surrogates” or “correlates” of actual clinical outcomes have been used in place of true clinical outcomes in machinelearning modeling tasks to reduce the dimensionality of EMRgenerated features (Yu et al. [2016]
), or directly as “noisy” imputed outcome labels for classifier development (
Agarwal et al. [2016]). However, model development with misclassified outcomes may seriously compromise validity of using the resulting model predictions for downstream analyses (Sinnott et al. [2014]), therefore using surrogates to replace abstracted clinical outcomes as labels may not be justified. Alternatively, surrogates could help guide selection of subjects for labeled data abstraction. In fact, subject selection based on nonnegated keywords and ICD codes has been described when assembling a labeled data sample for machinelearning classification of clinical outcomes (Pakhomov et al. [2005]). Yet, there remains little discussion of corresponding statistical rationale, and such heuristic decisions based on purposeful biased sampling may not create generalizable predictions, or valid summaries of accuracy.
This paper is motivated by the need for a formal statistical framework to guide sampling of subjects for labeled data abstraction, towards accurate and scalable machinelearning classification of clinical outcomes. We specifically focus on the rare outcome scenario, where model accuracy is often ratelimited by the number of outcome cases. As with conventional intuition, our proposed strategy targets caseenrichment of rare outcomes for selection of training data. The key contribution of our work is the formalization of heuristic sampling methodologies drawn from the fields of machinelearning and epidemiology, therefore filling a critical gap in EMR research methods. In Section 2.1, we frame the statistical problem and describe the proposed sampling framework. Then, in Section 2.2 we provide a representation of development sample composition on model discrimination and demonstrate direct connections with statistical efficiency. In Section 2.3, we describe configurations within the class of proposed designs that are best for learning, and characterize the sampling impact for both model development and validation in Section 2.4. We provide empirical evidence through simulations in Section 3. Finally, in Section 4 we illustrate the method on a data set of lumbar spine imaging reports that was obtained in a pragmatic trial of radiology decision support (Jarvik et al. [2015]), and provide a concluding discussion in Section 5.
2 Methods
2.1 Statistical motivation and proposed design
For subject denote
as the feature vector and
as the binary outcome. The general classification problem is to find function that maps from the features to outcomes, for example penalized logistic regression(1) 
where in (1),
refers to Lasso regression (
Tibshirani [1996]) andrefers to Ridge regression (
Le Cessie and Van Houwelingen [1992]). For a concrete example of application of classification models such as (1), consider the task of classifying radiology reports for subject vertebral fracture status. For this natural language processing (NLP) motivated task, simple features may be derived using bagofwords (BOW) representations, while outcomes obtained through abstraction. More sophisticated feature engineering is common in NLP, but for the characterization of efficient designs we illustrate using a simple learning approach. For subject , the BOW feature vector has binary elements with unique terms obtained by concatenating all reports, while the abstracted outcome label is the cliniciandefined indicator of vertebral fracture. For data required towards fitting (1), note that while extracting features is relatively cheap, obtaining outcome statuses is timeconsuming. Therefore, for clinical outcome identification tasks, a sample needs to be drawn from the EMR cohort for outcome abstraction so that both and are available for machinelearning development and evaluation.Denote the sample , having sample size and sampling design . Typically, the assumption on is simple random sampling (SRS) from . However, due to the expected low number of cases from naturally rare clinical outcomes, using SRS to select reports for abstraction is often inadequate. In machinelearning a common procedure is oversampling to artificially increase sample prevalence where cases are randomly replicated at the analysis stage. However, oversampling does not generate new information, rather simply reweighs existing data.
Alternatively, samples with higher outcome prevalence can be collected by design through stratified sampling. For the development of outcome classification algorithms using EMR data, recall that while true requires abstraction from unstructured data, there exists other structured data elements in the database that are related to . For instance, true outcome status for the example of vertebral fracture identification may be related to counts of related keywords in report text, or related to International Classification of Disease (ICD) codes recorded during the same subject visit. Denote summaries of such related structured data elements as , where it is reasonable to expect to be associated with true , and therefore a “surrogate” for the true outcome. However, due to the potential misclassification of for , instead of replacing by we suggest using as auxiliary variables for sampling to develop a prediction model.
Definition 1
Surrogateguided sampling (SGS) design class.
Denote the surrogateguided sampling (SGS) design class as the set of stratified sampling procedures based only on values of a binary enrichment surrogate
. Such designs would select an individual for sampling with probability
where typically when is positively correlated with .The surrogateguided sampling (SGS) design class (Definition 1) describes the class of stratified sampling designs based on values of an enrichment surrogate, and is a special case of twophase sampling. To conduct an SGS design, all subjects in the cohort are divided into two strata based on surrogate values: surrogate positives with , and surrogate negatives with . Then, subjects are selected into the sample based on surrogate values, and only selected subjects have true abstracted for. The intended benefit of SGS designs is that, for the same abstraction cost, resulting samples have higher expected outcome prevalences compared to using SRS. For illustration, consider an outcome prevalence of 10%, and assume that in the EMR, there exists a surrogate with 40% sensitivity and 95% specificity for the outcome of interest. For an abstraction budget of collecting labels, using an SGS design with threetimes as many surrogate positives as surrogate negatives yields about 185 true cases in expectation. In contrast, an SRS design would have required abstraction of almost subjects to yield actual cases, abstraction burden of close to four times. Note that cases identified using SGS designs are true cases collected from the cohort, and not replicates or synthetic data as resulting from using analysisbased rebalancing methods.
We now describe theoretical and analytical results that are key to understanding our proposed sampling framework. First, in order to demonstrate that the sampling design choice does affect learning performance, we provide a mathematical representation of how training sample composition affects prediction accuracy. Second, among the possible designs in the proposed SGS sampling class, we investigate configurations that improve sampling benefit for learning. Intuitively as illustrated in Figure 1, when the case proportion is higher in the surrogate positive () compared to the surrogate negative () stratum, overrepresenting the surrogate positive stratum provides higher expected sample case proportion by construction  we provide an analytical treatment of such intuition. Lastly, we characterize the impact of using the proposed SGS design on machinelearning performance, describing bias and modeling considerations for both model development and model validation.
2.2 Effect of training sample composition on prediction accuracy
To statistically motivate that sampling design choice does affect learning performance, we provide a mathematical representation of how training sample composition impacts prediction accuracy. For tractability we focus on a commonly used evaluation metric, the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC). Model validation AUC can be interpreted as how well resulting continuous predictions discriminate between randomly selected pairs of case and control subjects in yet unseen data. Other performance metrics such as binary accuracy correspond to the sum of error values for an optimal point on the ROC curve. If we consider continuous model predictions as a “test” for true outcome statuses, then under a binormal assumption,
Pepe [2003] has shown the AUC to be(2) 
In (2), and
are the means and variances of the “test” among the cases (
) and controls (). The binormal AUC formula (2) was developed in Pepe [2003] for diagnostic testing applications, but may be generalized to the classification modeling setting. For classification model development, the continuous “test” is estimated using a development sample, but generalizable performance usually evaluated on a separate validation sample. Therefore, we introduce additional notation to express such differences between the classification modeling and diagnostic testing settings. Denote as the development sample collected using sampling design and having sample size , and assume that the validation sample is a large sample obtained through SRS from . Then, the validation AUC for model developed with may be represented using an indexing as shown in Definition 2.Definition 2
.
Let denote the validation AUC of a classification model for outcome developed using sample defined with sampling design and sample size .
Using the indexing as in Definition 2
, we may then represent validation AUC in terms of development sample composition, assuming binormally distributed features (
). Theorem 1 shows that validation AUC is inversely proportional to the estimation variance and the data signaltonoise ratio. Therefore, assuming use of the same modeling procedure, using a design with higher statistical information as measured by lower estimation variance results in higher validation AUC. To our knowledge, the results in Theorem 1 are the first to directly present an indexing of validation AUC in terms of development sample composition. Note that our argument for Theorem 1 may be generalized beyond binormal features and logistic regression. For example, the binormal features assumption may be relaxed to allow for monotone transformations of normal distributions (Pepe [2003]). In addition, the results in Theorem 1 may be applied to penalized logistic regression, as long as the estimation bias and variance of resulting coefficients can be well characterized.Theorem 1
Assume that in , for , , where and . Let be the estimated linear predictor, where model coefficients are estimated by logistic regression using development sample . Then,
(3) 
where is the approximate covariance matrix of estimating using , and and are parameters describing the data signaltonoise ratio.
We may use the results in Theorem 1, to explain the effect of outcome class imbalance on classifier discrimination. When modeling using logistic regression, it has been noted that samples with rare outcomes tend to result in more highly variable coefficient estimates compared to that of more prevalent outcomes (King and Zeng [2001]). As demonstrated in (3), such increased estimation variance directly results in lower discrimination. To increase discrimination in validation requires using an alternative sampling design that results in lower variability in estimating , a higher information design. In the context of twophase sampling designs, the choice of sampling variable as well as strata proportions may affect design information – we now turn to discussion on this point.
2.3 Quantifying SGS design benefit with .
For a parametric regression model we demonstrated that the development sample composition affects ultimate model validation AUC through the variance associated with the estimation of risk scores. In general, to directly quantify the effect of sampling design on estimation variance requires numerical approximations. Alternatively, motivated by empirical results in machinelearning, we turn to sample outcome prevalence as another simple measure of information. We now focus on characterizing the “effect” of a sampling design on the sample outcome prevalence. Let denote that subject was selected for sampling using design
. The sample case/control odds,
, compares the expected proportion of cases to controls among sampled subjects (), where higher odds indicate higher prevalence. To denote the sample case enrichment comparing SGS to SRS, we propose using the case/control odds ratio, a metric we denote as and mathematically define in Definition 3.Definition 3
.
Let denote the expected case/control odds ratio comparing surrogateguided sampling (SGS) to simple random sampling (SRS), where
.
The denominator of is the expected odds of cases for samples collected with SRS, and is assumed to be less than for rare outcomes. The numerator is the expected odds of cases for samples collected with SGS designs. Therefore, can be interpreted as the expected increase in cases comparing SGS to SRS, with higher values indicating that SGS provides more case enrichment, and indicating improvement using SGS relative to SRS. has similarities and differences to the term “odds ratio” which is often used in epidemiology. The epidemiological usage of “odds ratio” compares the case/control odds of a sample drawn from the exposed group to a sample drawn from the unexposed group, and provides a single estimate of exposure effect. Similar to the exposure odds ratio, also compares the case/control odds of two samples drawn from the same population. However, since the samples are defined by sampling design instead of exposure statuses, the provides a single estimate of design effect on sample outcome prevalence. Therefore, provides a onedimensional summary measure of case enrichment comparing SGS over SRS.
2.3.1 Properties of and the impact of surrogate specificity
An interesting property of is the connection to Likelihood Ratios (LRs) of the enrichment surrogate. Of note, LRs of a diagnostic test can be interpreted as slopes of Receiver Operating Characteristics (ROC) curves, are related to positive and negative predictive values (PPV & NPV), but are invariant to outcome prevalence (Choi [1998]). Therefore, by framing enrichment surrogates as “prior tests” of outcome , we may gain insight into what types of variables are the best surrogates for sampling.
Proposition 1
Properties of .
Let a surrogateguided sampling (SGS) design of sample size be defined with surrogate and sampling ratio , where has and operating characteristics: , . Then,
(4) 
Additionally, if the outcome is rare (), then
(5) 
Corollary 1
For a given , . Over the set of possible , and .
Equation (5) in Proposition 1 shows that is approximately the sum of positive and negative surrogate likelihood ratios ( and ), weighted by the sample proportion of surrogate positives. Corollary 1 follows directly from (5), and directly characterizes the effect of both strata allocation and sampling variable choice on sample case enrichment. For strata allocation, overrepresenting surrogate positives (higher ) results in higher values of . In terms of sampling variable choice, notice that for any given , while the rate of increase in is linear in it is inverse polynomial in . Therefore, a small change in specificity can have a much higher impact on compared to the same change in sensitivity. Another perspective on sampling variable choice may be obtained by translating sensitivities and specificities into likelihood ratios. In general, while a “good” test requires having high values of both and , having a high alone is sufficient to achieve a high : requirements for a variable to be a good enrichment surrogate for sampling are weaker than requirements for a good diagnostic test.
The illustration in Figure 2 emphasizes the impact of surrogate specificity on , where values of are indicated by different colors across possible ranges of surrogate marginal sensitivities and specificities for an SGS design with fixed . The nongray regions of Figure 2 illustrates operating characteristics of surrogates that constitute good candidates for stratified sampling variables. We excluded the presentation of surrogates with specificities less than , as we may redefine these surrogates to obtain a more specific variable. From Figure 2, note that when using surrogates with specificities of or higher, caseenrichment relative to SRS can be expected even with sensitivities as low as . Our mathematical analyses convey two important practical implications. First, if there exists a dichotomous variable in the EMR that predicts the outcome better than random noise, stratified sampling based on such a variable can provide a development sample that is more enriched for cases, for the same abstraction cost of a simple random sample. Second, to improve on case enrichment, optimizing the enrichment surrogate for high specificity provides much more value compared to optimizing for high sensitivity. By stratified sampling on the values of an enrichment surrogate that is highly specific for the outcome of interest, SGS designs result in development samples with higher outcome prevalence, which may correspond to increased statistical information, lower estimation variance, and therefore improved statistical learning.
2.4 Design impact on model development and model validation
To improve the information of samples selected for machinelearning, SGS designs intentionally overrepresent surrogate positives. A natural concern is whether such introduced selection bias may impact the validity of developed models. The impact of sample characteristics on machinelearning was first formalized in Zadrozny [2004], and can be formulated as a missing data problem (Little and Rubin [2014]). Recall that sampling in SGS only depends on surrogate values , which are assumed to be available for all subjects in . Therefore, for the SGS design, sampling is independent of outcome labels conditional on surrogate values, equivalently , an assumption also known as Missing At Random (MAR). Using the MAR assumption, we now describe the impact of using SGS designs for both model development and model validation.
2.4.1 Design impact on model development
To characterize design impact on model development, we consider distributional differences between the development sample and the cohort. For sample obtained with sampling design , Zadrozny [2004] suggested that may be used for developing model “validly” under the asymptotic equivalence criteria, with , where as the development sample size grows, the approaches the truth as if the full cohort were available. In particular, resulting in having outcomes MAR from are “valid” for model development of classifiers based on conditional means in the asymptotic “true model” sense (Zadrozny [2004]). Note that for logistic regression, , therefore, machinelearning model development with logistic regression using SGS samples will result in validly estimated models under this interpretation.
2.4.2 Design impact on model validation
Now, consider the impact of using sample , where is the SGS design, on model validation to assess prediction performance. This practically relevant scenario may arise, for example, when a single sampling design is used to select subjects for outcome abstraction, and then resulting sample split into separate subsamples for model development and model validation. On the validation sample, the developed model may be assessed for its prediction accuracy, using metrics such as sensitivity, specificity, and AUC.
In general, unless the validation sample is drawn randomly from the cohort (i.e. SRS), empirically estimated accuracy metrics are typically biased for the true values. However, for validation samples collected using SGS, due to the MAR assumption biascorrection methods are available. For example, the Inverse Probability Weighting (IPW) estimator first proposed by Horvitz and Thompson [1952] adjusts empirical estimates according to inverse sampling probabilities. To estimate generalizable AUC of the model on this intentionally biased sample, for pairs of subjects and , outcome and predicted probabilities , the IPWcorrected empirical estimator is
(6) 
In (6), is the sampling probability for subject , and may be estimated from observed data for any MAR sample. For the SGS design, is additionally known by construction to be
(7) 
The known sampling probabilities (7) may be directly used in IPWcorrected accuracy metrics such as (6). Note that the AUC indexing described in Section 2.3 is slightly different than the AUC estimator in (6). In Section 2.2, we assumed that the validation sample was large and representative of the cohort, and represented the effect of development sample composition on validation AUC. Here, in using (6
), we considered the developed model to be fixed, and studied the effect of validation sample composition towards unbiased estimation of true model accuracy measures of this fixed model on the target cohort. Our theoretical arguments demonstrate that any introduced bias from using SGS samples for model validation may be corrected with IPW towards unbiased estimation of model accuracy measures.
2.4.3 Theoretical requirements for design validity
By framing the proposed sampling design as a missing data problem, we have characterized sampling impact on modeling and outlined several analytic guidelines for design validity. For model development of classifiers based on conditional outcome distributions, the surrogate needs to be included as a predictor. For model validation, empirical accuracy measures may be corrected using IPW estimators, where required sampling probabilities are known exactly by design. For both model development and model validation, subjects representing surrogate positives (Z=1) and surrogate negatives are (Z=0) are required in the sample. For example, if only surrogate positives are available, model coefficients are estimable only on the stratum and is undefined for the stratum, without further parametric assumptions.
3 Simulations
To illustrate the benefit of using SGS designs for statistical machinelearning model development, we conducted simulations motivated by a realworld data set of radiology text reports from the Lumbar Imaging with Reporting of Epidemiology (LIRE) study (Jarvik et al. [2015]  for additional information see Section 4). For the simulation study, features were generated following a longtail distribution that is characteristic of bagofwords text data, modeling was based on penalized regression due to highdimensional feature assumptions, and we compared the effect of various sampling designs on validation prediction accuracy.
3.1 Simulation setup
For cohorts of size
, we generated conditional outcomes as independent Bernoulli random variables, having prevalence of either 5% or 10%. As most EMR datasets contain features of high dimensionality, we set the number of features to be
, of which only had nonzero coefficients. Specifically, the conditional outcome was generated as , where , with for the first most frequent features, for the features with frequencies closest to the outcome prevalence, and for the remaining features. Here, we used a simplifying assumption that the most predictive textbased features tend to occur as often as the outcome prevalence, frequent features are weakly predictive, but most features are irrelevant for predicting the outcome.Binary features were generated as independent Bernoulli random variables, with marginal feature frequencies following an exponential distribution simulating a long upper tail distribution, where the most common features are present in almost all reports but the majority of features have very low frequencies (
Sichel [1975]). Specifically, features were generated as , where simulated following an exponential distribution with mean = comparable to observed distributions in the LIRE dataset. For the binary enrichment surrogates, surrogate had a sensitivity of 0.40 and a specificity of 0.95, defined to have comparable operating characteristics with the realworld surrogate for the LIRE data set, while surrogate had a sensitivity of 0.67 and a specificity of 0.66, and may be viewed as a “weaker” surrogate for sampling. Note that both surrogates have the same discrimination for the outcome (AUC = 0.67) as computed according to the trapezoidal rule.We compared the sampling methods: simple random sampling (SRS) which we consider to be the “baseline”, surrogateguided sampling designs (SGS), as well as random oversampling (ROS) which is a commonly used analysisbased resampling procedure. For each simulated cohort, we set aside a large validation sample with sample size using SRS. From the remaining subjects, we simulated “abstraction samples” varying across a grid of sample sizes, and sampling methods of SRS, ROS, SGS 1:1 or SGS 3:1, where SGS may be based on surrogates or . For the SRS and SGS sampling designs, the abstraction sample size is exactly the development sample size. The ROS procedure replicates cases from an SRS sample of size until the number of cases and controls are equal. Therefore, even though both SRS and ROS have the same “abstraction sample size”, ROS results in a higher development sample size due to case replication. For a fair comparison, we used abstraction sample size rather than development sample size as the unit of cost measurement.
For each iterations we fit either Lasso or Ridge classification models, but coefficients for enrichment surrogates were assigned a zero penalty, which is a modification to the usual likelihood so that the surrogate is always included in the resulting model. Regularization parameters were selected based on values that maximized AUC using tenfold crossvalidation on development samples. Then, we apply resulting model estimates to the validation sample, calculating the empirical validation AUC using the WilcoxonMannWhitney formula. Over all iterations, we calculated average validation AUCs and illustrated results in the form of learning curves. Briefly, a learning curve is a type of plot in machinelearning to show the change in model prediction accuracy (here: discrimination) when cost (here: abstraction sample size) increases. In these experiments, since we compared prediction accuracy across different sampling designs conditioned on the same models and data generating mechanism, the difference in model performance is due to differences in the sampling design that gave rise to resulting samples.
3.2 Simulation results
Figure 3 illustrates simulation results when modeling with logistic lasso regression. First, consider the cohort with 5% outcome prevalence and SGS sampling using surrogate Z1 (Figure 3(a)(i)), where surrogate discrimination for outcome was low but machinelearning may improve model discrimination. Learning with samples obtained with SRS is difficult, requiring an abstraction sample size of n=3000 to achieve a validation AUC of 0.85 (94% of the maximum AUC of 0.90). In contrast, using SGS achieves such discrimination at lower sample sizes, with the SGS 1:1 design requiring n=1500 (50% of SRS cost) and the SGS 3:1 design requiring n=1000 (33% of SRS cost). Notice that while assigning a higher proportion to surrogate positives (SGS 3:1) resulted in slightly higher learning curves compared to equal proportions of surrogate positives and negatives (SGS 1:1), such differences were less substantial compared to the difference between using any SGS compared to SRS. We remark that the benefit of ROS on learning is inconsistent, where such case replication sometimes resulted in worse generalizable discrimination compared to no replication (SRS) at all.
Figure 3(b)(i) shows the same setting as described before, but SGS sampling was based on surrogate Z2 rather than Z1. Even though surrogates Z1 and Z2 had the same discrimination for the outcome (AUC = 0.675), Z2 was a worse variable for stratified sampling purposes. For a validation AUC of 0.85, using surrogate Z2 based on the SGS 1:1 and SGS 3:1 allocations required n=2500 (83% of SRS cost) and n=2000 (67% of SRS cost) respectively. Such differences may be attributed to the lower specificity of surrogate Z2 compared to Z1, resulting in overall lower sample prevalence and therefore reduced benefit for learning. Similar results on SGS design benefit and surrogate specificity were observed for the cohort having 10% outcome prevalence (Figures 3(a)(ii) and 3(b)(ii)), but SGS design benefit over SRS was less pronounced due to a less rare outcome. To achieve a validation AUC of 0.85 with lasso regression, using SRS required an abstraction sample size of n=1000 while using SGS 1:1 sampling based on surrogate Z1 and Z2 required n=700 and n=900 respectively, compared to sample size savings of 50% and 83% respectively for the 5% outcome scenario.
Figure 4 illustrates learning curves for modeling with logistic ridge regression, with SGS using surrogate Z1 (Figure 4(a)) and Z2 (Figure 4(b)). Compared to using lasso regression, the different shapes of learning curves reflected differences in choice of modeling using variable selection versus shrinkage. To achieve a validation AUC of 0.85 for the 5% outcome prevalence cohort, learning with SRS required an abstraction sample size of at least n=4000 while SGS sampling using surrogate Z1 required about n=2500 (less than 63% of SRS cost), where using SGS regardless of stratification allocation was a consistent improvement over SRS. On the other hand, SGS sampling using surrogate Z2 had almost the same sample size requirement as with SRS, again emphasizing the importance of surrogate specificity for sampling. Similar conclusions were observed for the 10% outcome prevalence cohort. When modeling with ridge regression, ROS was consistently worse than SRS without case replication. One possible explanation is due to that using ridge regression reduces the estimation variance of classification through intentionally biased estimates. With oversampling, while modeling bias increases, variation remains the same as case replication does not provide additional information, therefore resulting in lower generalizable prediction accuracy.
Therefore, our results suggest three consistent patterns associated with SGS sampling. First, using SGS for sampling was generally an improvement over using SRS for classification of rare outcomes, as observed for both lasso and ridge regression learning. Second, allocating higher proportions to the surrogate positive stratum resulted in improved learning compared to equal allocations, but only slightly. Third, using a more specific surrogate results in a sample with higher information and therefore improved learning compared to using a less specific surrogate. We acknowledge that quantification of the exact design benefit on learning depends on factors such as the specific modeling choice and the outcome prevalence.
4 Application: Fracture identification from radiology reports
4.1 Data set details
Vertebral fractures of the spine can lead to spinal deformity, loss of vertebral height, crowding of internal organs, and loss of muscles, resulting in acute back pain and potentially chronic pain. Diagnosis is usually made through radiographic imaging, such as with plain xray or magnetic resonance imaging (MRI). In EMR systems a vertebral fracture finding is natively captured in unstructured text form, and for research a definite fracture status variable requires clinical expert abstraction of associated radiology text reports. Therefore, sampling strategies alternative to the usual SRS may be reduce the abstraction burden towards accurate and scalable machinelearning classification of vertebral fracture outcomes.
The Lumbar Imaging with Reporting of Epidemiology (LIRE) study evaluated the effect of radiology report content on subsequent treatment decisions among adult subjects (Jarvik et al. [2015]). Subjects were eligible for the LIRE study if they had a diagnostic imaging test ordered by their Primary Care Physician (PCP), so all subjects in LIRE had at least one radiology report available from the EMR database. The prevalence of vertebral fractures is estimated to be relatively rare: 320% among primary care subjects seeking care for all reasons (Waterloo et al. [2012]) and expected to be similar among subjects from the LIRE study. Using LIRE data as the “cohort”, we evaluate the benefit of using SGS designs for outcome label abstraction and subsequent classification model development.
4.2 Surrogate creation and sampling design application
Together with clinicians, we identified a set of International Classification of Disease (ICD) codes that if present, are highly likely to indicate that a subject was diagnosed with a vertebral fracture; details are in Supplementary Material C. For each subject, we counted how many ICD codes were noted in the EMR within 90 days of cohort entry. In the cohort of 178,333 subjects, 171,592 (96%) did not have any relevant ICD codes, 3,275 (1.83%) had one code, 1,303 (0.73%) had two codes, 758 (0.42%) had three codes, and the remaining had more than three codes. Since most subjects did not have any relevant ICD codes and a count of one was the most common count, we defined the enrichment surrogate Z as , where 3.78% of the cohort were considered to be “surrogate positives”.
This abstraction task was nested within a larger abstraction setup for the LIRE study. The radiology reports of each selected subject were abstracted by two independent clinicians for the presence or absence of vertebral fractures. From the available dataset, data “marts” for model development and model validation were assembled, each having a sample size of n=500. The validation data mart was selected such that it was representative of the underlying cohort, while the development data mart was selected based on an SGS 1:1 configuration. Using the validation data mart, we estimated marginal characteristics of the surrogate as well as the of the resulting SGS design.
4.3 Modeling and analysis
Features were created by processing radiology report text data using the quanteda package in R. Features were bagofwords (BOW) unigrams excluding typical English stopwords as well as terms that were very rare ( of all reports) or common ( of all reports). We used the termfrequency inversedocument frequency (TFIDF) representation for BOW as described in Salton and Buckley [1988], which incorporates information about the importance of terms both locally (within a single report) as well as globally (across all reports). For a collection of reports denoted , the set of terms denoted was obtained from concatenating unique words from all reports. Then the TFIDF feature matrix contains elements , with term frequency TF defined as and inverse document frequency IDF defined as . In addition to textfeatures, we also included the binary enrichment surrogate as a predictor, for a total of features.
To investigate the design effect on model prediction accuracy, we drew bootstrap samples of sizes from the development data mart stratified by surrogate status. To simulate the SRS design, we drew samples according to an “inverse SGS” design from the development data mart, where surrogate positives were underincluded with the sampling probabilities (7). To simulate the SGS design, we drew samples randomly from the development data mart. For each simulated sample, we fitted Lasso logistic regression selecting regularization parameter
based on minimizing the average 10fold crossvalidated error using an AUC loss function. Resulting estimated model parameters were then applied to the validation sample to obtain estimates of the validation AUC. For each sampling design (SRS and SGS) and for each sample size, we reported mean validation AUC and 95% bootstrap confidence intervals.
Estimated data set characteristics are shown in Table 1. Note that the defined enrichment surrogate by itself had low AUC in discriminating the outcome. In fact, its operating characteristics were such that it was highly specific but only moderately sensitive for the finding, an overall high for the resulting SGS design. Even though only a weak predictor, the surrogate is beneficial as a sampling variable. Data analysis results are shown in Table 2. In general, when fitting a logistic lasso regression, average validation AUC increases with sample size. However, for the same sample size, using samples drawn with SGS resulted in higher average validation AUC. For example, using the same sample size of , the AUC of SGS was 0.86 while that of SRS was only 0.74, a difference of 0.12 suggesting that allocating a sample size of 250 is more resource efficient under SGS compared to SRS.
5 Discussion
In summary, motivated by sampling frameworks from epidemiology and machinelearning, we formalized a design strategy for abstraction selection and label collection of rare outcomes through a twophase stratified sampling framework. We have demonstrated that the specificity of a sampling variable used to guide sampling greatly affects design benefit for classification accuracy. We suggest that a specificity of or higher is ideal, is very good, and is the absolute minimum specificity that a surrogate needs to achieve in order to be considered as a sampling variable. To create highly specific surrogates, simple keyword searches may be supplemented with offtheshelf negation tools for example in Harkema et al. [2009], while related ICD codes may be defined with higher count thresholds within a shorter period of time. Additionally, keywords and ICD codes may be combined with an “AND” query to further increase specificity. To estimate the specificity of a candidate surrogate, a small initial sample may be collected using SGS, where we remark that appropriate estimators may be based on those described in the verification bias literature (Alonzo [2014]).
The are two practical tradeoffs of the proposed design worth considering. First, a concern may be whether sampling on a highly specific surrogate could result in a dataset that is sufficiently representative of all possible outcome subgroups. For example, in the vertebral fracture data application, while requiring at least two instead of one ICD codes may further increase surrogate specificity, such a strategy could have resulted in a sample with mostly chronic fractures and not acute fractures. A possible solution may implement a “tiered” surrogate, using subsamples defined by variables to balance specificities and case representativeness (e.g. , , counts of ICD codes). Another tradeoff relates to strata proportions. While increasing sample surrogate positives may result in slightly improved classification accuracy, resulting inflated inverse weights may greatly increase the variance of IPW estimators of validation accuracy measures.
Anchored in the proposed SGS design framework, future work may formally investigate methodological and practical questions related to full study planning such as formal sample size calculations. Once relevant tradeoffs are carefully defined, appropriate sample size calculations may then proceed taking into account the need of both model development and model validation. Other future work should include: investigating the appropriateness of the SGS framework for outcomes that are much rarer than what we considered (5%); characterizing design effects on prediction accuracy measures other than AUC; as well as determining best practices for sampling in the presence of site heterogeneity. Ultimately, our hope is to encourage careful statistical and study design thinking when assembling labeled data sets for machinelearning model development and validation, especially when considering the nontrivial abstraction cost in obtaining such labels.
References
 Agarwal et al. [2016] Vibhu Agarwal, Tanya Podchiyska, Juan M Banda, Veena Goel, Tiffany I Leung, Evan P Minty, Timothy E Sweeney, Elsie Gyang, and Nigam H Shah. Learning statistical models of phenotypes using noisy labeled training data. Journal of the American Medical Informatics Association, 23(6):1166–1173, 2016.
 Alonzo [2014] Todd A Alonzo. Verification bias  impact and methods for correction when assessing accuracy of diagnostic tests. REVSTAT–Statistical Journal, 12(1):67–83, 2014.
 Batista et al. [2004] Gustavo EAPA Batista, Ronaldo C Prati, and Maria Carolina Monard. A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1):20–29, 2004.
 Carroll et al. [2012] Robert J Carroll, Will K Thompson, Anne E Eyler, Arthur M Mandelin, Tianxi Cai, Raquel M Zink, Jennifer A Pacheco, Chad S Boomershine, Thomas A Lasko, Hua Xu, et al. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. Journal of the American Medical Informatics Association, 19(e1):e162–e169, 2012.
 Chapman et al. [2001] Wendy Webber Chapman, Marcelo Fizman, Brian E Chapman, and Peter J Haug. A comparison of classification algorithms to automatically identify chest xray reports that support pneumonia. Journal of biomedical informatics, 34(1):4–14, 2001.
 Chatterjee et al. [2003] Nilanjan Chatterjee, YiHau Chen, and Norman E Breslow. A pseudoscore estimator for regression problems with twophase sampling. Journal of the American Statistical Association, 98(461):158–168, 2003.

Chawla et al. [2002]
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer.
Smote: synthetic minority oversampling technique.
Journal of artificial intelligence research
, pages 321–357, 2002. 
Choi [1998]
Bernard CK Choi.
Slopes of a receiver operating characteristic curve and likelihood ratios for a diagnostic test.
American Journal of Epidemiology, 148(11):1127–1132, 1998. 
Esteva et al. [2017]
Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter,
Helen M Blau, and Sebastian Thrun.
Dermatologistlevel classification of skin cancer with deep neural networks.
Nature, 542(7639):115–118, 2017.  Harkema et al. [2009] Henk Harkema, John N Dowling, Tyler Thornblade, and Wendy W Chapman. Context: an algorithm for determining negation, experiencer, and temporal status from clinical reports. Journal of biomedical informatics, 42(5):839–851, 2009.
 He and Garcia [2009] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
 Horvitz and Thompson [1952] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
 Jarvik et al. [2015] Jeffrey G Jarvik, Bryan A Comstock, Kathryn T James, Andrew L Avins, Brian W Bresnahan, Richard A Deyo, Patrick H Luetmer, Janna L Friedly, Eric N Meier, Daniel C Cherkin, et al. Lumbar imaging with reporting of epidemiology (lire)—protocol for a pragmatic cluster randomized trial. Contemporary clinical trials, 45:157–163, 2015.
 King and Zeng [2001] Gary King and Langche Zeng. Logistic regression in rare events data. Political analysis, 9(2):137–163, 2001.
 Le Cessie and Van Houwelingen [1992] Saskia Le Cessie and Johannes C Van Houwelingen. Ridge estimators in logistic regression. Applied statistics, pages 191–201, 1992.
 Little and Rubin [2014] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data. John Wiley & Sons, 2014.
 McIsaac and Cook [2014] Michael A McIsaac and Richard J Cook. Responsedependent twophase sampling designs for biomarker studies. Canadian Journal of Statistics, 42(2):268–284, 2014.
 Neyman [1934] Jerzy Neyman. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4):558–625, 1934.
 Pakhomov et al. [2005] Serguei V Pakhomov, James Buntrock, and Christopher G Chute. Prospective recruitment of patients with congestive heart failure using an adhoc binary classifier. Journal of biomedical informatics, 38(2):145–153, 2005.
 Pepe [2003] Margaret Sullivan Pepe. The statistical evaluation of medical tests for classification and prediction. Medicine, 2003.
 Pons et al. [2016] Ewoud Pons, Loes MM Braun, MG Myriam Hunink, and Jan A Kors. Natural language processing in radiology: a systematic review. Radiology, 279(2):329–343, 2016.
 Prentice and Pyke [1979] Ross L Prentice and Ronald Pyke. Logistic disease incidence models and casecontrol studies. Biometrika, 66(3):403–411, 1979.
 Salton and Buckley [1988] Gerard Salton and Christopher Buckley. Termweighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988.
 Sichel [1975] Herbert S Sichel. On a distribution law for word frequencies. Journal of the American Statistical Association, 70(351a):542–547, 1975.
 Sinnott et al. [2014] Jennifer A Sinnott, Wei Dai, Katherine P Liao, Stanley Y Shaw, Ashwin N Ananthakrishnan, Vivian S Gainer, Elizabeth W Karlson, Susanne Churchill, Peter Szolovits, Shawn Murphy, et al. Improving the power of genetic association tests with imperfect phenotype derived from electronic medical records. Human genetics, 133(11):1369–1382, 2014.
 Tibshirani [1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 Wang et al. [2018] Yanshan Wang, Liwei Wang, Majid RastegarMojarad, Sungrim Moon, Feichen Shen, Naveed Afzal, Sijia Liu, Yuqun Zeng, Saeed Mehrabi, Sunghwan Sohn, et al. Clinical information extraction applications: a literature review. Journal of biomedical informatics, 77:34–49, 2018.
 Waterloo et al. [2012] Svanhild Waterloo, Luai A Ahmed, Jacqueline R Center, John A Eisman, Bente Morseth, Nguyen D Nguyen, Tuan Nguyen, Anne J Sogaard, and Nina Emaus. Prevalence of vertebral fractures in women and men in the populationbased tromsø study. BMC musculoskeletal disorders, 13(1):3, 2012.
 Wei and Dunbrack Jr [2013] Qiong Wei and Roland L Dunbrack Jr. The role of balanced training and testing data sets for binary classifiers in bioinformatics. PloS one, 8(7):e67863, 2013.
 Weiss and Provost [2001] Gary M Weiss and Foster Provost. The effect of class distribution on classifier learning: an empirical study. 2001.
 Xue and Hall [2015] JingHao Xue and Peter Hall. Why does rebalancing classunbalanced data improve auc for linear discriminant analysis? IEEE transactions on pattern analysis and machine intelligence, 37(5):1109–1112, 2015.

Yu et al. [2016]
Sheng Yu, Abhishek Chakrabortty, Katherine P Liao, Tianrun Cai, Ashwin N
Ananthakrishnan, Vivian S Gainer, Susanne E Churchill, Peter Szolovits,
Shawn N Murphy, Isaac S Kohane, et al.
Surrogateassisted feature extraction for highthroughput phenotyping.
Journal of the American Medical Informatics Association, 24(e1):e143–e149, 2016.  Zadrozny [2004] Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twentyfirst international conference on Machine learning, page 114. ACM, 2004.
 Zhao et al. [2009] Yang Zhao, Jerald F Lawless, and Donald L McLeish. Likelihood methods for regression models with expensive variables missing by design. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 51(1):123–136, 2009.
 Zhao et al. [2012] Yang Zhao, Jerald F Lawless, and Donald L McLeish. Design and relative efficiency in twophase studies. Journal of Statistical Planning and Inference, 142(11):2953–2964, 2012.
Appendix A Proof of Theorem 2.1
Denote the cohort data as , consisting of features (implicitly also including the surrogate ), and binary outcomes . From , units (typically subjects) are selected to form development and validation samples.
a.1 Preliminaries
In , let the features follow a binormal distribution, so that for
(1) 
This is equivalent to a Linear Discriminant Analysis (LDA) setting, where
(2) 
The parameters in (2) are true parameters in . To estimate regression coefficients, a sample needs to be drawn from . Then, based on theory from generalized linear models, the resulting estimate
has the following first and second moments:
(3) 
In (3), , where estimates the average probabilities resulting from the sigmoidal transformation of development sample linear predictions. The terms in (3) are accurate up to second order approximations. In estimating the regression parameters, denote the bias as and variance as , then both and depend on the development sample through sample size and sampling design . To evaluate the resulting classification model, we use a large validation sample, obtained using simple random sampling from . Denote the true linear predictions in the validation sample as , with distribution
In the classification setting, coefficients are estimated from the development sample , where is generated with sampling design and with development sample size . We use to denote an indexing of resulting validation AUC, where
(4) 
In (5), the notation and indicates that the estimation of is from . This proof outlines in terms of development sample composition.
a.2 Mean and variances of validation sample linear predictions
In the large and representative validation sample, for , the mean of the estimated linear predictions is
(5) 
where the double expectation is due to the dependence on validation sample features as well as development sample estimated coefficients . Similarly, the variance of the estimated linear predictions is
(6) 
The first part of the right hand side of (6) is
(7) 
and the second part of the right hand side of (6) is
(8) 
where we have used properties of the expectation of a quadratic form: for . Therefore, combining (7) and (8), the variance of is
(9) 
a.3 Classifier validation AUC in terms of estimation variance
Now we plug in values for (4). WLOG assume that and that . Then, the means and variances of validation sample linear predictions among cases (Y=1) and controls (Y=0) are respectively
(10) 
Thus, the numerator in (4) is the square of
(11) 
while the denominator in (4) is
(12) 
When , then since , and are assumed to be “fixed” quantities in a large validation sample,
Appendix B Derivation of Proposition 2.1
For , . Denote subjects where as those included in , the SGS sample selected from the cohort only based on values of . Thus, . The expected case odds in samples collected using SGS is
where
The expected case odds in samples collected using SRS is
Then, the case/control odd ratio of samples obtained with SGS compared to that of SRS is:
Comments
There are no comments yet.