Many important problems in biomedical decision making can be expressed as binary classification problems. For example, one may wish to identify infants infected with hepatitis C virus from a sample of infants born to infected mothers (Shebl et al., 2009), screen for prostate cancer using prostate-specific antigen (Etzioni et al., 1999), or predict which breast cancer patients will respond to treatment based on genetic characteristics (Fan et al., 2011). In many applications, the costs of false positives and false negatives may differ and classification methods must allow for unequal weighting of these errors. We present an approach to estimating receiver operating characteristic (ROC) curves using a weighted support vector machine (SVM) and introduce a bootstrap method for constructing confidence bands for the SVM ROC curve.
Receiver operating characteristic curves plot the false positive fraction against the false negative fraction across values of the classification cut point (Zhou et al., 2002; Pepe, 2003). Various methods for modeling and estimating ROC curves have been proposed, including parametric regression models (Pepe, 1997; McIntosh and Pepe, 2002) and semiparametric regression models (Pepe, 2000; Cai and Pepe, 2002; Cai and Dodd, 2008)
. Existing methods for ROC curve confidence bands involve estimating the biomarker distributions in the diseased and nondiseased samples using parametric models(Ma and Hall, 1993)2000; Claeskens et al., 2003; Horváth et al., 2008), or using empirical distribution functions in combination with the bootstrap (Campbell, 1994). Existing methods for ROC curve confidence bands assume a scalar biomarker. In the current setting, we apply the SVM of Cortes and Vapnik (1995) to classification with a multivariate biomarker (see also Krzyżak et al., 1996; Lin, 2002; Zhang, 2004; Steinwart and Christmann, 2008)
. The ROC curve is constructed by varying the weight placed on false positives and false negatives in the objective function rather than varying the classification cut point. Because the SVM classifier may vary across the range of the weight parameter, existing confidence band methods that assume a scalar biomarker cannot be directly applied.
Machine learning techniques that output a continuous score or predicted probability allow for straightforward application of ROC curve methodology(see, e.g., Spackman, 1989; Bradley, 1997; Provost and Fawcett, 1997, 1998). However, there are fewer examples of applying ROC curve methodology to classifiers that output only a class label, such as the SVM. Platt (1999) proposed a method to extract class probabilities from the output of the SVM (see also Vapnik, 1998; Lin et al., 2007). However, these methods rely on fitting parametric models to the SVM class labels. Example 2.5 in Steinwart and Christmann (2008) discusses classification using a weighted SVM but does not explore tuning these weights to achieve desired operating characteristics (e.g., a specific false positive fraction). Veropoulos et al. (1999) propose using weights to control the sensitivity and specificity of the SVM and to estimate an ROC curve, but provide no theoretical results or inference methods. As such, the weighted SVM has not yet been extensively applied in practice. We build on the work of Veropoulos et al. (1999) by deriving theoretical properties and developing a bootstrap method for constructing confidence bands for the SVM ROC curve.
There are numerous applications to motivate this work; however, we focus on two primary illustrative applications. The first is diagnostic testing for infant hepatitis C virus (HCV). Existing diagnostic tests exhibit poor sensitivity for predicting which infants will become chronically infected. A weighted SVM using multiple biomarkers is able to improve performance over standard HCV diagnostic tests. The second illustrative application we consider is predicting which breast cancer patients will respond to treatment. Genomic data provide a wealth of information for this purpose. However, it is difficult to specify a parametric model for response given genomic features because of the high dimension of genomic data. Because the SVM provides nonparametric classification (Steinwart and Christmann, 2008), it is a natural choice for this problem.
, we show a number of theoretical results, including that the risk of the estimated decision function is uniformly consistent across the weight parameter. In Section 4, we present a series of simulation experiments comparing the performance of the weighted SVM to standard methods in diagnostic medicine including logistic regression(McIntosh and Pepe, 2002) and semiparametric ROC curves (Cai and Moskowitz, 2004) and to evaluate the operating characteristics of the proposed bootstrap confidence bands. In Section 5, we present illustrative case studies. We conclude and discuss future research in Section 6. Proofs and additional simulation results are provided in Appendix A and Appendix B.
2 Weighted Support Vector Machines
2.1 ROC Curve Estimation
Assume that the available data are , , which comprise i.i.d. copies of , where is a class label (e.g., in diagnostic medicine, corresponds to a diseased individual and corresponds to a nondiseased individual) and are covariates. The goal is to estimate a classifier that correctly identifies a patient’s class label based on that patient’s covariates. Consider minimizing the expected weighted misclassification, where each misclassification event is weighted by the cost function , where is the cost of misclassification when the true class label is . In diagnostic medicine, with corresponding to disease and corresponding to nondisease, determines the relative weight placed on the sensitivity and specificity of the test. When , sensitivity and specificity are given equal weight and the cost function reduces to zero-one misclassification error. Let denote a class of functions from into . Then, the optimal classifier with respect to cost function within is
For fixed and a classifier , the plug-in estimator of the weighted misclassification error is , where is the empirical measure of the observed data. Note that any classifier can be represented as for some decision function ; we will assume that the decision function is smooth and thus belongs to a class of smooth functions, . For example, we can let be the space of linear functions, the space of polynomial functions, or the reproducing kernel Hilbert space (RKHS) associated with the Gaussian kernel (Steinwart and Christmann, 2008). The weighted misclassification error associated with decision function is . Minimizing the empirical risk is difficult due to the discontinuity of the indicator function. Using the hinge loss,
, as a surrogate loss function(Bartlett et al., 2006), an estimator for the optimal decision function is
where is a norm on and is a penalty parameter. We discuss how to choose a value of in Section 4. In the following, we write in place of to simplify notation. The problem of estimating the optimal classifier in (2) can be solved using the SVM introduced by Cortes and Vapnik (1995).
We estimate the optimal classifier, , using . For any , we can estimate the sensitivity and specificity of the estimated classifier using the empirical estimators and . Plotting against as functions of will yield a nonparametric estimator of the optimal ROC curve. The ROC curve encodes a continuum of classifiers indexed by ; to select a single classifier, there are a number of methods for defining an optimal value, say , for . For example, one could choose the which leads to the point on the ROC curve closest to in Euclidean distance, the which maximizes the sum of estimated sensitivity and specificity, or the which maximizes the estimated sensitivity for a fixed minimum specificity estimate (López-Ratón et al., 2014). The choice of will depend on the clinical application of interest. We classify an individual presenting with covariates as . This is an equivalent formulation to the method proposed in Section 2.1 of Veropoulos et al. (1999).
The optimal classifier over all functions mapping into , also known as the Bayes classifier (Duda et al., 2012), can be expressed as
Thus, is equal to 1 when
or, using Bayes theorem,
or, using Bayes theorem,, and otherwise, where . Thus, the optimal classifier given in (3) has the same form as the Neyman–Pearson test of against . If we fix (or equivalently, fix ) to have fixed specificity , then the Neyman–Pearson lemma ensures that maximizes sensitivity across all classifiers with specificity equal to . Therefore, the ROC curve for , say , has the property that for all , where is the ROC curve corresponding to any other classifier. This is analogous to the result given by McIntosh and Pepe (2002) (see also page 71 of Pepe, 2003).
The optimal decision function in is
where and are the sensitivity and specificity of the decision rule . Thus, the true optimal decision function maximizes a weighted sum of sensitivity and specificity where the weights are determined by the population prevalence, , and a user chosen weight, .
2.2 Confidence Bands
In this section, we present a method for constructing confidence bands for the ROC curve of , which provide an indication of how well the estimated classifier will perform in future samples. The method relies on consistency results given in Section 3 along with the following result, which characterizes the asymptotic distribution of the estimated sensitivity and specificity of . A proof is provided in Appendix A.
Let be the true sensitivity, be the estimated sensitivity, be the true specificity, and be the estimated specificity of , where is defined in (2), and assume that is a space of linear or polynomial functions. Define , where the minimization is taken over all measurable functions mapping into , and assume that . Let be defined as in Remark 2 and let . Then,
as , where and are mean zero Gaussian processes with covariances
respectively, with cross-covariance
Let be the false positive fraction for the decision function . Define such that , i.e., is the weight such that . Let
be fixed. A quantile bootstrap algorithm for constructing an asymptotically correctconfidence band for the ROC curve, , , is as follows:
Set a large number of bootstrap replications, B, a grid and a grid .
For , compute the estimated ROC curve, .
For , compute
by linearly interpolating.
Generate a weight vector , where
are independent standard exponential random variables and.
For , set
For , compute by linearly interpolating .
Let be the -th quantile of and let be the largest such that for all for at least bootstrap samples.
Set and .
One can also use alternate choices for the weights, for example, a multinomial weight vector with probabilities and trials. Let denote convergence in probability over , as defined in Section 2.2.3 and Chapter 10 of Kosorok (2008). The following result states the consistency of the bootstrap.
Let and define similarly. Let and let be as defined above. Then, for any , in .
Thus, will cover across with probability for large enough and . In addition to the linear and polynomial SVM, this procedure will work for any classifier such that the estimated decision function is in a VC class, such as a logistic regression classifier.
3 Uniform Consistency
For any , the estimated classifier is the sign of , the minimizer of the empirical hinge loss in a class as defined in (2). For any function, , define to be the risk of , and define the risk of to be . Let and . Furthermore, define and , i.e., minimizes the risk over and minimizes the risk over all measurable functions mapping into . Define as in Theorem 2.1. Throughout, we assume that , i.e., that the function that minimizes the risk is contained within the chosen class. If this is not the case, the consistency results given here will not hold; however, the estimated decision function will still yield a reasonable approximation to due to the identity . When , the optimal classifier assigns uniformly and when , the optimal classifier assigns 1 uniformly. Focusing on will enable us to avoid these trivial extremes. Nonetheless, many of our results hold for all . We will make this distinction explicit as needed. Throughout, we assume that all requisite expectations exist.
The following result gives a bound on the excess risk in terms of the excess risk. The proof is similar to that of Theorem 3.2 of Zhao et al. (2012) and uses Theorem 1 and Example 4 of Bartlett et al. (2006). We omit the proof here. This result will be used later to show uniform consistency of the risk of the estimated decision function.
For any measurable and any distribution of , .
This result implies that the difference between the risk of the estimated decision function and the optimal risk is no smaller than the difference between the risk of the estimated decision function and the optimal risk. Therefore, we can consider the risk when proving convergence results.
Next, we establish a number of consistency results for the risk of the estimated decision function. We begin with Fisher consistency. This result implies that estimation using either the hinge loss or the zero-one loss will yield the true optimal classifier given an infinite sample, providing justification for using the proposed surrogate loss function. The proof follows from an extension to the proof of Proposition 3.1 of Zhao et al. (2012) and is in Appendix A.
For any , if minimizes , then for almost all .
The following result establishes consistency of the risk of the estimated decision function when estimation takes place within a RKHS. We then extend this consistency by showing that it is uniform in . The proof of the following result closely follows the proof of Theorem 3.3 of Zhao et al. (2012) and is in Appendix A.
Let be fixed and let be a sequence of positive, real numbers such that and . Let be a RKHS with kernel function and let denote the closure of . Then, for any distribution of , we have that as .
We next strengthen the consistency stated above by showing that the convergence is uniform in when estimation uses a linear, quadratic, polynomial, or Gaussian kernel (see Steinwart and Christmann, 2008, for a discussion of kernel functions used with the SVM). The following lemma indicates that the estimated decision function lies in a Glivenko–Cantelli class (Kosorok, 2008) indexed by , which will help us to extend the consistency stated above to uniform consistency in . The proof is in Appendix A.
Let be estimated using a linear, quadratic, polynomial, or Gaussian kernel function. Then, is contained in a Glivenko–Cantelli (GC) class.
Given that and are contained in a GC class, we have by Corollary 9.27 (iii) of Kosorok (2008), that and are contained in a GC class because is continuous. By Corollary 9.27 (ii) of Kosorok (2008), and are contained in a GC class and thus, is contained in a GC class by Corollary 9.27 (i) of Kosorok (2008), where . It follows that , where . This convergence will be used in the proof of Theorem 3.5, which is given in Appendix A.
Assume that is estimated using a linear, quadratic, polynomial, or Gaussian kernel. For any sequence of positive, real numbers satisfying and and any distribution of ,
as , where is the RKHS associated with .
Note that we do not allow the sequence to depend on , which is reflected in the implementation in Section 4 below.
Here, we prove a number of continuity and convergence results regarding the ROC curve and risk function for and . We begin with the following result which indicates that the ROC curve of the Bayes classifier, , is continuous. We require
to be a continuous random variable; however, we do not require that the mapbe continuous. The proof is included in Appendix A.
Let and be the sensitivity and specificity of . Then, and are continuous in whenever is a continuous random variable with support .
Thus, is monotone nondecreasing and continuous except possibly at 0. It follows from Lemma 3.6 and Remark 2 that is continuous in . This is used in the proof of the following result, which is deferred to Appendix A.
Under the assumptions of Lemma 3.6, , is continuous in .
Finally, we state two corollaries pertaining to the sensitivity and specificity of the estimated decision rule. These results show that the ROC curve of the estimated decision function converges uniformly to the ROC curve of the optimal decision function in . The proof of Corollary 3.9 relies on a novel empirical process result which is included in Appendix A.
Let be estimated using a linear, quadratic, polynomial, or Gaussian kernel function. Let be the sensitivity and be the specificity of the decision rule . Then, there exist and such that and and as .
Note that Corollary 3.8 does not require to be unique. We can only say that the sensitivity and specificity of converge to the sensitivity and specificity of a function in the same equivalence class as , i.e., a function with optimal risk.
4 Simulation Experiments
To investigate the performance of classification using a weighted SVM and the resulting ROC curves and confidence bands, we use the following generative model. Let be generated according to , where is equal to a vector of ones with probability and a vector of negative ones with probability and is a identity matrix. Thus,
is a mixture of multivariate normal distributions with mixing probability. Let for a vector , where . Given , we let be equal to 1 with probability and with probability . Because depends on only through a linear function of , we refer to this model below as the linear generative model. We also consider a generalization of the above model where , which we refer to below as the nonlinear generative model.
We implement the weighted SVM in MATLAB software using the LIBSVM library of Chang and Lin (2011). Each simulated data set is divided into training and testing sets with 70% of the data used for training the SVM and 30% used to estimate sensitivity and specificity. We use both linear and Gaussian kernels. The Gaussian kernel function is (Steinwart and Christmann, 2008). The bandwidth parameter, , and the penalty parameter, , are estimated using cross-validation within the training data for and the resulting tuning parameters are used to fit the weighted SVM for all on a grid over . Comparison methods are implemented in R software (R Core Team, 2016).
We compare the performance of the weighted SVM to standard methods in diagnostic medicine, including logistic regression (McIntosh and Pepe, 2002) and semiparametric ROC curves (Cai and Moskowitz, 2004). Logistic regression and the SVM combine multiple biomarkers while the semiparametric ROC curve is calculated for a single biomarker (the first component of ). These four methods are applied to simulated data from the linear and nonlinear generative models with , , , , and . When , we use and , respectively. When , we use , i.e., noise variables are introduced for the case where
. We report the mean area under the ROC curve (AUC) and the Monte Carlo standard deviation of AUC as well as optimal sensitivity and specificity across 100 replications. Optimal sensitivity and specificity are calculated as the point on the ROC curve closest toin Euclidean distance (see López-Ratón et al., 2014, for a discussion of different methods for selecting the optimal point on the ROC curve).
Table 1 below contains estimated AUCs averaged across replications and Monte Carlo standard deviations of AUCs for the four methods when the true generative model is nonlinear.
|Linear SVM||Gaussian SVM||Logistic||Semiparametric|
|250||2||0.05||0.61 (0.07)||0.78 (0.06)||0.58 (0.08)||0.58 (0.04)|
|0.25||0.64 (0.07)||0.81 (0.05)||0.62 (0.06)||0.62 (0.03)|
|5||0.05||0.71 (0.06)||0.75 (0.06)||0.71 (0.07)||0.56 (0.03)|
|0.25||0.74 (0.05)||0.77 (0.06)||0.74 (0.06)||0.62 (0.03)|
|10||0.05||0.70 (0.06)||0.56 (0.05)||0.70 (0.06)||0.57 (0.04)|
|0.25||0.74 (0.06)||0.56 (0.05)||0.74 (0.06)||0.62 (0.04)|
|500||2||0.05||0.61 (0.05)||0.81 (0.04)||0.59 (0.05)||0.58 (0.02)|
|0.25||0.65 (0.04)||0.81 (0.04)||0.61 (0.05)||0.61 (0.02)|
|5||0.05||0.72 (0.04)||0.78 (0.04)||0.72 (0.05)||0.57 (0.02)|
|0.25||0.77 (0.04)||0.80 (0.04)||0.75 (0.03)||0.62 (0.02)|
|10||0.05||0.71 (0.04)||0.60 (0.05)||0.71 (0.04)||0.56 (0.03)|
|0.25||0.75 (0.04)||0.60 (0.04)||0.74 (0.04)||0.62 (0.02)|
The Gaussian SVM outperforms the other methods except in the case where there are noise variables. The linear SVM slightly outperforms logistic regression in most cases. Table 4 in Appendix B contains optimal sensitivities and specificities for the four methods when the true generative model is nonlinear, averaged across replications. Table 5 contains estimated sensitivities and specificities of an unweighted SVM when the true model is nonlinear. The unweighted SVM often fails to achieve a balance between sensitivity and specificity. In particular, the linear SVM often achieves low specificity. The imbalance between sensitivity and specificity is often worse when is small, indicating that proper balance is difficult to achieve when there is an imbalance between true class labels in the data. These results highlight the importance of estimating the full ROC curve and selecting the weight to achieve the desired balance between sensitivity and specificity; unweighted classification may not achieve satisfactory performance in many settings. Tables 6, 7, and 8 in Appendix B contain results when the true generative model is linear.
Next, we examine the performance of the proposed bootstrap confidence band method for the linear SVM. Independent testing sets of size 100,000 were used to calculate and , giving us an approximation to the true ROC curve for each . The method introduced in Section 2.2 was used to construct 90% confidence bands using 1000 bootstrap samples. We report the proportion of 100 Monte Carlo replications for which the true ROC curve is fully contained within the confidence band across [0.01, 0.99] along with the average area between the upper and lower confidence bands. Table 2 contains these results.
|Coverage probability||Area between curves|
|Linear model||Nonlinear model||Linear model||Nonlinear model|
We observe that, across , , and , the proposed quantile bootstrap method provides approximately 90% coverage with the area between curves decreasing for larger sample sizes.
Figure 1 below contains bootstrap confidence bands for one simulated replication for the linear and nonlinear generative model when , , and . The true ROC curve, calculated from a large testing set of size 100,000, is also plotted.
These figures demonstrate that the proposed quantile bootstrap produces confidence bands that capture the true ROC curve and are sufficiently narrow as to provide useful inference about the future performance of an estimated SVM classifier.
5 Applications to Data
5.1 Breast Cancer Genomics
We apply the weighted SVM to the problem of predicting treatment response among patients with breast cancer. The full data consist of 323 patients with complete data. For each patient, we calculated a collection of 512 gene expression signatures, called modules, each of which is a function of patient gene expression data, which can be used to predict response to neoadjuvant chemotherapy (Fan et al., 2011). We also observe a variety of clinical variables, e.g., age and tumor stage. Figure 2 contains ROC curves for predicting response to treatment using the linear and Gaussian SVM, logistic regression with LASSO penalty (Tibshirani, 1996)
, and random forests(Breiman, 2001), along with confidence bands for the linear SVM.
Each method performs equally well, with each ROC curve falling within the confidence bands for the linear SVM. Table 3 contains AUC and optimal sensitivity and specificity for each method along with the sensitivity and specificity of the unweighted versions of each method.
On these data, the linear SVM achieves the best AUC. Each method achieves a better balance between sensitivity and specificity after proper weighting. Unweighted classification results in close to perfect specificity at the expense of very low sensitivity for each method. This is likely due to the imbalance in the data (only 22% of patients in the sample respond).
5.2 Diagnosis of Infant Hepatitis C
We also applied the proposed methods to data from the cohort study of mother-to-infant hepatitis C transmission of Shebl et al. (2009). In this study, 1863 mother-infant pairs in three Egyptian villages were studied to assess risk factors for vertical transmission of hepatitis C virus (HCV). Of this sample, 33 infants were positive for both HCV RNA and HCV antibodies at the end of the study. We use data from infant follow-up visits at 2-4 months and 10-12 months. At each follow-up visit, infants were tested for HCV RNA using a polymerase chain reaction (PCR) test and HCV antibodies using an enzyme-linked immunosorbent assay (ELISA) test. Mothers in the study were also tested for HCV RNA and antibodies during pregnancy. In pediatric infectious diseases, it is important to correctly diagnose infected infants so that they will be retained in care for subsequent treatment. A test with high specificity is also important, as this allows for quickly and reliably reassuring families that their child is not infected and needs no further care. We use a weighted SVM to estimated a classifier based on the mother’s test results during pregnancy and infant’s test results at 2-4 months. While a PCR test at 2-4 months detects HCV viremia, it cannot predict which children subsequently become chronically infected, and a PCR test at 10-12 months remains the gold standard.
In this study, the PCR test achieved a sensitivity of 0.4167 and a specificity of 0.9911. The ELISA test achieved a sensitivity of 0.5833 and a specificity of 0.9571. Due to a variety of factors, diagnosis during the early months of life is difficult. Both PCR and ELISA suffer from low sensitivity at 2-4 months for detecting which infants will become chronically infected later. It is of interest to see if diagnosis via a weighted SVM can provide even a modest improvement in performance thereby reducing the need for a repeat test after 10-12 months of age.
We apply the weighted SVM and evaluate performance using 5-fold cross validation. Averaging the estimated sensitivity and specificity for each value of over the 5 folds yields the ROC curve found in Figure 3, plotted with bootstrap confidence bands. We plot the sensitivity and specificity of the individual PCR and ELISA tests as points in the figure.
The closest point on the ROC curve to yields an estimated sensitivity of 0.6011 and an estimated specificity of 0.8000, which provides increased sensitivity and a better balance between sensitivity and specificity when compared to the usual diagnostic tests. Classification is difficult due to the imbalance of infections and non-infections in the data, but a weighted SVM provides increased performance compared to either diagnostic test available.
A wide variety of problems in biomedical decision making can be expressed as classification problems, such as diagnosing disease and predicting response to treatment. In some clinical applications, false positives may have very different consequences from false negatives; classification methods which can properly weight sensitivity and specificity and estimate the optimal ROC curve are needed, along with inference methods for the ROC curve. Estimating the optimal ROC curve using a weighted SVM has been considered by Veropoulos et al. (1999). We have established the theoretical justification for estimating the ROC curve with a weighted SVM, demonstrated its performance in simulation studies, and provided a bootstrap confidence band method for the SVM ROC curve.
The applications of the weighted SVM in diagnostic medicine are numerous. We have demonstrated, for example, that this method can be used to improve early infant diagnosis of hepatitis C. Early detection of childhood infectious diseases is an important public health problem; reliable early diagnosis identifies children who could transmit the virus and would benefit from treatment with antivirals. We have also demonstrated that the weighted SVM accommodates high dimensional data and can be used to predict response to neoadjuvant breast cancer treatment using genomic information.
Because machine learning techniques are well suited to binary classification, there is great potential for research in applying machine learning to diagnostic medicine and other biomedical decision making problems. Developing methods of variable selection for the weighted SVM (Dasgupta et al., 2015) is an important step forward for this research as our simulations indicate that the performance of the Gaussian SVM is hindered by noise variables. Other areas of future work may include developing methods to accommodate biomarker measurements that are taken at different time points from the same patient.
The authors gratefully acknowledge the following funding sources: NIH T32 CA201159, NIH P01 CA142538, National Center for Advancing Translational Sciences UL1 TR001111, NSF DMS-1555141, NSF DMS-1557733, NSF DMS-1513579, NIH 1R01DE024984, NIH U01HD39164, NIH U01AI58372, an investigator initiated grant from Merck, NCI Breast SPORE program (P50-CA58223-09A1), the Breast Cancer Research Foundation, and the V Foundation for Cancer Research
In the case of a linear or polynomial decision function, is a Vapnik–Cervonenkis class for any . Thus, is a Donsker class for any . Let . Then, we have that
where is a quantity converging to 0 in probability uniformly over , lies in a Donsker class, and is a mean zero Gaussian process with covariance
Similarly, for specificity, we have that
where , the is uniform over as before, is a mean zero Gaussian process with covariance