# Selecting Biomarkers for building optimal treatment selection rules using Kernel Machines

Optimal biomarker combinations for treatment-selection can be derived by minimizing total burden to the population caused by the targeted disease and its treatment. However, when multiple biomarkers are present, including all in the model can be expensive and hurt model performance. To remedy this, we consider feature selection in optimization by minimizing an extended total burden that additionally incorporates biomarker measurement costs. Formulating it as a 0-norm penalized weighted classification, we develop various procedures for estimating linear and nonlinear combinations. Through simulations and a real data example, we demonstrate the importance of incorporating feature-selection and marker cost when deriving treatment-selection rules.

## Authors

• 4 publications
• 16 publications
• ### Controlling Costs: Feature Selection on a Budget

The traditional framework for feature selection treats all features as c...
10/08/2019 ∙ by Guo Yu, et al. ∙ 0

• ### Accumulator Bet Selection Through Stochastic Diffusion Search

An accumulator is a bet that presents a rather unique payout structure, ...
04/18/2020 ∙ by Nassim Dehouche, et al. ∙ 0

• ### Minimal cost feature selection of data with normal distribution measurement errors

Minimal cost feature selection is devoted to obtain a trade-off between ...
11/12/2012 ∙ by Hong Zhao, et al. ∙ 0

• ### Feature Selection and Dualities in Maximum Entropy Discrimination

Incorporating feature selection into a classification or regression meth...
01/16/2013 ∙ by Tony S. Jebara, et al. ∙ 0

• ### Selecting optimal subgroups for treatment using many covariates

We consider the problem of selecting the optimal subgroup to treat when ...
02/26/2018 ∙ by Tyler J. VanderWeele, et al. ∙ 0

• ### On the utility of power spectral techniques with feature selection techniques for effective mental task classification in noninvasive BCI

In this paper classification of mental task-root Brain-Computer Interfac...
11/16/2021 ∙ by Akshansh Gupta, et al. ∙ 4

• ### A Penalized Shared-parameter Algorithm for Estimating Optimal Dynamic Treatment Regimens

A dynamic treatment regimen (DTR) is a set of decision rules to personal...
07/13/2021 ∙ by Trikay Nalamada, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction.

A considerable amount of recent biometric research is being conducted in the ‘personalized medicine’ framework, because it has been well accepted now that heterogeneity can exist among individual subjects’ response to treatment in many disease settings. The characteristics contributing to this heterogeneity may include patient demographics, genetic/genomic information or other biological markers, henceforth referred to as treatment-selection biomarkers. These biomarkers can be effectively used to select optimal therapies for individuals in order to optimize this clinical outcome. Also it is important to remember that a single biomarker may not sufficiently explain this heterogeneity, and multiple biomarkers may need to be combined to build the correct statistical framework to optimize the process of treatment selection.

The direct approach to identifying these optimal marker combinations in treatment selection involves parametric modeling of the disease risk conditional on biomarkers, treatment assignment and other baseline patient characteristics, and recommending treatment assignments based on whether the predicted risk under treatment is lower than the predicted risk under no treatment. This framework was first introduced by

Song and Pepe (2004), and studied extensively in Foster et al. (2011); Qian and Murphy (2011); Lu et al. (2013). Treatment selection rules based on parametric risk models rely heavily on the correct specification of this model, which is often challenging given the complexity of biological mechanisms. An alternate, much more robust approach is to build optimization algorithms to minimize (or maximize) a desired criterion. This criterion, often called the objective function, is formulated based on relevant goals for treatment selection, and the optimization algorithm tries to find the best marker combination that optimizes it. These methods (also called indirect approaches) can be made completely nonparametric and assumption free and much more robust to model misspecifications. Zhang et al. (2012a, b) proposed finding the optimal marker combination within a pre-specified class by optimizing an estimator of the overall population mean outcome. Zhao et al. (2012) approached the same as an outcome-weighted learning problem and derived optimal treatment-selection rules using a weighted support vector machine method.

In Huang and Fong (2014), the authors proposed a new method to identify linear and nonlinear marker combinations by directly optimizing a targeted criterion function in the manner of Zhang et al. (2012a) and Zhao et al. (2012). However, the targeted criterion differed from others as adverse side-effects and/or cost incurred by the treatment were considered along with the event rates of the targeted disease in establishing the objective function. This was a crucial extension, as reducing safety events or cost of administering the treatment is often a valid policy goal (see Vickers et al., 2007). The authors formulated the optimization problem as minimization of a weighted sum of 0-1 loss, and used the ramp loss as an approximation of the 0-1 loss. In this article, we extend the idea of Huang and Fong (2014)

in creating an augmented target criterion that takes on the additional challenge of controlling for the number of markers in the risk model as well. This is crucial for two vital reasons: (a) Measuring biomarkers can be expensive with respect to both time and money, and possibly invasive too. It is therefore of significant interest to limit the number of biomarkers that need to be collected for an individual to selecting his/her optimal treatment. (b) As discussed earlier, our original problem requires us to find optimal biomarker combinations to explain the disease response heterogeneity in individuals, but finding this optimal combination can be difficult in the presence of redundant markers that do not contribute to treatment selection. This may lead to overfitting and result in overly complex selection rules that have poor prediction performance. One way to deal with this ‘curse of dimensionality’ is through marker selection. Recently a lot of attention has been directed to effect modifier selection in precision medicine. For example, in

Zhao (2017), the authors studied selective inference in effect modification models via LASSO; in Liang (2017), the authors constructed sparse decision rules in the context of concordance-assisted learning; and in Shi (2018), the authors proposed a penalized multi-stage A-learning algorithm for deriving the optimal dynamic treatment regime. Although these feature selection methods are all relevant to the broad genre of precision medicine, each of them target a very specific problem, which are quite different from our objective of penalized optimization of the single stage targeted objective function, where still only limited research exists: some insight into marker selection in treatment recommendation problems were briefly studied by Huang (2016) and Zhou et al. (2015), where both targeted minimization of disease rate in treatment selection.

To address these issues along with the original goals of treatment selection, we reformulate the criterion for treatment selection by associating a cost for measuring each biomarker in creating the optimal rule. The optimization problem can be written as minimization of a weighted sum of 0-1 loss, but with a penalty added for the number of markers in the model. In this article, we adopt the usual hinge loss convex relaxation of the 0-1 loss and perform a comprehensive investigation and comparison of various algorithms for solving this optimization problem. In linear support vector machines, feature extraction is a well-researched problem (see Bradley and Mangasarian, 1998; Weston et al., 2003; Mangasarian, 2006; Huang et al., 2010), although it is a much more challenging problem in nonlinear support vector machines (see Mangasarian and Wild, 2007). However, it is worth noting that our setting differs from a simple classification format in two vital aspects: (a) although the treatment selection objective can be rewritten into a (weighted) classification problem (as shown in Section 2), it is still in essence a fundamentally different problem from classification, and feature selection techniques in SVMs have not been studied under this context, and (b) weighted SVM is a more complicated optimization problem than the standard SVM, where the constraint on each support vector varies according to the weight associated with it, and research into feature extraction under this setting has also been fairly limited till now. Some work has been done to extend the SCAD penalty with the weighted linear support vector machines with special forms of such weights (see Jung, 2013), but beyond that, there hasn’t been any targeted investigation of such as per our knowledge. These two reasons make these explorations vitally important.

Thus, the biggest contribution of this article is to combine the methodological advances in two different areas of research, advances in indirect approaches to treatment selection through kernelized methods and those in feature selection techniques in nonparametric statistical learning methods like SVMs. We believe that combining these two areas of research is a novel methodological formulation in itself. Additionally each of these feature selection methods has been translated from the standard SVM framework to the weighted SVM formulation to match the proposed objective. Moreover, adding a penalty for the number of markers in the model redefines the objective function as well, which has been used in conjunction with these feature selection techniques - for example, in choosing tuning parameters for a given feature selection method, we propose the use of a generalized cross validation (GCV) technique that utilizes the whole objective criterion including the penalty for the number of markers.

The article continues in the following manner: In Section 2, we establish the problem of minimizing the total disease, treatment, and marker measurement cost in treatment-selection, and discuss marker selection in conjunction with treatment selection. We briefly discuss various methods built for support vector machines, adapted specifically for the weighted classification setup, for deriving the best linear and nonlinear marker combinations in order to optimize the desired objective function. In Section 3, we set up different simulation examples to test the strength of these linear and nonlinear feature selection methods, modified for our setting, and discuss the results. Then in Section 4, we illustrate the application of our methods using a real data example from an HIV vaccine trial. And finally, in Section 5, we discuss our findings regarding the use of these feature selection methods and the impact of incorporating marker measurement cost into treatment selection, and link to additional materials are presented in the Supplementary Section 6.

## 2 Methods.

In this article, we consider the problem of finding optimal treatment selection rules for a binary clinical outcome (0 for nondiseased and 1 for diseased) based on a set of candidate markers collected from an individual’s biological characteristics (the solution we propose is applicable to clinical outcome measured in continous scale as well). Denote this set of candidate markers by . For a given subset of markers , we consider marker based treatment selection rules of the form , where and denotes a class of functions spanning over X, and is the indicator function. The above then translates to the rule: for not treating, and for treating. The treatment-selection benefit of a decision rule can be quantified by , the expected disease rate in the population as a result of treatment selection based on (Song and Pepe, 2004; Qian and Murphy, 2011). This measure has been widely accepted as a crucial metric in recent literature (Zhao et al., 2012; Zhang et al., 2012a). Although characterizes the burden of disease upon the population, one may be concerned with additional burden associated with a treatment regimen, such as its side effects and/or the monetary cost of its implementation. Hence it might be preferable to search for treatment-selection strategies that take these aspects into consideration as well. Huang and Fong (2014) proposed to incorporate additional burden associated with a treatment regimen, such as due to its side effects of monetary cost, by pre-specifying a treatment/disease harm ratio such that each burden type can be put on the same scale. Following the decision-theoretic framework of Vickers et al. (2007), let be a pre-specified ratio of the burden per treatment relative to the burden per disease event, and let and indicate the potential disease outcome if a subject were to receive or not receive the treatment. Then the total burden due to disease and treatment for , represented in the unit of burden per disease event (see Huang and Fong, 2014) is given as . And the best treatment-selection rule is derived as the one that minimizes this total burden.

In this article we further take into consideration the cost of measuring biomarkers in deriving the optimal treatment-selection rule, under the expectation that inclusion of the burden of measuring biomarkers in the criterion function can lead to the derivation of more cost-effective treatment-selection rules in the sense of achieving desired public health impact under the guidance of a parsimonious biomarker panel. We make the following assumptions: (i) measurement of each biomarker induces roughly equal cost, and (ii) the cost of measuring one biomarker is times the burden per disease event. Then the total burden due to disease, treatment, and biomarker measurement for a treatment-selection rule can be represented in the unit of the burden per disease event as

 θ =EA(X)(Y)+E{δ1×A(X)}+δ2×dim(X) =1∑a=0E[Y(a)×I{A(X)=a}]+δ1×E{A(X)}+δ2×dim(X). (1)

Note: Although we assume a roughly equal cost for each biomarker for simplicity in this paper, it is straightforward to extend it to a setting where each biomarker has a different cost. Biomarkers which are absolutely essential to collect can be considered to have no cost at all, or there may be several groups of biomarkers, such that each group has a different cost depending on its burden and importance for the population of interest. In such situations, the penalty can be replaced by where is the cost associated with the biomarker , and our methods can still be applied as described with only minor modifications.

We propose to derive an optimal treatment-selection rule by minimizing this quantity . Suppose there exists an ‘optimal’ or ‘correct’ set of biomarkers within the list of biomarkers measured, such that , which when combined through an oracle rule minimizes . Thus, we can write the above problem as,

 f0=argminX⊆Xpminf∈FXθ (2)

Note that under this setup, . In practice, it is important to have a sensible way to specify the values or the range of values for and . Choice of these cost ratios can be facilitated based on information of the monetary cost for controlling the targeted disease, for applying the treatment, and for biomarker measurement, as in our data example of making recommendation for HIV vaccine, presented later in Section 4.

Now imagine data from a two-arm randomized trial with the treatment indicator taking values 0 and 1, to refer being untreated and treated, respectively. Let and indicate the number of subjects in the untreated and treated arms, respectively. Thus we have i.i.d. samples of the form for with . As in Huang and Fong (2014), we assume: (i) stable unit treatment value (SUTVA) (Rubin, 1980) and consistency: , of one subject is independent of the treatment assignments of other subjects, and given the treatment a subject actually received, a subject’s potential outcomes equal the observed outcomes; (ii) ignorable treatment assignments assumption: . Assumption (i) is plausible in trials where participants do not interact with one another and assumption (ii) is ensured by randomization. Under the above assumptions, it can be shown that,

 θ =E[Y×{1−A(X)}|T=0]+E[Y×A(X)|T=1]+δ1×P{A(X)=1}+δ2×dim(X) =E(Y|T=0)−E[A(X)×{Risk0(X)−Risk1(X)−δ1}]+δ2×dim(X) (3)

where and are the risk of conditional on X among the untreated and treated, respectively.

Huang and Fong (2014) showed that when , an optimal rule can be obtained as if , and otherwise. But when is positive, such a strategy alone wouldn’t exactly work as we need to also control the number of markers in the model. Since the quantity imposes an penalty on the set of markers X, one ad hoc method may be to use standard regression-based methods with / penalization to estimate the quantities and and then derive the optimal treatment-selection rule as . But note that the above procedure may lead to suboptimal rules with respect to our goal of minimizing .

Alternatively, following the strategy in Zhang et al. (2012a), Zhao et al. (2012) and Huang and Fong (2014), we can consider a class of rules for treatment recommendation based on functionals of the form (belonging to some functional class on X), and a given threshold (usually 0). In particular, we let with the indicator function, with a function of markers X. Then, assuming randomization does not depend on X for simplicity, can be expressed as . The optimal can be found by minimizing the empirical estimate of , that is,

 ^f=argminX⊆Xpminf∈FX{n∑i=1{Yi×Tin1−Yi×(1−Ti)n0+δ1n}×I{f(% Xi)≤0}+δ2×dim(X)}.

Therefore, we can formulate this problem as the minimization of the sum of a weighted sum of 0-1 loss and a term proportional to the number of biomarkers. That is, can be found as the minimizer of

 n∑i=1WiI{f(Xi)≤0}/n+δ2×dim(X), (4)

with the case-specific weight . Other types of weights such as control-specific or case-control mixture weights or their robust substitutes can also be adopted and are discussed in Huang and Fong (2014). For example, the robust substitute for the case-only weights , where and are risk estimates obtained from a working model and . It is worthwhile to note that since the minimization of (2) is equivalent to the minimization of , any consistent estimate of can also serve as weights in (4).

Note that if we are only interested in combining a set of candidate biomarkers without performing any variable selection, then it is not necessary to include the biomarker measurement cost in the criterion function as in Huang and Fong (2014). However, in the presence of a large number of biomarkers, incorporation of marker cost will impact the features selected in the estimated rule.

### 2.1 Treatment selection as a weighted Support Vector Machines problem.

In this section, we consider the minimization of the regularized loss function (

4), conditional on a pre-specified set of weights . Since , (4) can be reformulated as

 n∑i=1|Wi|I{sgn(f(Xi))≠sgn(Wi)}/n+δ2×dim(X) (5)

Note that (5) is a weighted classification problem where is the true binary classes, is the predicted binary class based on X, and is the subject specific weight. This type of problem can be resolved using the weighted support vector machine (Lin and Wang, 2002), based on , , and . Since minimization of a weighted sum of 0-1 loss is non-convex and intractable, we look to replace the 0-1 loss with a convex surrogates; the hinge loss is often used in this context. The hinge loss, given as has been proven to be a useful surrogate to the classification loss, such that the term in (5) is replaced by . It is worth noting that the hinge loss penalizes departure of a decision rule from its observed class label, based on the extent of this departure, which the classification loss fails to do. The hinge-loss SVM formulation of the above problem is given as,

 minX⊆Xpminf∈FXn∑i=1|Wi|max(1−sgn(Wi)×f(Xi),0)/n+δ2×dim(X) (6)

This is the penalized weighted support vector machines framework. In SVMs, the functional space is generally restricted to be a reproducing kernel Hilbert space , represented uniquely by its kernel . Before considering the solution to (6), we first review the weighted support vector machines formulation from Lin and Wang (2002), where the square of the Hilbert space norm is used instead of the norm. For a given subspace , we define it as:

 minf∈HXn∑i=1|Wi|max(1−sgn(Wi)×f(Xi),0)/n+λ∥f∥2HX (7)

In linear support vector machines, is used for optimization, an RKHS with the Euclidean inner product as its kernel function, . The Hilbert norm in this case becomes the usual Euclidean norm on the linear combination weights , and the estimated decision functions can be expressed as a linear combination of the input marker set X. However, as many optimal marker combination for treatment selection may not be among linear combinations of the input markers, it is important to extend to include more complex nonlinear functions. This can be achieved by considering transformations of the feature space through feature maps of the form , the appropriate RKHS for which is the one with the kernel satisfying . Examples of popular nonlinear kernels include the polynomial kernel of degree

and the radial basis function (RBF) kernel

with as a tuning parameter. The resulting SVM solution at a given covariate vector is given as , where ’s are the trained weights on the support vectors and is the estimated global constant.

### 2.2 Weighted versus unweighted support vector machines.

The weighted support vector machines is a well-known technique for classification. First proposed by Lin and Wang (2002), who called it the “fuzzy support vector machines” or FSVM, it has been further applied and studied in subsequent works (Fan and Ramamohanarao, 2005; Yang et al., 2007; Zhao et al., 2012, see for example). In essence, the weighted support vector machines becomes a very important tool in classification, when some training points are more important than others for the given problem. The main difference in solving the weighted SVM and the standard unweighted SVM lie in the constraints that we put on the support vectors in the dual formulation of the problem. To see this, let us write down the dual of the unweighted SVM below and note the changes that can transform it into a weighted SVM problem. Let us denote and , and then see that the unweighted squared Hilbert loss penalized SVM is given as , the dual of which is given as,

 maxαn∑i=1αi−12n∑i=1n∑j=1αiαjY∗iY∗jkX(xi,xj)subject to n∑i=1Y∗iαi=0,0≤αi≤C, (8)

for . The dual of (7) differs from (8) only through the constraints that we put on the s. In the unweighted SVM, the upper-bound for each of is the fixed constant , while in the weighted SVM, the upper bound for each is further multiplied by the quantity and thus becomes (that is, ). Intuitively it just means that , the coefficient associated with a given sample, is each constrained differently according to their weight.

### 2.3 Identifying optimal biomarker combinations in the linear space.

As noted before, variable selection is a key motivation in our pursuit to solve (6), which isn’t an inherent feature of the support vector machines framework of (7). The Hilbert norm provides some control on the overcomplexification of the estimated functions, but without performing any variable selection. Many authors have proposed to use the and penalty either as a replacement or in conjunction with the norm in the linear SVMs. First of all, note that solving (6) is combinatorially a very hard problem (see Amaldi and Kann, 1998), but the zero norm is directly related to finding minimal subsets and can provide optimal feature extraction if properly implemented. There are however many feature extraction algorithms for the unweighted linear support vector machines that do not rely on the norm for selection (see for example Mangasarian, 2006; Zhang et al., 2006). To this effect, in this article, we also look at alternate ways to perform feature selection in support vector machines, instead of focusing solely on solving (6). Here, we build our optimization approaches on some of the most commonly used algorithms for unweighted SVMs. That is, we modify each of them to cater to the weighted version of the algorithm, and we will see how it paves the way for directly or indirectly solving (6) in the linear space. We now briefly introduce each method below, along with their modified objective function under the weighted SVM setup, while a more detailed description of each is reserved for the Supplementary Section (see Web Appendix LABEL:sec:appA).

1. weighted support vector machines ( WSVM), that utilizes the norm for penalization instead of the squared Hilbert norm, built upon Mangasarian (2006), with objective function,

 minβ,β0n∑i=1qimax{1−Y∗i(⟨Xi,β⟩+β0),0}/n+λp∑j=1|βj|. (9)
2. SCAD applies the smoothly clipped absolute deviation penalty (Zhang et al., 2006) to the WSVM objective function ,

 minβ,β0n∑i=1qimax{1−Y∗i(⟨Xi,β⟩+β0),0}/n+p∑j=1pλ(|βj|), (10)

.

3. elastic SCAD, that uses a mixture of the SCAD penalty and the norm,

 minβ,β0n∑i=1qimax{1−Y∗i(⟨Xi,β⟩+β0),0}/n+p∑j=1pλ1(|βj|)+λ2∥β∥2. (11)
4. feature selection concave (FSV), built on a concave minimization algorithm originally proposed by Bradley and Mangasarian (1998), achieves approximate penalization in weighted SVMs through solving,

 minβ,β0,v ∑Y∗i=1qimax{1−⟨Xi,β⟩−β0,0}∑ni=1I(Y∗i=1)+∑Y∗i=−1qimax{1−⟨Xi,β⟩−β0,0}∑ni=1I(Y∗i=−1) +∑jαe−αv∗j(vj−v∗j)subject to −v≤β≤v, (12)
5. approximation of the zero norm minimization (AROM), that achieves approximate penalization (see Weston et al., 2003) by solving the following objective function iteratively. It initializes a vector , and at each successive step, it resets as ,

 minβ,β0n∑i=1qimax{1−Y∗i(⟨zi∗Xi,β⟩+β0),0}/n+λp∑j=1|βj|2 (13)

where .

6. weighted support vector machines ( WSVM), built on the works of Huang et al. (2010), that achieves norm in weighted support vector machines through an iterative scheme, by solving the following objective function at the step,

 (β(t),β(t)0)=argminβ,β0n∑i=1qiξi/n+λβTΛ(t−1)β, (14)

subject to , where .

### 2.4 Identifying optimal biomarker combinations in the nonlinear space.

In this section, we consider the derivation of best biomarker combinations in the nonlinear space, as most optimal marker combinations for treatment-selection rules may not be among the linear combinations of the input markers. Feature selection is a fairly straightforward procedure for linear SVM classifiers, but it is a much more challenging problem in nonlinear support vector machines

(see Mangasarian and Wild, 2007). For example, in the linear support vector machines, we can use the or norm instead of the Hilbert norm. However, when similar techniques are applied to nonlinear SVM classifiers, we have reduction in the number of support vectors but not in the number of input space features (Fung and Mangasarian, 2004). So although the dimensionality of the transformed space is reduced, it does not provide any direct reduction of input space features. A few procedures have recently been developed however to deal with feature selection in the original input space under nonlinear feature maps (see Dasgupta et al., 2013; Allen, 2013), but not a lot of them have been developed with the goal of penalization. We will investigate some of these newly developed methods here, but under the more generalized setting of weighted support vector machines. We briefly introduce each method below, along with their modified objective function under the weighted SVM setup, while a more detailed description of each is reserved for the Supplementary Section (see Web Appendix LABEL:sec:appB).

1. risk recursive feature elimination (riskRFE), a newly developed, powerful wrapper technique based on recursive computation of the learning function (see Dasgupta et al., 2013), that works both in the linear and the nonlinear space (see Figure 1 for a flow chart of the algorithm).

2. kernel iterative feature extraction (KNIFE), that achieves feature selection in nonlinear SVMs, by optimizing a feature-regularized loss function (by weighting features within the kernel), iteratively (see Allen, 2013). For , the weighted version of the algorithm solves,

 minβ0,α,w n∑i=1qimax{1−Y∗i(n∑j=1αjkw(xi,xj)+β0),0}/n+λ1αTkwα+λ2∥w∥1 subject to 0≤wj<1 for all j=1,…,p. (15)
3. kernel penalized weighted support vector machines (KP-WSVM), built on the works of Maldonado et al. (2011) for feature selection in nonlinear SVMs, that relies on penalizing each feature’s use in the dual formulation.

 minβ0,α,w n∑i=1qimax{1−Y∗i(n∑j=1αjkw(xi,xj)+β0),0}/n+λn∑j=1(1−e−βwj) subject to wj≥0∀j={1,…,n}. (16)

The approximated penalty in the above objective function can also be replaced by a norm.

## 3 A simulation study

In this section, we investigate the performance of the aforementioned feature selection methods in conjunction with the weighted support vector machines algorithms in minimizing the penalized sum of the weighted classification loss as in (5) in various simulation settings that we describe below.

### 3.1 The simulation setup

We consider data from 1:1 randomized trial of size for studying the performance of our proposed strategy. The linear feature selection methods (riskRFE (L), WSVM, SCAD/eSCAD, FSV, AROM, WSVM) are used in conjunction with the linear kernel based weighted SVM, while the nonlinear marker selection methods (riskRFE (G), KNIFE, KP-WSVM, KP-WSVM) are used in conjunction with the Gaussian RBF kernel based weighted SVM. For each simulation setting, we restrict ourselves to when the treatment/disease burden ratio is equal to 0, but consider three different values for the marker cost , which is either units. The feature selection methods considered here can be broadly categorized into two groups - (a) Embedded (penalized) methods, which have a separate tuning parameter for penalizing the or cost, including SCAD, eSCAD, WSVM, WSVM, FSV, KNIFE and KP-WSVMs. For these methods, , the ratio of the cost of measuring one biomarker to the burden per disease event, is utilized directly for choosing the optimum value of the respective tuning parameter(s). Although AROM also belongs to this class, it achieves penalization through an iterative fitting of the norm SVM. Thus in this case, is used for choosing the optimum amount of tuning; and (b) Wrappers that have no separate penalization for the or cost (both riskRFE(L) and riskRFE(G)); thus are not tuned on the value of . The other global SVM parameters, such as the width of the Gaussian kernel in nonlinear SVMs, are tuned based on the prediction performance of the weighted support vector machines in the full model. In this simulation study, we only consider the double-robustness weights of Huang and Fong (2014) for obtaining

for each individual. As a comparative method, we use linear logistic regression model with LASSO and weighted SVM without feature selection to find the optimal treatment selection rule. In logistic regression, the

is modeled as a function of the main effects of the treatment and the biomarkers, and the interaction effects between the treatment and the biomarkers, with the amount of penalization tuned globally through usual cross validation. This is also the risk model used for constructing the double-robustness weights.

In each Monte Carlo simulation, we follow a 5-fold cross-validation procedure to identify the optimal tuning parameters in the model for a given method. As mentioned before, we perform this tuning in a two-step procedure - the global SVM parameters are tuned first using the full model in an weighted SVM analysis. The Hilbert Space norm of the estimated function (which in linear kernel becomes the norm) is tuned on a grid of values lying between , while the width of the Gaussian kernel, , is tuned over a grid of values lying between . Then for each of the embedded methods, we use cross validation to tune for the additional penalization parameter(s) based on the performance of the method in question, using the selected global SVM parameters from step one. Each of these parameters was tuned on a grid of a viable range of values, and the value obtaining the optimal GCV performance is chosen for that method. This two-step procedure is followed to gain computation time, and also because the norm allows shrinkage of the estimated effects, rather than controlling sparsity directly.

The optimal treatment-selection rule , where is the optimal set of markers chosen by a given method, is then estimated from the training sample, after retuning the global SVM parameters in the model with the selected markers. To evaluate the performance of the estimated rule, a test set of is generated in each simulation run, based on which we estimate as . We evaluate the performance of each method over 100 Monte Carlo runs. In our results, we present the estimated version of the quantity that we evaluate from these Monte Carlo runs. We compare performance of different methods in the following three settings.

In the first setting, we have in total a list of 27 markers, of which only 2 are significant in explaining treatment-marker interactions (). The significant markers,

are generated from the multivariate normal distribution

. Each of the rest is generated independently from a distribution. We consider two subcases under this setting: (i) a ‘linear’ underlying model, given by , with disease prevalence approximately 0.3 and 0.2 among the untreated and treated, respectively; and (ii) a polynomial ‘nonlinear’ underlying model , with disease prevalence approximately 0.18 and 0.17 among the untreated and treated, respectively.

In the second setting, we have in total a list of 53 markers, of which only 3 are significant in explaining treatment-marker interactions (). The significant markers, are generated from the multivariate normal distribution . Each of the rest is generated independently from a

distribution with mean 0 and standard deviation 1. Again we consider two subcases: (i) a ‘linear’ underlying model, given by

, with disease prevalence approximately 0.22 among both the untreated and treated groups; and (ii) a ‘nonlinear’ underlying model , with disease prevalence approximately 0.30 and 0.18 among the untreated and treated, respectively.

In the third setting, we consider a complex nonlinear setup, with a total of 27 markers, of which only 2 are significant in explaining treatment-marker interactions. All of the markers are generated from the uniform distribution, that is,

. The model for treatment marker interaction considers a situation where the treatment has benevolent effect for an individual only if his/her marker values and lie in a specific region of the covariate space, without which the treatment might yield a harmful result. This is achieved by dividing the covariate space spanned by and , the square , into two regions: (i) a ‘harmful’ zone given by the smaller square , denoted as ; and (ii) a ‘beneficial’ zone given by the region between the two concentric squares, denoted as (Figure 2 plots and

marker values for 10000 such hypothetical individuals). Thus, the probability that the treatment is benevolent for the population is 0.75. The probability of disease for an individual in region

is 0 if the individual receives treatment, but 0.6 otherwise, and they are reversed for an individual from . The global probability for disease in the population is thus fixed at 0.3. The disease prevalences are approximately around 0.53 and 0.17 among the untreated and treated, respectively.

### 3.2 Results

The results for the simulation exercise are summarized in Tables 1-5. We give an overview of these results below, while a more detailed discussion of each of these settings is provided in the Supplementary Materials (see Web Appendix LABEL:sec:appC).

1. While most biomarker-based treatment-selection rules result in reduction of the total cost compared to the optimal strategy between treating all or treating none, methods that allow for feature selection lead to substantial improvement compared to the weighted SVM method without feature selection.

2. Embedded feature selection methods that involve tuning by marker measure cost in general show an decreasing trend in sensitivity and an increasing trend in specificity with increasing values of , which is not the case for wrapper methods (such as riskRFE) or LASSO, that do not rely on for tuning. As a result, for the latter type of methods, the total cost necessarily increases with increasing , the cost of measuring a marker; in contrast, for methods that involve in tuning, the total cost can often decrease with increasing , especially when the improvement in the specificity score for a given method is more substantial than the decline in its sensitivity score.

3. In presence of a linear trend, best performing methods are typically among feature selection methods with linear kernel, yet feature selection methods with nonlinear kernel can have comparable performance; in presence of nonlinear trends, feature selection methods with nonlinear kernel can have substantial improvement over linear methods.

4. Relative performance of various methods vary with settings. In general, riskRFE performs really well when the cost of marker measurement is low. Apart from it, WSVM, SCAD and AROM are the best performing linear methods, especially when the cost of marker measurement is high, while KP-WSVM stands out among the nonlinear feature selection methods, with robust performance across settings.

We also evaluate these algorithms in a few additional settings. We compare their performance against that of the Decision List method of Zhang et al. (2015). We also consider a setting where the true optimal decision rule is not sparse but the effect of some biomarkers on the optimal treatment decision are so small that given the cost consideration, a sparser decision rule is more optimal in terms of the criterion . For space constraint, these additional settings are discussed in detail in the Supplementary Materials (Web Appendix LABEL:sec:appD), and the results are provided in Supplementary Tables LABEL:tab:set4_decision, LABEL:tab:set4_linear and LABEL:tab:set5_linear.