# Augmented Outcome-weighted Learning for Optimal Treatment Regimes

Precision medicine is of considerable interest in clinical, academic and regulatory parties. The key to precision medicine is the optimal treatment regime. Recently, Zhou et al. (2017) developed residual weighted learning (RWL) to construct the optimal regime that directly optimize the clinical outcome. However, this method involves computationally intensive non-convex optimization, which cannot guarantee a global solution. Furthermore, this method does not possess fully semiparametrical efficiency. In this article, we propose augmented outcome-weighted learning (AOL). The method is built on a doubly robust augmented inverse probability weighted estimator, and hence constructs semiparametrically efficient regimes. Our proposed AOL is closely related to RWL. The weights are obtained from counterfactual residuals, where negative residuals are reflected to positive and accordingly their treatment assignments are switched to opposites. Convex loss functions are thus applied to guarantee a global solution and to reduce computations. We show that AOL is universally consistent, i.e., the estimated regime of AOL converges the Bayes regime when the sample size approaches infinity, without knowing any specifics of the distribution of the data. We also propose variable selection methods for linear and nonlinear regimes, respectively, to further improve performance. The performance of the proposed AOL methods is illustrated in simulation studies and in an analysis of the Nefazodone-CBASP clinical trial data.

## Authors

• 25 publications
• 23 publications
• ### Causal nearest neighbor rules for optimal treatment regimes

The estimation of optimal treatment regimes is of considerable interest ...
11/22/2017 ∙ by Xin Zhou, et al. ∙ 0

• ### Estimating Individualized Treatment Regimes from Crossover Designs

The field of precision medicine aims to tailor treatment based on patien...
02/05/2019 ∙ by Crystal T. Nguyen, et al. ∙ 0

• ### Sequential Advantage Selection for Optimal Treatment Regimes

Variable selection for optimal treatment regime in a clinical trial or a...
05/20/2014 ∙ by Ailin Fan, et al. ∙ 0

• ### Technical Background for "A Precision Medicine Approach to Develop and Internally Validate Optimal Exercise and Weight Loss Treatments for Overweight and Obese Adults with Knee

A precision medicine (PM) pipeline was developed to determine the optima...
01/27/2020 ∙ by Xiaotong Jiang, et al. ∙ 0

• ### Sample Size Calculations for SMARTs

Sequential Multiple Assignment Randomized Trials (SMARTs) are considered...
06/16/2019 ∙ by Eric J. Rose, et al. ∙ 0

• ### Efficient selection of predictive biomarkers for individual treatment selection

The development of molecular diagnostic tools to achieve individualized ...
05/05/2019 ∙ by Shonosuke Sugasawa, et al. ∙ 0

• ### Median Optimal Treatment Regimes

Optimal treatment regimes are personalized policies for making a treatme...
03/02/2021 ∙ by Liu Leqi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Most medical treatments are designed for the “average patient”. Such a “one-size-fits-all” approach is successful for some patients but not always for others. Precision medicine, also known as personalized medicine, is an innovative approach to disease prevention and treatment that take into account individual variability in clinical information, genes, environments and lifestyles. Currently, precision medicine is of considerable interest in clinical, academic, and regulatory parties. There are already several FDA-approved treatments that are tailored to specific characteristics of individuals. For example, ceritinib, a recently FDA approved drug for the treatment of lung cancer, is highly active in patients with advanced, ALK-rearranged non-small-cell lung cancer (Shaw et al. 2014).

The key to precision medicine is the optimal treatment regime. Let be a patient’s clinical covariates, be the treatment assignment, and be the observed clinical outcome. Assume without loss of generality that larger values of are more desirable. A treatment regime is a function from to . An optimal treatment regime is a regime that maximizes the outcome under this regime. Assuming that the data generating mechanism is known, the optimal treatment regime is related to the contrast

 δ(x)=μ+1(x)−μ−1(x),

where and . The Bayes optimal regime is if and otherwise.

Most of published optimal treatment strategies estimate the contrast by modelling either the conditional mean outcomes or contrast directly based on data from randomized clinical trials or observational studies (see Moodie et al. (2014); Murphy (2003); Robins (2004); Taylor et al. (2015) and references therein). They obtain treatment regimes indirectly by inverting the regression estimates. They are regression-based approaches for treatment regimes. For instance, Qian and Murphy (2011) proposed a two-step procedure that first estimates a conditional mean for the outcome and then determines the treatment regime by comparing conditional mean outcomes across various treatments. The success of these regression-based approaches depends on the correct specification of models and on the high precision of the model estimates. However, in practice, the heterogeneity in population makes the regression model estimate complicated.

Alternatively, Zhao et al. (2012)

proposed a classification-based approach, called outcome weighted learning (OWL), to utilize the weighted support vector machines

(Vapnik 1995) to estimate the optimal treatment regime directly. Zhang et al. (2012a) also proposed a general framework to make use of classification methods to the optimal treatment regime problem.

Indeed, the classification-based approaches follow Vapnik’s main principle (Vapnik 1995): “When solving a given problem, try to avoid solving a more general problem as an intermediate step.” As in Figure 1, the aim of optimal treatment regimes is to estimate the form of the decision boundary . Regression-based approaches find the decision boundary by solving a general problem that estimates for any . For the optimal treatment regime, it is sufficient to find an accurate estimate of only near the zeros of . In general, finding the optimal regime is an easier problem than regression function estimation. Classification-based approaches, which seek the decision boundary directly, provide a flexible framework from a different perspective.

Recently, Zhou et al. (2017) proposed residual weighted learning (RWL), which uses the residual from a regression fit of outcome as the pseudo-outcome, to improve finite sample performance of OWL. However, this method, involving a non-convex loss function, presents numerous challenges in computations, which hinders its practical use. For non-convex optimization, a global solution is not guaranteed, and the computation is generally intensive. Athey and Wager (2017) also pointed out that RWL does not possess fully semiparametrical efficiency. In this article, we propose augmented outcome-weighted learning (AOL). The method is built on a doubly robust augmented inverse probability weighted estimator (AIPWE), and hence constructs semiparametrically efficient regimes. Although this article focuses on randomized clinical trials, the double robustness is particularly useful for observational studies. Our proposed AOL is closely related to RWL. The weights are obtained from counterfactual residuals, where negative residuals are reflected to positive and accordingly their treatment assignments are switched to opposites. Convex loss functions are thus applied to reduce computations. AOL inherits almost all desirable properties of RWL. Similar with RWL, AOL is also universally consistent, i.e., the estimated regime of AOL converges the Bayes regime when the sample size approaches infinity, without knowing any specifics of the distribution of the data. The finite sample performance of AOL is demonstrated in numerical simulations.

The remainder of the article is organized as follows. In Section 2.1, we review outcome weighted learning and residual weighted learning. In Section 2.2 and 2.3, we propose augmented outcome-weighted learning. We discover the connection between augmented outcome-weighted learning and residual weighted learning in Section 2.4. We establish universal consistency for the proposed AOL in Section 2.5. The variable selection techniques for AOL are discussed in Section 2.6. We present simulation studies to evaluate finite sample performance of the proposed methods in Section 3. The method is then illustrated on the Nefazodone-CBASP clinical trial in Section 4. We conclude the article with a discussion in Section 5. All the technical proofs are provided in Appendix.

## 2 Method

### 2.1 Review of outcome weighted learning and residual weighted learning

In this article, random variables are denoted by uppercase letters, while their realizations are denoted by lowercase letters. Consider a two-arm randomized trial. Let

be the probability of being assigned treatment for patients with clinical covariates . It is predefined in the trial design. We assume for all and .

We use the potential outcomes framework (Rubin 1974) to precisely define the optimal treatment regime. Let and denote the potential outcomes that would be observed had a subject received treatment or . There are two assumptions in the framework. The actually observed outcomes and potential outcomes are connected by the consistency assumption, i.e., . We further assume that conditional on covariates , the potential outcomes are independent of , the treatment that has been actually received. This is the assumption of no unmeasured confounders (NUC). This assumption is automatically hold in a randomized clinical trial.

For an arbitrary treatment regime , we can thus define its potential outcome , where is the indicator function. It would be the observed outcome if a subject from the population were to be assigned treatment according to regime . The expected potential outcome under any regime , defined as , is called the value function associated with regime . Thus, an optimal regime is a regime that maximizes . Let . Under the consistency and NUC assumptions, it is straightforward to show that

 V(d)=E(m(X,d)))=E(Rπ(A,X)I(A=d(X))). (1)

Thus finding is equivalent to the following minimization problem:

 (2)

Zhao et al. (2012) viewed this as a weighted classification problem, and proposed outcome weighted learning (OWL) to apply statistical learning techniques to optimal treatment regimes. However, as discussed in Zhou et al. (2017), this method is not perfect. Firstly, the estimated regime of OWL is affected by a simple shift of the outcome . Hence estimates from OWL are unstable especially when the sample size is small. Secondly, since OWL needs the outcome to be nonnegative to gain computational efficiency from convex programming, OWL works similarly as weighted classification to reduce the difference between the estimated and true treatment assignments. Thus the regime by OWL tends to retain the treatments that subjects actually received. This behavior is not ideal for data from a randomized clinical trial, since treatments are actually randomly assigned to patients.

To alleviate these problems, Zhou et al. (2017) proposed residual weighted learning (RWL), in which the misclassification errors are weighted by residuals of the outcome from a regression fit on clinical covariates . The residuals are calculated as

 Rg=R−g(X).

Zhou et al. (2017) used as a choice of . Unlike OWL in (2), RWL targets the following optimization problem,

 d∗∈argmindE(Rgπ(A,X)I(A≠d(X))).

Suppose that the realization data are collected independently. For any decision function , let be the associated regime. RWL aims to minimize the following regularized empirical risk,

 1nn∑i=1rg,iπ(ai,xi)T(aif(xi))+λ||f||2, (3)

where , is a continuous surrogate loss function, is some norm for , and is a tuning parameter.

Since some residuals are negative, convex surrogate loss functions are not appropriate in (3). Zhou et al. (2017) considered a non-convex loss, the smoothed ramp loss function. However, the non-convexity presents significant challenges for solving the optimization problem (3). Unlike convex functions, non-convex functions may possess local optima that are not global optima, and most of efficient optimization algorithms, such as gradient descent and coordinate descent, are only guaranteed to converge to a local optimum. The theoretical properties of RWL establish on the global optimum. Although Zhou et al. (2017) applied a difference of convex (d.c.) algorithm to address the non-convex optimization problem by solving a sequence of convex subproblems to increase the likelihood of reaching a global minimum, the global optimization is not guaranteed (Sriperumbudur and Lanckriet 2009). The d.c. algorithm is still computationally intensive. In addition, RWL may connect with AIPWE as discussed in Zhou et al. (2017), but it does not have fully semiparametrical efficiency (Athey and Wager 2017).

### 2.2 Augmented outcome-weighted learning (AOL)

Let us come back to equation (1). The first equality is the foundation of regression-based approaches, while the second inspired outcome weighted learning (Zhao et al. 2012). Zhang et al. (2012a) combined these two perspectives through a doubly robust augmented inverse probability weighted estimator (Bang and Robins 2005, AIPWE) of the value function.

Recall that , , and . Following Zhang et al. (2012b), we start from the doubly robust AIPWE:

 AIPWE(d)=1nn∑i=1(ri−^m(xi,d)π(ai,xi)I(ai=d(xi))+^m(xi,d)),

where is an estimator of , which is an estimator of

 V(d)=E(R−m(X,d)π(A,X)I(A=d(X))+m(X,d)).

For an observational study, we are also required to estimate by the data. is a consistent estimator of if either or is correctly specified. This is the so-called double robustness. In a randomized clinical trial is known, hence even if is inconsistent, is still consistent.

Noting that

 R−m(X,d)π(A,X)I(A=d(X))+m(X,d)=R−~g(X)π(A,X)I(A=d(X))+μ−A(X),

where

 ~g(x):=π(−1,x)μ+1(x)+π(+1,x)μ−1(x), (4)

maximizing is asymptotic to the following minimization problem

 argmindE(R−~g(X)π(A,X)I(A≠d(X))). (5)

Let . As explained later in Section 2.4, is a form of residuals. At this point, we may apply a similar non-convex surrogate loss in the regularization framework as RWL in (3). However, it still suffers from local optimization and intensive computation.

To seek the optimal regime, we apply a finding in Liu et al. (2016) to take advantage of efficient convex optimization. Note that

 E(|~R|π(A,X)I(A⋅% sign(~R)≠d(X)))=E(~Rπ(A,X)I(A≠d(X)))+E(~R−π(A,X)),

where . Therefore finding in (5) is equivalent to the following optimization problem,

where negative weights are reflected to positive, and accordingly their treatment assignments are switched to opposites.

Similar with OWL and RWL, we seek the decision function by minimizing a regularized surrogate risk,

 (6)

where is a continuous surrogate loss function, is some norm for , and is a tuning parameter controlling the trade-off between the empirical risk and the complexity of the decision function . This method is called augmented outcome-weighted learning (AOL) in this article, since the weights are derived from augmented outcomes.

As the weights are all nonnegative, convex surrogate can be employed for efficient computation. In this article, we apply the Huberized hinge loss function (Wang et al. 2008),

 ϕ(u)=⎧⎪ ⎪⎨⎪ ⎪⎩0ifu≥1,14(1−u)2if−1≤u<1,−uifu<−1. (7)

Other convex loss functions, such as the hinge loss, can be also applied in AOL. Although the Huberized hinge loss has a similar shape with the hinge loss, the Huberized hinge loss is smooth everywhere. Hence it has computational advantages in optimization.

### 2.3 Implementation of AOL

We derive an algorithm for the linear AOL in Section 2.3.1, and then generalize it to the case of nonlinear learning through kernel mapping in Section 2.3.2. Both algorithms solve convex optimization problems, and global solutions are guaranteed.

#### 2.3.1 Linear Decision Rule for AOL

Consider a linear decision function . The associated regime will assign a subject with clinical covariates into treatment 1 if and otherwise. In (6), we define as the Euclidean norm of . Then the minimization problem (6) can be rewritten as

 minw,b1nn∑i=1|~ri|π(ai,xi)ϕ(ai⋅sign(~ri)(wTxi+b))+λ2wTw. (8)

There are many efficient numerical methods for solving this smooth unconstrained convex optimization problem. One example is the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm (Nocedal 1980), a quasi-Newton method that approximates the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm using a limited amount of computer memory. When we obtain the solution , the decision function is .

#### 2.3.2 Nonlinear Decision rule for AOL

The nonlinear decision function can be represented by with and , where is a reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel function . The kernel function is a positive definite function mapping from to . The norm in , denoted by , is induced by the following inner product,

 K=n∑i=1m∑j=1αiβjK(xi,xj),

for and

. The most widely used nonlinear kernel in practice is the Gaussian Radial Basis Function (RBF) kernel, that is,

 Kσ(x,z)=exp(−σ2||x−z||2),

where is a free parameter whose inverse is called the width of .

Then minimizing (6) can be rewritten as

 minh,b1nn∑i=1|~rg,i|π(ai,xi)ϕ(ai⋅sign(~ri)(h(xi)+b))+λ2||h||2K. (9)

Due to the representer theorem (Kimeldorf and Wahba 1971), the nonlinear problem can be reduced to finding finite-dimensional coefficients , and can be represented as . So the problem (9) is changed to

 minv,b1nn∑i=1|~ri|π(ai,xi)ϕ(ai⋅sign(~ri)(n∑j=1vjK(xi,xj)+b))+λ2n∑i,j=1vivjK(xi,xj). (10)

Again, it is a smooth unconstrained convex optimization problem. We apply L-BFGS algorithm to solve (10). When we obtain the solution , the decision function is .

### 2.4 Connection to residual weighted learning

Note that in (4) is a weighted average of and . Hence is a form of residuals. The use of residuals in optimal treatment regimes is justified in Zhou et al. (2017) as follows, for any measurable function ,

 E(R−g(X)π(A,X)I(A≠d(X)))=E(Rπ(A,X)−g(X))−V(d).

For residual weighted learning in Zhou et al. (2017), the corresponding is

 g1(x)=E(R2π(A,X)∣∣X=x)=12μ+1(x)+12μ−1(x). (11)

Similarly, Liu et al. (2016) applied unweighted regression to calculate residuals, where the corresponding is

 g2(x)=E(R|X=x)=π(+1,x)μ+1(x)+π(−1,x)μ−1(x). (12)

It is interesting to understand the implication of in (4). Under the consistency and NUC assumptions, we can check that

 E(R∗(−A)|X=x)=π(−1,x)μ+1(x)+π(+1,x)μ−1(x)=~g(x).

is the expected outcome for subjects with covariate had they received the opposite treatments to the ones that they have actually received. is counterfactual, and cannot be observed. It can be estimated by , where and are estimates of and , respectively. Noting that

 ~g(x)=E(π(−A,X)π(A,X)R∣∣X=x)=π(−1,x)μ+1(x)+π(+1,x)μ−1(x), (13)

also can be estimated by weighted regression directly, where weights are . In a randomized clinical trial with usual equal allocation ratio 1:1, , and coincide. If the allocation ratio is unequal, they are different. Compared with the regression weights of in (11) and of in (12), in (13) utilizes a more extreme set of weights. For example, in a randomized clinical trial with the allocation ratio , i.e., the number of subjects in arm is three times as that in arm , the weights in (12) for two arms are both 1 (unweighted), the weights in (11) are and 2, and the weights in (13) are and 3.

Our proposed AOL is closely related to RWL, as we just discussed that AOL uses counterfactual residuals. AOL possesses almost all desirable properties of RWL. First, by using residuals, AOL stabilizes the variability introduced from the original outcome. Second, to minimize the empirical risk in (6

), for subjects with positive residuals, AOL tends to recommend the same treatment assignments that subjects have actually received; for subjects with negative residuals, AOL is apt to give the opposite treatment assignments to what they have received. Third, AOL is location-scale invariant with respect to the original outcomes. Specifically, the estimated regime from AOL is invariant to a shift of the outcome; it is invariant to a scaling of the outcome with a positive number; the regime from AOL that maximizes the outcome is opposite to the one that minimizes the outcome. These are intuitively sensible. The only nice property of RWL that is not inherited by AOL is the robustness to outliers because of the unbounded convex loss in AOL. However, we may apply an appropriate method or model estimating residuals to reduce the probability of outliers.

### 2.5 Theoretical properties

In this section, we establish theoretical properties for AOL. Recall that for any treatment regime , the value function is defined as

 V(d)=E(Rπ(A,X)I(A=d(X))).

Similarly, we define the risk function of a treatment regime as

 R(d)=E(Rπ(A,X)I(A≠d(X))).

The regime that minimizes the risk is the Bayes regime , and the corresponding risk is the Bayes risk. Recall that the Bayes regime is if and otherwise.

Let , where , be a convex function. In this section, we investigate a general result, and do not limit as the Huberized hinge loss. A few popular convex surrogate examples are listed as follows:

• Hinge loss: , where ,

• Squared hinge loss: ,

• Least squares loss: ,

• Huberized hinge loss as shown in (7),

• Logistic loss: ,

• Distance weighted discrimination (DWD) loss:

 ϕ(u)={1uifu≥1,2−uifu<1,
• Exponential loss: .

The hinge loss and squared hinge loss are widely used in support vector machines (Vapnik 1995). The least squares loss is applied to regularization networks (Evgeniou et al. 2000)

. The loss function in the logistic regression is just the logistic loss. The DWD loss is the loss function in the distance-weighted discrimination

(Marron et al. 2007). The exponential loss is used in AdaBoost (Freund and Schapire 1997).

For any measurable function , recall that . In this section, we do not require to be a regression fit of , and can be any arbitrary function. For a decision function , we proceed to define a surrogate -risk function:

 Rϕ,g(f)=E(|Rg|π(A,X)ϕ(A⋅sign(Rg)f(X))). (14)

Similarly, the minimal -risk as and .

The performance of the associated regime is measured by the excess risk . Similarly, we define the excess -risk as .

Suppose that a sample is independently drawn from a probability measure on , where is compact. Let , i.e. , where and , be a global minimizer of the following optimization problem:

 minf=h+b∈HK+{1}1nn∑i=1|Rg,i|π(Ai,Xi)ϕ(Ai⋅sign(Rg,i)f(Xi))+λn2||h||2K, (15)

where . Here we suppress and from the notations of , and .

The purpose of the theoretical analysis is to investigate universal consistency of the associated regime of . The concept of universal consistency is given in Zhou et al. (2017). A universally consistent regime method eventually learns the Bayes regime without knowing any specifics of the distribution of the data when the sample size approaches infinity. Mathematically, a regime is universally consistent when in probability.

#### 2.5.1 Fisher consistency

The first question is whether the loss function used is Fisher consistent. The concept of Fisher consistency is brought from pattern classification (Lin 2002). For optimal treatment regimes, a loss function is Fisher consistent if the loss function alone can be used to identify the Bayes regime when the sample size approaches infinity, i.e., . We define

 η1(x) = E(R+g|X=x,A=+1)+E(R−g|X=x,A=−1), η2(x) = E(R+g|X=x,A=−1)+E(R−g|X=x,A=+1), (16)

where and . We suppress the dependence on from the notations. Note that

 η1(x)−η2(x)=E(R|X=x,A=+1)−E(R|X=x,A=−1)=μ+1(x)−μ−1(x).

The sign of is just the Bayes regime on . After some simple algebras, the -risk in (14) can be shown as

 Rϕ,g(f)=E(η1(X)ϕ(f(X))+η2(X)ϕ(−f(X))).

Now we introduce the generic conditional -risk,

 Qη1,η2(α)=η1ϕ(α)+η2ϕ(−α),

where and . The notation suppresses the dependence on and . We define the optimal conditional -risk,

 H(η1,η2)=Qη1,η2(α∗)=minα∈RQη1,η2(α),

and furthermore define,

 H−(η1,η2)=minα:α(η1−η2)≤0Qη1,η2(α).

is the optimal value of the conditional -risk, under the constraint that the sign of the argument disagrees with the Bayes regime. Fisher consistency is equivalent to for any with . The condition is similar with that of classification calibration in Bartlett et al. (2006). When is convex, this condition is equivalent to a simpler condition on the derivative of at 0.

###### Theorem 2.1.

Assume that is convex. Then is Fisher consistent if and only if exists and .

It is interesting to note that the necessary and sufficient condition for a convex surrogate loss function to yield a Fisher consistent regime concerns only its local property at . All surrogate loss function listed above are Fisher consistent.

#### 2.5.2 Relating excess risk to excess ϕ-risk

We now turn to the excess risk and show how it can be bounded through the excess -risk. It is easy to verify that the excess -risk can be expressed as

 ΔRϕ,g(f)=E(Qη1(X),η2(X)(f(X))−minα∈RQη1(X),η2(X)(α)).

Let

###### Theorem 2.2.

Assume is convex, exists and . In addition, suppose that there exist constants and such that

 |η1−η2|s≤CsΔQη1,η2(0),

Then

 ΔR(f)≤C(ΔRϕ,g(f))1/s.

As shown in the examples below, is often related to . The following theorem handles this situation.

###### Theorem 2.3.

Assume is convex, exists, and . Suppose . In addition, suppose that there exist a constant and a concave increasing function such that

 |η1−η2|s≤h(η1+η2)ΔQη1,η2(0),

Then

 ΔR(f)≤(h(Mg))1/s(ΔRϕ,g(f))1/s.

We now examine the consequences of these theorems on the examples of loss functions. Here we only present results briefly, and show details in Appendix A. Except for Examples 1 and 6, we assume that is bounded by in all other examples.

Example 1 (hinge loss). As shown in Appendix A, , and . By Theorem 2.2, .

Example 2 (squared hinge loss). Consider the loss function . We have . By Theorem 2.3, .

Example 3 (least squares loss). Now consider the loss function . Both and are the same as those in the previous example. Hence the bound in the previous example also applies to the least squares loss.

Example 4 (Huberized hinge loss). We can simply obtain that . By Theorem 2.3,

 ΔR(f)≤2√Mg(ΔRϕ,g(f))1/2. (17)

Example 5 (logistic loss). We consider the loss function . This is a little complicated case. As shown in Appendix A, . Then by Theorem 2.3, we have .

Example 6 (DWD loss). As shown in Appendix A, we obtain . Then by Theorem 2.2, .

Example 7 (exponential loss). Consider the loss function . We have , and . Then . By Theorem 2.3, .

#### 2.5.3 Universal consistency

We will establish universal consistency of the regime . The following theorem shows the convergence of -risk on the sample dependent function . We apply empirical process techniques to show consistency.

###### Theorem 2.4.

Suppose is a Lipschitz continuous function. Assume that we choose a sequence such that and . For any distribution for satisfying and almost everywhere, we have that in probability,

 limn→∞Rϕ,g(fDn,λn)=inff∈HK+{1}Rϕ,g(f).

When the loss function satisfies Theorem 2.2 or 2.3, starting from Theorem 2.4, universally consistent follows if . This condition requires the concept of universal kernels (Steinwart and Christmann 2008). A continuous kernel on a compact metric space is called universal if its associated RKHS is dense in , the space of all continuous functions on the compact metric space endowed with the usual supremum norm. The next Lemma shows that the RKHS of a universal kernel is rich enough to approximate arbitrary decision functions.

###### Lemma 2.5.

Let be a universal kernel, and be the associated RKHS. Suppose that is a Lipschitz continuous function, and is measurable and bounded, . For any distribution for satisfying almost everywhere with regular marginal distribution on , we have

 inff∈HK+{1}Rϕ,g(f)=R∗ϕ,g.

Our proposed AOL uses the Huberized hinge loss. Combining all the theoretical results and the excess risk bound in (17) together, the following proposition shows universal consistency of AOL with the Huberized hinge loss.

###### Proposition 2.6.

Let be a universal kernel, and be the associated RKHS. Let be the Huberized hinge loss function. Assume that we choose a sequence such that and . For any distribution for satisfying almost everywhere with regular marginal distribution on , we have that in probability,

 limn→∞R(sign(fDn,λn))=R∗.

In the proof of Proposition 2.7, we provide a bound on . The similar trick can be applied to hinge loss, squared hinge loss, and least squares loss. Thus for these three loss functions, it is not hard to derive their universal consistency. The exponential loss function is not Lipschitz continuous, so the learning regime with this loss is probably not universally consistent.

For the logistic loss and DWD loss, they do not satisfy Lemma 2.5 since is not bounded. We require stronger conditions for consistency. Firstly, we may assume that both and in (2.5.1) are continuous. This assumption is plausible in practice. Secondly, we still need an assumption on bounded as in Theorem 2.4 to exclude some trivial situations, for example, where almost everywhere. We present the result in the following proposition. The proof is simple and we omit it.

###### Proposition 2.7.

Let be a universal kernel, and be the associated RKHS. Let be the logistic loss or the DWD loss. Assume that we choose a sequence such that and . For any distribution for satisfying that (1) both and are continuous, (2) almost everywhere, and (3) almost everywhere with regular marginal distribution on , we have that in probability,

 limn→∞R(sign(fDn,λn))=R∗.

### 2.6 Variable selection for AOL

As demonstrated in Zhou et al. (2017), variable selection is critical for optimal treatment regime when the dimension of clinical covariates is moderate or high. In this section, we apply the variable selection techniques in Zhou et al. (2017) to AOL.

#### 2.6.1 Variable selection for linear AOL

As in Zhou et al. (2017), we apply the elastic-net penalty (Zou and Hastie 2005),

 λ1||w||1+λ22wTw,

where is the -norm, to replace the -norm penalty in (8) for variable selection. The elastic-net penalty selects informative covariates through the -norm penalty, and tends to identify or remove highly correlated variables together, the so-called grouping property, as the -norm penalty does.

The elastic-net penalized linear AOL minimizes

where and are regularization parameters. We use projected scaled sub-gradient (PSS) algorithms (Schmidt 2010), which are extensions of L-BFGS to the case of optimizing a smooth function with an -norm penalty. The obtained decision function is , and thus the estimated optimal treatment regime is the sign of .

#### 2.6.2 Variable selection for AOL with nonlinear kernels

Similar in Zhou et al. (2017), taking the Gaussian RBF kernel as an example, we define the covariates-scaled Gaussian RBF kernel,

 Kη(x,z)=exp(−p∑j=1ηj(xj−zj)2),

where . The covariate is scaled by . Setting is equivalent to discarding the

’th covariate. The hyperparameter

in the original Gaussian RBF kernel is discarded as it is absorbed to the scaling factors. We seek to minimize the following optimization problem:

 minv,b,η 1nn∑i=1|~ri|π(ai,xi)ϕ(ai⋅sign(~ri)(n∑j=1vjKη(xi,xj)+b)) (18) +λ1||η||1+λ22n∑i,j=1vivjKη(xi,xj), subject to η≥