1 Introduction
Precision medicine has recently gained much attention in treating complex diseases, such as cancer and mental disorders. The purpose of precision medicine is to tailor treatments to individual patients to maximize treatment benefit and safety in health care. Modern precision medicine is different from the traditional “onesizefitsall” approach, which does not rigorously take into account the treatment heterogeneity.
A major component of precision medicine is the treatment selection rule, or optimal treatment regime. A treatment regime is a decision rule that assigns a treatment to a patient based on his or her clinical or medical characteristics. A large number of approaches have been developed to estimate optimal treatment regimes based on data from clinical trials or observational studies (see Murphy (2005); Qian and Murphy (2011); Zhang et al. (2012b); Taylor et al. (2015) and references therein). Most of these methods are regressionbased. They model the conditional mean outcomes, and obtain the estimated treatment regime by comparing the regression estimates.
Several researchers have applied classification methods to optimal treatment regimes. For example, Zhao et al. (2012) viewed the treatment regime estimation as a weighted classification problem, and proposed outcome weighted learning to construct an optimal treatment regime to optimize the observed clinical outcome directly. Recently, Zhou et al. (2017) proposed residual weighted learning, which uses residuals to replace outcomes, to improve finite sample performance of outcome weighted learning. Zhang et al. (2012a)
also proposed a general framework to make use of weighted classification methods to generate treatment regimes. As an illustrating example, they constructed a weighted classification problem through a doubly robust augmented inverse probability weighted estimator of the conditional mean outcome, and used classification and regression trees
(Breiman et al. 1984) to produce interpretable regimes.The
nearest neighbor rule is a simple and intuitively appealing classification approach, where a subject is classified by a majority vote of its neighbors. Since its conception
(Fix and Hodges 1951), it has attracted many researchers, and retains its popularity today (Stone 1977; Hastie and Tibshirani 1996; Wager and Athey 2015). The rationale of nearest neighbor rules is that close covariate vectors share similar properties more often than not.
In this article, we propose a causal nearest neighbor method for optimal treatment regimes. The method roots in the framework of causal inference, and compares the causal treatment effects within the nearest neighborhood. Although the method is simple, it possesses nice theoretical properties. Firstly, we show that the causal nearest neighbor regime is universally consistent. Without knowing any specifics about the distribution underlying the data, a universally consistent treatment regime would eventually learn the Bayes regime when the sample size approaches infinity. Secondly, we establish its convergence rate. The convergence rate is as high as with appropriately chosen if the dimension of covariates is 1 or 2, and the rate is for dimension .
Similar to the nearest neighbor rule for classification, the causal nearest neighbor regime suffers from the curse of dimensionality, i.e., performance deteriorates as dimensionality increases. To alleviate this problem, we propose an adaptive causal nearest neighbor method, where the distance metric is adaptively determined from the data. Through adaptive metric selection, this adaptive method performs variable selection implicitly. The superior performance of the adaptive causal nearest neighbor regime over the original causal nearest neighbor regime is illustrated in the simulation studies. In practical settings, we recommend the adaptive causal nearest neighbor method.
2 Methods
2.1 Causal nearest neighbor rules
Consider a randomized clinical trial with treatment arms. Let denote the observed clinical outcome, denote the treatment assignment received by the patient, and , where is compact, denote the patient’s clinical covariates. Assume that larger values of are preferred. Let denote the probability of being assigned treatment for a patient with clinical covariates . This probability is predefined in the design.
We then introduce the potential outcomes framework to formally identify the optimal treatment regime. The potential outcomes, denoted , are defined as the outcomes that would be observed were a patient to receive treatment , respectively (Robins 1986). As in the literature of potential outcomes, we require the following assumptions. The first one is the consistency assumption (Robins 1994): the potential outcomes and the observed outcomes agree, i.e., . We also assume that conditional on covariates , the potential outcomes are independent of the treatment assignment that has been actually received. This is called the assumption of no unmeasured confounders. It always holds in a randomized clinical trial.
A treatment regime is a function from clinical covariates to the treatment assignment . For a treatment regime , we can thus define its potential outcome . It would be the observed outcome if a patient from the population were to be assigned treatment according to regime . The expected potential outcome under any regime , given as , is called the value function associated with regime . An optimal regime is a regime that maximizes . The regime is also called the Bayes regime. There is a positivity assumption that almost everywhere for any . That is, any treatment option must be represented in the data in order to estimate an optimal regime. For simplicity, let . It is easy to obtain that
(1) 
Note that by the consistency and nounmeasuredconfounders assumptions. It is identifiable in the observed data.
The nearest neighbor rule is a nonparametric method used for classification and regression (Fix and Hodges 1951). In this article, we apply the nearest neighbor rule to optimal treatment regimes. The idea is simple. We use the nearest neighbor algorithm to find a neighborhood of in , then estimate for each arm in this neighborhood, and plug into (1) to get the nearest neighbor estimate for the optimal treatment regime. Similar procedures are proposed in the recent literature for treebased nonparametric approaches (Athey and Imbens 2016; Wager and Athey 2015), where they target a partition in to estimate the treatment heterogeneity.
Let denote the observed data. We fix , and reorder the observed data according to increasing values of . The reordered data sequence is denoted by
Thus are the nearest neighbors of . can be approximated in the nearest neighborhood of by
(2) 
and is the indicator function, as suggested in Murphy (2005). Here we define . Let be the distance of the th nearest neighbor to , and define . It is straightforward through the consistency and nounmeasuredconfounders assumptions to see that
is an unbiased estimator for
. Hence is a reasonable approximation to . Then the plugin estimate of the Bayes regime in (1) is(3) 
This is called the causal nearest neighbor regime because of its close relationship to causal effects.
We need to address the problem of distance ties, i.e., when for some . Devroye et al. (1996, Section 11.2) discussed several methods for breaking distance ties. In practical use, we adopt the tiebreaking method used in Stone (1977). Subjects who have the same distance from as the th nearest neighbor are averaged on the outcome . We denote the distance of the th nearest neighbor to by , and define the sets and . The revised rule of (2) in the main paper is as follows:
(4) 
The corresponding causal nearest neighbor regime is the regime in (3) of the main paper after replacing with . This is not a strictly nearest neighbor rule when there are distance ties on the th nearest neighbor, since the estimate uses more than neighbors.
The causal nearest neighbor regimes are based on local averaging. Here, is a tuning parameter. It is required that be small enough so that local changes of the distribution can be detected. On the other hand, needs to be large enough so that averaging over the arm is effective. We may tune this parameter by a cross validation procedure to balance the two requirements.
In this article, we focus on applications in randomized clinical trials. However, the proposed methods can be easily extended to observational studies. We still require three assumptions (consistency, no unmeasured confounders and positivity). The only difference is that the assumption of no unmeasured confounders automatically holds in randomized clinical trials. In observational studies, it may hold when all relevant confounders have been measured, though this assumption cannot be verified in practice. One additional step for observational studies is to estimate the treatment allocation probabilities
, which can be obtained through, for example, logistic regression.
To our knowledge, no nearest neighbor related methods were applied in optimal treatment regimes. Wager and Athey (2015) described a standard nearest neighbor matching procedure to estimate heterogeneous treatment effects. Wager and Athey (2015) used a different estimator
(5) 
where is the set of nearest neighbors to in the treatment arm . in (2) and in (5) are two estimates of from different perspectives. is an estimate of , while is an estimate of . Our proposed causal nearest neighbor method is slightly distinct in two ways. First, the estimates are obtained from the same neighborhood of . The subsequent comparison is more sensible. Second, for fairly large neighborhood, the inverse probability weighting estimator (2) corrects for variations inside the neighborhood. This is particularly useful for applications in observational studies.
2.2 Theoretical Properties
In machine learning, a classification rule is called universally consistent if its expected error probability approaches the Bayes error probability, in probability or almost surely, for any distribution underlying the data
(Devroye et al. 1996). The nearest neighbor classification is the first to be proved to possess such universal consistency (Stone 1977). Here, we extend the concept of universal consistency to optimal treatment regimes.Definition 2.1.
Given a sequence of data, a regime is universally (weakly) consistent if in probability for any probability measure on , and universally strongly consistent if almost surely for any probability measure on .
Denote the probability measure for by , and recall that is the closed ball centered at of radius . The collection of all with for all is called the support of (Cover and Hart 1967). The set is denoted as .
The analysis of universal consistency requires some assumptions.

There exists a constant such that for any and ;

;

Distance ties occur with probability zero in .
These assumptions are quite weak. Assumption (A1) is just the positivity assumption, and can be obtained by design. Assumption (A2) is natural. This assumption is automatically satisfied for bounded outcomes, i.e., for some constant . Assumption (A3) is to avoid the messy problem of distance ties. When (A3) does not hold, we may add a small uniform variable independent of to the vector . This causes the dimensional random vector to satisfy Assumption (A3). We may perform the nearest neighbor method on the modified data . Because of the independence of , the corresponding conditional outcome . Hence Assumption (A3) is reasonable, but at the cost of potentially compromising performance by introducing an artificial covariate to the regime. When is very small, we actually break ties randomly. The difference with Stone’s tiebreaking method is that Stone’s method takes into account all subjects whose distance to equals that of the th nearest neighbor, while the tiebreaking method here only picks one of them randomly. The remark following the proof of Theorem 2.2 in Appendix A demonstrates that Stone’s tiebreaking estimate in (4) is asymptotically better than the random tiebreaking method here.
The following theorem shows universal consistency of the causal nearest neighbor regime. The proofs of theorems are provided in Appendix A.
Theorem 2.2.
For any distribution for satisfying assumptions (A1)(A3),
(i) the causal nearest neighbor regime in (3) is universally weakly consistent if and ;
(ii) the causal regime in (3) is universally strongly consistent if and .
If Assumption (A2) is tightened to for some constant , the regime in (3) is universally strongly consistent if and .
The next natural question is whether the associated value of the causal nearest neighbor regime tends to the Bayes value at a specified rate. To establish the rate of convergence, we require stronger assumptions.

for all and , and there exists a constant such that for all , and ;

there exists a constant such that for all and ;

distance ties occur with probability zero in , and the support of is compact with diameter ;

’s are Lipschitz continuous, i.e., there exists a constant such that for any and in , and .
Assumption (A1
) implies that randomization is not extremely skewed with respect to the covariates. Assumptions (A2
)(A4) are standard in the literature of nearest neighbor rules (Györfi et al. 2002). The following theorem gives the convergence rate of causal nearest neighbor regimes. This theorem is proved in Appendix B.Theorem 2.3.
For any distribution for satisfying Assumptions (A1)(A4), there exists a sequence such that and , and
When , ; when , can be arbitrarily close to ; when , .
The rate of convergence is as high as if the dimensionality is 1 or 2. When increases, the convergence rate decreases significantly. As with nearest neighbor rules in classification and regression, the causal nearest neighbor regime also suffers from the curse of dimensionality.
2.3 Adaptive rules
The causal nearest neighbor regime is consistent as shown previously. However, it is well known that the curse of dimensionality can severely hurt nearest neighbor rules in finite samples. The rate of convergence in Theorem 2.3 is slow when dimensionality is high. Hence appropriate variable selection may improve performance. In this section, we propose an adaptive causal nearest neighbor method to estimate the optimal treatment regime, and to perform metric selection and variable selection simultaneously.
Let and use the distance metric to compute the distance between and . is the scaling factor for the th covariate. Setting is equivalent to discarding the th covariate. We intend to set a large if the th covariate is important for treatment selection.
We apply the following univariate method to evaluate the importance of an individual covariate. It is related to a test statistic comparing two treatment regimes
(Murphy 2005). One regime only involves the th covariate; and the other , called the noninformative regime, assigns all patients to the treatment with the largest estimated potential outcome . For a specific regime , let be the treatment assignment for the th subject according to . The value function associated with is estimated by(6) 
For two regimes, and
, a consistent estimator of the variance of
is(7) 
The statistic
asymptotically has a standard normal distribution under the null hypothesis that
(Murphy 2005). When the statistic is greater than zero, regime is considered better than the noninformative regime , otherwise is better. The statistic reflects the importance of the th covariate on optimal treatment regimes. We estimate by the causal nearest neighbor method only using the th covariate.We set for each , where is a predefined parameter and is the positive part. The adaptive causal nearest neighbor regime follows the same procedure used in the causal nearest neighbor regime described above, except that the adaptive one uses the distance metric to compute the distance between and . When is very large (for example, ), all are zero, hence the adaptive regime degenerates to a noninformative regime. On the other hand, when is very small (for example, ), all are almost identical, and the adaptive regime is equivalent to the causal nearest neighbor one. Figure 1 illustrates the effects of on the construction of .
Here is a summary of the adaptive causal nearest neighbor procedure:
1) Normalize each covariate to a similar scale.
2) Calculate and , where and .
3) Use the metric to estimate a causal nearest neighbor regime.
The scaling at the first step is to avoid covariates in greater numeric ranges dominating those in smaller numeric ranges. We recommend linearly scaling each covariate to the range or (Hsu et al. 2003). For the adaptive causal nearest neighbor regime, there are two tuning parameters, and . We tune the parameters using 10fold cross validation.
3 Simulation studies
We performed extensive simulations to evaluate empirical performance of the causal nearest neighbor and adaptive causal nearest neighbor methods.
We first considered simulations for twoarm data (). In the simulations, we generated
dimensional vectors of clinical covariates. The first two covariates were independent Bernoulli random variables with success probability of
, and the remaining covariates were independent standard normal random variables . The treatment was generated from independently of with , i.e., for any . To mimic a well balanced trial, we generated simulation data such that , where is the sample size of the data, and are the numbers of patients in treatment arm 1 and 2, respectively. The response was normally distributed with meanand standard deviation 1. We considered three scenarios with different choices of
:
;
.

;
.

;
,
where , for .
We run the simulations for two different dimensions of covariates: low dimensional data () and moderate dimensional data (). On low dimensional data (), we compared empirical performances of the following seven methods: (1) penalized least squares proposed by Qian and Murphy (2011)
; (2) Qlearning using random forests as described in
Taylor et al. (2015); (3) Residual weighted learning proposed in Zhou et al. (2017) using the linear kernel; (4) Residual weighted learning using the Gaussian kernel; (5) Augmented inverse probability weighted estimation proposed by Zhang et al. (2012a); (6) the causal nearest neighbor method; and (7) the proposed adaptive causal nearest neighbor method. When the dimension was moderate (), two residual weighted learning methods were replaced with their variable selection counterparts (Zhou et al. 2017).In the simulation studies, penalized least squares estimated a linear model on to approximate the conditional outcomes , and also used the least absolute shrinkage and selection operator to carry out variable selection. The obtained regime was the treatment arm in which the conditional mean is larger. Qlearning using random forests is nonparametric. The conditional outcomes were approximated using as input covariates in the random forests. The number of trees was set to 1000 as suggested in Taylor et al. (2015). Residual weighted learning is an improved method for outcome weighted learning (Zhao et al. 2012)
. Outcome weighted learning views the treatment selection as a weighted classification problem, and treats the original outcomes as weights. Residual weighted learning is similar except that outcomes are replaced with residuals of the outcome from a regression fit on covariates excluding treatment assignment. Residual weighted learning with the linear kernel estimates linear treatment regimes, while the one with the Gaussian kernel has the ability to detect nonlinear regimes. Residual weighted learning involves nonconvex programming, and hence the computational cost is high. For the augmented inverse probability weighted estimator, we first obtained the doubly robust version of the contrast function through linear regression, and then we let the propensity score be 0.5 and searched the optimal treatment regime using a classification and regression tree.
We applied 10fold crossvalidation for parameter tuning. The sample sizes were varied from , , , , to for each scenario. We repeated the simulation 500 times. For comparison, we generated a large test set with 10,000 subjects to evaluate performance. The comparison criterion was the value function of the estimated optimal treatment regime on the test set. Precisely, it is given by (Murphy 2005), where denotes the empirical average on the test data.
Scenario 1 (Optimal value )  

PLS  2.01 (0.08)  2.04 (0.06)  2.06 (0.04)  2.08 (0.02)  2.08 (0.01) 
QRF  1.89 (0.08)  1.96 (0.06)  2.00 (0.04)  2.02 (0.02)  2.03 (0.01) 
RWLLinear  1.96 (0.09)  2.01 (0.06)  2.05 (0.04)  2.07 (0.02)  2.08 (0.01) 
RWLGaussian  1.97 (0.09)  2.00 (0.07)  2.03 (0.06)  2.06 (0.03)  2.08 (0.02) 
AIPWE  1.92 (0.13)  1.96 (0.10)  2.00 (0.06)  2.02 (0.04)  2.03 (0.03) 
CNN  1.89 (0.10)  1.95 (0.08)  1.99 (0.06)  2.02 (0.03)  2.04 (0.02) 
ACNN  1.88 (0.13)  1.94 (0.12)  1.99 (0.08)  2.02 (0.05)  2.04 (0.03) 
Scenario 2 (Optimal value )  
PLS  1.48 (0.08)  1.53 (0.09)  1.58 (0.09)  1.64 (0.06)  1.66 (0.03) 
QRF  1.64 (0.10)  1.74 (0.08)  1.82 (0.05)  1.87 (0.03)  1.90 (0.02) 
RWLLinear  1.55 (0.08)  1.58 (0.07)  1.61 (0.05)  1.64 (0.04)  1.66 (0.03) 
RWLGaussian  1.64 (0.11)  1.71 (0.11)  1.81 (0.08)  1.86 (0.05)  1.90 (0.02) 
AIPWE  1.63 (0.15)  1.74 (0.13)  1.81 (0.09)  1.87 (0.05)  1.90 (0.03) 
CNN  1.64 (0.11)  1.73 (0.09)  1.81 (0.06)  1.87 (0.03)  1.90 (0.02) 
ACNN  1.65 (0.14)  1.76 (0.12)  1.84 (0.08)  1.89 (0.04)  1.92 (0.03) 
Scenario 3 (Optimal value )  
PLS  1.88(0.03)  1.89(0.03)  1.89(0.03)  1.89(0.04)  1.90(0.03) 
QRF  2.05 (0.08)  2.14 (0.06)  2.21 (0.04)  2.26 (0.02)  2.28 (0.01) 
RWLLinear  1.93 (0.05)  1.93 (0.05)  1.96 (0.06)  1.97 (0.06)  1.98 (0.06) 
RWLGaussian  2.04 (0.09)  2.13 (0.08)  2.20 (0.06)  2.26 (0.04)  2.30 (0.02) 
AIPWE  2.06 (0.13)  2.17 (0.11)  2.23 (0.05)  2.26 (0.03)  2.28 (0.02) 
CNN  2.02 (0.08)  2.11 (0.06)  2.18 (0.05)  2.24 (0.03)  2.28 (0.02) 
ACNN  2.09 (0.11)  2.19 (0.08)  2.26 (0.06)  2.31 (0.04)  2.33 (0.02) 

PLS, penalized least squares; QRF, Qlearning using random forests; RWLLinear, residual weighted learning with linear kernel; RWLGaussian, residual weighted learning with Gaussian kernel; AIPWE, augmented inverse probability weighted estimation; CNN, causal nearest neighbor; ACNN, adaptive causal nearest neighbor.
The simulation results on the low dimensional data () are presented in Table 1. Let , , be the value function when all subjects are sent to treatment , and be the optimal value function for simplicity. For Scenario 1, , , and . The optimal regime is 1 if , and 2 otherwise. The decision boundary was a linear combination of a binary covariate and a continuous covariate. penalized least squares performed very well since its model was correctly specified. Both residual weighted learning methods performed similarly to penalized least squares, especially when the sample size was large. Our proposed causal nearest neighbor and adaptive causal nearest neighbor methods showed similar performance to Qlearning using random forests and augmented inverse probability weighted estimation, and when the sample size was large they were close to penalized least squares and residual weighted learning. For Scenario 2, , , and . The optimal regime is 1 if , and 2 otherwise. The decision boundary was nonlinear. penalized least squares and residual weighted learning with linear kernel both failed due to model misspecification. The adaptive causal nearest neighbor method yielded the best performance. The causal nearest neighbor method showed similar performance to Qlearning using random forests, residual weighted learning with Gaussian kernel and augmented inverse probability weighted estimation. For Scenario 3, , , and . The optimal regime is 1 if , and 2 otherwise. The decision boundary was highly nonlinear. Similar to Scenario 2, our proposed adaptive causal nearest neighbor approach outperformed all other methods. The causal nearest neighbor method yielded similar performance to other nonlinear methods including Qlearning using random forests, residual weighted learning with Gaussian kernel and augmented inverse probability weighted estimation.
We move now to the moderate dimensional cases (). The simulation results are shown in Table 2. In Scenario 1, penalized least squares outperformed other methods because of correct model specification and inside variable selection techniques. Residual weighted learning methods yielded similar performance to penalized least squares due to their variable selection mechanism. The causal nearest neighbor regime was not comparable with others in this scenario because of the lack of a variable selection procedure. It is well known that nearest neighbor rules deteriorate when there are irrelevant covariates present in the data. The proposed adaptive causal nearest neighbor approach showed similar performance to Qlearning using random forests and augmented inverse probability weighted estimation, and when the sample size was large it was close to penalized least squares and residual weighted learning methods. Their good performance can be explained by variable selection. The adaptive causal nearest neighbor approach carries out variable selection through the adaptive metric selection. Qlearning using random forests and augmented inverse probability weighted estimation, as two tree methods, have a builtin mechanism to perform variable selection (Breiman et al. 1984). In Scenarios 2 and 3, penalized least squares and residual weighted learning with linear kernel failed due to misspecification; again, causal nearest neighbor failed due to the lack of variable selection. Four nonparametric methods with variable selection, Qlearning using random forests, augmented inverse probability weighted estimate, residual weighted learning with Gaussian kernel and the adaptive causal nearest neighbor approach, stood out. Among them, our proposed adaptive causal nearest neighbor method ranked the first in both scenarios.
Scenario 1 (Optimal value )  

PLS  1.91 (0.12)  2.00 (0.07)  2.04 (0.04)  2.06 (0.02)  2.08 (0.02) 
QRF  1.83 (0.10)  1.91 (0.08)  1.97 (0.06)  2.01 (0.03)  2.04 (0.01) 
RWLVSLinear  1.84 (0.12)  1.97 (0.08)  2.03 (0.05)  2.06 (0.03)  2.08 (0.01) 
RWLVSGaussian  1.82 (0.13)  1.92 (0.10)  2.02 (0.07)  2.06 (0.04)  2.07 (0.03) 
AIPWE  1.80 (0.15)  1.90 (0.12)  1.97 (0.08)  2.01 (0.05)  2.03 (0.03) 
CNN  1.79 (0.09)  1.82 (0.08)  1.85 (0.06)  1.89 (0.05)  1.91 (0.04) 
ACNN  1.77 (0.12)  1.83 (0.13)  1.91 (0.12)  2.00 (0.07)  2.03 (0.04) 
Scenario 2 (Optimal value )  
PLS  1.44(0.06)  1.44(0.06)  1.44(0.07)  1.45(0.06)  1.46(0.06) 
QRF  1.48 (0.09)  1.54 (0.09)  1.68 (0.08)  1.81 (0.06)  1.87 (0.03) 
RWLVSLinear  1.49 (0.07)  1.52 (0.07)  1.57 (0.07)  1.62 (0.05)  1.65 (0.04) 
RWLVSGaussian  1.51 (0.09)  1.61 (0.13)  1.77 (0.12)  1.87 (0.07)  1.91 (0.04) 
AIPWE  1.48 (0.09)  1.55 (0.12)  1.71 (0.13)  1.82 (0.08)  1.88 (0.04) 
CNN  1.49 (0.06)  1.52 (0.06)  1.56 (0.05)  1.60 (0.05)  1.65 (0.04) 
ACNN  1.52 (0.11)  1.62 (0.15)  1.76 (0.13)  1.86 (0.07)  1.90 (0.04) 
Scenario 3 (Optimal value )  
PLS  1.89 (0.02)  1.89 (0.02)  1.89 (0.02)  1.89 (0.02)  1.89 (0.02) 
QRF  1.92 (0.03)  1.94 (0.04)  1.99 (0.05)  2.07 (0.06)  2.18 (0.05) 
RWLVSLinear  1.90(0.03)  1.90(0.03)  1.91(0.04)  1.92(0.04)  1.93(0.05) 
RWLVSGaussian  1.94 (0.07)  2.02 (0.13)  2.20 (0.12)  2.30 (0.07)  2.32 (0.06) 
AIPWE  1.92 (0.06)  2.00 (0.12)  2.15 (0.11)  2.24 (0.04)  2.27 (0.03) 
CNN  1.92 (0.01)  1.93 (0.02)  1.94 (0.02)  1.96 (0.02)  1.98 (0.02) 
ACNN  1.99 (0.11)  2.10 (0.12)  2.23 (0.09)  2.30 (0.04)  2.33 (0.02) 

PLS, penalized least squares; QRF, Qlearning using random forests; RWLVSLinear, residual weighted learning with variable selection and linear kernel; RWLVSGaussian, residual weighted learning with variable selection and Gaussian kernel; AIPWE, augmented inverse probability weighted estimation; CNN, causal nearest neighbor; ACNN, adaptive causal nearest neighbor.
Here, the covariates were independent. We also run simulations to assess performance of our proposed methods when the covariates were correlated. The results are similar to the independent cases presented above. Details are collected in Appendix D.
We then evaluated the performance of our proposed methods on data with more than two treatment arms (say, ). The simulation setup was similar to that with two treatment arms. We generated dimensional vectors of clinical covariates as before. The treatment was generated from independently of with for any . The response was normally distributed with mean and standard deviation 1. We considered two scenarios with different choices of :

;
;
.

;
;
,
where , for .
We compared the performance of the following four methods: (1) penalized least squares proposed by Qian and Murphy (2011); (2) Qlearning using random forests as described in Taylor et al. (2015); (3) the proposed causal nearest neighbor method; and (4) the proposed adaptive causal nearest neighbor method. Residual weighted learning and augmented inverse probability weighted estimation methods have only been implemented for two treatment arms, and so are not included here. For each scenario, we varied sample sizes from , , , to , and repeated the simulation 500 times. The independent test set was with a sample size of 30,000.
The simulation results on the low dimensional cases () are presented in Table 3. For Scenario 4, , and . The decision boundary is linear. penalized least squares produced the best performance because of correct model specification. The other nonparametric methods, Qlearning using random forests, causal nearest neighbor and adaptive causal nearest neighbor methods, showed similar performance. For Scenario 5, , , , and . The decision boundary is nonlinear. penalized least squares was not comparable with other nonparametric methods as the postulated model was misspecified. Our proposed adaptive causal nearest neighbor method produced the best performance. The causal nearest neighbor method showed similar performance to Qlearning using random forests.
Scenario 4 (Optimal value )  

PLS  1.94(0.08)  1.99(0.06)  2.02(0.02)  2.03(0.02) 
QRF  1.82 (0.06)  1.87 (0.05)  1.91 (0.03)  1.95 (0.02) 
CNN  1.83 (0.08)  1.89 (0.06)  1.93 (0.04)  1.96 (0.03) 
ACNN  1.81 (0.09)  1.87 (0.09)  1.91 (0.18)  1.97 (0.04) 
Scenario 5 (Optimal value )  
PLS  1.82(0.12)  1.90(0.11)  1.98(0.08)  2.02(0.05) 
QRF  1.88 (0.07)  1.97 (0.05)  2.04 (0.03)  2.09 (0.02) 
CNN  1.85 (0.09)  1.94 (0.06)  2.02 (0.04)  2.07 (0.02) 
ACNN  1.89 (0.10)  1.99 (0.08)  2.07 (0.06)  2.12 (0.03) 

PLS, penalized least squares; QRF, Qlearning using random forests; CNN, causal nearest neighbor; ACNN, adaptive causal nearest neighbor.
We then increased the dimensionality to 25. The results are presented in Table 4. Again, penalized least squares produced the best performance in Scenario 4, and the adaptive causal nearest neighbor method in Scenario 5. The causal nearest neighbor approach was not comparable with others in both scenarios due to the curse of dimensionality.
Scenario 4 (Optimal value )  

PLS  1.85(0.09)  1.94(0.07)  2.00(0.04)  2.02(0.02) 
QRF  1.77 (0.06)  1.83 (0.06)  1.90 (0.04)  1.95 (0.03) 
CNN  1.72 (0.04)  1.76 (0.04)  1.78 (0.04)  1.82 (0.04) 
ACNN  1.72 (0.08)  1.77 (0.09)  1.83 (0.09)  1.91 (0.07) 
Scenario 5 (Optimal value )  
PLS  1.72(0.08)  1.79(0.09)  1.90(0.08)  1.99(0.05) 
QRF  1.72 (0.08)  1.82 (0.08)  1.92 (0.05)  1.99 (0.04) 
CNN  1.66 (0.04)  1.69 (0.03)  1.73 (0.03)  1.76 (0.03) 
ACNN  1.74 (0.12)  1.87 (0.13)  2.01 (0.10)  2.11 (0.04) 

PLS, penalized least squares; QRF, Qlearning using random forests; CNN, causal nearest neighbor; ACNN, adaptive causal nearest neighbor.
When the dimension is low, the causal nearest neighbor regime produced comparable performance to other alternatives. The adaptive selection on the distance metric enhances the causal nearest neighbor regime. From the simulations, the adaptive causal nearest neighbor method showed at least similar results to the causal nearest neighbor regime. As we explained before, when the tuning parameter is very small, the adaptive causal nearest neighbor approach is almost equivalent to the causal nearest neighbor approach. Considering the superior performance of the adaptive causal nearest neighbor over the causal nearest neighbor approach, especially when the dimensionality is large, we suggest the adaptive causal nearest neighbor method for general practical use.
4 Data analysis
We applied the proposed methods to analyze data from a chronic depression clinical trial (Keller et al. 2000). Patients with nonpsychotic chronic major depressive disorder were randomized in a 1:1:1 ratio to either Nefazodone, cognitive behavioralanalysis system of psychotherapy, or the combination of two therapies. The primary outcome measurement in efficacy was the score on the 24item Hamilton rating scale for depression. Lower score is desirable. We considered 50 pretreatment covariates as in Zhao et al. (2012). We excluded some patients with missing covariate values from the analyses. The data used here consisted of 647 patients. Among them, 216, 220, and 211 patients were assigned to three arms, respectively. Each clinical covariate was scaled to , as described in Hsu et al. (2003).
Since the trial had three treatment arms, we compared the performance of the adaptive causal nearest neighbor regime with penalized least squares and Qlearning using random forests. Residual weighted learning and augmented inverse probability weighted estimation methods can only deal with two treatments. From the simulation studies, the adaptive regime outperformed the causal nearest neighbor one especially when the dimension of covariates was large, so we only considered the adaptive regime in this section. Outcomes used in the analyses were opposites of the scores on the 24item Hamilton rating scale for depression. We used a nested 10fold crossvalidation procedure for an unbiased comparison (Ambroise and McLachlan 2002). To obtain reliable estimates, we repeated the nested crossvalidation procedure 100 times with different fold partitions.
The mean value functions over 100 repeats and the standard deviations are presented in Table 5. The adaptive causal nearest neighbor regime achieved a similar performance to penalized least squares and Qlearning using random forests. All methods assigned the combination treatment to almost every patient. The original analysis in Keller et al. (2000) indicated that the combination treatment is significantly more efficacious than either treatment alone. Our analysis confirmed that this is indeed true.
We also performed pairwise comparisons between two treatment arms. We included two residual weighted learning methods with variable selection and augmented inverse probability weighted estimation in the analysis. The analysis results are also presented in Table 5. For comparison between Nefazodone and cognitive behavioralanalysis system of psychotherapy, the adaptive causal nearest neighbor regime was slightly better than other methods except for residual weighted learning with linear kernel. For comparison between Nefazodone and combination therapy, all methods produced similar performance. For comparison between cognitive behavioralanalysis system of psychotherapy and combination therapy, the adaptive causal neareast neighbor regime did not perform comparably to other methods. We carried out the significance test described in Section 2.3 to compare the regimes by the adaptive causal nearest neighbor and residual weighted learning with linear kernel, and the difference between them was not statistically significant.
NFZ vs CBASP  

vs COMB  NFZ vs CBASP  NFZ vs COMB  CBASP vs COMB  
PLS  11.19 (0.15)  16.30 (0.39)  11.20 (0.16)  10.95 (0.09) 
QRF  11.11 (0.13)  16.27 (0.44)  11.05 (0.18)  10.93 (0.09) 
RWLVSLinear  15.45 (0.37)  11.09 (0.29)  10.88 (0.05)  
RWLVSGaussian  16.29 (0.44)  11.33 (0.25)  11.07 (0.28)  
AIPWE  16.45 (0.41)  10.97 (0.15)  10.96 (0.14)  
ACNN  11.18 (0.27)  15.70 (0.39)  11.03 (0.27)  11.41 (0.28) 

PLS, penalized least squares (Qian and Murphy 2011); QRF, Qlearning using random forests (Taylor et al. 2015); RWLVSLinear, residual weighted learning with variable selection and linear kernel (Zhou et al. 2017); RWLVSGaussian, residual weighted learning with variable selection and Gaussian kernel (Zhou et al. 2017); AIPWE, augmented inverse probability weighted estimation (Zhang et al. 2012a); ACNN, adaptive causal nearest neighbor. NFZ, Nefazodone; CBASP, cognitive behavioralanalysis system of psychotherapy; COMB, combination of Nefazodone and cognitive behavioralanalysis system of psychotherapy.
The adaptive causal nearest neighbor regime showed a statistically equivalent performance to other methods on the chronic depression clinical trial data.
5 Discussion
In this article, we have proposed a simple causal nearest neighbor method to optimal treatment regimes, and developed an adaptive method to determine the distance metric. As shown in the simulation and data studies, the adaptive method can rival and improve upon more sophisticated methods, especially when the decision boundary is nonlinear.
Variable selection plays a critical role in identifying the optimal treatment regime when the dimension of covariates is large, as shown in the simulation studies. penalized least squares methods use the least absolute shrinkage and selection operator for variable selection. Residual weighted learning performs variable selection through the elasticnet penalty for linear kernels and through covariatescaling for Gaussian kernels (Zhou et al. 2017). As a tree method, augmented inverse probability weighted estimation is equipped with a builtin variable selection mechanism (Breiman et al. 1984). Our proposed adaptive causal nearest neighbor method applies an adaptive distance metric to perform variable selection. Recently, several researchers highlighted the importance of variable selection for optimal treatment regimes (Gunter et al. 2011; Zhou et al. 2017). Variable selection in optimal treatment regimes has its own characteristics. There are two different types of covariates related to outcomes , predictive and prescriptive covariates. Predictive covariates are useful to the prediction of outcomes; and prescriptive covariates are used to prescribe optimal treatment regimes (Gunter et al. 2011). Athey and Imbens (2016) discussed several ways of splitting on prescriptive covariates rather than predictive covariates on causal trees. The variable selection in the adaptive causal nearest neighbor regime is to identify prescriptive covariates through tuning with the additional parameter . As pointed out by an anonymous reviewer, in practice, the number of predictive covariates may be much larger than the number of prescriptive covariates. So it is important and challenging to carry out variable selection for optimal treatment regimes.
The causal nearest neighbor methods are simple and fast; they possess nice theoretical properties; as nonparametric methods, they are free of model specification; they naturally work with multiplearm trials; the variable selection in the adaptive causal nearest neighbor regime identifies prescriptive covariates to further improve finite sample performance.
Acknowledgement
This work was sponsored by the National Cancer Institute. We are grateful to the editors and the reviewers for their insightful comments, which have led to important improvements in this paper.
APPENDIX
Appendix A Proof of Theorem 2.2
The following lemma shows that consistency of , , guarantees consistency of the rule .
Lemma A.1.
The causal nearest neighbor rule in (3) of the main paper satisfies the following bound for any distribution for ,
Proof of Lemma a.1.
Note that the value function of any rule ,
Thus, by fixing , we have
where and , and the expectation is with respect to for . By the construction of , we have
The desired result follows by taking expectation over on both sides. ∎
Now it is sufficient to prove, for any ,
in probability or almost surely, as . We start from a simpler nearest neighbor rule, for ,
(8) 
The relationship between and is that
By the law of large numbers, the denominator
as . Thus it is now sufficient to prove, for any ,
in probability or almost surely, as .
From now on, we use . For weak consistency, we will prove a slightly stronger result, We rewrite
Comments
There are no comments yet.