Causal nearest neighbor rules for optimal treatment regimes

11/22/2017 ∙ by Xin Zhou, et al. ∙ 0

The estimation of optimal treatment regimes is of considerable interest to precision medicine. In this work, we propose a causal k-nearest neighbor method to estimate the optimal treatment regime. The method roots in the framework of causal inference, and estimates the causal treatment effects within the nearest neighborhood. Although the method is simple, it possesses nice theoretical properties. We show that the causal k-nearest neighbor regime is universally consistent. That is, the causal k-nearest neighbor regime will eventually learn the optimal treatment regime as the sample size increases. We also establish its convergence rate. However, the causal k-nearest neighbor regime may suffer from the curse of dimensionality, i.e. performance deteriorates as dimensionality increases. To alleviate this problem, we develop an adaptive causal k-nearest neighbor method to perform metric selection and variable selection simultaneously. The performance of the proposed methods is illustrated in simulation studies and in an analysis of a chronic depression clinical trial.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Precision medicine has recently gained much attention in treating complex diseases, such as cancer and mental disorders. The purpose of precision medicine is to tailor treatments to individual patients to maximize treatment benefit and safety in health care. Modern precision medicine is different from the traditional “one-size-fits-all” approach, which does not rigorously take into account the treatment heterogeneity.

A major component of precision medicine is the treatment selection rule, or optimal treatment regime. A treatment regime is a decision rule that assigns a treatment to a patient based on his or her clinical or medical characteristics. A large number of approaches have been developed to estimate optimal treatment regimes based on data from clinical trials or observational studies (see Murphy (2005); Qian and Murphy (2011); Zhang et al. (2012b); Taylor et al. (2015) and references therein). Most of these methods are regression-based. They model the conditional mean outcomes, and obtain the estimated treatment regime by comparing the regression estimates.

Several researchers have applied classification methods to optimal treatment regimes. For example, Zhao et al. (2012) viewed the treatment regime estimation as a weighted classification problem, and proposed outcome weighted learning to construct an optimal treatment regime to optimize the observed clinical outcome directly. Recently, Zhou et al. (2017) proposed residual weighted learning, which uses residuals to replace outcomes, to improve finite sample performance of outcome weighted learning. Zhang et al. (2012a)

also proposed a general framework to make use of weighted classification methods to generate treatment regimes. As an illustrating example, they constructed a weighted classification problem through a doubly robust augmented inverse probability weighted estimator of the conditional mean outcome, and used classification and regression trees

(Breiman et al. 1984) to produce interpretable regimes.

The

-nearest neighbor rule is a simple and intuitively appealing classification approach, where a subject is classified by a majority vote of its neighbors. Since its conception

(Fix and Hodges 1951), it has attracted many researchers, and retains its popularity today (Stone 1977; Hastie and Tibshirani 1996; Wager and Athey 2015)

. The rationale of nearest neighbor rules is that close covariate vectors share similar properties more often than not.

In this article, we propose a causal -nearest neighbor method for optimal treatment regimes. The method roots in the framework of causal inference, and compares the causal treatment effects within the nearest neighborhood. Although the method is simple, it possesses nice theoretical properties. Firstly, we show that the causal -nearest neighbor regime is universally consistent. Without knowing any specifics about the distribution underlying the data, a universally consistent treatment regime would eventually learn the Bayes regime when the sample size approaches infinity. Secondly, we establish its convergence rate. The convergence rate is as high as with appropriately chosen if the dimension of covariates is 1 or 2, and the rate is for dimension .

Similar to the nearest neighbor rule for classification, the causal -nearest neighbor regime suffers from the curse of dimensionality, i.e., performance deteriorates as dimensionality increases. To alleviate this problem, we propose an adaptive causal -nearest neighbor method, where the distance metric is adaptively determined from the data. Through adaptive metric selection, this adaptive method performs variable selection implicitly. The superior performance of the adaptive causal -nearest neighbor regime over the original causal -nearest neighbor regime is illustrated in the simulation studies. In practical settings, we recommend the adaptive causal -nearest neighbor method.

2 Methods

2.1 Causal nearest neighbor rules

Consider a randomized clinical trial with treatment arms. Let denote the observed clinical outcome, denote the treatment assignment received by the patient, and , where is compact, denote the patient’s clinical covariates. Assume that larger values of are preferred. Let denote the probability of being assigned treatment for a patient with clinical covariates . This probability is predefined in the design.

We then introduce the potential outcomes framework to formally identify the optimal treatment regime. The potential outcomes, denoted , are defined as the outcomes that would be observed were a patient to receive treatment , respectively (Robins 1986). As in the literature of potential outcomes, we require the following assumptions. The first one is the consistency assumption (Robins 1994): the potential outcomes and the observed outcomes agree, i.e., . We also assume that conditional on covariates , the potential outcomes are independent of the treatment assignment that has been actually received. This is called the assumption of no unmeasured confounders. It always holds in a randomized clinical trial.

A treatment regime is a function from clinical covariates to the treatment assignment . For a treatment regime , we can thus define its potential outcome . It would be the observed outcome if a patient from the population were to be assigned treatment according to regime . The expected potential outcome under any regime , given as , is called the value function associated with regime . An optimal regime is a regime that maximizes . The regime is also called the Bayes regime. There is a positivity assumption that almost everywhere for any . That is, any treatment option must be represented in the data in order to estimate an optimal regime. For simplicity, let . It is easy to obtain that

(1)

Note that by the consistency and no-unmeasured-confounders assumptions. It is identifiable in the observed data.

The -nearest neighbor rule is a nonparametric method used for classification and regression (Fix and Hodges 1951). In this article, we apply the nearest neighbor rule to optimal treatment regimes. The idea is simple. We use the nearest neighbor algorithm to find a neighborhood of in , then estimate for each arm in this neighborhood, and plug into (1) to get the nearest neighbor estimate for the optimal treatment regime. Similar procedures are proposed in the recent literature for tree-based nonparametric approaches (Athey and Imbens 2016; Wager and Athey 2015), where they target a partition in to estimate the treatment heterogeneity.

Let denote the observed data. We fix , and reorder the observed data according to increasing values of . The reordered data sequence is denoted by

Thus are the nearest neighbors of . can be approximated in the -nearest neighborhood of by

(2)

and is the indicator function, as suggested in Murphy (2005). Here we define . Let be the distance of the th nearest neighbor to , and define . It is straightforward through the consistency and no-unmeasured-confounders assumptions to see that

is an unbiased estimator for

. Hence is a reasonable approximation to . Then the plug-in estimate of the Bayes regime in (1) is

(3)

This is called the causal -nearest neighbor regime because of its close relationship to causal effects.

We need to address the problem of distance ties, i.e., when for some . Devroye et al. (1996, Section 11.2) discussed several methods for breaking distance ties. In practical use, we adopt the tie-breaking method used in Stone (1977). Subjects who have the same distance from as the th nearest neighbor are averaged on the outcome . We denote the distance of the th nearest neighbor to by , and define the sets and . The revised rule of (2) in the main paper is as follows:

(4)

The corresponding causal nearest neighbor regime is the regime in (3) of the main paper after replacing with . This is not a strictly -nearest neighbor rule when there are distance ties on the th nearest neighbor, since the estimate uses more than neighbors.

The causal nearest neighbor regimes are based on local averaging. Here, is a tuning parameter. It is required that be small enough so that local changes of the distribution can be detected. On the other hand, needs to be large enough so that averaging over the arm is effective. We may tune this parameter by a cross validation procedure to balance the two requirements.

In this article, we focus on applications in randomized clinical trials. However, the proposed methods can be easily extended to observational studies. We still require three assumptions (consistency, no unmeasured confounders and positivity). The only difference is that the assumption of no unmeasured confounders automatically holds in randomized clinical trials. In observational studies, it may hold when all relevant confounders have been measured, though this assumption cannot be verified in practice. One additional step for observational studies is to estimate the treatment allocation probabilities

, which can be obtained through, for example, logistic regression.

To our knowledge, no nearest neighbor related methods were applied in optimal treatment regimes. Wager and Athey (2015) described a standard -nearest neighbor matching procedure to estimate heterogeneous treatment effects. Wager and Athey (2015) used a different estimator

(5)

where is the set of nearest neighbors to in the treatment arm . in (2) and in (5) are two estimates of from different perspectives. is an estimate of , while is an estimate of . Our proposed causal -nearest neighbor method is slightly distinct in two ways. First, the estimates are obtained from the same neighborhood of . The subsequent comparison is more sensible. Second, for fairly large neighborhood, the inverse probability weighting estimator (2) corrects for variations inside the neighborhood. This is particularly useful for applications in observational studies.

2.2 Theoretical Properties

In machine learning, a classification rule is called universally consistent if its expected error probability approaches the Bayes error probability, in probability or almost surely, for any distribution underlying the data

(Devroye et al. 1996). The -nearest neighbor classification is the first to be proved to possess such universal consistency (Stone 1977). Here, we extend the concept of universal consistency to optimal treatment regimes.

Definition 2.1.

Given a sequence of data, a regime is universally (weakly) consistent if in probability for any probability measure on , and universally strongly consistent if almost surely for any probability measure on .

Denote the probability measure for by , and recall that is the closed ball centered at of radius . The collection of all with for all is called the support of (Cover and Hart 1967). The set is denoted as .

The analysis of universal consistency requires some assumptions.

  1. There exists a constant such that for any and ;

  2. ;

  3. Distance ties occur with probability zero in .

These assumptions are quite weak. Assumption (A1) is just the positivity assumption, and can be obtained by design. Assumption (A2) is natural. This assumption is automatically satisfied for bounded outcomes, i.e., for some constant . Assumption (A3) is to avoid the messy problem of distance ties. When (A3) does not hold, we may add a small uniform variable independent of to the vector . This causes the -dimensional random vector to satisfy Assumption (A3). We may perform the -nearest neighbor method on the modified data . Because of the independence of , the corresponding conditional outcome . Hence Assumption (A3) is reasonable, but at the cost of potentially compromising performance by introducing an artificial covariate to the regime. When is very small, we actually break ties randomly. The difference with Stone’s tie-breaking method is that Stone’s method takes into account all subjects whose distance to equals that of the th nearest neighbor, while the tie-breaking method here only picks one of them randomly. The remark following the proof of Theorem 2.2 in Appendix A demonstrates that Stone’s tie-breaking estimate in (4) is asymptotically better than the random tie-breaking method here.

The following theorem shows universal consistency of the causal nearest neighbor regime. The proofs of theorems are provided in Appendix A.

Theorem 2.2.

For any distribution for satisfying assumptions (A1)(A3),

(i) the causal -nearest neighbor regime in (3) is universally weakly consistent if and ;

(ii) the causal regime in (3) is universally strongly consistent if and .

If Assumption (A2) is tightened to for some constant , the regime in (3) is universally strongly consistent if and .

The next natural question is whether the associated value of the causal -nearest neighbor regime tends to the Bayes value at a specified rate. To establish the rate of convergence, we require stronger assumptions.

  1. for all and , and there exists a constant such that for all , and ;

  2. there exists a constant such that for all and ;

  3. distance ties occur with probability zero in , and the support of is compact with diameter ;

  4. ’s are Lipschitz continuous, i.e., there exists a constant such that for any and in , and .

Assumption (A1

) implies that randomization is not extremely skewed with respect to the covariates. Assumptions (A2

)(A4) are standard in the literature of nearest neighbor rules (Györfi et al. 2002). The following theorem gives the convergence rate of causal -nearest neighbor regimes. This theorem is proved in Appendix B.

Theorem 2.3.

For any distribution for satisfying Assumptions (A1)(A4), there exists a sequence such that and , and

When , ; when , can be arbitrarily close to ; when , .

The rate of convergence is as high as if the dimensionality is 1 or 2. When increases, the convergence rate decreases significantly. As with nearest neighbor rules in classification and regression, the causal -nearest neighbor regime also suffers from the curse of dimensionality.

2.3 Adaptive rules

The causal -nearest neighbor regime is consistent as shown previously. However, it is well known that the curse of dimensionality can severely hurt nearest neighbor rules in finite samples. The rate of convergence in Theorem 2.3 is slow when dimensionality is high. Hence appropriate variable selection may improve performance. In this section, we propose an adaptive causal -nearest neighbor method to estimate the optimal treatment regime, and to perform metric selection and variable selection simultaneously.

Let and use the distance metric to compute the distance between and . is the scaling factor for the th covariate. Setting is equivalent to discarding the th covariate. We intend to set a large if the th covariate is important for treatment selection.

We apply the following univariate method to evaluate the importance of an individual covariate. It is related to a test statistic comparing two treatment regimes

(Murphy 2005). One regime only involves the th covariate; and the other , called the non-informative regime, assigns all patients to the treatment with the largest estimated potential outcome . For a specific regime , let be the treatment assignment for the th subject according to . The value function associated with is estimated by

(6)

For two regimes, and

, a consistent estimator of the variance of

is

(7)

The statistic

asymptotically has a standard normal distribution under the null hypothesis that

(Murphy 2005). When the statistic is greater than zero, regime is considered better than the non-informative regime , otherwise is better. The statistic reflects the importance of the th covariate on optimal treatment regimes. We estimate by the causal -nearest neighbor method only using the th covariate.

We set for each , where is a predefined parameter and is the positive part. The adaptive causal -nearest neighbor regime follows the same procedure used in the causal -nearest neighbor regime described above, except that the adaptive one uses the distance metric to compute the distance between and . When is very large (for example, ), all are zero, hence the adaptive regime degenerates to a non-informative regime. On the other hand, when is very small (for example, ), all are almost identical, and the adaptive regime is equivalent to the causal -nearest neighbor one. Figure 1 illustrates the effects of on the construction of .

(a)
(b)
(c)
Figure 1: Three examples for construction of with different choices of . Suppose that there are five covariates with test statistics from individual comparison tests. In example (a), is greater than any test statistic, and hence ’s are all zero. Under this situation, the adaptive causal -nearest neighbor regime degenerates to a non-informative regime. When decreases, some ’s turn to positive from zero. In example (b), is between and . and are positive, and the first three are still zero. It is equivalent to throwing away the first three covariates in the analysis. When continues to decrease, in example (c) is smaller than any test statistic. All ’s are non-zero. The adaptive causal -nearest neighbor regime involves all five covariates. However, the fifth covariate contributes the most for the regime, and the first contributes the least.

Here is a summary of the adaptive causal nearest neighbor procedure:

1) Normalize each covariate to a similar scale.

2) Calculate and , where and .

3) Use the metric to estimate a causal -nearest neighbor regime.

The scaling at the first step is to avoid covariates in greater numeric ranges dominating those in smaller numeric ranges. We recommend linearly scaling each covariate to the range or (Hsu et al. 2003). For the adaptive causal -nearest neighbor regime, there are two tuning parameters, and . We tune the parameters using 10-fold cross validation.

3 Simulation studies

We performed extensive simulations to evaluate empirical performance of the causal -nearest neighbor and adaptive causal -nearest neighbor methods.

We first considered simulations for two-arm data (). In the simulations, we generated

-dimensional vectors of clinical covariates. The first two covariates were independent Bernoulli random variables with success probability of

, and the remaining covariates were independent standard normal random variables . The treatment was generated from independently of with , i.e., for any . To mimic a well balanced trial, we generated simulation data such that , where is the sample size of the data, and are the numbers of patients in treatment arm 1 and 2, respectively. The response was normally distributed with mean

and standard deviation 1. We considered three scenarios with different choices of

:

  1. ;

    .

  2. ;

    .

  3. ;

    ,

    where , for .

We run the simulations for two different dimensions of covariates: low dimensional data () and moderate dimensional data (). On low dimensional data (), we compared empirical performances of the following seven methods: (1) penalized least squares proposed by Qian and Murphy (2011)

; (2) Q-learning using random forests as described in

Taylor et al. (2015); (3) Residual weighted learning proposed in Zhou et al. (2017) using the linear kernel; (4) Residual weighted learning using the Gaussian kernel; (5) Augmented inverse probability weighted estimation proposed by Zhang et al. (2012a); (6) the causal -nearest neighbor method; and (7) the proposed adaptive causal -nearest neighbor method. When the dimension was moderate (), two residual weighted learning methods were replaced with their variable selection counterparts (Zhou et al. 2017).

In the simulation studies, penalized least squares estimated a linear model on to approximate the conditional outcomes , and also used the least absolute shrinkage and selection operator to carry out variable selection. The obtained regime was the treatment arm in which the conditional mean is larger. Q-learning using random forests is nonparametric. The conditional outcomes were approximated using as input covariates in the random forests. The number of trees was set to 1000 as suggested in Taylor et al. (2015). Residual weighted learning is an improved method for outcome weighted learning (Zhao et al. 2012)

. Outcome weighted learning views the treatment selection as a weighted classification problem, and treats the original outcomes as weights. Residual weighted learning is similar except that outcomes are replaced with residuals of the outcome from a regression fit on covariates excluding treatment assignment. Residual weighted learning with the linear kernel estimates linear treatment regimes, while the one with the Gaussian kernel has the ability to detect nonlinear regimes. Residual weighted learning involves non-convex programming, and hence the computational cost is high. For the augmented inverse probability weighted estimator, we first obtained the doubly robust version of the contrast function through linear regression, and then we let the propensity score be 0.5 and searched the optimal treatment regime using a classification and regression tree.

We applied 10-fold cross-validation for parameter tuning. The sample sizes were varied from , , , , to for each scenario. We repeated the simulation 500 times. For comparison, we generated a large test set with 10,000 subjects to evaluate performance. The comparison criterion was the value function of the estimated optimal treatment regime on the test set. Precisely, it is given by (Murphy 2005), where denotes the empirical average on the test data.

Scenario 1 (Optimal value )
-PLS 2.01 (0.08) 2.04 (0.06) 2.06 (0.04) 2.08 (0.02) 2.08 (0.01)
Q-RF 1.89 (0.08) 1.96 (0.06) 2.00 (0.04) 2.02 (0.02) 2.03 (0.01)
RWL-Linear 1.96 (0.09) 2.01 (0.06) 2.05 (0.04) 2.07 (0.02) 2.08 (0.01)
RWL-Gaussian 1.97 (0.09) 2.00 (0.07) 2.03 (0.06) 2.06 (0.03) 2.08 (0.02)
AIPWE 1.92 (0.13) 1.96 (0.10) 2.00 (0.06) 2.02 (0.04) 2.03 (0.03)
CNN 1.89 (0.10) 1.95 (0.08) 1.99 (0.06) 2.02 (0.03) 2.04 (0.02)
ACNN 1.88 (0.13) 1.94 (0.12) 1.99 (0.08) 2.02 (0.05) 2.04 (0.03)
Scenario 2 (Optimal value )
-PLS 1.48 (0.08) 1.53 (0.09) 1.58 (0.09) 1.64 (0.06) 1.66 (0.03)
Q-RF 1.64 (0.10) 1.74 (0.08) 1.82 (0.05) 1.87 (0.03) 1.90 (0.02)
RWL-Linear 1.55 (0.08) 1.58 (0.07) 1.61 (0.05) 1.64 (0.04) 1.66 (0.03)
RWL-Gaussian 1.64 (0.11) 1.71 (0.11) 1.81 (0.08) 1.86 (0.05) 1.90 (0.02)
AIPWE 1.63 (0.15) 1.74 (0.13) 1.81 (0.09) 1.87 (0.05) 1.90 (0.03)
CNN 1.64 (0.11) 1.73 (0.09) 1.81 (0.06) 1.87 (0.03) 1.90 (0.02)
ACNN 1.65 (0.14) 1.76 (0.12) 1.84 (0.08) 1.89 (0.04) 1.92 (0.03)
Scenario 3 (Optimal value )
-PLS 1.88(0.03) 1.89(0.03) 1.89(0.03) 1.89(0.04) 1.90(0.03)
Q-RF 2.05 (0.08) 2.14 (0.06) 2.21 (0.04) 2.26 (0.02) 2.28 (0.01)
RWL-Linear 1.93 (0.05) 1.93 (0.05) 1.96 (0.06) 1.97 (0.06) 1.98 (0.06)
RWL-Gaussian 2.04 (0.09) 2.13 (0.08) 2.20 (0.06) 2.26 (0.04) 2.30 (0.02)
AIPWE 2.06 (0.13) 2.17 (0.11) 2.23 (0.05) 2.26 (0.03) 2.28 (0.02)
CNN 2.02 (0.08) 2.11 (0.06) 2.18 (0.05) 2.24 (0.03) 2.28 (0.02)
ACNN 2.09 (0.11) 2.19 (0.08) 2.26 (0.06) 2.31 (0.04) 2.33 (0.02)
  • -PLS, penalized least squares; Q-RF, Q-learning using random forests; RWL-Linear, residual weighted learning with linear kernel; RWL-Gaussian, residual weighted learning with Gaussian kernel; AIPWE, augmented inverse probability weighted estimation; CNN, causal -nearest neighbor; ACNN, adaptive causal -nearest neighbor.

Table 1: Mean (standard deviation) of empirical value functions evaluated on the test set for Scenarios 1-3 when the dimension is low (). The best value function for each scenario and sample size combination is in bold.

The simulation results on the low dimensional data () are presented in Table 1. Let , , be the value function when all subjects are sent to treatment , and be the optimal value function for simplicity. For Scenario 1, , , and . The optimal regime is 1 if , and 2 otherwise. The decision boundary was a linear combination of a binary covariate and a continuous covariate. penalized least squares performed very well since its model was correctly specified. Both residual weighted learning methods performed similarly to penalized least squares, especially when the sample size was large. Our proposed causal -nearest neighbor and adaptive causal -nearest neighbor methods showed similar performance to Q-learning using random forests and augmented inverse probability weighted estimation, and when the sample size was large they were close to penalized least squares and residual weighted learning. For Scenario 2, , , and . The optimal regime is 1 if , and 2 otherwise. The decision boundary was nonlinear. penalized least squares and residual weighted learning with linear kernel both failed due to model misspecification. The adaptive causal -nearest neighbor method yielded the best performance. The causal -nearest neighbor method showed similar performance to Q-learning using random forests, residual weighted learning with Gaussian kernel and augmented inverse probability weighted estimation. For Scenario 3, , , and . The optimal regime is 1 if , and 2 otherwise. The decision boundary was highly nonlinear. Similar to Scenario 2, our proposed adaptive causal -nearest neighbor approach outperformed all other methods. The causal -nearest neighbor method yielded similar performance to other nonlinear methods including Q-learning using random forests, residual weighted learning with Gaussian kernel and augmented inverse probability weighted estimation.

We move now to the moderate dimensional cases (). The simulation results are shown in Table 2. In Scenario 1, penalized least squares outperformed other methods because of correct model specification and inside variable selection techniques. Residual weighted learning methods yielded similar performance to penalized least squares due to their variable selection mechanism. The causal -nearest neighbor regime was not comparable with others in this scenario because of the lack of a variable selection procedure. It is well known that nearest neighbor rules deteriorate when there are irrelevant covariates present in the data. The proposed adaptive causal -nearest neighbor approach showed similar performance to Q-learning using random forests and augmented inverse probability weighted estimation, and when the sample size was large it was close to penalized least squares and residual weighted learning methods. Their good performance can be explained by variable selection. The adaptive causal -nearest neighbor approach carries out variable selection through the adaptive metric selection. Q-learning using random forests and augmented inverse probability weighted estimation, as two tree methods, have a built-in mechanism to perform variable selection (Breiman et al. 1984). In Scenarios 2 and 3, penalized least squares and residual weighted learning with linear kernel failed due to misspecification; again, causal -nearest neighbor failed due to the lack of variable selection. Four nonparametric methods with variable selection, Q-learning using random forests, augmented inverse probability weighted estimate, residual weighted learning with Gaussian kernel and the adaptive causal -nearest neighbor approach, stood out. Among them, our proposed adaptive causal -nearest neighbor method ranked the first in both scenarios.

Scenario 1 (Optimal value )
-PLS 1.91 (0.12) 2.00 (0.07) 2.04 (0.04) 2.06 (0.02) 2.08 (0.02)
Q-RF 1.83 (0.10) 1.91 (0.08) 1.97 (0.06) 2.01 (0.03) 2.04 (0.01)
RWL-VS-Linear 1.84 (0.12) 1.97 (0.08) 2.03 (0.05) 2.06 (0.03) 2.08 (0.01)
RWL-VS-Gaussian 1.82 (0.13) 1.92 (0.10) 2.02 (0.07) 2.06 (0.04) 2.07 (0.03)
AIPWE 1.80 (0.15) 1.90 (0.12) 1.97 (0.08) 2.01 (0.05) 2.03 (0.03)
CNN 1.79 (0.09) 1.82 (0.08) 1.85 (0.06) 1.89 (0.05) 1.91 (0.04)
ACNN 1.77 (0.12) 1.83 (0.13) 1.91 (0.12) 2.00 (0.07) 2.03 (0.04)
Scenario 2 (Optimal value )
-PLS 1.44(0.06) 1.44(0.06) 1.44(0.07) 1.45(0.06) 1.46(0.06)
Q-RF 1.48 (0.09) 1.54 (0.09) 1.68 (0.08) 1.81 (0.06) 1.87 (0.03)
RWL-VS-Linear 1.49 (0.07) 1.52 (0.07) 1.57 (0.07) 1.62 (0.05) 1.65 (0.04)
RWL-VS-Gaussian 1.51 (0.09) 1.61 (0.13) 1.77 (0.12) 1.87 (0.07) 1.91 (0.04)
AIPWE 1.48 (0.09) 1.55 (0.12) 1.71 (0.13) 1.82 (0.08) 1.88 (0.04)
CNN 1.49 (0.06) 1.52 (0.06) 1.56 (0.05) 1.60 (0.05) 1.65 (0.04)
ACNN 1.52 (0.11) 1.62 (0.15) 1.76 (0.13) 1.86 (0.07) 1.90 (0.04)
Scenario 3 (Optimal value )
-PLS 1.89 (0.02) 1.89 (0.02) 1.89 (0.02) 1.89 (0.02) 1.89 (0.02)
Q-RF 1.92 (0.03) 1.94 (0.04) 1.99 (0.05) 2.07 (0.06) 2.18 (0.05)
RWL-VS-Linear 1.90(0.03) 1.90(0.03) 1.91(0.04) 1.92(0.04) 1.93(0.05)
RWL-VS-Gaussian 1.94 (0.07) 2.02 (0.13) 2.20 (0.12) 2.30 (0.07) 2.32 (0.06)
AIPWE 1.92 (0.06) 2.00 (0.12) 2.15 (0.11) 2.24 (0.04) 2.27 (0.03)
CNN 1.92 (0.01) 1.93 (0.02) 1.94 (0.02) 1.96 (0.02) 1.98 (0.02)
ACNN 1.99 (0.11) 2.10 (0.12) 2.23 (0.09) 2.30 (0.04) 2.33 (0.02)
  • -PLS, penalized least squares; Q-RF, Q-learning using random forests; RWL-VS-Linear, residual weighted learning with variable selection and linear kernel; RWL-VS-Gaussian, residual weighted learning with variable selection and Gaussian kernel; AIPWE, augmented inverse probability weighted estimation; CNN, causal -nearest neighbor; ACNN, adaptive causal -nearest neighbor.

Table 2: Mean (standard deviation) of empirical value functions evaluated on on the test set for Scenarios 1-3 when the dimension is moderate (). The best value function for each scenario and sample size combination is in bold.

Here, the covariates were independent. We also run simulations to assess performance of our proposed methods when the covariates were correlated. The results are similar to the independent cases presented above. Details are collected in Appendix D.

We then evaluated the performance of our proposed methods on data with more than two treatment arms (say, ). The simulation setup was similar to that with two treatment arms. We generated -dimensional vectors of clinical covariates as before. The treatment was generated from independently of with for any . The response was normally distributed with mean and standard deviation 1. We considered two scenarios with different choices of :

  1. ;

    ;

    .

  2. ;

    ;

    ,

    where , for .

We compared the performance of the following four methods: (1) penalized least squares proposed by Qian and Murphy (2011); (2) Q-learning using random forests as described in Taylor et al. (2015); (3) the proposed causal -nearest neighbor method; and (4) the proposed adaptive causal -nearest neighbor method. Residual weighted learning and augmented inverse probability weighted estimation methods have only been implemented for two treatment arms, and so are not included here. For each scenario, we varied sample sizes from , , , to , and repeated the simulation 500 times. The independent test set was with a sample size of 30,000.

The simulation results on the low dimensional cases () are presented in Table 3. For Scenario 4, , and . The decision boundary is linear. penalized least squares produced the best performance because of correct model specification. The other nonparametric methods, Q-learning using random forests, causal -nearest neighbor and adaptive causal -nearest neighbor methods, showed similar performance. For Scenario 5, , , , and . The decision boundary is nonlinear. penalized least squares was not comparable with other nonparametric methods as the postulated model was misspecified. Our proposed adaptive causal -nearest neighbor method produced the best performance. The causal -nearest neighbor method showed similar performance to Q-learning using random forests.

Scenario 4 (Optimal value )
-PLS 1.94(0.08) 1.99(0.06) 2.02(0.02) 2.03(0.02)
Q-RF 1.82 (0.06) 1.87 (0.05) 1.91 (0.03) 1.95 (0.02)
CNN 1.83 (0.08) 1.89 (0.06) 1.93 (0.04) 1.96 (0.03)
ACNN 1.81 (0.09) 1.87 (0.09) 1.91 (0.18) 1.97 (0.04)
Scenario 5 (Optimal value )
-PLS 1.82(0.12) 1.90(0.11) 1.98(0.08) 2.02(0.05)
Q-RF 1.88 (0.07) 1.97 (0.05) 2.04 (0.03) 2.09 (0.02)
CNN 1.85 (0.09) 1.94 (0.06) 2.02 (0.04) 2.07 (0.02)
ACNN 1.89 (0.10) 1.99 (0.08) 2.07 (0.06) 2.12 (0.03)
  • -PLS, penalized least squares; Q-RF, Q-learning using random forests; CNN, causal -nearest neighbor; ACNN, adaptive causal -nearest neighbor.

Table 3: Mean (standard deviation) of empirical value functions evaluated on the test set for Scenarios 4 and 5 when the dimension is low (). The best value function for each scenario and sample size combination is in bold.

We then increased the dimensionality to 25. The results are presented in Table 4. Again, penalized least squares produced the best performance in Scenario 4, and the adaptive causal -nearest neighbor method in Scenario 5. The causal -nearest neighbor approach was not comparable with others in both scenarios due to the curse of dimensionality.

Scenario 4 (Optimal value )
-PLS 1.85(0.09) 1.94(0.07) 2.00(0.04) 2.02(0.02)
Q-RF 1.77 (0.06) 1.83 (0.06) 1.90 (0.04) 1.95 (0.03)
CNN 1.72 (0.04) 1.76 (0.04) 1.78 (0.04) 1.82 (0.04)
ACNN 1.72 (0.08) 1.77 (0.09) 1.83 (0.09) 1.91 (0.07)
Scenario 5 (Optimal value )
-PLS 1.72(0.08) 1.79(0.09) 1.90(0.08) 1.99(0.05)
Q-RF 1.72 (0.08) 1.82 (0.08) 1.92 (0.05) 1.99 (0.04)
CNN 1.66 (0.04) 1.69 (0.03) 1.73 (0.03) 1.76 (0.03)
ACNN 1.74 (0.12) 1.87 (0.13) 2.01 (0.10) 2.11 (0.04)
  • -PLS, penalized least squares; Q-RF, Q-learning using random forests; CNN, causal -nearest neighbor; ACNN, adaptive causal -nearest neighbor.

Table 4: Mean (standard deviation) of empirical value functions evaluated on the test set for Scenarios 4 and 5 when the dimension is moderate (). The best value function for each scenario and sample size combination is in bold.

When the dimension is low, the causal -nearest neighbor regime produced comparable performance to other alternatives. The adaptive selection on the distance metric enhances the causal -nearest neighbor regime. From the simulations, the adaptive causal -nearest neighbor method showed at least similar results to the causal -nearest neighbor regime. As we explained before, when the tuning parameter is very small, the adaptive causal -nearest neighbor approach is almost equivalent to the causal -nearest neighbor approach. Considering the superior performance of the adaptive causal -nearest neighbor over the causal -nearest neighbor approach, especially when the dimensionality is large, we suggest the adaptive causal -nearest neighbor method for general practical use.

4 Data analysis

We applied the proposed methods to analyze data from a chronic depression clinical trial (Keller et al. 2000). Patients with non-psychotic chronic major depressive disorder were randomized in a 1:1:1 ratio to either Nefazodone, cognitive behavioral-analysis system of psychotherapy, or the combination of two therapies. The primary outcome measurement in efficacy was the score on the 24-item Hamilton rating scale for depression. Lower score is desirable. We considered 50 pre-treatment covariates as in Zhao et al. (2012). We excluded some patients with missing covariate values from the analyses. The data used here consisted of 647 patients. Among them, 216, 220, and 211 patients were assigned to three arms, respectively. Each clinical covariate was scaled to , as described in Hsu et al. (2003).

Since the trial had three treatment arms, we compared the performance of the adaptive causal -nearest neighbor regime with penalized least squares and Q-learning using random forests. Residual weighted learning and augmented inverse probability weighted estimation methods can only deal with two treatments. From the simulation studies, the adaptive regime outperformed the causal -nearest neighbor one especially when the dimension of covariates was large, so we only considered the adaptive regime in this section. Outcomes used in the analyses were opposites of the scores on the 24-item Hamilton rating scale for depression. We used a nested 10-fold cross-validation procedure for an unbiased comparison (Ambroise and McLachlan 2002). To obtain reliable estimates, we repeated the nested cross-validation procedure 100 times with different fold partitions.

The mean value functions over 100 repeats and the standard deviations are presented in Table 5. The adaptive causal -nearest neighbor regime achieved a similar performance to penalized least squares and Q-learning using random forests. All methods assigned the combination treatment to almost every patient. The original analysis in Keller et al. (2000) indicated that the combination treatment is significantly more efficacious than either treatment alone. Our analysis confirmed that this is indeed true.

We also performed pairwise comparisons between two treatment arms. We included two residual weighted learning methods with variable selection and augmented inverse probability weighted estimation in the analysis. The analysis results are also presented in Table 5. For comparison between Nefazodone and cognitive behavioral-analysis system of psychotherapy, the adaptive causal -nearest neighbor regime was slightly better than other methods except for residual weighted learning with linear kernel. For comparison between Nefazodone and combination therapy, all methods produced similar performance. For comparison between cognitive behavioral-analysis system of psychotherapy and combination therapy, the adaptive causal -neareast neighbor regime did not perform comparably to other methods. We carried out the significance test described in Section 2.3 to compare the regimes by the adaptive causal -nearest neighbor and residual weighted learning with linear kernel, and the difference between them was not statistically significant.

NFZ vs CBASP
vs COMB NFZ vs CBASP NFZ vs COMB CBASP vs COMB
-PLS 11.19 (0.15) 16.30 (0.39) 11.20 (0.16) 10.95 (0.09)
Q-RF 11.11 (0.13) 16.27 (0.44) 11.05 (0.18) 10.93 (0.09)
RWL-VS-Linear 15.45 (0.37) 11.09 (0.29) 10.88 (0.05)
RWL-VS-Gaussian 16.29 (0.44) 11.33 (0.25) 11.07 (0.28)
AIPWE 16.45 (0.41) 10.97 (0.15) 10.96 (0.14)
ACNN 11.18 (0.27) 15.70 (0.39) 11.03 (0.27) 11.41 (0.28)
  • -PLS, penalized least squares (Qian and Murphy 2011); Q-RF, Q-learning using random forests (Taylor et al. 2015); RWL-VS-Linear, residual weighted learning with variable selection and linear kernel (Zhou et al. 2017); RWL-VS-Gaussian, residual weighted learning with variable selection and Gaussian kernel (Zhou et al. 2017); AIPWE, augmented inverse probability weighted estimation (Zhang et al. 2012a); ACNN, adaptive causal -nearest neighbor. NFZ, Nefazodone; CBASP, cognitive behavioral-analysis system of psychotherapy; COMB, combination of Nefazodone and cognitive behavioral-analysis system of psychotherapy.

Table 5: Mean score (standard deviation) on Hamilton rating scale for depression from the cross-validation procedure using different methods. Lower score is better.

The adaptive causal -nearest neighbor regime showed a statistically equivalent performance to other methods on the chronic depression clinical trial data.

5 Discussion

In this article, we have proposed a simple causal -nearest neighbor method to optimal treatment regimes, and developed an adaptive method to determine the distance metric. As shown in the simulation and data studies, the adaptive method can rival and improve upon more sophisticated methods, especially when the decision boundary is nonlinear.

Variable selection plays a critical role in identifying the optimal treatment regime when the dimension of covariates is large, as shown in the simulation studies. penalized least squares methods use the least absolute shrinkage and selection operator for variable selection. Residual weighted learning performs variable selection through the elastic-net penalty for linear kernels and through covariate-scaling for Gaussian kernels (Zhou et al. 2017). As a tree method, augmented inverse probability weighted estimation is equipped with a built-in variable selection mechanism (Breiman et al. 1984). Our proposed adaptive causal -nearest neighbor method applies an adaptive distance metric to perform variable selection. Recently, several researchers highlighted the importance of variable selection for optimal treatment regimes (Gunter et al. 2011; Zhou et al. 2017). Variable selection in optimal treatment regimes has its own characteristics. There are two different types of covariates related to outcomes , predictive and prescriptive covariates. Predictive covariates are useful to the prediction of outcomes; and prescriptive covariates are used to prescribe optimal treatment regimes (Gunter et al. 2011). Athey and Imbens (2016) discussed several ways of splitting on prescriptive covariates rather than predictive covariates on causal trees. The variable selection in the adaptive causal -nearest neighbor regime is to identify prescriptive covariates through tuning with the additional parameter . As pointed out by an anonymous reviewer, in practice, the number of predictive covariates may be much larger than the number of prescriptive covariates. So it is important and challenging to carry out variable selection for optimal treatment regimes.

The causal -nearest neighbor methods are simple and fast; they possess nice theoretical properties; as nonparametric methods, they are free of model specification; they naturally work with multiple-arm trials; the variable selection in the adaptive causal -nearest neighbor regime identifies prescriptive covariates to further improve finite sample performance.

Acknowledgement

This work was sponsored by the National Cancer Institute. We are grateful to the editors and the reviewers for their insightful comments, which have led to important improvements in this paper.

APPENDIX

We prove Theorems 2.2 and 2.3 of the main paper in Appendix A and B. The proofs are based on theoretical results for nearest neighbor rules in regression. For completeness, we collect the theorems and lemmas needed in the proofs in Appendix C. We present additional simulation results in Appendix D.

Appendix A Proof of Theorem 2.2

The following lemma shows that consistency of , , guarantees consistency of the rule .

Lemma A.1.

The causal -nearest neighbor rule in (3) of the main paper satisfies the following bound for any distribution for ,

Proof of Lemma a.1.

Note that the value function of any rule ,

Thus, by fixing , we have

where and , and the expectation is with respect to for . By the construction of , we have

The desired result follows by taking expectation over on both sides. ∎

Now it is sufficient to prove, for any ,

in probability or almost surely, as . We start from a simpler -nearest neighbor rule, for ,

(8)

The relationship between and is that

By the law of large numbers, the denominator

as . Thus it is now sufficient to prove, for any ,

in probability or almost surely, as .

From now on, we use . For weak consistency, we will prove a slightly stronger result, We rewrite