1 Introduction
Statistical methods are increasingly popular for optimizing drug doses in clinical trials. A typical dosefinding study is conducted by a doubleblind randomized trial where each patient is randomly assigned a dose among a few safe dose levels for a candidate drug (Chevret, 2006). At the end of the trial, a single dose leading to the best average treatment effect is determined as a recommendation for future patients. However, different patients might respond differently to the same dose of a drug due to their differences in physical conditions, genetic factors, environmental factors and medication histories. Taking these differences into consideration when making dose decisions is essential for achieving better treatment results.
Recently, there has been a growing interest in personalized treatments optimization. However, most of the methods are restricted to a finite number of treatment options. In particular, people are interested in finding individualized treatment rules, which output a treatment option within a finite number of available treatments based on patient level information. Such treatment rules can thus be used to guide treatment decisions aiming to maximize the expected clinical outcome of interest, also known as the expected reward or value. An optimal treatment rule is defined to be the one that maximizes the value function in the population among a class of treatment rules. Various statistical learning methods have been proposed to infer optimal individualized treatment rules using data from randomized trials or observational studies. Existing methods include modelbased approaches, such as the Qlearning (Watkins and Dayan, 1992; Zhao et al., 2009; Qian and Murphy, 2011) and Alearning (Murphy, 2003; Robins, 2004; Henderson et al., 2010; Liang et al., 2018; Shi et al., 2018), direct value search methods by maximizing a nonparametric estimator of the value function (Zhang et al., 012a; Zhang et al., 012b; Zhao et al., 2012; Fan et al., 2017; Shi et al., 2019; Zhou et al., 2019), and other semiparametric methods (Song et al., 2017; Kang et al., 2018; Xiao et al., 2019).
The above methods, however, are not directly applicable when the number of possible treatment levels is large. Let us illustrate with warfarin, which is an anticoagulant drug commonly used for the prevention of thrombosis and thromboembolism. Establishing the appropriate dosage for warfarin is known to be a difficult problem because the optimal dosage can vary by a factor of 10 among patients, from 10mg to 100mg per week (Consortium, 2009). Incorrect dosages contribute largely to the adverse effects of warfarin usage. Underdosing will fail to alleviate symptoms in patients and overdosing will lead to catastrophic bleeding. In this case, an individualized dose rule, where a dose level is suggested within a continuous safe dose range according to each individual’s physical conditions, would be better at tailoring to patient heterogeneity in drug response. Several methods have been proposed for finding optimal individualized dose rules. One way of extending existing methods to the continuous dose case is to discretize the dose levels. Laber and Zhao (2015) proposed a treebased method and turned the problem into a classification problem by dividing patients into subgroups and assigning a single dose to each subgroup. Chen et al. (2018) extended the outcome weighted learning method (Zhao et al., 2012) from binary treatment settings to ordinal treatment settings. However, in cases where patient responses are sensitive to dose changes, a discretized dose rule with a small number of levels will fail to provide dose recommendations leading to optimal clinical results. On the other hand, a discretized dose rule with a large number of levels may result in limited observations within each subgroup, and thus may be at risk of overfitting.
Alternatively, Rich et al. (2014) extended the Qlearning method by modeling the interactions between the dose level and covariates with both linear and quadratic terms in doses. However, such a parametric approach is sensitive to model misspecification and the estimated individualized dose rule might be far away from the true optimal dose rule. In addition, it cannot be guaranteed that the estimated optimal dose falls into the safe dose range. More recently, Chen et al. (2016) extended the outcome weighted learning method proposed by Zhao et al. (2012)
and transformed the dosefinding problem into a weighted regression with individual rewards as weights. The optimal dose rule is then obtained by optimizing a nonconvex loss function. This method is robust to model misspecification and has appealing computational properties, however, the associated statistical inference for the estimated dose rule is challenging to determine. In this article we propose a kernel assisted learning method to infer the optimal individualized dose rule in a manner which enables statistical inference. Our proposed method can be viewed as a direct value search method. Specifically, we first estimate the value function with a kernel based estimator. Then we search for the optimal individualized dose rule within a prespecified class of rules where the suggested doses always lie in the safe dose range. The proposed method is robust to model misspecification and is applicable to data from both randomized trials and observational studies. We establish the consistency and asymptotic normality of the estimated parameters in the obtained optimal dose rule. In particular, the asymptotic covariance of the estimators is derived based on nontrivial calculations of the expectation of a Ustatistic.
The remainder of the article is organized as follows. In Section 2, we present the problem setting and our proposed method. The theoretical results of the estimated parameters are established in Section 3. In Section 4, we demonstrate the empirical performance of the proposed method via simulations. In Section 5, the proposed method is further illustrated with an application to a warfarin study. Some discussions and conclusions are given in Section 6. Proofs of the theoretical results are provided in the appendix.
2 Method
2.1 Problem Setting
The observed data consist of independent and identically distributed observations , where is a
dimensional vector of covariates for the
th patient, is the dose assigned to the patient with being the safe dose range, and is the outcome of interest. Without loss of generality, we assume that larger means better outcome. Let denote an individualized dose rule, which is a deterministic mapping function from to . To define the value function of an individualized dose rule, we use the potential outcome framework (Rubin, 1978). Specifically, let be the potential outcome that would be observed when a dose level is given. Define the value function as the expected potential outcome in the population if everyone follows the dose rule , i.e. . The optimal individualized dose rule is defined as .In order to estimate the value function from the observed data, we need to make the following three assumptions similar to those adopted in the causal inference literature (Robins, 2004). First, we assume , where is the Dirac delta function. This corresponds to the stable unit treatment value assumption (also known as the consistency assumption). It assumes that the observed outcome is the same as the potential outcome had the dosage given to the patient be the actual dose. This assumption also implies that there is no interference among patients. Second, we assume that the potential outcomes are conditionally independent of given , which is also known as the no unmeasured confounders assumption. Third, we assume that there exists a such that for all , where is the conditional density of given . This is a generalization of the positivity assumption for continuous dosing. Under these assumptions, we can show that can be estimated with the observed data:
The second equation above is based on the basic property of conditional densities. The third equation above is valid because of the no unmeasured confounder assumption. The fourth equation is based on the consistency assumption. The positivity assumption ensures that the right side of the last equation can be estimated empirically. In the next section, we will propose a consistent estimator for based on kernel smoothing.
2.2 Method
To estimate the optimal IDR, we first estimate with a kernel based estimator and then estimate by directly maximizing the estimated value function . We search for the optimal individualized dose rule within a class of dose rules of the form: , where , and is a predefined link function to ensure that the suggested dosage is within the safe dose range. Thus
Notice that is an estimator of the optimal IDR within : where . If the true optimal IDR lies in , then the proposed . To see this more clearly, we illustrate with a toy example. If the true model for takes the form: , where is an unspecified baseline function, is a nonnegative function and is a unimodal function which is maximized at 0, then is maximized at dose level for patients with covariates . Thus, the true optimal individualized dose rule is:
The last equation above is true because is maximized at for each . If a unique maximizer of exists, then
Therefore, . Notice that if, then . However, is still of interest as long as the form of is flexible enough, because it maximizes the value function among this set of treatment rules. can be estimated using , and the optimal IDR within can be thus estimated with . Notice that we do not need any model assumption on the form of the conditional expectation to apply this method.
Next, we propose a kernel based estimator for the value function. Let
where and is the marginal density of . Thus, . The function is estimated using the NadarayaWatson method given:
where is a univariate kernel function and is a dimensional kernel function. Here, and are bandwidths that go to 0 as . Note that for simplicity of notification, we use the same bandwidth for all dimensions of here. In practice, we can use different bandwidths for different dimensions of to increase the efficiency of the estimation. Moreover, the marginal density of is estimated by . The estimated value function can thus be written as:
Then is estimated with , where is a compact subset of containing .
2.3 Computational Details
To implement the proposed method, the R package optimr() is used for optimization of the objective function. The integral in is estimated by taking the average of grid points in the covariate space. In our implementation, we chose . In order to find the global maximizer of , we start optimization from different initial points and choose the one that leads to the maximal objective function value. Denote the maximizer as . When there is only one continuous covariate included, following the theoretical rate of the bandwidth parameters, the bandwidths are chosen as , , where and are constants between and .
When the covariates consist of both continuous variables and categorical variables, the categorical variables are stratified for estimation of the value function. Specifically, assume that
, where is a dimensional vector of continuous variables and is a dimensional vector of categorical variables. The form of then becomes:where , ,
.
The R code for the proposed method is available at: https://github.com/lz2379/Kernel_Assisted_Learning
3 Theoretical Results
In this section, we establish the asymptotic properties of . To prove these results, we need to make the following assumptions. In the following equations, , and denote the first, second and third derivatives of the function with respect to ; and .
Assumption 1.
1 Assume that , and exist for , where is a subset of . For as and constants , such that , , , , , , , , , , , , , .
Assumption 2.
The function has a unique maximizer .
Assumption 3.
The function is uniformly bounded. The joint density function of and , , is uniformly bounded away from 0. In addition, the first, second, third and fourth order derivatives of and with respect to and exist and are uniformly bounded almost everywhere.
Assumption 4.
The covariate
has bounded first, second and third moments.
Assumption 5.
The function is thrice differentiable almost everywhere and the corresponding derivatives, are bounded almost everywhere.
Assumption 1 can be satisfied by most commonly used kernel functions such as the Gaussian kernel function and all sufficiently smooth bounded kernel functions. Assumption 2 is an identifiability condition for . Assumptions 3–5 ensure the existence of the limit of the expectation of and the existence of the covariance matrix of the limiting distribution. In the following two theorems, we establish the consistency and asymptotic normality of , respectively.
Theorem 1.
Under assumptions 1–3, for , satisfying
as , we have
converge in probability to
, where is a compact region containing . Thus, converge in probability to .Theorem 2.
Proofs of the above theorems are based on theory for kernel density estimators
(Schuster et al., 1969) and Mestimation (Kosorok, 2008). Details of proofs are given in the appendix. Note that the convergence rate is slower then due to the kernel estimation of the value function.4 Simulation Studies
In this section, we conduct some simulations to show the capability of our proposed method in identifying the optimal individualized dose rule. We first simulate some simple settings with only one covariate.
is generated randomly from the standard normal distribution.
is generated from the uniform distribution on
. We generate and independently to mimic a randomized dose trial where a random dose from the safe dose range is assigned to each patient. The optimal dose rule is , where . is generated from a normal distribution with mean , where . We use two different baseline functions for and two different sets of as shown in Table 1. The sample sizes are and and each setting is replicated 500 times.The average bias and the standard deviation of the estimated parameters from 500 simulations are summarized in the first half of Table 2
. The estimated parameters were close to the true parameters. The third column shows the average of the standard errors estimated with the covariance function formula derived in Theorem
2 (see appendix for details). confidence intervals were calculated with the estimated standard errors. The coverage probabilities are shown in the table. From the result, we can see that the bias and standard deviation of the estimated parameters decreased with larger sample sizes. The coverage probabilities of the confidence intervals were close to , supporting the convergence results given in Section 3.We also study the performance of our method when the training data are from observational studies, where the doses given to the patients may depend on the covariates . The simulation settings are the same as settings 1–4 except that is generated from the distribution . The results are summarized in the second half of Table 2. The proposed method was still capable of giving good estimates of the parameters and the coverage of the confidence intervals were close to . These simulation implies that the proposed method performs well with data from both randomized trials and observational studies.
No baseline  With baseline  

Setting 1  Setting 3  
Setting 2  Setting 4 
Randomized trials  

n  Bias  SD  SE  CP  Bias  SD  SE  CP  
S1 
400  2.5  46.6  47.5  95.6  17.3  53.5  54.5  92.8 
800  2.4  33.7  33.4  95.8  19.5  37.3  38.5  90.2  
S2  400  2.1  52.2  54.4  95.6  38.9  91.0  93.7  94.6 
800  1.5  39.1  38.1  93.8  33.0  63.0  65.9  95.8  
S3  400  2.7  54.1  55.7  95.2  20.4  64.5  64.1  90.8 
800  1.6  38.8  39.3  95.0  18.9  43.7  45.4  92.0  
S4  400  2.4  61.8  63.4  95.4  39.4  103.5  111.2  96.2 
800  1.5  44.4  44.3  94.6  33.6  75.0  77.5  95.6  
Observational studies  
n  Bias  SD  SE  CP  Bias  SD  SE  CP  
S1  400  13.9  80.5  82.4  96.0  32.4  97.7  102.1  94.6 
800  8.5  47.3  47.0  94.6  19.6  56.7  58.2  92.2  
S2  400  21.9  83.3  88.1  96.4  7.6  146.4  150.4  95.2 
800  17.8  63.4  60.9  93.0  0.6  94.0  103.2  98.2  
S3  400  13.4  89.4  90.2  95.6  33.2  109.3  112.3  94.8 
800  9.0  53.1  50.6  93.0  22.8  60.2  63.2  93.4  
S4  400  21.6  91.3  97.1  96.2  5.0  165.4  169.3  95.8 
800  20.3  71.2  67.6  93.2  2.0  109.0  116.8  97.0 

Note: * columns are in scale

Note: SD refers to the standard deviation of the estimated parameters from 500 replicates, SE refers to the mean of the estimated standard errors calculated by our covariance function, CP refers to the coverage probability of the confidence intervals calculated using the estimated standard errors.

Note: The worst case Monte Carlo standard error for proportions is .
Under settings 1–4, we compare our method with linear based Olearning (LOL) and kernel based Olearning (KOL) proposed in Chen et al. (2016) and a discretized dose rule estimated using Qlearning. For discretized Qlearning, we divide the safe dose range into 10 equally spaced intervals: and create an indicator variable for each of the dose intervals , where , . The covariates included in the regression models are . To this end, an optimal dose range is selected for each individual and the middle point of the selected interval is suggested to the patient. The results from replicates are summarized in Table 3. Each column is the average value function of the dose rule estimated by the corresponding method. The value function is evaluated at a testing dataset. The numbers in the parentheses are the standard deviation of the estimated value functions. From Table 3, we see that the proposed method performed the best under most settings. In the simulation for observational studies, Olearning performed the best when the sample size was small. However, the proposed method performed comparatively well and performed better as the sample size increased. The discretized Qlearning method did not provide a good dose suggestion in this case.
Randomized trials  

n  DQ  LOL  KOL  KAL  
S1  400  38.1(1.9)  7.7(7.0)  16.5(8.2)  2.7(2.7) 
800  32.7(0.9)  3.9(3.7)  9.3(4.3)  1.9(1.8)  
S2 
400  33.9(1.3)  18.1(10.5)  31.7(12.5)  3.3(3.7) 
800  20.0(0.8)  15.6(7.5)  20.4(7.4)  1.9(1.8)  
S3 
400  41.6(11.4)  8.5(14.1)  17.2(14.2)  3.7(12.4) 
800  61.2(11.5)  4.3(12.1)  10.0(12.7)  2.4(11.7)  
S4 
400  52.5(11.9)  21.3(17.7)  33.3(17.4)  4.2(12.5) 
800  23.3(11.9)  17.8(14.5)  22.4(13.9)  2.4(11.7)  
Observational studies 

n  DQ  LOL  KOL  KAL  
S1  400  29.5(1.2)  7.4(6.3)  15.6(7.7)  8.1(8.0) 
800  24.4(0.9)  5.5(4.3)  10.3(5.3)  3.1(3.1)  
S2 
400  16.0(7.6)  14.1(6.6)  21.3(9.7)  8.2(8.3) 
800  32.0(1.1)  12.8(4.7)  12.2(4.5)  4.4(4.4)  
S3 
400  29.1(11.6)  8.1(13.5)  11.7(14.6)  9.8(15.7) 
800  34.2(11.8)  6.2(12.4)  11.2(13.3)  3.5(12.0)  
S4 
400  83.8(13.8)  14.7(13.0)  20.7(15.8)  10.0(15.8) 
800  34.1(11.3)  13.5(12.1)  11.2(12.3)  5.1(12.2) 

Note: DQ refers to discretized Qlearning, LOL refers to linear Olearning, KOL refers to kernel based Olearning, KAL refers to kernel assisted learning.

All columns are in scale. For settings 3 and 4, the numbers in the table are the value estimate for the purpose of comparison with the first two settings.
5 Warfarin Data Analysis
Warfarin is a widely used anticoagulant for prevention of thrombosis and thromboembolism. Although highly efficacious, dosing for warfarin is known to be challenging because of the narrow therapeutic index and the large variability among patients (Johnson et al., 2011). Overdosing of warfarin leads to bleeding and underdosing diminishes the efficacy of the medication. The international normalized ratio (INR) measures the clotting tendency of the blood. An INR between – is considered to be safe and efficacious for patients. Typically, the warfarin dosage is decided empirically: an initial dose is given based on the population average, and adjustments are made in the subsequent weeks while the INR of the patient is being tracked. A stable dose is decided in the end to achieve an INR of – (Johnson et al., 2011). The dosing process may take weeks to months, during which the patient is constantly at risk of bleeding or underdosing. Therefore, a quantitative method for warfarin dosing will greatly decrease the time, cost and risks for patients.
The following analysis uses the warfarin dataset collected by Consortium (2009)
. In the original paper, a linear regression was used to predict the stable dose using clinical results and pharmacogenetic information, including age, weight, height, gender, race, two kinds of medications (Cytochrom P450 and Amiodarone), and two genotypes (CYP2C9 genotype and VKORC1 genotype). This prediction method is based on the assumption that the stable doses received by the patients are optimal. However, later studies showed that the suggested doses by the International Warfarin Pharmacogenetic Consortium are suboptimal for elderly people, implying that the optimal dose assumption might not be valid
(Chen et al., 2016).We apply our proposed method to this dataset to estimate the optimal individualized dose rule for warfarin. Instead of using only the data of the patients with stablized INR, we include all patients who received weekly doses between 6 mg to 95 mg. The medication information is missing for half of the observations and is therefore excluded from our analysis. Observations which are missing in the other variables are removed from the dataset, resulting in a total of patients. The outcome variable is defined as for the th observation. Stratification of the categorical variables is needed for the kernel density estimation. In order to ensure that there are enough observations in each stratified group, we consider only categorical variables that are distributed comparatively even among different groups. In our analysis, we use three variables: height, gender and the indicator variable for VKORC1 of type AG. Before we apply the proposed the method, we normalize all the variables by , where , . and sd is the standard deviation of the th variable.
The estimation results are shown in Table 4. The pvalue is obtained for each of the parameters. The result implied that the optimal dose for male is higher than the optimal dose for female given all the other variables being the same. It was also implied that the patients with genotype VKORC1AG need higher doses than the patients with VKORC1 AG. We use the same variables and compare our method with Olearning and the discretized Qlearning method. For the discretized Qlearning method, we also divide the dose range into 10 equally spaced intervals. The suggested doses by the three methods are shown in Fig. 2. The result shows a tendency of the discretized Qlearning to suggest extreme doses, which is not ideal in real application. This might be due to the fact that the higher dose intervals contain small numbers of observations, and thus the estimated models are dominated by a few subjects.
To evaluate the estimated dose rules of these methods, we randomly take two thirds of the data as training data and the rest of the data as testing data. The optimal individualized dose rule is estimated with the training data. The value function of the suggested individualized dose rule is estimated with the average of the Nadaraya Watson estimator for in the testing dataset. The tuning parameters for the NadarayWatson estimators are taken as and , where is the size of the testing dataset. The process is repeated 200 times. The distribution of the estimated value of the suggested dose is shown in Fig. 3. The suggested individualized dose rule with our proposed method lead to better expected outcomes in the population compared to the other methods. The performance of the discretized Qlearning method was not stable as shown in the result. However, this result was only based on the three variables selected, while in reality, the two medications (Cytochrome P450 enzyme and Amiodarone) and the genotype CYP2C9 are also of significant importance in warfarin dosing. The computation complexity of our proposed method restricted its capability of handling higher dimensional problems.
Variable  Estimated Parameter  SE  pvalue 

Intercept  0.463  0.064  0.000 
Height  0.263  0.101  0.005 
Gender  0.268  0.134  0.023 
VKORC1.AG  0.4682  0.094  0.000 
6 Discussion and Conclusion
The proposed kernel assisted learning method for estimating the optimal individualized dose rule provides the possibility of conducting statistical inference with estimated dose rules, thus providing insights on the importance of the covariates in the dosing process. In our simulation settings, our method was capable of identifying the optimal individualized dose rule when the optimal dose rule was inside the prespecified class of rules. In the warfarin dosing case, based on the three covariates selected, the suggested dose lead to better expected clinical result compared to the other methods. Application of the proposed methodology is not limited to optimal dose finding. This method can also be applied to any scenario where continuous decision making is desired.
The proposed method has several possible extensions. Notice that the form of the prespecified rule class can be extended to a link function with a nonlinear predictor where are some prespecified basic spline functions and . The accuracy of the approximated value function might also be improved by extending the multivariate kernel to (Duong and Hazelton, 2005).
One weakness of the proposed method is that the accuracy of the estimated value function is sensitive to the choice of bandwidth. The kernel density estimator in the denominator of might lead to large bias when the bandwidths are not properly chosen. As the dimension of increases, the choice of the bandwidths is nontrivial. The criteria for choosing bandwidths needs to be studied further.
In the future, we are interested in variable selection when dealing with high dimensional data. Extensions to multistage dose finding problems is also of interest. Personalized Dose Finding is still a relatively new problem. With the complicated mechanisms of various diseases, there are many more problems to be tackled in this realm.
References
 (1)
 Chen et al. (2016) Guanhua Chen, Donglin Zeng, and Michael R Kosorok. 2016. Personalized dose finding using outcome weighted learning. J. Amer. Statist. Assoc. 111, 516 (2016), 1509–1521.
 Chen et al. (2018) Jingxiang Chen, Haoda Fu, Xuanyao He, Michael R Kosorok, and Yufeng Liu. 2018. Estimating individualized treatment rules for ordinal treatments. Biometrics 74, 3 (2018), 924–933.
 Chevret (2006) Sylvie Chevret. 2006. Statistical methods for dosefinding experiments. Vol. 24. Wiley Online Library.
 Consortium (2009) International Warfarin Pharmacogenetics Consortium. 2009. Estimation of the warfarin dose with clinical and pharmacogenetic data. New England Journal of Medicine 360, 8 (2009), 753–764.
 Duong and Hazelton (2005) Tarn Duong and Martin L Hazelton. 2005. Crossvalidation bandwidth matrices for multivariate kernel density estimation. Scandinavian Journal of Statistics 32, 3 (2005), 485–506.
 Fan et al. (2017) Caiyun Fan, Wenbin Lu, Rui Song, and Yong Zhou. 2017. Concordanceassisted learning for estimating optimal individualized treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79, 5 (2017), 1565–1582.
 Henderson et al. (2010) Robin Henderson, Phil Ansell, and Deyadeen Alshibani. 2010. Regretregression for optimal dynamic treatment regimes. Biometrics 66, 4 (2010), 1192–1201.
 Johnson et al. (2011) Julie A Johnson, Li Gong, Michelle WhirlCarrillo, Brian F Gage, Stuart A Scott, CM Stein, JL Anderson, Stephen E Kimmel, Ming Ta Michael Lee, M Pirmohamed, et al. 2011. Clinical Pharmacogenetics Implementation Consortium Guidelines for CYP2C9 and VKORC1 genotypes and warfarin dosing. Clinical Pharmacology & Therapeutics 90, 4 (2011), 625–629.
 Kang et al. (2018) Suhyun Kang, Wenbin Lu, and Jiajia Zhang. 2018. On estimation of the optimal treatment regime with the additive hazards model. Statistica Sinica 28, 3 (2018), 1539.
 Kosorok (2008) Michael R Kosorok. 2008. Introduction to empirical processes and semiparametric inference. Springer.
 Laber and Zhao (2015) EB Laber and YQ Zhao. 2015. Treebased methods for individualized treatment regimes. Biometrika 102, 3 (2015), 501–514.
 Liang et al. (2018) Shuhan Liang, Wenbin Lu, and Rui Song. 2018. Deep advantage learning for optimal dynamic treatment regime. Statistical theory and related fields 2, 1 (2018), 80–88.
 Murphy (2003) Susan A Murphy. 2003. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 2 (2003), 331–355.
 Qian and Murphy (2011) Min Qian and Susan A Murphy. 2011. Performance guarantees for individualized treatment rules. Annals of statistics 39, 2 (2011), 1180.
 Rich et al. (2014) Benjamin Rich, Erica EM Moodie, and David A Stephens. 2014. Simulating sequential multiple assignment randomized trials to generate optimal personalized warfarin dosing strategies. Clinical Trials 11, 4 (2014), 435–444.
 Robins (2004) James M Robins. 2004. Optimal structural nested models for optimal sequential decisions. In Proceedings of the second seattle Symposium in Biostatistics. Springer, 189–326.
 Rubin (1978) Donald B Rubin. 1978. Bayesian inference for causal effects: The role of randomization. The Annals of statistics (1978), 34–58.

Schuster
et al. (1969)
Eugene F Schuster et al.
1969.
Estimation of a probability density function and its derivatives.
The Annals of Mathematical Statistics 40, 4 (1969), 1187–1195.  Shi et al. (2018) Chengchun Shi, Alin Fan, Rui Song, and Wenbin Lu. 2018. Highdimensional alearning for optimal dynamic treatment regimes. Annals of statistics 46, 3 (2018), 925.
 Shi et al. (2019) Chengchun Shi, Rui Song, and Wenbin Lu. 2019. Concordance and value information criteria for optimal treatment decision. Annals of Statistics (2019).
 Song et al. (2017) Rui Song, Shikai Luo, Donglin Zeng, Hao Helen Zhang, Wenbin Lu, and Zhiguo Li. 2017. Semiparametric singleindex model for estimating optimal individualized treatment strategy. Electronic journal of statistics 11, 1 (2017), 364.
 Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Qlearning. Machine learning 8, 34 (1992), 279–292.
 Xiao et al. (2019) Wei Xiao, Hao Helen Zhang, and Wenbin Lu. 2019. Robust regression for optimal individualized treatment rules. Statistics in medicine 38, 11 (2019), 2059–2073.
 Zhang et al. (012a) Baqun Zhang, Anastasios A Tsiatis, Marie Davidian, Min Zhang, and Eric Laber. 2012a. Estimating optimal treatment regimes from a classification perspective. Stat 1, 1 (2012a), 103–114.
 Zhang et al. (012b) Baqun Zhang, Anastasios A Tsiatis, Eric B Laber, and Marie Davidian. 2012b. A robust method for estimating optimal treatment regimes. Biometrics 68, 4 (2012b), 1010–1018.
 Zhao et al. (2009) Yufan Zhao, Michael R Kosorok, and Donglin Zeng. 2009. Reinforcement learning design for cancer clinical trials. Statistics in medicine 28, 26 (2009), 3294–3315.
 Zhao et al. (2012) Yingqi Zhao, Donglin Zeng, A John Rush, and Michael R Kosorok. 2012. Estimating individualized treatment rules using outcome weighted learning. J. Amer. Statist. Assoc. 107, 499 (2012), 1106–1118.
 Zhou et al. (2019) Jie Zhou, Jiajia Zhang, Wenbin Lu, and Xiaoming Li. 2019. On restricted optimal treatment regime estimation for competing risks data. Biostatistics (2019).
Appendix A Proof of Theorem 1
We first prove the uniform convergence of to . For simplicity of notation, let’s define:
, , . Similarly, , . We write as
where
Notice that can be written as
where , and . Thus,
where
and , . It is trivial to prove that . Thus, under the boundedness of , we only need to show that:
To prove the first equation, notice that . By the dominated convergence theorem, it suffices to show the uniform convergence of the kernel density estimate to , which can be proved according to Schuster (1969).
For the second equation,
Comments
There are no comments yet.