Constructing Stabilized Dynamic Treatment Regimes

08/03/2018 ∙ by Ying-Qi Zhao, et al. ∙ 0

We propose a new method termed stabilized O-learning for deriving stabilized dynamic treatment regimes, which are sequential decision rules for individual patients that not only adapt over the course of the disease progression but also remain consistent over time in format. The method provides a robust and efficient learning framework for constructing dynamic treatment regimes by directly optimizing a doubly robust estimator of the expected long-term outcome. It can accommodate various types of outcomes, including continuous, categorical and potentially censored survival outcomes. In addition, the method is flexible enough to incorporate clinical preferences into a qualitatively fixed rule, where the parameters indexing the decision rules that are shared across stages can be estimated simultaneously. We conducted extensive simulation studies, which demonstrated superior performance of the proposed method. We analyzed data from the prospective Canary Prostate Cancer Active Surveillance Study (PASS) using the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dynamic treatment regimes (DTRs), also called adaptive treatment strategies (Murphy, 2003, 2005a), are sequential decision rules adapting over time to the time-varying characteristics of patients. A DTR takes patient health histories as inputs and recommends the next treatment strategy at each decision point. For example, treatment for lung cancer usually involves regimens with multiple lines (Socinski and Stinchcombe, 2007). Clinicians may update treatment for major depressive disorder according to factors emerging over time, such as side-effect severity, treatment adherence, and so on (Murphy et al., 2007). In these examples, the decision rules are different across different stages, yet in practice, it is not unusual to have a common decision rule shared across different stages. For example, diabetes patients are recommended medication or lifestyle intervention when Hemoglobin A1c (HbA1c) rises above a threshold, which is universal across the entire medication process. Lung transplantation may be initiated for a cystic fibrosis patient if FEV (Forced Expiratory Volume in 1 second) value falls below 30%, which, again, remains the same throughout the course of disease progression. Such shared decision rules are easier to implement in practice throughout multiple decision points, in particular, when multivariate time-varying covariates are potentially involved. We term them as stabilized dynamic treatment regimes (SDTRs), which are also referred to as DTRs with shared parameters (Chakraborty et al., 2016). We will use both terms interchangeably throughout the article.

Estimating an optimal DTR without shared parameters has been widely studied in the past few years. A well-established approach is Q-learning (Watkins, 1989; Nahum-Shani et al., 2012; Zhao et al., 2009; Laber et al., 2014; Goldberg and Kosorok, 2012), which recursively estimates the conditional expectation of the outcomes given the current patient history, assuming that optimal decisions are made in the future. The foregoing conditional expectations are known as Q-functions. Semiparametric methods have also been proposed, such as iterative minimization of regrets (Murphy, 2003) and G-estimation (Robins, 2004). These methods are methodologically more complex, but could potentially provide efficiency gain. Recently, direct methods have become popular in the literature. They circumvent the modeling of conditional means of the outcome given the treatment and covariates, and they directly estimate the decision rule that maximizes the expected outcomes. Examples include backward outcome weighted learning, simultaneous outcome weighted learning (Zhao et al., 2015), and others (Robins et al., 2008; Orellana et al., 2010; Zhang et al., 2013). We refer readers to Chakraborty and Moodie (2013) and Kosorok and Moodie (2015) for detailed reviews of the current literature.

SDTRs are analogous to stationary Markov decision processes with function approximations

(Sutton and Barto, 1998a). For solving Markovian decision problems, Antos et al. (2008) proposed to minimize the squared Bellman error, where Bellman error quantifies the difference between the estimated reward at any time point, and the actual reward received. Chakraborty et al. (2016) proposed shared Q-learning to estimate the optimal shared-parameter DTR when the decision rule at each stage is the same function of one or more time-varying covariates. In particular, they formulated the decision rules as linear functions of covariates, and the coefficients are assumed to be the same across stages. An alternative way of identifying an SDTR is the simultaneous G-estimation (Robins, 2004), which can handle problems with shared parameters in principle, but its empirical performance is largely unknown (Moodie and Richardson, 2010). The simultaneous outcome weighted learning approach proposed in Zhao et al. (2015) (with some modification) could be used when the goal is to derive SDTRs. They converted the construction of DTRs into a simultaneous nonparametric classification problem, where a multi-dimensional hinge loss of the expected long-term outcome was employed. However, they did not explore modifying the method for SDTRs.

Moreover, none of the aforementioned methods can handle censored survival outcome when constructing an SDTR. In general, methods for accommodating time-to-event outcomes are mostly limited to the regular DTR settings. This is mostly due to two main challenges. First, the number of stages for each individual in the study is not fixed. This is because the event time can vary by individual, and the treatment is usually stopped once the failure event happens. Second, the treatment/outcome status of a subject may be unknown when censoring occurs. To this end, Goldberg and Kosorok (2012) developed a Q-learning algorithm to adjust for censored data and allow a flexible number of stages. However, it cannot be directly applied to solve for SDTRs.

In this paper, we propose two methods for solving SDTRs, termed as censored shared-Q-learning and censored shared-O-learning (abbreviated from “outcome weighted learning”). The censored shared-Q-learning method generalizes the shared-Q-learning method (Chakraborty et al., 2016)

by applying the inverse probability of censoring weights to account for the uncertainty in the outcomes of censored subjects, where the censoring weights need to be carefully constructed for each stage due to the multi-stage nature of the problem. Similar to shared Q-learning, censored shared-Q-learning is an iterative procedure, which identifies the optimal decision rule for each stage in a sequential/iterative manner. The censored shared-O-learning method uses a non-iterative approach by directly maximizing a concave relaxation of the inverse-probability-of-censoring weighted estimator of the expected survival benefit. To the best of our knowledge, this is the first article that provides a thorough solution for deriving DTRs with shared parameters in the censored data setup.

The remainder of the paper is organized as follows. We introduce the general framework of SDTR in Section 2. In Section 3, we introduce censored shared-Q-learning and censored shared-O-learning along with the computation algorithms. We conduct numerical studies that compare the proposed methods with Q-learning and shared Q-learning in Section 4. Section 5 focuses on the application of the proposed methods to the Framingham Heart data. Finally, we provide a discussion of open questions in Section 6.

2 Statistical framework

In this section, we present definitions and notations used in the paper, where we follow Goldberg and Kosorok (2012)

whenever possible. Throughout, we use uppercase letters to denote random variables and lowercase letters to denote realizations of the random variables.

Let be the maximal number of stages in a multistage study. Here, stages are referred to as clinical decision time points. A full trajectory of an observation sequence is . For any decision point , the information of a patient is represented by , where denotes the initial information, denotes the intermediate information collected between stages and when , and is the treatment assigned at the stage subsequent to observing . We assume that there are two possible treatments at each stage, i.e. . is a non-negative random variable that equals to either the length of the interval between decision time point and , or the length between decision time point and failure time if the failure event occurs in that stage. can be viewed as the reward in the stage, and is the cumulative reward up to and including stage . In our context, the sum is the total survival time up to and including stage . As the total survival times are broken down based on the number of stages, this setup could introduce complexity to the trajectory structure. In particular, if a failure event occurs before the final decision point , the trajectory will not be of full length (Goldberg and Kosorok, 2012). Subsequently, the number of stages for different observations could be different. We denote the (random) total number of stages for an individual by . Thus, is the overall survival time.

In this study, the observations are subject to censoring. Let denote the censoring time, taking values in the segment . We assume that the censoring and failure times are independent given the covariates of all previous stages. Let be the censoring indicator at stage , where . If no censoring event happens before the decision time point, then and the outcome is observed. Hence, if a censoring event occurs during stage , then , and . In this case, we cannot observe the failure time that is censored by . On the other hand, if the failure time is observed. While indicates that censoring has not occurred at stage , it does not necessarily mean is the time of event, which is different from the notation of a typical survival analysis. Only is the true failure time indicator.

We use an overbar to denote the collection of history information, e.g. the sequence of treatments is represented by , and is the sequence of covariates up to . Let denote the accrued information at each stage, with the convention that , and . A DTR is a sequence of deterministic decision rules, , where is a function mapping from the space of accrued information to the space of available treatments. Under , a patient presenting with at time is recommended to treatment . We use to denote the class of all possible treatment regimes. Our goal is to identify the optimal dynamic treatment regime , that maximizes the expected outcome if deployed to the whole population in the future. Given that no information on survival is available beyond in the observed data, the outcome of interest is truncated-by- expected survival. Let denote the distribution of a trajectory given that . The optimal DTR is the sequence of rules that maximizes

(1)

where is the expectation with respect to . We refer to 1) as the value associated with a regime , denoted by .

Following Goldberg and Kosorok (2012), we modify the trajectories when they are not of full length or the overall survival time is greater than . The idea is to extend the information to full length by introducing noninformative values at stages after the failure. If a failure time occurs at stage , we let for and let the noninformative , for . Similarly, we let and , for . And for , we set and draw uniformly from as the noninformative values. If for some , , then we modify to be , and modify the trajectory to be noninformative at all stages after . For any DTR , we define a corresponding for the modified trajectories, where the same action is chosen for any triplet if , and a fixed action is chosen if . It has been shown in Goldberg and Kosorok (2012) that

(2)

Subsequently, remains the same after this modification. In the following, we omit the “prime” in the modified trajectories without the risk of ambiguity. Another complication is due to censoring, where the trajectories themselves may be censored and cannot be fully observed. This needs to be carefully handled when estimating SDTRs from the data.

3 Estimating DTRs with shared parameters for censored data

We often encounter time-varying covariates in longitudinal studies. To facilitate clinical implementation, it could be beneficial to have a shared decision rule in its functional form across multiple stages, while allowing the covariate values to change over time. In the following, we propose two methods to estimate the DTRs with shared parameters using repeatedly measured covariate information when the outcome is subject to censoring. The proposed algorithms are based on Q-learning and O-learning in the multistage setup, respectively.

3.1 Censored shared-Q-learning

Due to delayed effects, we need to consider the entire treatment sequence in order to optimize the long-term outcome. Results from the dynamic programming literature show that , where and recursively where for when the underlying generative distribution is known (Bellman, 1957; Sutton and Barto, 1998b). Q-learning is an approximate dynamic programming algorithm that uses regression models to estimate the Q-functions and then estimate recursively. Note that if a failure occurred prior to the time point, i.e., , we set

as zero. A commonly used strategy is to estimate Q-functions via linear regression using working models

, where and (including intercepts) are possibly different features of .

We focus on the setting where the decision rule parameters are shared across stages with , but s are left unshared. Let . Assuming the shared model is correctly specified, the pseudo-outcome at the stage is , and . Let . For any stage , we can write . Hence, can be estimated (Chakraborty et al., 2016) via

(3)

where denotes the empirical averages of the data. However, in the may be censored and unknown in the censoring data setup, and s are defined by unknown parameters.

Let be the conditional survival function for the censoring time given history information up to stage and treatment received at that stage. In addition, we assume the conditional independent of censoring (i.e. ). Then

Let if the censoring occurs in the stage, and otherwise. We further define for and . Consequently,

The quantity reflected in the above equation will only involve observed data, thus instead of (3), can be estimated via a weighted least square procedure, with

(4)

where is an estimator for .

We employ an iterative procedure for estimating s, which was also applied in Chakraborty et al. (2016) for shared Q-learning without censoring. The censored shared-Q-learning algorithm is presented below.

  1. Estimate the conditional survival function for the censoring time at stage , and construct the diagonal matrix

    where .

  2. Set the initial value of , denoted by .

  3. At the iteration, :

    1. Constructing the vector

      .

    2. Solving for , where

  4. Repeat steps (3a)-(3b) until for a prespecified value or until the maximum number of iterations is reached.

A simple choice for the initial values of is to sets all parameters to zero, . In this paper, we let the initial values depend on the estimates from censored Q-learning (Goldberg and Kosorok, 2012), where the parameters are not shared. Denote the estimates as . We combine the distinct estimates of into a single estimate via an average. The initial values of can be set as , where .

3.2 Censored shared-O-learning

The censored shared-Q-learning algorithm requires correct specifications of Q-functions at each stage. This could be unrealistic in the multistage setup since the underlying data generating mechanism is usually complicated. In this section, we propose censored shared-O-learning, which constructs the shared decision rules by directly targeting the overall benefit of the decision rule.

Define for . We assume the following conditions, including i) is independent of all potential values of the outcome and future variables conditional on ; and (ii) is strictly between 0 and 1. Assumption (i) is true in a sequential multiple assignment randomized trial (Murphy, 2005b) but unverifiable in an observational study. As shown in Zhao et al. (2015),

(5)

where is the indicator function. The right-hand side indicates equals the weighted average of outcomes among those who received treatments coinciding with that dictated by , with weights . s are usually known in a sequential multiple assignment randomized trial. If they are unknown, we can estimate

s using methods such as a logistic regression. Given the estimates

, a plug-in estimator for based on (5) is

However, may not be fully observed due to censoring. Following the notion in Section 3.1, let and be the conditional survival function for the censoring time given history information up to stage . Denote the estimator of as . We can estimate in the scenario with time-to-event outcomes using

(6)

The decision rules are formulated as fixed linear functions of the present variables at each stage. Mathematically, they are presented as , where is a subset of important for determining the SDTRs similar to the one defined in Section . In addition, we define . Consequently, we maximize over

where we use to denote , and substitute by .

It could be challenging to optimize directly due to the discontinuity of the indicator functions. A computationally efficient approach is to replace the indicator by a concave surrogate. This leads to an optimization problem of

(7)

where is a concave function. In this paper, we will use

, which is an analog of the logistic loss in the machine learning literature. However, other choices of

are available; for example, analogs of exponential loss, hinge loss and others can also be applied (Bartlett et al., 2006). The above objective function is not differentiable in . To account for the non-differentiability of the minimum function in (7), we instead consider a soft-minimum function of and to replace , which equals to with being a positive constant. Hence, the term , in (7) can be replaced by its soft-minimum counterpart. We maximize

The derivative with respect to shared parameters of the decision rules can be written as

We can employ the orthant-wise limited-memory quasi-Newton algorithm proposed by Andrew and Gao (2007). We also note that it is possible to write out the Hessian matrix for our proposed objective function. However, we find it does not benefit the numerical performance since the calculation of the second derivative is rather complicated and less efficient than a numerical approximation using the Sherman–Morrison updating formula, as implemented in the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm.

Censored shared-O-learning maximizes the estimated mean outcome of a DTR over the pre-specified class of DTRs with shared parameters. Hence, compared with censored shared-Q-learning, it circumvents the need for estimating Q-functions at each stage. Furthermore, it does not require modeling the censoring distribution at each stage but only needs to model the censoring distribution at the final stage. However, censored shared-O-learning involves an unknown parameter , which controls the approximation of the soft minimum function to the minimum function. In practice, we can use cross-validation to select the best by grid search over a prespecified set of candidate values.

Remark 1.

In practice, interpretable and simple rules are preferable. Sparse penalty such as LASSO can be applied in both censored shared-Q- and O-learning, where the coefficients for unimportant variables will shrink to zero. For censored shared-Q-learning, we can solve for in step 3(b) using penalized weighted least squares. For censored shared-O-learning, we can maximize the penalized objective

where is the norm of and is a tuning parameter controlling the amount of penalization. The orthant-wise limited-memory quasi-Newton algorithm, which is a limited-memory BFGS algorithm that incorporates the regulation, can still be applied in this case.

4 Simulation Studies

One of the motivations for the current work derives from the long-term care of patients with diabetes. Patients are routinely examined for glycosylated hemoglobin (A1c) level every three months, and treatments are recommended for tightly controlling A1c to prevent adverse events such as hospitalization due to the disease. Our simulation mimics such a setting, using the generative model similar to that of Timbie et al. (2010) and Ertefaie and Strawderman (2018). We treat each check-up time as a decision point for determining treatment in the next three months. Our study consists of 10 decision points. The treatments include metformin, sulfonylurea, glitazone, and insulin. Patients start with metformin and augment with treatments sulfonylurea, glitazone, and insulin during the study period. At each decision point, patients can either continue the current treatment or augment the treatment. A binary discontinuation indicator is generated to represent patients’ intolerance to treatment due to side effects, and patients who discontinue a treatment will take the next available treatment. is the number of augmented treatments by the end of interval where , and the number of augmented treatments increases by one if a treatment is augmented. The outcome of interest is time to hospitalization. Hence, each patient’s trajectory continues until either a failure time occurs or the study ends. A censoring variable is uniformly drawn from . When an event is censored, the trajectory ends up to the time of censoring and the censoring times are given. Here are the steps we take to generate the dataset:

  • Baseline variables: Variables

    are generated from a multivariate normal distribution with mean

    and the covariance matrix , where BP is the systolic blood pressure. Also, , where is the discontinuation indicator at stage .

  • Treatments: Given , the sets of available treatments are , , , , and where 0 means continue with the current treatment. The treatment is given as follows,

    • if , continue with the current treatment and .

    • if and , augment the current treatment and .

    • if and

      , then a binary variable

      is generated with probability

      where is the discontinuation indicator. If , the patient continues with the current treatment, and we set . If , the treatment is augmented, and we set .

  • Treatment discontinuation indicator: A binary variable

    is generated from a Bernoulli distribution given the last augmented treatment. The treatment discontinuation rates are

    , and . We assume that .

  • A1c, BP and weight at time : we use the following generative model for A1c,

    where , and

    is the treatment effect of , where the treatment effects of metformin, sulfonylurea, glitazone and insulin are 0.14, 0.20, 0.12, and 0.14, respectively. For the other time-varying variables at time : and , .

  • Time to hospitalization: two generative mechanisms are considered. In Scenario 1, the survival time at stage , i.e., time to hospitalization, starting from the beginning of stage , is generated by

    where follows a standard normal distribution. In Scenario 2, the survival time at stage is generated by

The regret at each stage, i.e., the loss of reward incurred by not following the optimal treatment regime at each stage, for Scenario 1 is , and the regret for Scenario 2 is . Then the underlying optimal rule is the rule that yields zero regret for all stages. Hence, in both scenarios, the optimal DTR is shared across stages, and , where in our situation. In the first example, the difference between the treatment effects, also known as the contrast function, can be specified as . A linear model in censored Q-learning or censored shared-Q-learning could be close enough to a correctly specified model. However, this is not true in the second example, where a linear model is severely misspecified.

The proposed censored shared-Q-learning and censored shared-O-learning are compared with censored Q-learning (Goldberg and Kosorok, 2012), which does not take into account the shared data structure. We consider sample sizes of 2000 and 5000. We generate a validation dataset with 50000 observations. The experiment is performed 500 times independently. In each replicate, we calculate the mean response for all subjects in the validation dataset, had the whole population followed the estimated rule. The averaged outcome is used to compare different methods.

The censored shared-O-learning is implemented following (7) with , and in the soft minimum function is set to 1. We looked at other values of , which gave similar results. The propensity scores at each stage are estimated using the treatment proportion. For the censored shared-Q-learning, we use the linear model for the Q-function, and parameters are estimated via weighted least squares as presented in (4). For the censored Q-learning, the linear model is also used for the Q-function, where we let . A weighted least square is utilized at each stage to solve for and .

We use the Cox proportional hazards model to estimate and construct the weight at each stage. Note that in censored shared-O-learning method, only needs to be fitted. For all stages, we use A1c, BP and weight at the baseline level to fit the Cox models. Let denote the regressors, and denote the hazard functions of censoring times for subject . Then, , where is the baseline hazard functions for censoring time. The estimator for , say , maximizes the partial likelihood

We use the Breslow estimator for the cumulative baseline hazard function . An estimator of for subject is , where is the estimator for .

The means and the standard errors of the 500 values of the estimated DTRs on the validation set for both scenarios are presented in Table

1. In Scenario 1, when the regression model is close to being correctly specified, the censored shared-Q-learning leads to the best performance. The censored Q-learning approach does not account for the shared data structure. Thus, there is a large variation in the obtained results. In Scenario 2, both censored shared-Q-learning and censored Q-learning methods are sensitive to model misspecification. Conversely, censored shared-O-learning has a robust performance, though it has slightly worse results compared to censored shared-Q-learning method in Scenario 1. In practice, it is unknown to us whether the regression model in censored shared-Q-learning could be correctly specified. We can use a cross-validation approach to choose the one that yields a better result (e.g. larger estimated value) between censored shared-Q-learning and censored shared-O-learning method.

5 Data Analysis

In this section, we apply the proposed methods to the Framingham Heart Study. The Framingham Heart Study, established in 1948, is the first longitudinal prospective large-scale cohort to study cardiovascular disease in the US. In the original cohort 5209 men and women are monitored prospectively for epidemiological and genetic risk factors for cardiovascular disease. There are maximum 32 examinations that occurred biannually during the 65 years of followup (Tsao and Vasan, 2015). For illustration we consider only information from the second to the sixth visits. In our dataset, 2236 subjects are available with complete information on the risk factors at each measurement time and are free of cardiovascular disease at the time of examination. The long-term outcome of interest is time to the onset of the first major cardiovascular disease event or death, which has an event rate of 18.7% by the end of the study. The median followup time is 25 years and the ages at baseline range from 17 to 70 with a median of 43. Traditionally, hypertension medication is recommended based on the blood pressure level. However, the outcomes might be improved if other information is also factored in. We utilize the Framingham Heart Study data to derive a decision rule that informs a patient whether a hypertension medication should be taken at each decision point, aiming to reduce the long-term risk of cardiovascular disease. Risk factors considered in our prediction at each visit include age, diastolic blood pressure, cholesterol, high-density lipoprotein, presence of diabetes, and smoking.

We first carry out a cross-validation procedure to select the method between censored shared-Q-learning and censored shared-O-learning. At each run, we partition the whole dataset into two parts, with one part serving as training data to estimate the SDTRs using both methods, and the other part as the validation set for implementing the estimated SDTRs. When estimating SDTRs, the Kaplan–Meier method is used to fit the censoring probabilities. The constructed SDTRs from the training set are evaluated using the empirical value on the validation set adjusted for censoring. Each part served as the validation subset once, and the cross-validated values are obtained by averaging the empirical values on both validation subsets. The procedure is repeated 100 times.

Our implementation shows that both methods result in similar cross-validated values of mean residual survival time in years (censored shared-Q-learning: 16.53, and censored shared-O-learning: 16.80). Hence, we carry out both methods on the whole dataset. The coefficients in the estimated SDTR are presented in Table 2, and Figure 1 shows the treatment allocation rates from the constructed dynamic treatment regimes. The recommended rules may look different between the two methods. This could happen when there are patients who don’t have great differential treatment effects. In general, patients who are currently on medication are more likely to continue taking it. There is a slight increase in the proportion of subjects recommended for medication in the later years from censored shared-Q-learning recommendations, compared with the current data. Conversely, fewer patients are recommended with medications using censored shared-O-learning rules. In either case, the survival benefit could be significantly improved under the recommended SDTRs. Figure 2 shows the Kaplan-Meier curves of time to first cardiovascular disease event for patients whose treatments are consistent with censored shared-Q-learning recommendations (left panel) and censored shared-O-learning recommendations (right panel) versus those who are not, at the first and subsequent time points. It is clear that subjects whose medication coincided with the recommendation had on average better survival outcome.

6 Discussion

We proposed two new methods for constructing a SDTR with survival outcomes, which is a fixed function of time-varying covariates over time. Such a rule yields an optimal treatment strategy that can be easily implemented in practice. We provide efficient computing algorithms to obtain the solution. Our method for the decision rule is based on a linear combination of updated covariate information. It is also of interest to develop a more robust tree structured decision rule, e.g. Zhu et al. (2017), without the assumption of linearity, which has the additional advantage of ease of interpretation and dissemination. The censored shared-O-learning is proposed based on an inverse probability weighted estimator of the expected outcome that would be achieved under a particular DTR. However, such an estimator is potentially less efficient because it only uses outcome information from subjects whose treatment assignments coincide with those dictated by the DTR of interest. In the future, we can develop SDTRs via censored shared-O-learning using an augmented inverse probability weighted estimator (Tsiatis, 2006; Zhang et al., 2013). Such an approach can incorporate contributions from the subjects who did not receive the specified treatment assignments across all stages by estimating their pseudo outcomes using censored Q-learning. Hence, it will improve efficiency over the censored shared-O-learning proposed here.

References

  • Andrew and Gao (2007) Andrew, G. and Gao, J. “Scalable training of L 1-regularized log-linear models.” In Proceedings of the 24th international conference on Machine learning, 33–40. ACM (2007).
  • Antos et al. (2008) Antos, A., Szepesvári, C., and Munos, R. “Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path.” Machine Learning, 71(1):89–129 (2008).
  • Bartlett et al. (2006) Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. “Convexity, classification, and risk bounds.” Journal of the American Statistical Association, 101(473):138–156 (2006).
  • Bellman (1957) Bellman, R. Dynamic Programming. Princeton: Princeton Univeristy Press (1957).
  • Chakraborty et al. (2016) Chakraborty, B., Ghosh, P., Moodie, E. E., and Rush, A. J. “Estimating optimal shared-parameter dynamic regimens with application to a multistage depression clinical trial.” Biometrics, 72(3):865–876 (2016).
  • Chakraborty and Moodie (2013) Chakraborty, B. and Moodie, E. Statistical methods for dynamic treatment regimes. Springer (2013).
  • Ertefaie and Strawderman (2018) Ertefaie, A. and Strawderman, R. L. “Constructing dynamic treatment regimes over indefinite time horizons.” Biometrika, 105(4):963–977 (2018).
  • Goldberg and Kosorok (2012) Goldberg, Y. and Kosorok, M. R. “Q-learning with censored data.” Annals of statistics, 40(1):529 (2012).
  • Kosorok and Moodie (2015) Kosorok, M. R. and Moodie, E. E. Adaptive treatment strategies in practice: planning trials and analyzing data for personalized medicine. SIAM (2015).
  • Laber et al. (2014) Laber, E. B., Linn, K. A., and Stefanski, L. A. “Interactive model building for Q-learning.” Biometrika, 101(4):831–847 (2014).
  • Moodie and Richardson (2010) Moodie, E. E. and Richardson, T. S. “Estimating optimal dynamic regimes: Correcting bias under the null.” Scandinavian Journal of Statistics, 37(1):126–146 (2010).
  • Murphy (2003) Murphy, S. A. “Optimal Dynamic Treatment Regimes.” Journal of the Royal Statistical Society, Series B, 65:331–366 (2003).
  • Murphy (2005a) ——–. “An experimental design for the development of adaptive treatment strategies.” Statistics in Medicine, 24:1455–1481 (2005a).
  • Murphy (2005b) ——–. “An experimental design for the development of adaptive treatment strategies.” Statistics in medicine, 24(10):1455–1481 (2005b).
  • Murphy et al. (2007) Murphy, S. A., Oslin, D. W., Rush, A. J., Zhu, J., and MCATS. “Methodological Challenges in Constructing Effective Treatment Sequences for Chronic Psychiatric Disorders.” Neuropsychopharmacology, 32:257–262 (2007).
  • Nahum-Shani et al. (2012) Nahum-Shani, I., Qian, M., Almirall, D., Pelham, W. E., Gnagy, B., Fabiano, G. A., Waxmonsky, J. G., Yu, J., and Murphy, S. A. “Q-learning: A data analysis method for constructing adaptive interventions.” Psychological methods, 17(4):478 (2012).
  • Orellana et al. (2010) Orellana, L., Rotnitzky, A., and Robins, J. M. “Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part I: main content.” The international journal of biostatistics, 6(2) (2010).
  • Robins et al. (2008) Robins, J., Orellana, L., and Rotnitzky, A. “Estimation and extrapolation of optimal treatment and testing strategies.” Statistics in medicine, 27(23):4678–4721 (2008).
  • Robins (2004) Robins, J. M. “Optimal structural nested models for optimal sequential decisions.” In Proceedings of the second seattle Symposium in Biostatistics, 189–326. Springer (2004).
  • Socinski and Stinchcombe (2007) Socinski, M. and Stinchcombe, T. “Duration of first-line chemotherapy in advanced non small-cell lung cancer: less is more in the era of effective subsequent therapies.” J Clin Oncol., 25:5155–5157 (2007).
  • Sutton and Barto (1998a) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge (1998a).
  • Sutton and Barto (1998b) ——–. Reinforcement Learning I: Introduction. Cambridge,MA: MIT Press (1998b).
  • Timbie et al. (2010) Timbie, J. W., Hayward, R. A., and Vijan, S. “Diminishing efficacy of combination therapy, response-heterogeneity, and treatment intolerance limit the attainability of tight risk factor control in patients with diabetes.” Health services research, 45(2):437–456 (2010).
  • Tsao and Vasan (2015) Tsao, C. W. and Vasan, R. S. “Cohort Profile: The Framingham Heart Study (FHS): overview of milestones in cardiovascular epidemiology.” International journal of epidemiology, 44(6):1800–1813 (2015).
  • Tsiatis (2006) Tsiatis, A. A. Semiparametric theory and missing data. Springer (2006).
  • Watkins (1989) Watkins, C. J. C. H. “Learning from delayed rewards.” Ph.D. thesis, King’s College, Cambridge (1989).
  • Zhang et al. (2013) Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. “Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions.” Biometrika, 100:681–695 (2013).
  • Zhao et al. (2009) Zhao, Y., Kosorok, M. R., and Zeng, D. “Reinforcement learning design for cancer clinical trials.” Statistics in medicine, 28(26):3294–3315 (2009).
  • Zhao et al. (2015) Zhao, Y.-Q., Zeng, D., Laber, E. B., and Kosorok, M. R. “New statistical learning methods for estimating optimal dynamic treatment regimes.” Journal of the American Statistical Association, 110(510):583–598 (2015).
  • Zhu et al. (2017) Zhu, R., Zhao, Y.-Q., Chen, G., Ma, S., and Zhao, H. “Greedy outcome weighted tree learning of optimal personalized treatment rules.” Biometrics, 73(2):391–400 (2017).