1 Introduction
The use of mobile devices in clinical care, called mobile health (mHealth), provides an effective and scalable platform to assist patients in managing their illness (Free et al., 2013; Steinhubl et al., 2013). Advantages of mHealth interventions include realtime communication between a patient and their healthcare provider as well as systems for delivering training, teaching, and social support (Kumar et al., 2013). Mobile technologies can also be used to collect rich longitudinal data to estimate optimal dynamic treatment regimes and to deliver treatment that is deeply tailored to each individual patient. We propose a new estimator of an optimal treatment regime that is suitable for use with with longitudinal data collected in mHealth applications.
A dynamic treatment regime provides a framework to administer individualized treatment over time through a series of decision rules. Dynamic treatment regimes have been wellstudied in the statistical and biomedical literature (Murphy, 2003; Robins, 2004; Moodie et al., 2007; Kosorok and Moodie, 2015; Chakraborty and Moodie, 2013) and furthermore, statistical considerations in mHealth have been studied by, for example, Liao et al. (2015) and Klasnja et al. (2015). Although mobile technology has been successfully utilized in clinical areas such as diabetes (Quinn et al., 2011; Maahs et al., 2012), smoking cessation (Ali et al., 2012), and obesity (Bexelius et al., 2010), mHealth poses some unique challenges that preclude direct application of existing methodologies for dynamic treatment regimes. For example, mHealth applications typically involve a large number of time points per individual and no definite time horizon; the momentary signal may be weak and may not directly measure the outcome of interest; and estimation of optimal treatment strategies must be done online as data accumulate.
This work is motivated in part by our involvement in a study of mHealth as a management tool for type 1 diabetes. Type 1 diabetes is an autoimmune disease wherein the pancreas produces insufficient levels of insulin, a hormone needed to regulate blood glucose concentration. Patients with type 1 diabetes are continually engaged in management activities including monitoring glucose levels, timing and dosing insulin injections, and regulating diet and physical activity. Increased glucose monitoring and attention to selfmanagement facilitate more frequent treatment adjustments and have been shown to improve patient outcomes (Levine et al., 2001; Haller et al., 2004; Ziegler et al., 2011). Thus, patient outcomes have the potential to be improved by diabetes management tools which are deeply tailored to the continually evolving health status of each patient. Mobile technologies can be used to collect data on physical activity, glucose, and insulin at a fine granularity in an outpatient setting (Maahs et al., 2012). There is great potential for using these data to create comprehensive and accessible mHealth interventions for clinical use. We envision application of this work for use before the artificial pancreas (Weinzimer et al., 2008; Kowalski, 2015; Bergenstal et al., 2016) becomes widely available.
The sequential decision making process can be modeled as a Markov decision process (Puterman, 2014) and the optimal treatment regime can be estimated using reinforcement learning algorithms such as Qlearning (Murphy, 2005; Zhao et al., 2009; Tang and Kosorok, 2012; Schulte et al., 2014). Ertefaie (2014) proposed a variant of greedy gradient Qlearning (GGQ) to estimate optimal dynamic treatment regimes in infinite horizon settings (see also Maei et al., 2010). In GGQ, the form of the estimated Qfunction dictates the form of the estimated optimal treatment regime. Thus, one must choose between a parsimonious model for the Qfunction at the risk of model misspecification or a complex Qfunction that yields unintelligible treatment regimes. Furthermore, GGQ requires modeling a nonsmooth function of the data, which creates complications (Laber et al., 2014; Linn et al., 2017). We propose an alternative estimation method for infinite horizon dynamic treatment regimes that is suited to mHealth applications. Our approach, which we call Vlearning, involves estimating the optimal policy among a prespecified class of policies (Zhang et al., 2012, 2013). It requires minimal assumptions about the datagenerating process and permits estimating a randomized decision rule that can be implemented online as data accumulate.
In Section 2, we describe the setup and present our method for offline estimation using data from a microrandomized trial or observational study. In Section 3, we extend our method for application to online estimation with accumulating data. Theoretical results, including consistency and asymptotic normality of the proposed estimators, are presented in Section 4. We compare the proposed method to GGQ using simulated data in Section 5. A case study using data from patients with type 1 diabetes is presented in Section 6 and we conclude with a discussion in Section 7. Proofs of technical results are in the Appendix.
2 Offline estimation from observational data
We assume that the available data are , which comprise independent, identically distributed trajectories , where: denotes a summary of patient information collected up to and including time ; denotes the treatment assigned at time ; and denotes the (possibly random) patient followup time. In the motivating example of type 1 diabetes, could contain a patient’s blood glucose, dietary intake, and physical activity in the hour leading up to time and could denote an indicator that an insulin injection is taken at time . We assume that the datagenerating model is a timehomogeneous Markov process so that and the conditional density is the same for all . Let denote an indicator that the patient is still in followup at time , i.e., if the patient is being followed at time and zero otherwise. We assume that is contained in so that and implies
with probability one. Furthermore, we assume a known utility function
so that measures the ‘goodness’ of choosing treatment in state and subsequently transitioning to state . In our motivating example, the utility at time could be a measure of how far the patient’s average blood glucose concentration deviates from the optimal range over the hour preceding and following time . The goal is to select treatments to maximize expected cumulative utility; treatment selection is formalized using a treatment regime (Schulte et al., 2014; Kosorok and Moodie, 2015) and the utility associated with any regime is defined using potential outcomes (Rubin, 1978).Let
denote the space of probability distributions over
. A treatment regime in this context is a function so that, under , a decision maker presented with state at time will select action with probability . Define , and . The set of potential outcomes iswhere is the potential state and is the potential followup status at time under treatment sequence . Thus, the potential utility at time is . For any , define to be a sequence of independent, valued stochastic processes indexed by such that . The potential followup time under is
where . The potential utility under at time is
where . Thus, utility is set to zero after a patient is lost to followup. However, in certain situations, utility may be constructed so as to take a negative value at the time point when the patient is lost to followup, e.g., if the patient discontinues treatment because of a negative effect associated with the intervention. Define the statevalue function (Sutton and Barto, 1998), where is a fixed constant that captures the tradeoff between short and longterm outcomes. For any distribution on , define the value function with respect to reference distribution as ; throughout, we assume that this reference distribution is fixed. The reference distribution can be thought of as a distribution of initial states and we estimate it from the data in the implementation in Sections 5 and 6. For a prespecified class of regimes, , the optimal regime, , satisfies for all .
To construct an estimator of , we make a series of assumptions that connect the potential outcomes in with the datagenerating model.
Assumption 1.
Strong ignorability, for all .
Assumption 2.
Consistency, for all and .
Assumption 3.
Positivity, there exists so that for all , , and all .
In addition, we implicitly assume that there is no interference among the experimental units. These assumptions are common in the context of estimating dynamic treatment regimes (Robins, 2004; Schulte et al., 2014). Assumptions 1 and 3 hold by construction in a microrandomized trial (Klasnja et al., 2015; Liao et al., 2015).
Let for each . In a microrandomized trial, is a known randomization probability; in an observational study, it must be estimated from the data. The following lemma characterizes for any regime, , in terms of the datagenerating model (see also Lemma 4.1 of Murphy et al., 2001). A proof is provided in the appendix.
Lemma 2.1.
The preceding result will form the basis for an estimating equation for . Write the right hand side of (1) as
from which it follows that
Subsequently, for any function defined on , the statevalue function satisfies
(2) 
which is an importanceweighted variant of the wellknown Bellman optimality equation (Sutton and Barto, 1998).
Let denote a model for indexed by . We assume that the map is differentiable everywhere for each fixed and . Let denote the gradient of and define
(3) 
Given a positive definite matrix and penalty function , define , where is a tuning parameter. Subsequently, is the estimated statevalue function under in state . Thus, given a reference distribution, , the estimated value of a regime, , is and the estimated optimal regime is . The idea of Vlearning is to use estimating equation (3) to estimate the value of any policy and maximize estimated value over a class of policies; we will discuss strategies for this maximization in Section 5.
Vlearning requires a parametric class of policies. Assuming that there are possible treatments, , we can define a parametric class of policies as follows. Define for , and . This defines a class of randomized policies parametrized by , where
is a vector of parameters for the
th treatment. Under a policy in this class defined by , actions are selected stochastically according to the probabilities , . In the case of a binary treatment, a policy in this class reduces to and for a vector . This class of policies is used in the implementation in Sections 5 and 6.Vlearning also requires a class of models for the state value function indexed by a parameter, . We use a basis function approximation. Let be a vector of prespecified basis functions and let . Let . Under this working model,
Computational efficiency is gained from the linearity of in ; flexibility can be achieved through the choice of . We examine the performance of Vlearning using a variety of basis functions in Sections 5 and 6.
3 Online estimation from accumulating data
Suppose we have accumulating data , where and represent the state and action for patient at time . At each time , we estimate an optimal policy in a class, , using data collected up to time , take actions according to the estimated optimal policy, and estimate a new policy using the resulting states. Let be the estimated policy at time , i.e., is estimated after observing state and before taking action . If is a class of randomized policies, we can select an action for a patient presenting with according to , i.e., we draw according to the distribution . If a class of deterministic policies is of interest, we can inject some randomness into to facilitate exploration. One way to do this is an greedy strategy (Sutton and Barto, 1998), which selects the estimated optimal action with probability and otherwise samples equally from all other actions. Because an greedy strategy can be used to introduce randomness into a deterministic policy, we can assume a class of randomized policies.
At each time , let , where , , and are as defined in Section 2 and
(5) 
with some initial randomized policy. We note that estimating equation (5) is similar to (3), except that replaces as the datagenerating policy. Given the estimator of the value of at time , , the estimated optimal policy at time is . In practice, we may choose to update the policy in batches rather than at every time point. An alternative way to encourage exploration through the action space is to choose for some sequence , where is a measure of uncertainty in . An example of this is upper confidence bound sampling, or UCB (Lai and Robbins, 1985).
It some settings, when the datagenerating process may vary across patients, it may be desirable to allow each patient to follow an individualized policy that is estimated using only that patient’s data. Suppose that patients are followed for an initial time points after which the policy is estimated. Then, suppose that patient follows until time , when a policy is estimated using only the states and actions observed for patient . This procedure is then carried out until time for some fixed with each patient following their own individual policy which is adapted to match the individual over time. We may also choose to adapt the randomness of the policy at each estimation. For example, we could select and, following estimation , have patient follow policy with probability and policy with probability . In this way, patients become more likely to follow their own individualized policy and less likely to follow the initial policy over time, reflecting increasing confidence in the individualized policy as more data become available. The same class of policies and model for the state value function can be used as in Section 2.
4 Theoretical results
In this section, we establish asymptotic properties of and for offline estimation. Throughout, we assume assumptions 13 from Section 2.
Let . Thus, we use the squared Euclidean norm of as the penalty function; we will assume that . For simplicity, we let
be the identity matrix. Assume the working model for the state value function introduced in Section
2, i.e., . For fixed , denote the true by , i.e., . Let so that . Define , where denotes the empirical measure of the observed data. Let be a parametric class of policies and let where .Our main results are summarized in Theorems 4.2 and 4.3
below. Because each patient trajectory is a stationary Markov chain, we need to use asymptotic theory based on stationary processes; consequently, some of the required technical conditions are more difficult to verify than those for i.i.d. data. Define the bracketing integral for a class of functions,
, by , where the bracketing number for , , is the number of brackets needed such that each element of is contained in at least one bracket (see Chapter 2 of Kosorok, 2008). For any stationary sequence of possibly dependent random variables,
, let be the field generated by and define . We say that the chain is absolutely regular if as (also called mixing in Chapter 11 of Kosorok, 2008). We make the following assumptions.Assumption 4.
There exists a such that

, , and .

The sequence is absolutely regular with .

The bracketing integral of the class of policies, .
Assumption 5.
There exists some such that
for all .
Assumption 6.
The map has a unique and well separated maximum over in the interior of ; let denote the maximizer.
Assumption 7.
The following condition holds: as .
Remark 4.1.
Assumption 4
requires certain finite moments and that the dependence between observations on the same patient vanishes as observations become further apart. In Lemma
7.2 in the appendix, we verify part 3 of assumption 4 and assumption 7 for the class of policies introduced in Section 2. However, note that the theory holds for any class of policies satisfying the given assumptions, not just the class considered here. Assumption 5 is needed to show the existence of a unique uniformly over and assumption 6 requires that the true optimal decision in each state is unique (see assumption A.8 of Ertefaie, 2014). Assumption 7 requires smoothness on the class of policies.The main results of this section are stated below. Theorem 4.2 states that there exists a unique solution to uniformly over and that the estimator converges weakly to a mean zero Gaussian process in .
Theorem 4.2.
Under the given assumptions, the following hold.

For all , there exists a such that has a zero at . Moreover, and as .

Let be a tight, mean zero Gaussian process indexed by with covariance where
and
Then, in .

Let be as defined in part 2. Then, in .
Theorem 4.3 below gives us that the estimated optimal policy converges in probability to the true optimal policy over and that the estimated value of the estimated optimal policy converges to the true value of the estimated optimal policy.
Theorem 4.3.
Under the given assumptions, the following hold.

Let and . Then, .

Let . Then, .

A consistent estimator for is
where
and
Proofs of the above results are in the Appendix along with a result on bracketing entropy that is needed for the proof of Theorem 4.2 and a proof that the class of policies introduced above satisfies the necessary bracketing integral assumption.
5 Simulation experiments
In this section, we examine the performance of Vlearning on simulated data. Section 5.2 contains results for offline estimation and Section 5.3 contains results for online estimation. We begin by discussing an existing method for infinite horizon dynamic treatment regimes in Section 5.1
5.1 Greedy gradient Qlearning
Ertefaie (2014) introduced greedy gradient Qlearning (GGQ) for estimating dynamic treatment regimes in infinite horizon settings (see also Maei et al., 2010; Murphy et al., 2016). Here we briefly discuss this method.
Define . The Bellman optimality equation (Sutton and Barto, 1998) is
(6) 
Let
be a parametric model for
indexed by . In our implementation, we model as a linear function with interactions between all state variables and treatment. The Bellman optimality equation motivates the estimating equation
Comments
There are no comments yet.