A treatment regime is a set of decision rules that determines a personalized treatment plan based on evolving treatment and covariate history (Murphy, 2003; Chakraborty and Murphy, 2014; Tsiatis et al., 2019). An optimal treatment regime maximizes the mean of some cumulative measure of patient health across a target population. Thus, there is keen interest in the development of statistical methodology for the estimation of optimal treatment regimes both to inform clinical practice and to generate new hypotheses about heterogeneous treatment effects (Athey and Imbens, 2016; Wager and Athey, 2018). Seminal methods for estimating optimal treatment regimes from observational or randomized studies included g-estimation (Robins, Robins; Murphy, 2003; Robins, 2004), -learning and its variants Murphy (2005); Moodie et al. (2007); Henderson et al. (2010); Schulte et al. (2014); Moodie et al. (2014); Song et al. (2015); Taylor et al. (2015); Zhang et al. (2018); Ertefaie (2019), and inverse probability weighting (Robins, 1999; Murphy et al., 2001; van der Laan, 2006; Robins et al., 2008)
. More recently, there has been a surge of research on extending these methods to make them more flexible, e.g., through the use of machine learning methodsZhao et al. (2012); Zhang et al. (2012); Rubin and van der Laan (2012); Moodie et al. (2014); Zhao et al. (2009, 2015); Laber and Zhao (2015); Luedtke and van der Laan (2016); Xu et al. (2016); Wager and Athey (2018); Tao et al. (2018); Jiang et al. (2019); Luckett et al. (2019); Liu et al. (2019), or to allow them to work with high-dimensional feature spaces or other complex data structures Lu et al. (2013); McKeague and Qian (2014); Tian et al. (2014); Song et al. (2015); Ciarleglio et al. (2015, 2016); Laber and Staicu (2017); Shi et al. (2018); Ertefaie (2019); Wallace et al. (2019); Shi et al. (2019). While the literature on treatment regimes is rich and growing rapidly, the types of data to which these methods apply is restrictive. Existing methods for finite-time-horizon decision problems require that one be able to align patient treatment decisions in time and that the conditional average treatment effect at each decision point be estimated using either regression or weighting methods. For indefinite-time-horizon problems, existing methods for estimating optimal treatment regimes require that the data-generating distribution have sufficient structure to allow pooling of data over time points and extrapolation to future decisions, e.g., the data-generating model might be assumed to be a contextual bandit or a homogeneous Markov decision process (MDP, Tewari and Murphy, 2017; Ertefaie, 2019; Luckett et al., 2019; Liao et al., 2019). Thus, existing methods do not apply to observational data with frequent and irregularly spaced treatment changes as patients cannot be properly aligned nor can the data be reasonably assumed to be Markov. For an example of such data, see Figure 1, which displays patient treatment histories for a subset of patients from the STEP-BD observational care pathway.
We propose a method for estimating an optimal treatment regime in the indefinite-time-horizon setting when data are irregularly spaced, contain multiple treatment changes, and cannot be assumed to be Markov. Motivated by the underlying clinical science of bipolar depression and other episodic chronic illnesses, we assume that a patient’s health status is dictated by a latent (unobserved) state and a subset of their observable data; we assume that conditional on current patient information and this latent state, the evolution of a patient’s health status is Markov. Treatment is allowed to affect the transition dynamics of the latent process as well as patient observables. We show that under this model the optimal treatment regime is determined by the so-called information state, which comprises the conditional distribution of the latent state and current patient measurements. We subsequently derive estimators of the optimal treatment regime and establish their asymptotic operating characteristics.
The proposed model is an example of a partially observable MDP (POMDP, Monahan, 1982). POMDPs have been studied extensively in the computer science literature with applications in robotics, scheduling, videogames, and wildlife management (see, for example, Kaelbling et al., 1998; Cassandra, 1998; J Pineau, 2003; Hansen, 1998; Ji et al., 2007; Sutton et al., 2018). The primary contributions of this work include: a theory-driven construction of the latent-process model, the application of POMDPs to episodic chronic mental illness, and the development of valid statistical inference for clinically relevant estimands in this context. The proposed methodology is extensible and could be ported for estimation and inference with optimal treatment regimes in other contexts that have complex treatment and observation patterns, e.g., mobile-health.
The remainder of this manuscript is organized as follows. In Section 2, we formally introduce the latent state model and show that the information state is, in some sense, minimally sufficient for the optimal treatment regime. In Section 3, we review estimation of optimal treatment regimes under an MDP model. In Section 4, we derive estimators of the optimal treatment regime based on a data-driven transformation of the observed process which makes it approximately Markov and thus amenable to the methods reviewed in Section 3. In Section 5, we derive the asymptotic distributions of the proposed estimators. In Section 6, we study the finite sample performance of the proposed methods through an extensive suite of simulation experiments. In Section 7, we provide an illustrative application using the observational care pathway of the STEP-BD bipolar disorder study. We provide a concluding discussion in Section 8.
2 Setup and preliminary results
We use uppercase letters, e.g., , and
, to denote random variables and lower case letters, e.g.,, , and , to denote instances of these random variables. The symbol ‘’ is used to distinguish definitions from equalities. The observed data are assumed to comprise i.i.d. copies, one per patient, of the trajectory , where is the number of clinic visits, encode clinic visit times; denotes the assigned treatment during period ; and denotes a patient’s health status at time , . Thus, both the number and timing of clinic visits are treated as random quantities. Let and so that contains the patient history available to the decision maker at clinic visit before treatment is assigned.
denote the space of probability distributions over(i.e., the -dimensional probability simplex). We encode elements
as vectors inso that represents the probability of selecting action under . A treatment regime in this setting is a sequence of decision rules , one per clinic visit, with , so that under a patient presenting with at clinic visit would receive treatment recommendation with probability . Whereas the observed data comprise finite patient trajectories, we are interested in estimating treatment regimes that can be applied indefinitely; that is, they can be used to provide treatment recommendations for as long as the patient is receiving care. To this end, we consider treatment regimes composed of decision rules , where are summary functions , so that is a summary of patient history , and is a stationary decision rule acting on patient summaries. Write to denote the composed regime for all . We will show below that restricting attention to composed regimes of this type incurs no loss of generality. Furthermore, because remains fixed in this representation, the regime can be vetted for clinical validity by domain experts when the summary functions provide ‘qualitatively similar’ summaries of the patient history (we show that the natural choice of summary function in our domain produces such summaries).
The immediate utility associated with history and
and is thus assumed to depend on the history only through its summary.
Note that the summary function can always be chosen to ensure that this holds.
Let denote expectation with respect to
the probability distribution
induced by following the treatment recommendations given by
(for a formal development using potential outcomes, see the Supplemental Material; see also Tsiatis et al. (2019)).
We consider the following two measures of cumulative
(i) discounted mean utility
where is a discount factor, and
(ii) average utility
These two cumulative measures are used almost exclusively in indefinite decision problems (Powell, 2007; Busoniu et al., 2017; Sutton et al., 2018), though the proposed methods could be extended to hyberbolic discounting or other notions of cumulative utility Fedus et al. (2019). We write without a subscript to generically denote either of these cumulative utility measures. We say that is optimal with respect to if for all . Without imposing additional structure on the data-generating model, it is not possible in general to identify from data collected over a finite time-horizon even as .
We assume that there exists a (latent) Markov process for all that represents a critical component of a patient’s health status, e.g., in bipolar depression this might represent whether the patient is in a depressive, manic, hypomanic, mixed, or stable episode. Furthermore, we assume that:
so that the process is conditionally Markov given the latent state, i.e., given a summary of a patient’s (observable) history, , their latent health state, and current treatment, the future is independent of the past. The following result shows that the conditional distribution of the latent state given the available history is sufficient for the optimal regime.
Assume (A1) and for each let be such that for . Define and write with . If depends on only through 111 As noted previously, the utility is typically a function of observables, and thus can always be defined so as to include . However, this assumption also allows for utility to be the posterior of some latent patient characteristic given the history and treatment. then:
is a homogeneous MDP, and
The summary is minimally sufficient in that there exists generative models in which any further reduction of the history, e.g., learning a strategy that depends on where , leads to degradation in the value of the learned strategy. See the Supplemental Material for a precise statement and example.
Lemma 2.1 establishes that we can characterize the optimal regime in terms of the MDP , which admits a stationary optimal regime, . Thus, the structure provided by the MDP reduces the problem of estimating an optimal treatment regime from a search over the space of countable sequences of functions, each acting on a different domain, to a search for a single function mapping into (this is why, hereafter, we reference regimes using the unbolded ; see sutton1997significance; Sutton et al., 2018, for additional discussion of the structure induced by MDPs). Were trajectories from this MDP observed, an estimated optimal regime could be obtained by solving estimating equations based on the Bellman optimality conditions; we review these estimating equations in the next section. As the states are not fully observed, we first construct estimators of , and then plug them into the MDP estimating equations. Asymptotic results for estimators of this type are provided in Section 5.
3 Optimal treatment regimes in an MDP
Recall that our approach is to transform the observed data so that it mimics data collected under the homogeneous MDP of Lemma 2.1. To illustrate how this transformed data will be used, we briefly review two established methods for estimating an optimal treatment regime in an MDP. Our developments closely follow Murphy et al. (2016), Ertefaie (2019), and Luckett et al. (2019). For a more general treatment of MDPs see Sutton et al. (2018) and Wiering and Van Otterlo (2012). For the purpose of describing these methods, assume that the observed data are , which consist of independent trajectories from a homogeneous MDP. We assume that the states , there are a finite number of treatment options coded so that , and the utilities are coded so that higher values are better. We present estimating equations for the optimal treatment regime with both the discounted and average utility criteria. Technical conditions needed for unbiasedness of these estimating equations and asymptotic normality of the resultant estimators applied to the transformed data, , are provided in Section 5.
For any regime and state , define the discounted state-value function , which does not depend on because the MDP is assumed to be homogeneous (Puterman, 2014). For any distribution on , termed a reference distribution, define then where is the initial state distribution. Because is unknown, one might take the empirical distribution of or some other reference distribution constructed from historical data (see Luckett et al., 2019). Define , where denotes a class of regimes of interest. In the discounted utility case it can be shown (e.g., Luckett et al., 2019) that the state-value function satisfies the following recursion
for all and any , where the ratio is an importance sampling weight. Let be a parametric class of continuously differentiable maps from into ; we have overloaded the notation to reflect that each regime will be associated with a corresponding parameter vector . For each , define to be the solution to (1) at . An estimator of is given by the solution of the sample analogue of (1) with , i.e., the solution to
where denotes the empirical measure. The estimated optimal regime is obtained by maximizing the estimated integrated state-value function over the class of regimes so that
Properties of this estimator—applied to data from a homogeneous MDP—are provided in Luckett et al. (2019). We assumed for simplicity that
was known, e.g., if the data were from a randomized clinical trial; if these propensities were unknown, they could be estimated from the observed data, e.g., using a multinomial logistic regression(see also Jiang and Li, 2015; Thomas and Brunskill, 2016; Hanna et al., 2018, for related ideas and discussion).
An estimating equation for the average utility setting is derived using a similar strategy to the discounted case. For each define the differential value
for all and any . Let be a class of continuously differentiable maps from into . An estimator of is obtained by jointly solving the sample analog of (3) for and with so that , solve
The estimated optimal regime is thus given by
The remainder of this manuscript is focused on constructing the transformed process and examining the theoretical and empirical properties of the foregoing two estimators when applied to the transformed data. However, these are but two of many possible methods for estimating an optimal regime with MDPs; these were chosen because they have been used previously in clinical applications and, furthermore, are simple, extensible, and amenable to statistical inference (for alternative approaches see Szepesvári, 2010; Powell, 2007; Sutton et al., 2018, and references therein).
4 Estimation of sufficient summary functions
Recall that the sufficient summary functions are given by for . As is observed, constructing an estimator of is tantamount to constructing an estimator of , the conditional distribution of the latent state given history . We develop an estimator of under the assumption that the observables, , evolve under a latent-state-dependent autoregressive process. This choice is motivated by the clinical theory underpinning bipolar disorder as well as its robustness and utility in modeling chronic illness (for additional discussion on time series and mechanistic models for bipolar disorder, see Daugherty et al., 2009; Bonsall et al., 2011; Moore et al., 2012, 2014; Bonsall et al., 2015; Holmes et al., 2016, and references therein).
We assume that the latent state follows a homoegenous Markov process the dynamics of which are described by the transition rate matrix for each , where
from which it can be seen that for (see Liu et al., 2015). The transition rate matrix, also known as the infinitesimal generator (e.g., Pyke, 1961a, b; Albert, 1962), induces the following transition probabilities
for and . To identify the transition rate matrix, we need to link the observed data to the latent state process. This link is established by the following assumption:
for all .
Furthermore, we posit parametric models for the dynamics of the observed data and assume that these models have densities of the following form: the density ofgiven is , which is indexed by , and the density of given and is , which is indexed by
. For example, a Gaussian autoregressive model with linear mean models takes the form:
where are unknown parameters, and
where We use this model in our simulation experiments and application to the data from the STEP-BD trial.
Let denote the unknown parameters indexing the latent Markov process. It can be seen that is determined by and , i.e., where is a deterministic map from into , the -dimensional probability simplex. We construct an estimator via maximum likelihood implemented using the forward-backward algorithm (for a review see Rabiner, 1989) and subsequently compute the plug-in estimator so that . The preceding estimator is used to convert i.i.d. trajectories of the form for into trajectories drawn from an (approximate) homogeneous MDP for , which can then be used with the estimators of an optimal regime described in the previous section.
5 Theoretical properties
We establish consistency and asymptotic normality for the estimated optimal regime constructed by solving estimating equations as described in the preceding section. For simplicity, in Sections 5.2 and 5.3 we assume that the class of regimes is finite. However, this assumption is not limiting as given an arbitrary one can approximate any separable collection of regimes, , by a finite mesh, , so that is within of . We illustrate this approach in Section 5.4 in the derivation of confidence sets for the value of the optimal regime within a parametric class of regimes (see Zhang et al., 2018, for additional discussion).
5.1 Consistency of the estimated state probabilities
Consistency of the estimated latent state distribution is central to characterizing the large sample behavior of estimators of the optimal treatment regime constructed by solving the MDP estimating equations of Section 3. Consistency follows from existing results on maximum likelihood for latent Markov models and the continuous mapping theorem. We make the following assumptions.
Both the time process and the number of time points are independent of the latent process .
The true parameter vector is an interior point of , where is a compact subset of .
For all , .
There exist measures on which are bounded away from zero with
The transition kernel is continuous in in an open neighborhood of .
For any , for some integrable function , .
The preceding assumptions are relatively mild and standard in hidden Markov models. Assumption (B1) ensures that the distribution of the visit times factors out of the likelihood for, i.e., the time process and latent process do not share parameters. Assumptions (B2)-(B8) ensure that the model is well-defined and that the maximum likelihood estimators are regular (Leroux, 1992; Bickel et al., 1998; Jensen and Petersen, 1999; Le Gland and Mevel, 2000; Douc and Matias, 2001; Douc et al., 2004). Consistency and asymptotic normality of the maximum likelihood estimators in general autoregressive hidden Markov models have been established under the preceding conditions (Douc et al., 2004). Moreover, we will show that Assumption (B9) holds for the Gaussian autoregressive hidden Markov model in the Supplementary Material, which ensures the class is Donsker and thus consistency of the estimated state probabilities follows immediately.
Assume (A1) - (A2), (B1) - (B8), as :
for each . Furthermore, if (B9) holds, then for each fixed , as ,
5.2 Asymptotic properties in the discounted utility setting
We consider linear working models for the state-value function , where
is a finite-dimensional set of basis functions; these basis functions might comprise custom features informed by domain expertise as well as nonlinear expansions such as b-splines or radial basis functions. Using this functional form, the population-level estimating equation for the state value-function, i.e., (1) from Section 3, is given by
let denote the solution to . The sample analog using the estimated states is thus
where ; let denote a solution to . For linear estimators, such a root always exists, however, below we require the weaker condition that is an approximate root. Let denote the Frobenius norm. We make the following assumptions.
For each , solves , where is an interior point of and is a compact subset of .
For each , there exists a sequence of such that .
Define , which attains its supremum at .
There exists a sequence of such that
There exists a constant , such that
for all and .
is uniformly continuous, where , is compact, and is finite almost surely. Furthermore, for some and all .
For each , , , , for some .
For each , , define
There exists a linear operator such that and, for all and , the following expansion holds
These conditions are standard for Z-estimators (van der Vaart and Wellner, 1996; Kosorok, 2008). Conditions (C1), (C2), (C6), and (C7) are used to establish the consistency of , while the addition of (C5) and (C8) are used to establish asymptotic normality. A sufficient condition for (C8) is that is almost everywhere differentiable in in which case can be chosen to be the gradient operator. We use (C3) and (C4) to show , which is a weaker but more general result than . Convergence of generally requires that is a unique and well-separated maximizer of , which need not hold for some commonly used classes of regimes (see Zhang et al., 2018).
Assume (A1) - (A2), (C1) - (C7), and that is finite. Then as :
for any fixed regime , ;
To define the limiting distribution of the estimated optimal value we make use of the following quantities:
Assume (A1) - (A2), (C1) - (C8), and that is finite. The following results hold as
where is a mean zero Gaussian process indexed by with covariance