Precision medicine is an approach to healthcare that involves tailoring treatment based on individual patient characteristics (Hamburg and Collins, 2010; Collins and Varmus, 2015). Accounting for heterogeneity by tailoring treatment has the potential to improve patient outcomes in many therapeutic areas. An individualized treatment rule formalizes precision medicine as a map from the space of patient covariates into the space of allowable treatments (Murphy, 2003; Robins, 2004). Almost all methods for estimating individualized treatment rules have been designed to optimize a scalar outcome (exceptions will be discussed shortly). However, in practice, clinical decision making often requires balancing trade-offs between multiple outcomes. For example, clinicians treating patients with bipolar disorder must manage both depression and mania. Antidepressants may help correct depressive episodes but may also induce manic episodes (Sachs et al., 2007; Ghaemi, 2008; Goldberg, 2008). We propose a novel method for using observational data to estimate a composite outcome and the associated optimal individualized treatment rule.
The estimation of optimal individualized treatment rules has been studied extensively leading to a wide range of estimators. These estimators include: regression-based methods like -learning (Murphy, 2005; Qian and Murphy, 2011; Schulte et al., 2014; Laber et al., 2014), -learning (Murphy, 2003; Robins, 2004; Blatt et al., 2004; Moodie et al., 2007; Wallace and Moodie, 2015), and regret regression (Henderson et al., 2010); direct-search methods (Zhang et al., 2012; Zhao et al., 2012; Zhang et al., 2013; Zhou et al., 2017)
based on inverse probability weighting(Robins, 1999; Murphy et al., 2001; van der Laan and Petersen, 2007; Robins et al., 2008); and hybrid methods (Taylor et al., 2015; Zhang et al., 2017). The preceding methods require specification of a single scalar outcome that will be used to define an optimal regime; were individual patient utilities known, then they could be used as the outcome in any of these methods. However, in general such utilities are not known though they can be elicited provided a high-quality instrument is available (Butler et al., 2017); in the absence of such an instrument, preference elicitation is difficult to apply.
We propose a new paradigm for estimating optimal individualized treatment rules from observational data without eliciting patient preferences. The key premise is that clinicians are attempting to act optimally with respect to each patient’s utility and thus the observed treatment decisions contain information about individual patient utilities. We construct estimators of individual patient utilities which do not require that clinicians are acting optimally, only that they approximately follow an optimal policy. This approach allows us to describe the goals of the decision maker and how these goals vary across patients, determine what makes a patient more or less likely to be treated optimally under standard care, and estimate the decision rule which optimizes patient-specific composite outcomes. We develop this approach in the context of a single-stage, binary decision in the presence of two outcomes.
Other methods for estimating individualized treatment rules in the presence of multiple outcomes include using an expert-derived composite outcome for all patients (Thall et al., 2002; Thall et al., 2007). However, this does not account for differences in the utility function across patients and in some cases it may not be possible to elicit a high-quality composite outcome from an expert. Alternatively, multiple outcomes can be incorporated using set-valued treatment regimes (Laber et al., 2014; Lizotte and Laber, 2016; Wu, 2016), constrained optimization (Linn et al., 2015; Laber et al., 2016), or inverse preference elicitation (Lizotte et al., 2012). Schnell et al. (2017) extend methods for estimating the benefiting subgroup to the case of multiple outcomes using the concept of admissibility (see also Schnell et al., 2016). However, none of these approaches provide a method for estimating an individual patient’s utility.
This work is closely related to inverse reinforcement learning (Kalman, 1964; Ng et al., 2000; Abbeel and Ng, 2004; Ratliff et al., 2006), which involves studying decisions made by an expert and constructing the utility function that is optimized by the expert’s decisions. Inverse reinforcement learning has been successfully applied in navigation (Ziebart et al., 2008) and human locomotion (Mombaur et al., 2009). Inverse reinforcement learning methods assume that decisions are made in a single environment. However, in the context of precision medicine, both the utility function and the probability of optimal treatment may vary across patients. Our approach is a version of inverse reinforcement learning with multiple environments.
In Section 2, we introduce a pseudo-likelihood method to estimate patient utility functions from observational data. In Section 3, we state a number of theoretical results pertaining to the proposed method, including consistency and inference for the maximum pseudo-likelihood estimators. Section 4 present a series of simulation experiments used to evaluate the finite sample performance of the proposed methods. Section 5 presents an illustrative application using data from the STEP-BD bipolar disorder study. Conclusions and a discussion of future research are given in Section 6. Proofs are given in the appendix.
2 Pseudo-likelihood estimation of utility functions
Assume the available data are , , which comprise independent and identically distributed copies of , where are patient covariates, is a binary treatment, and and are two real-valued outcomes for which higher values are more desirable. An individualized treatment rule is a function such that under a patient presenting with covariates will be assigned to treatment . Let denote the potential outcome under treatment and for any regime define . An optimal regime for the outcome , say , satisfies for any other regime . The optimal regime for the outcome , say , is defined analogously. In order to identify these optimal regimes, and subsequently to identify the optimal regime across the class of utility functions introduced below, we make the following assumptions.
Consistency, and .
Positivity, for some constant and all pairs .
Ignorability, and .
These assumptions are standard in causal inference (Robins, 2004; Hernan and Robins, 2010). Assumption 3 is not empirically verifiable in observational studies (Rosenbaum and Rubin, 1983; Rosenbaum, 1984).
Define , then under the preceding assumptions it can be shown that (Zhang et al., 2012). Similarly, it follows that where . In general, need not equal ; therefore, if both and are clinically relevant, neither nor may be acceptable. We assume that there exists an unknown and possibly covariate-dependent utility , where measures the “goodness” of the outcome pair . The optimal regime with respect to , say , satisfies for any other regime . The goal is to use the observed data to estimate the utility and subsequently . Define . For the class of utility functions we consider below, is (possibly covariate-dependent) convex combination of and and is therefore identifiable under the stated causal assumptions and furthermore .
We assume that clinicians act with the goal of optimizing each patient’s utility and that their success in identifying the optimal treatment depends on individual patient characteristics. Therefore, that the clinicians are approximately, i.e., imperfectly, assigning treatment according to . If the clinician were always able to correctly identify the optimal treatment and assign for each patient, there would be no need to estimate the optimal treatment policy (Wallace et al., 2016). Instead, we assume that the decisions of the clinician are imperfect and that where is an unknown parameter. We implicitly assume throughout that may contain higher order terms, interactions, or basis functions constructed from the covariates.
2.1 Fixed utility
We begin by assuming that the utility function is constant across patients and takes the form for some . Lemma 1 of Butler et al. (2017) states that, for a broad class of utility functions, the optimal individualized treatment rule is equivalent to the optimal rule for a utility function of this form. Define and define . Let and denote estimators of and obtained from regression models fit to the observed data (Qian and Murphy, 2011). For a fixed value of , let and subsequently let be the plug-in estimator of . Given and , can be computed for each .
The joint distribution ofis
Assuming that and do not depend on or , the likelihood for is
which depends on the unknown function . Plugging in for into (1) yields the pseudo-likelihood
If we let and denote the maximum pseudo-likelihood estimators obtained by maximizing (2), then an estimator of the utility function is and is an estimator of the probability that a patient presenting with covariates would be treated optimally under standard care. An estimator of the optimal policy at is .
Because the pseudo-likelihood given in (2) is non-smooth in , standard gradient-based optimization algorithms cannot be used. However, for a given , it is straightforward to compute the profile estimator . We can compute the profile pseudo-likelihood estimator over a grid of values for and select the point on the grid yielding the largest pseudo-likelihood. The algorithm to construct is as follows.
Set a grid
For each : compute
) can be accomplished using logistic regression. The theoretical properties of this estimator are discussed in Section3.
2.2 Patient-specific utility
Outcome preferences can vary widely across patients in some application domains including schizophrenia (Kinter, 2009; Strauss et al., 2010) and pain management (Gan et al., 2004). To accommodate this setting, we assume that the utility function takes the form where is a smooth function. For illustration, we let where is an unknown parameter. Define and define . Let and denote estimators of and obtained from regression models fit to the observed data. For a fixed value of , let and subsequently let be the plug-in estimator of . Assume that decisions are made according to the model . We compute the estimators of by maximizing the pseudo-likelihood
An estimator for the utility function is and an estimator for the optimal decision function is .
As before, the pseudo-likelihood given in (3) is non-smooth in and standard gradient-based optimization methods cannot be used. It is again straightforward to compute the profile pseudo-likelihood estimator for any . However, because it is computationally infeasible to compute for all on a grid if is of moderate dimension, we generate a random walk through the parameter space using the Metropolis algorithm as implemented in the metrop function in the R package mcmc (Geyer and Johnson, 2017) and compute the profile pseudo-likelihood for each on the random walk. Let . We can compute by estimating using logistic regression as described in Section 2.1. The algorithm to construct a random walk through the parameter space is as follows.
Set a chain length, , fix , and initialize to a starting value in
Generate ; if , set ; otherwise, set .
After generating a chain , we select the that leads to the largest value of
as the maximum pseudo-likelihood estimator. Standard practice is to choose the variance of the proposal distribution,, so that the acceptance proportion is between 0.25 and 0.5 (Geyer and Johnson, 2017).
3 Theoretical results
Here we state a number of theoretical results pertaining to the proposed pseudo-likelihood estimation method for utility functions. We state results for a patient-specific utility function; the setting where the utility function is fixed is a special case. All proofs are deferred to the appendix.
We assume that and that the true utility function is , where has bounded continuous derivative on compact sets and almost surely implies , i.e., the model introduced in Section 2.2 is well-defined and correctly specified with true parameters and . We further assume that the estimators and
are pointwise consistent for all ordered pairs. Along with assumptions 1-3, we implicitly assume that the densities and exist. The following result states the consistency of the maximum pseudo-likelihood estimators for the utility function and the probability of optimal treatment. The proof involves verifying the conditions of Theorem 2.12 of Kosorok (2008).
Theorem 3.1 (Consistency with patient-specific utility).
Let the maximum pseudo-likelihood estimators be as in Section 2.2, . Assume that is a compact set with and that . Then, and as .
Let be the mean composite outcome in a population where decisions are made according to . The following result establishes the consistency of the value of the estimated optimal policy. The proof uses general theory developed by Qian and Murphy (2011).
Theorem 3.2 (Value consistency with patient-specific utility).
Let be the maximum pseudo-likelihood estimator for and let be the associated estimated optimal policy. Then, under the given assumptions, as .
Next, we derive the convergence rate and asymptotic distribution of . Assume that is a bounded subset of and let be the sup norm over , i.e., for , . Let . Assume that and that . Define , , and . Similarly, define , , and . Let and . Note that and . Let , , , and . We use the following regularity conditions.
There exist independent and identically distributed influence vectors
There exist independent and identically distributed influence vectors, and , and vector basis functions and such that both
Let , , , and . Furthermore, assume that , , , and are bounded by some . Let be positive definite and finite, where .
The following conditions hold.
The random variablehas a continuous density function in a neighborhood of 0 with ;
The conditional distribution of given that converges to a non-degenerate distribution as ;
There exist such that
where is the -dimensional unit sphere.
Define, for , , and ,
Assume that has a unique, finite minimum over for all .
Let be the maximum pseudo-likelihood estimators given in Section 2.2. The following theorem states the asymptotic distribution of .
Theorem 3.3 (Asymptotic distribution).
Under the given regularity conditions
where , and .
Let denote convergence in probability over , as defined in Section 2.2.3 and Chapter 10 of Kosorok (2008). Theorem 3.4 below establishes the validity of a parametric bootstrap procedure for approximating the sampling distribution of .
Theorem 3.4 (Parametric bootstrap).
4 Simulation experiments
4.1 Fixed utility simulations
To examine the finite sample performance of the proposed methods, we begin with the following simple generative model. Let
be a vector of independent normal random variables with mean 0 and standard deviation 0.5. Let treatment be assigned according to, i.e., the probability that the clinician correctly identifies the optimal treatment is constant across patients. Let and be independent normal random variables with mean 0 and standard deviation 0.5 and let and . We estimated and using linear models, implemented the proposed method for a variety of , , and values, and examined , , and , across 500 Monte Carlo replications.
Table 1 contains mean estimates of and across replications along with their associated root mean squared errors (RMSE), and estimated error rate defined as the proportion of time the estimated optimal policy does not recommend the true optimal treatment.
|100||0.25||0.6||0.321 (0.2413)||0.613 (0.0654)||0.114|
|0.8||0.252 (0.0574)||0.803 (0.0386)||0.028|
|0.75||0.6||0.656 (0.2614)||0.610 (0.0718)||0.121|
|0.8||0.752 (0.0519)||0.805 (0.0410)||0.026|
|200||0.25||0.6||0.272 (0.1618)||0.613 (0.0404)||0.069|
|0.8||0.250 (0.0273)||0.802 (0.0267)||0.014|
|0.75||0.6||0.719 (0.1778)||0.610 (0.0441)||0.076|
|0.8||0.751 (0.0325)||0.800 (0.0280)||0.016|
|300||0.25||0.6||0.264 (0.1356)||0.610 (0.0300)||0.056|
|0.8||0.250 (0.0202)||0.802 (0.0214)||0.011|
|0.75||0.6||0.736 (0.1328)||0.608 (0.0308)||0.057|
|0.8||0.749 (0.0198)||0.803 (0.0231)||0.011|
|500||0.25||0.6||0.249 (0.0823)||0.607 (0.0221)||0.034|
|0.8||0.250 (0.0151)||0.800 (0.0176)||0.008|
|0.75||0.6||0.750 (0.0855)||0.606 (0.0208)||0.035|
|0.8||0.751 (0.0127)||0.800 (0.0187)||0.007|
The pseudo-likelihood method performs well at estimating both and , with estimation improving with larger sample sizes as expected. Table 2 contains estimated values of the true optimal policy, a policy where the utility function is known and the -functions are estimated, a policy where both the utility and the -functions are estimated (the proposed method), a policy where the -functions are estimated and the utility is misspecified, and the standard of care. The policy with misspecified utility assumes , which corresponds to Q-learning for optimizing , while ignoring . The value of the standard of care is the mean composite outcome under the generative model. For each policy, the value is estimated by generating a testing sample of size 500 with treatment assigned according to the policy and averaging utilities (calculated using the true ) in the testing set.
|Optimal||Known||Estimated||Misspecified||Standard of care|
The column labeled “estimated ” refers to the proposed method. We see that the proposed method produces values which increase with and generally come close to those of the policy where is known. In all settings, the proposed method offers significant improvement over the standard of care. The proposed method also offers significant improvement over the policy where the utility is misspecified.
To further examine the performance of the proposed method, we allow the probability of optimal treatment to depend on patient covariates. Let . This corresponds to the case where . Let , , and be generated as described above. In this generative model, the probability that a patient is treated optimally in standard care is larger for patients with positive values of and smaller for patients with negative values of . We applied the proposed method to 500 replications of this generative model for various and . Table 3 contains mean estimates of along with associated RMSE of , RMSE of , and the error rate.
|RMSE of||Error rate|
Estimation of the observational policy (as defined by ) improves with larger sample sizes, with the largest improvement coming from an increase from 100 patients to 200. The probability that the estimated policy assigns the optimal treatment also increases with the sample size. The true value of does not affect estimation of or .
Table 4 contains estimated values of the true optimal policy, a policy where the utility function is known and the -functions are estimated, a policy where both the utility and the -functions are estimated (the proposed method), a policy where the -functions are estimated and the utility is misspecified, and the standard of care. Values are estimated from independent testing sets of size 500 as described above. The policy with misspecified utility assumes . The value under the standard of care is the mean composite outcome under the generative model.
|Optimal||Known||Estimated||Misspecified||Standard of care|
The proposed method (found in the column labeled “estimated ”) produces values that are close to those of the policy where the utility function is known in large samples and a significant improvement over standard of care in small to moderate samples. Again, the proposed method offers significant improvement over the policy where the utility is misspecified. We note that value under the standard of care differs across . When is close to 1, the composite outcome places more weight on , for which the magnitude of the association with is larger. Because patients with larger values of are more likely to be treated optimally in this generative model, the standard of care produces larger composite outcomes when is closer to 1.
4.2 Patient-specific utility simulations
Next, we examine the case where the utility function is allowed to vary across patients. Let , , and be generated as above. Again, assume that , i.e., . Consider the composite outcome , where , i.e., . We implemented the proposed method for various and examined estimation of and
across 500 replications. Each replication is based on a simulated Markov chain of length 10,000 as described in Section2.2. Results are summarized in Table 5.
|RMSE of||RMSE of||Error rate|
Larger sample sizes produce marginal decreases in the RMSE of . Despite the large RMSE’s of , the estimated policy assigns the true optimal treatment more than 80% of the time for all sample sizes and the error rate decreases as the sample size increases. Table 6 contains estimated values of the true optimal policy, the policy estimated using the proposed method, and standard of care.
|Optimal||Estimated||Standard of care|
The proposed method produces policies that achieve close to optimal value in large samples and significant improvement over the standard of care in small to moderate samples.
Finally, we examine the performance of the parametric bootstrap as described in Section 3. Let be a bivariate vector of normal random variables with mean 0, standard deviation 0.5, and correlation zero. Let and be generated as above and let where the first element of is an intercept. We consider two values of , , and . Let be the vector
with the first element removed. We are interested in testing the null hypothesis, which corresponds to a test for heterogeneity of patient preferences. Table 7 below contains estimated power across 500 Monte Carlo replications under the null hypothesis, , and the alternative hypothesis, , at level for various sample sizes. Each replication is based on 1000 bootstrap samples.
|Power under null||Power under alternative|
The proposed bootstrap procedure demonstrated type I error rates near nominal levels under the null and high power in large samples under the alternative.
5 Case study: The STEP-BD standard care pathway
The Systematic Treatment Enhancement Program for Bipolar Disorder (STEP-BD) was a landmark study of the effects of antidepressants in patients with bipolar disorder (Sachs et al., 2007). In addition to a randomized trial assessing outcomes for patients given an antidepressant or placebo, the STEP-BD study also included a large-scale observational study, the standard care pathway. We apply the proposed method to data from the STEP-BD standard care pathway to estimate decision rules for the use of antidepressants in patients with bipolar disorder.
Although bipolar disorder is characterized by alternating episodes of depression and mania, recurrent depression is the leading cause of impairment among patients with bipolar disorder (Judd et al., 2002). However, the use of antidepressants has not become standard care in bipolar disorder due to the risk of antidepressants inducing manic episodes in certain patients (Ghaemi, 2008; Goldberg, 2008). Thus, the clinical decision in the treatment of bipolar disorder is whether to prescribe antidepressants to a specific patient in order to balance trade-offs between symptoms of depression, symptoms of mania, and other side effects of treatment.
We use the SUM-D score for depression symptoms and the SUM-M score for mania symptoms as outcomes. We consider a patient treated if they took any one of ten antidepressants that appear in the STEP-BD standard care pathway (Deseryl, Serzone, Citalopram, Escitalopram Oxalate, Prozac, Fluvoxamine, Paroxetine, Zoloft, Venlafaxine, or Bupropion). Covariates were screened using linear models and included if there was an interaction with treatment for either outcome. Covariates used for this analysis were age and substance abuse history (yes/no). Figure 1 contains box plots of SUM-D scores on the log scale by substance abuse and treatment. Figure 2 contains box plots of SUM-M scores on the log scale by substance abuse and treatment. For both outcomes, lower scores are more desirable.
Figure 1 indicates that those without a history of substance abuse benefit from treatment with antidepressants. However, among those with a history of substance abuse, patients treated with antidepressants appear to have worse symptoms of depression. Figure 2 indicates that treatment has no effect on symptoms of mania among those without a history of substance abuse. However, among those with a history of substance abuse, it appears that treatment may be inducing manic episodes. Thus, a sensible treatment policy would be one that tends to prescribe antidepressants only to patients without a history of substance abuse.
We analyzed these data using the proposed method for optimizing composite outcomes. Results are summarized in Table 8 below. We estimated policies where both utility and probability of optimal treatment are fixed (fixed-fixed), where utility is fixed but probability of optimal treatment is assumed to vary between patients (fixed-variable) and where both utility and probability of optimal treatment are assumed to vary between patients (variable-variable). For both the fixed-variable policy and the variable-variable policy, we report in place of and for the variable-variable policy, we report in place of . Thus, for parameters that are assumed to vary across patients, Table 8 contains the mean estimate in the sample. Mean outcomes and value functions are averaged over five fold cross validation. For both SUM-D and SUM-M, lower scores are preferred. Value is reported as the percent improvement over standard of care, calculated using the estimated utility function. Large percent improvements in value are preferred.
|Policy||SUM-D||SUM-M||Value (% improvement)|
|standard of care||2.480||0.868||0.0%|
All estimated policies produce more desirable SUM-D scores and SUM-M scores compared to standard of care and improved value according to the estimated utility. Allowing the probability of optimal treatment to vary between patients leads to further improvements in value, as does allowing the utility function to vary between patients. All policies produce similar estimates for the probability of optimal treatment averaged across patients.
The resulting decision rules can be written as the sign of a linear combination of the covariates. As an example, the fixed-fixed policy assigns treatment with antidepressants when is equal to 1. The negative coefficient for substance abuse means that a history of substance abuse indicates that a patient should not be prescribed antidepressants. Prior research has shown that patients with a history of substance abuse are more likely to abuse antidepressants (Evans and Sullivan, 2014). This may contribute to the poor outcomes experienced by STEP-BD patients with a history of substance abuse who were treated with antidepressants. A test for preference heterogeneity based on 10,000 bootstrap replications generated according to Theorem 3.4 yielded a p-value of .
As a secondary analysis, we use the SUM-D score and a side effect score as outcomes. Eight side effects were recorded in the STEP-BD standard care pathway (tremors, dry mouth, sedation, constipation, diarrhea, headache, poor memory, sexual dysfunction, and increased appetite). Patients rated the severity of each side effect from 0 to 4 with larger values indicating more severe side effects. We took the mean score across side effects as the second outcome. Covariates were screened as in the above analyses. Patients in the standard care pathway were asked to report the percent of days over the past week that they experienced mood elevation, irritability, and anxiety; these were included as covariates along with age and history of substance abuse. Results are summarized in Table 9, reported analogously to those in Table 8.
|Policy||SUM-D||Side effect score||Value (% improvement)|