Identifying optimally cost-effective dynamic treatment regimes with a Q-learning approach

07/07/2021 ∙ by Nicholas Illenberger, et al. ∙ 0

Health policy decisions regarding patient treatment strategies require consideration of both treatment effectiveness and cost. Optimizing treatment rules with respect to effectiveness may result in prohibitively expensive strategies; on the other hand, optimizing with respect to costs may result in poor patient outcomes. We propose a two-step approach for identifying an optimally cost-effective and interpretable dynamic treatment regime. First, we develop a combined Q-learning and policy-search approach to estimate an optimal list-based regime under a constraint on expected treatment costs. Second, we propose an iterative procedure to select an optimally cost-effective regime from a set of candidate regimes corresponding to different cost constraints. Our approach can estimate optimal regimes in the presence of commonly encountered challenges including time-varying confounding and correlated outcomes. Through simulation studies, we illustrate the validity of estimated optimal treatment regimes and examine operating characteristics under flexible modeling approaches.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to rising healthcare costs, there has been growing interest in improved methods for cost-effectiveness analyses. Cost-effectiveness research is concerned with identifying policies that balance treatment effectiveness and overall costs. The most effective treatments often have comparatively greater costs. For example, patients undergoing hysterectomy for the treatment of endometrial cancer sometimes receive adjuvant radiation or chemotherapy to reduce the risk of locoregional recurrence (van2021adjuvant). While adjuvant therapies have been shown to reduce the risk of recurrence and improve patient survival (keys2004phase; hogberg2010sequential), they are also associated with higher toxicity rates and greater costs (randall2006randomized). Identifying a treatment sequence which is both effective and cost-efficient is further complicated by heterogeneous treatment responses across individuals. Patients at high-risk of cancer recurrence are likely to benefit more from adjuvant therapies than are patients at low-risk of recurrence (van2021adjuvant). These complications underscore the need for treatment regime methodology that can assist with optimal allocation of limited resources towards patients most likely to benefit.

Traditional cost-effectiveness analyses compare the relative benefit of treatments using the incremental cost effectiveness ratio (ICER). The ICER is defined as the ratio of the difference in overall mean costs between two treatments to the difference in mean effectiveness as measured by some clinical outcome. This summary measure estimates the cost per unit change in effectiveness between two candidate interventions. Previously, regression approaches have been used to address treatment response heterogeneity and its effects on cost-effectiveness. willan2004regression

showed that if cost and effectiveness metrics are linear functions of treatment status and covariates, then ICER can be represented as a function of the parameters in a least-squares regression. The authors describe inferential procedures to assess cost-effectiveness within subgroups defined by patient covariates. Because cost distributions are often skewed,


developed Markov Chain Monte Carlo methods for inference with non-normal errors. However, both of these methods are limited to settings with point-exposures. Because we are interested in identifying cost-effective treatment sequences, methods that can account for time-varying treatments are imperative. Recently,

spieker2020nested proposed a novel nested g-computation approach for estimating and comparing expected treatment costs under static treatment regimes while accounting for time-varying treatment and confounding. Our approach instead identifies strategies which assign treatments based on individual and time-varying patient characteristics.

lakkaraju2017learning previously developed methodology to learn an optimal cost-effective individualized treatment rule in the form of a decision list for single decision point settings. They consider an objective function to maximize treatment effectiveness while penalizing both covariate ascertainment costs and overall treatment cost, and suggest using a validation set to select penalties resulting in estimated costs below pre-defined cost thresholds. laber2018identifying instead proposed policy search methodology to maximize treatment effectiveness within the set of regimes for which overall cost (defined by number of safety events) is below a pre-set threshold. For treatment rules taking the form of linear decision boundaries, they propose a coordinate decent algorithm to identify an optimal decision rule under a fixed cost constraint. Recent work by zhang2018interpretable argues that treatment rules based on decision lists provide both flexibility and interpretability. They developed an efficient algorithm for identifying optimal list-based regimes for time-varying treatments, but do not consider estimation under cost constraints.

In this paper, we propose an efficient algorithm for identifying optimal list-based decision rules under fixed cost constraints. We use Q-learning with policy search methodology to maximize treatment effectiveness within the class of list-based regimes with costs under pre-defined thresholds. We modify the algorithm proposed by zhang2018interpretable to efficiently estimate the clauses within the list-based decision rules. In addition, we propose a novel iterative cost-effectiveness analysis to select an optimally cost-effective treatment regime from a set of candidate treatment regimes. This approach uses ICER to sequentially compare the relative benefit of progressively more expensive treatment regimes. The proposed cost-effectiveness analysis better approximates traditional cost-effectiveness analyses than do prior methods and avoids complications arising from non-iterative analyses (cohen2008interpreting).

The remainder of this manuscript is organized as follows. In Section 2 we introduce methodology for estimating optimal list-based treatment regimes under a constraint on overall expected treatment costs. In Section 3, we propose a new approach for selecting an optimally cost-effective regime from a set of candidate regimes corresponding to increasing cost thresholds. We explore the operating characteristics of our proposed regime identification approach through simulations in Section 4. We describe additional considerations and avenues for future developments in Section 5.

2 Methods

2.1 Set-up and notation

In this section, we demonstrate how to estimate a set of treatment rules that maximize a measure of clinical effectiveness under a predefined cost constraint. Assume that treatment decisions are made at the beginning of distinct intervals indexed by . For individuals in interval , we observe data . Here, denotes confounding variables collected at the beginning of interval , denotes treatment status, and and are clinical effectiveness and cost outcomes collected at the end of interval , respectively. Within each interval, variables are observed in the order: . We use overbar notation to denote covariate history, e.g. , and underbar notation to denote future values of a covariate, e.g. . For , let denote a patient’s covariate and treatment history at the beginning of interval . A dynamic treatment regime is a set of decision rules where is a mapping

from the space of all possible covariate histories into that of treatment decisions. For a random variable,

, let denote the value of which would be observed if regime had been followed. If is a class of treatment regimes, then the optimal dynamic treatment regime within this class is the regime that satisfies the condition for all . If is the class of all treatment regimes, then can be identified through nonparametric Q-learning (kosorok2019precision).

Because a common goal in health policy is making decisions on treatment allocation in resource-limited settings, we restrict consideration to the class of list-based treatment rules with cost constraints. Compared with traditional Q-learning, this restricted search sacrifices regime flexibility in favor of interpretability. Following zhang2018interpretable, a list-based regime is a regime for which each treatment rule for consists of a series of if-else statements:

  • If then ;

  • else if then ;

  • else if then

Each treatment regime is determined by . For where is the maximum desired list length for a decision at interval , is a subset of . We restrict so that all units which are unassigned at the end of the list are assigned to treatment . For simplicity and to aid interpretability, we restrict to clauses involving thresholding of a single covariate:

Estimation of the optimal cost-restricted regime involves identifying optimal values for . Let and denote the collection of decision regions and treatment choices at each decision point. We employ an integrated Q-learning and policy search approach to define and estimate these optimal values. Policy search methodology is common when estimating interpretable regimes because it can decouple the form of a regime from the class of estimators for the Q-functions (kosorok2019precision). This approach will allow us to flexibly estimate an optimal list-based regime under a fixed cost constraint. Define the stage Q-functions for and :

Given a treatment regime , we may recursively define the Q-functions for intervals as follows:

Because the optimal regime maximizes the potential treatment effectiveness under a constraint on the potential cost, some identification assumptions are required to connect the distribution of the observed covariates to that of the potential outcomes. Specifically, we need to make the following assumptions:

  • (Positivity) If , then for all

  • (Consistency) and for

  • (Sequentially ignorable treatment assignment) for

Under these assumptions, schulte2014q show that for the general Q-learning setting:

and, for :

It follows that the expected potential effectiveness and cost of a regime, , are given by and , respectively. Because the stage Q-function for cost denotes the expected cost accrued in the interval and in future intervals under a pre-specified regime, decompose the overall cost constraint into components representing the interval specific cost constraints where . The choice of interval specific cost constraints should be driven by subject-level knowledge. One reasonable choice may set so that the costs accrued in each interval are equally constrained. By the principles of dynamic programming, the optimal treatment regime can be identified by optimizing over the Q-functions at each individual decision point (bellman1966dynamic). In the policy search context, the process of finding an optimal regime within a pre-specified class reduces to finding the optimal decision rule within this class at each decision point. The optimal choices of within the class of list-based and cost constrained decision rules are given by:

For decision intervals the optimal choices are given by:

In the next subsection, we propose a modification of the algorithm developed by zhang2018interpretable for identifying globally optimal list-based regimes. Our approach extends this approach to the cost-effectiveness setting by allowing the accommodation of cost constraints.

2.2 Estimating the Optimal Decision Rules

To estimate the optimal list-based and cost constrained regime, we use a backwards recursive procedure. The optimal decision rule for the final interval is estimated first, and earlier timepoints are estimated assuming optimal decisions are made at all future timepoints. For each decision point , we sequentially estimate the pairs for clauses .

For a unit entering the final interval with covariate history , define the unconstrained optimal rule . We want to find an interpretable list-based and cost constrained treatment rule which closely approximates this rule. First, consider approximating the final decision within the class of list-based rules with cost constraint . To estimate the first clause, consider the intermediate list-based decision rule:

  • If then ;

  • else if then

The expected effectiveness measure under this regime is given by:

Similarly, the expected cost is given by:

We search for the values of which maximize while ensuring that the expected cost, , is less than or equal to the cost constraint, . As zhang2018interpretable point out, maximizing the objective function, , is equivalent to minimizing the difference between this intermediate rule. Because we also impose the constraint , we are minimizing the difference between the intermediate rule and the optimal rule within the class of cost constrained rules.

Different choices of may lead to equivalent expected effectiveness and cost estimates. For example, suppose there exists a set , such that for all . In this case, the intermediate rules parameterized by and will lead to equivalent expected effectiveness and cost. When rules result in similar estimated effectiveness and cost, we want to reward those which assign treatment to a greater number of patients. Following the example of zhang2018interpretable, we add a complexity term, . to reward such regions:

Larger values of give greater weight to rules which include more units. However, this could potentially be at the expense of the mean effectiveness of an estimated rule. To ensure that rules are maximize effectiveness as much as possible this term may be selected using cross-validation. The optimal choices of for the intermediate rule are given by:

This procedure can be generalized to estimate optimal regions and treatment choices for each of the clauses in the list-based rule. The algorithm can be summarized as follows:

  • Let

  • Define . If , the expectation of equal to or under the intermediate rule is defined as:

  • Otherwise, if , force and define:

  • Define as:

    If set and return to Step 1, otherwise stop.

The algorithm above can be used to estimate . These estimates completely determine the optimal list-based and cost constrained decision rule for the interval, . To estimate treatment rules for decisions , let and we denote the Q-functions for interval assuming that the optimal list-based and cost constrained rule is followed in future intervals. The unconstrained optimal treatment decision at interval is . To approximate this rule within the class of list-based and cost constrained treatment rules, we may apply the steps listed above replacing the stage Q-functions with their stage equivalents, and the cost constraint with .

The proposed algorithm allows estimation of an optimal list-based regime under a set of interval-specific cost constraints. This work requires two important modifications of the algorithm described by zhang2018interpretable for identifying globally optimal list-based regimes. First, our procedure requires that the optimal values of result in a rule which satisfies some pre-specified cost constraint. To achieve this, we must be able to estimate the expected cost accrued under each intermediate decision rule. This motivates our second departure from the original algorithm. Equations 2.2 and 2.2 can be viewed as the sum of expected outcomes in three exhaustive groups: (1) those satisfying the current clause but not past clauses (), (2) those satisfying neither the current clause nor past clauses (), and (3) those satisfying at least one previous clause (). In order to estimate the overall expected outcomes, we must consider the expected outcome in each of these groups. Because zhang2018interpretable are interested in finding an optimal list-based regime without a cost constraint, at each clause they need only find that maximize effectiveness among unassigned units, i.e. those in groups (1) and (2). Because the outcomes for those in group (3) are fixed with respect to the choice of , these two methods will result in equivalent rules in cases where there is no cost constraint. This modification is necessary for ensuring the overall expected cost is below the cost constraint.

Estimates of the expected effectiveness and cost under fit regimes can be obtained from the identified treatment rules. Note that estimates the expected survival for a patient with baseline covariates , assuming the optimal cost constrained and list-based treatment is given at every interval. Similarly, estimates the expected total costs under the optimal regime for patients with baseline covariates . To estimate the expected performance of the regime accross the population, we may estimate:


2.3 Efficient optimization of treatment rules

We propose an algorithm for identifying optimal choices of and , at clause of decision rule . Let denote estimated optimal regions and treatments for clauses prior to . Recall that regions which are defined in terms of a threshold, , on a variable within . Let , for , denote a covariate contained within . First, consider decision regions of the form where patients with are given treatment . For the intermediate decision rule is given by:

  • If then ;

  • else if then ;

  • else if then ;

  • else if then ;

  • else

For each unit, define:

The values and represent expected effectiveness outcomes for patients included in and excluded from , respectively. Note that if a patient satisfies a previous clause, then so that the patient’s expected outcome is independent of the choice of and . It can be seen that . If we define and similarly, it follows that .

Although may take any real value, the unique values of and can be obtained by setting to the order statistics of . First, we relabel units so that . If there are no ties in the observed values of , then and follow the recursive relationships:

If there are ties in the observed values of

, as would be the case for ordinal or binary variables, then let

denote the unique values of , where . The recursive relationships can then be rewritten as:

These relationships allow us to quickly enumerate all possible values of and for regions of the form and assigned treatment . Analogous relationships can be defined for regions of the form . Optimal choices for and can be obtained by iterating over each variable in and each possible treatment option, and identifying which choices lead to maximal values of under the restriction .

3 Implementation of a Cost-effectiveness Analysis

In this section, we describe how our proposed optimal cost constrained regimes may be used to perform a cost-effectiveness analysis. To identify an optimally cost-effective treatment regime from the observed data, we use the incremental cost-effectiveness ratio (ICER) which is commonly utilized in health economics (cohen2008interpreting). Given two candidate regimes, and , ICER is a comparative measure defined as the ratio of the differnce in expected cost to the difference in expected effectiveness between the two regimes:

ICER can be interpreted as the cost per unit change in effectiveness obtained by switching from treatment regime to . Let denote a pre-selected willingness-to-pay (WTP) parameter. The WTP represents the maximum cost a payer is willing to incur for a unit change in effectiveness. Adopting regime over the comparator regime, , is considered cost-effective if ICER is less than the chosen WTP.

Given a set of candidate regimes and a pre-selected WTP, we may use ICER to select an optimally cost-effective regime. Let denote the estimated optimal treatment regime for a given cost constraint . Let be an ordered sequence of cost constraints, and be the set of candidate treatment regimes corresponding to these constraints. For these regimes, we know and . Because of this relation, if a regime fit with a lower cost constraint is used as the reference regime, then ICER will always be positive. We provide an algorithm that identifies the optimally cost-effective regime by sequentially comparing each candidate regime to the best existing alternative.

  1. Define the current most cost-effective regime: and set

  2. If , then , otherwise do not update .

  3. If , set and repeat from Step 2, otherwise is the optimally cost-effective regime.

We illustrate this procedure using a simulated example. Suppose we have a binary treatment variable, , which is assigned to patients at distinct time-points. Prior to treatment assignment at interval , we collect data on two continuous covariates, and , as well as data on each patient’s current treatment costs, , and survival time, . For this example, treatment costs are measured in units of USD , and survival is measured in years. Data are simulated as specified in Section 4. We wish to find a treatment regime which optimally balances treatment effectiveness and cost.

Regime Constraint Survival Cost ICER Comparator
I 26 1.35 25.98 NA NA
II 30 1.44 29.76 40.66 I
III 34 1.54 33.93 42.22 II
IV 38 1.64 37.86 41.84 III
V 42 1.71 41.81 53.77 IV
VI 46 1.76 45.62 61.01 IV
VII 50 1.79 47.46 63.32 IV
Table 1: Estimated mean survival (years) and cost (USD $) for patients treated according to regimes with different cost constraints and incremental cost-effectiveness ratios comparing subsequent candidate regimes. Results for optimally cost-effective treatment regime at WTP of /year bolded.

We first fit optimal list-based regimes with cost constraints ranging from a total cost of to over the three treatment intervals. Table 1 provides the estimated survival in years and cost in USD for each of the identified optimal cost constrained regimes. In this table, ICER is calculated comparing each regime and the current best existing alternative. If the WTP is , then the optimally cost-effective regime is the regime with a cost constraint of over the three year period. This regime satisfies the two conditions for optimality: (1) ICER comparing this regime to the best existing alternative (regime II) is below the selected WTP and (2) switching from this regime to any of the more expensive regimes would cost more than per unfit change in survival.

4 Simulation study

We perform a simulation study to evaluate the operating characteristics of our regime identification approach. Our goals in performing this study are two-fold. First, we demonstrate that our approach is able to fit valid cost constrained regimes in realistic settings where a cost-effectiveness analysis would be undertaken. Second, we illustrate the utility of flexible ensemble learning approaches for improving regime performance. We simulate data with decision points and sample units. For intervals , we simulate a vector of independent standard normal confounding variables, ; a binary treatment decision, ; an indicator of whether a unit has survived to the end of the interval, ; and cost accrued within the interval, . Specifically, baseline data are simulated as:

and data for intervals are simulated as,

We define , , , , , and . For follow-up intervals the corresponding parameter values are , , , , , , and . The parameter , which is the same for baseline and follow-up data, controls the level of correlation between the cost and effectiveness outcomes. We consider simulation settings with low, medium, and high levels of correlation between outcomes, corresponding to , and respectively. To ensure costs are similar across simulation settings we include the offset term which is equal to and in low, medium, and high correlation settings. We simulate data with , and sample units. To explore how flexible modeling of the Q-functions can improve regime performance, we compare two approaches: (1) linear models, and (2) SuperLearner. SuperLearner is an ensemble learner, combining predictions from a set of candidate learners to minimize cross-validated risk van2011targeted

. We specified random forests, neural nets, elastic net, generalized linear models, and an intercept-only regression as candidate learners in our implementation of Super Learner.

For each setting, we simulate 500 datasets and estimate an optimal cost constrained regime using linear models and another using SuperLearner. Regimes are fit with overall cost constraints of , , and . After regimes are identified, we simulate new units, treat them according to the estimated treatment rules, and record the mean survival and cost among these units. This procedure results in two estimates of the mean cost and survival under each fit regime; one arising from the Q-functions (as described in Equations 3 and 4), and another based on Monte Carlo simulation. The Monte Carlo approach allows us to determine how each regime performs when applied to new data and approximates the true cost and survival under that regime. On the other hand, estimates of cost and survival arising from the Q-functions are subject to bias due to model misspecification. If the Q-functions are appropriately modelled, then estimates of the expected cost and survival for a fit regime should be similar to those from the Monte Carlo approximations of the truth.

Table 2 summarizes our simulation results. For each setting and modeling approach, we provide the mean Q-function-based estimate of cost and survival across the identified treatment strategies. In addition, we provide the mean Monte Carlo estimated cost and survival across the fit regimes. In nearly all settings, the mean Monte Carlo estimated cost fall below the specified cost threshold. Only in scenarios where SuperLearner was used with large sample datasets, was the cost constraint not satisfied. In these settings, the mean Q-function based estimate of the total cost underestimated the true Monte-Carlo estimated cost. For example, when cost and survival outcomes were highly correlated, the cost constraint was set to , and data were simulated with units, the mean Q-function based estimate of cost across regimes was while the mean Monte Carlo estimated cost was . This contrasts with the results seen for regimes identified using linear models. Here, Q-learning based estimates of the mean consistently overestimated the costs observed when new units were treated according to the identified strategies. Because total regime costs are overestimated when using linear models, some regimes which would truly satisfy the cost constraint may be excluded from consideration. This results in more conservative treatment strategies and decreased survival when compared to SuperLearner-based regimes. In the moderate correlation setting, for sample sizes of and a cost constraint of , the average Q-learning based estimate of regime cost is . However, the regimes fit using linear models have a lower Monte Carlo estimated cost and survival than the corresponding SuperLearner-based regimes (costs: vs. ; survival: vs. ). Even when Monte Carlo regime costs are similar between regimes fit using linear models and SuperLearner, Monte Carlo estimates of patient survival are greater for SuperLearner-based regimes. This is the case in the high correlation scenario with a cost constraint of and a sample size of . While regimes fit under both modeling approaches lead to an average cost of when applied to new data, the average survival under linear model-based regimes is while for SuperLearner-based regimes it is .

Because SuperLearner uses cross-validation to create a weighted combination of candidate learners, predictions of survival and cost outcomes are dependent on random splits of the data. In the context of doubly robust estimation of the average treatment effect, benkeser2020practical illustrates the sensitivity of inference to setting different random seeds when outcome and propensity score models are fit using SuperLearner. For dynamic treatment regimes, this sensitivity manifests as different estimated treatment rules across random seeds. We perform additional simulations to assess the sensitivity of our list-based and cost-constrained regimes to random splits of the data. For sample sizes of , , , and , we simulate datasets under the set-up previously described. For each dataset, we use SuperLearner to fit two separate treatment regimes changing only the random seed. We summarise differences between the two regimes as the number of interval specific rules which differ between the regimes (e.g. if only the final decision rule differs between regimes, then the “distance” is one). We find that the stability of the fit regimes increases with sample size. The proportion of datasets for which fit regimes were identical or only differed at one decision interval is , , and at sample sizes of , , , and , respectively. In practice, sensitivity to random seed can be assessed by fitting treatment regimes using SuperLearner under multiple seeds. If fit regimes differ in meaningful ways, then modeling approaches that do not depend on random seeds should be used to fit Q-functions.

Linear Model SuperLearner
Corr. n MC. Cost MC. Surv. MC. Cost MC. Surv.
Low 26 500 25.75 1.30 25.53 1.25 25.81 1.36 25.81 1.34
1000 25.64 1.26 25.18 1.25 25.85 1.35 26.12 1.38
5000 25.55 1.23