1 Introduction
Due to rising healthcare costs, there has been growing interest in improved methods for costeffectiveness analyses. Costeffectiveness research is concerned with identifying policies that balance treatment effectiveness and overall costs. The most effective treatments often have comparatively greater costs. For example, patients undergoing hysterectomy for the treatment of endometrial cancer sometimes receive adjuvant radiation or chemotherapy to reduce the risk of locoregional recurrence (van2021adjuvant). While adjuvant therapies have been shown to reduce the risk of recurrence and improve patient survival (keys2004phase; hogberg2010sequential), they are also associated with higher toxicity rates and greater costs (randall2006randomized). Identifying a treatment sequence which is both effective and costefficient is further complicated by heterogeneous treatment responses across individuals. Patients at highrisk of cancer recurrence are likely to benefit more from adjuvant therapies than are patients at lowrisk of recurrence (van2021adjuvant). These complications underscore the need for treatment regime methodology that can assist with optimal allocation of limited resources towards patients most likely to benefit.
Traditional costeffectiveness analyses compare the relative benefit of treatments using the incremental cost effectiveness ratio (ICER). The ICER is defined as the ratio of the difference in overall mean costs between two treatments to the difference in mean effectiveness as measured by some clinical outcome. This summary measure estimates the cost per unit change in effectiveness between two candidate interventions. Previously, regression approaches have been used to address treatment response heterogeneity and its effects on costeffectiveness. willan2004regression
showed that if cost and effectiveness metrics are linear functions of treatment status and covariates, then ICER can be represented as a function of the parameters in a leastsquares regression. The authors describe inferential procedures to assess costeffectiveness within subgroups defined by patient covariates. Because cost distributions are often skewed,
nixon2005methodsdeveloped Markov Chain Monte Carlo methods for inference with nonnormal errors. However, both of these methods are limited to settings with pointexposures. Because we are interested in identifying costeffective treatment sequences, methods that can account for timevarying treatments are imperative. Recently,
spieker2020nested proposed a novel nested gcomputation approach for estimating and comparing expected treatment costs under static treatment regimes while accounting for timevarying treatment and confounding. Our approach instead identifies strategies which assign treatments based on individual and timevarying patient characteristics.lakkaraju2017learning previously developed methodology to learn an optimal costeffective individualized treatment rule in the form of a decision list for single decision point settings. They consider an objective function to maximize treatment effectiveness while penalizing both covariate ascertainment costs and overall treatment cost, and suggest using a validation set to select penalties resulting in estimated costs below predefined cost thresholds. laber2018identifying instead proposed policy search methodology to maximize treatment effectiveness within the set of regimes for which overall cost (defined by number of safety events) is below a preset threshold. For treatment rules taking the form of linear decision boundaries, they propose a coordinate decent algorithm to identify an optimal decision rule under a fixed cost constraint. Recent work by zhang2018interpretable argues that treatment rules based on decision lists provide both flexibility and interpretability. They developed an efficient algorithm for identifying optimal listbased regimes for timevarying treatments, but do not consider estimation under cost constraints.
In this paper, we propose an efficient algorithm for identifying optimal listbased decision rules under fixed cost constraints. We use Qlearning with policy search methodology to maximize treatment effectiveness within the class of listbased regimes with costs under predefined thresholds. We modify the algorithm proposed by zhang2018interpretable to efficiently estimate the clauses within the listbased decision rules. In addition, we propose a novel iterative costeffectiveness analysis to select an optimally costeffective treatment regime from a set of candidate treatment regimes. This approach uses ICER to sequentially compare the relative benefit of progressively more expensive treatment regimes. The proposed costeffectiveness analysis better approximates traditional costeffectiveness analyses than do prior methods and avoids complications arising from noniterative analyses (cohen2008interpreting).
The remainder of this manuscript is organized as follows. In Section 2 we introduce methodology for estimating optimal listbased treatment regimes under a constraint on overall expected treatment costs. In Section 3, we propose a new approach for selecting an optimally costeffective regime from a set of candidate regimes corresponding to increasing cost thresholds. We explore the operating characteristics of our proposed regime identification approach through simulations in Section 4. We describe additional considerations and avenues for future developments in Section 5.
2 Methods
2.1 Setup and notation
In this section, we demonstrate how to estimate a set of treatment rules that maximize a measure of clinical effectiveness under a predefined cost constraint. Assume that treatment decisions are made at the beginning of distinct intervals indexed by . For individuals in interval , we observe data . Here, denotes confounding variables collected at the beginning of interval , denotes treatment status, and and are clinical effectiveness and cost outcomes collected at the end of interval , respectively. Within each interval, variables are observed in the order: . We use overbar notation to denote covariate history, e.g. , and underbar notation to denote future values of a covariate, e.g. . For , let denote a patient’s covariate and treatment history at the beginning of interval . A dynamic treatment regime is a set of decision rules where is a mapping
from the space of all possible covariate histories into that of treatment decisions. For a random variable,
, let denote the value of which would be observed if regime had been followed. If is a class of treatment regimes, then the optimal dynamic treatment regime within this class is the regime that satisfies the condition for all . If is the class of all treatment regimes, then can be identified through nonparametric Qlearning (kosorok2019precision).Because a common goal in health policy is making decisions on treatment allocation in resourcelimited settings, we restrict consideration to the class of listbased treatment rules with cost constraints. Compared with traditional Qlearning, this restricted search sacrifices regime flexibility in favor of interpretability. Following zhang2018interpretable, a listbased regime is a regime for which each treatment rule for consists of a series of ifelse statements:

If then ;

else if then ;


else if then
Each treatment regime is determined by . For where is the maximum desired list length for a decision at interval , is a subset of . We restrict so that all units which are unassigned at the end of the list are assigned to treatment . For simplicity and to aid interpretability, we restrict to clauses involving thresholding of a single covariate:
Estimation of the optimal costrestricted regime involves identifying optimal values for . Let and denote the collection of decision regions and treatment choices at each decision point. We employ an integrated Qlearning and policy search approach to define and estimate these optimal values. Policy search methodology is common when estimating interpretable regimes because it can decouple the form of a regime from the class of estimators for the Qfunctions (kosorok2019precision). This approach will allow us to flexibly estimate an optimal listbased regime under a fixed cost constraint. Define the stage Qfunctions for and :
Given a treatment regime , we may recursively define the Qfunctions for intervals as follows:
Because the optimal regime maximizes the potential treatment effectiveness under a constraint on the potential cost, some identification assumptions are required to connect the distribution of the observed covariates to that of the potential outcomes. Specifically, we need to make the following assumptions:

(Positivity) If , then for all

(Consistency) and for

(Sequentially ignorable treatment assignment) for
Under these assumptions, schulte2014q show that for the general Qlearning setting:
and, for :
It follows that the expected potential effectiveness and cost of a regime, , are given by and , respectively. Because the stage Qfunction for cost denotes the expected cost accrued in the interval and in future intervals under a prespecified regime, decompose the overall cost constraint into components representing the interval specific cost constraints where . The choice of interval specific cost constraints should be driven by subjectlevel knowledge. One reasonable choice may set so that the costs accrued in each interval are equally constrained. By the principles of dynamic programming, the optimal treatment regime can be identified by optimizing over the Qfunctions at each individual decision point (bellman1966dynamic). In the policy search context, the process of finding an optimal regime within a prespecified class reduces to finding the optimal decision rule within this class at each decision point. The optimal choices of within the class of listbased and cost constrained decision rules are given by:
For decision intervals the optimal choices are given by:
In the next subsection, we propose a modification of the algorithm developed by zhang2018interpretable for identifying globally optimal listbased regimes. Our approach extends this approach to the costeffectiveness setting by allowing the accommodation of cost constraints.
2.2 Estimating the Optimal Decision Rules
To estimate the optimal listbased and cost constrained regime, we use a backwards recursive procedure. The optimal decision rule for the final interval is estimated first, and earlier timepoints are estimated assuming optimal decisions are made at all future timepoints. For each decision point , we sequentially estimate the pairs for clauses .
For a unit entering the final interval with covariate history , define the unconstrained optimal rule . We want to find an interpretable listbased and cost constrained treatment rule which closely approximates this rule. First, consider approximating the final decision within the class of listbased rules with cost constraint . To estimate the first clause, consider the intermediate listbased decision rule:

If then ;

else if then
The expected effectiveness measure under this regime is given by:
Similarly, the expected cost is given by:
We search for the values of which maximize while ensuring that the expected cost, , is less than or equal to the cost constraint, . As zhang2018interpretable point out, maximizing the objective function, , is equivalent to minimizing the difference between this intermediate rule. Because we also impose the constraint , we are minimizing the difference between the intermediate rule and the optimal rule within the class of cost constrained rules.
Different choices of may lead to equivalent expected effectiveness and cost estimates. For example, suppose there exists a set , such that for all . In this case, the intermediate rules parameterized by and will lead to equivalent expected effectiveness and cost. When rules result in similar estimated effectiveness and cost, we want to reward those which assign treatment to a greater number of patients. Following the example of zhang2018interpretable, we add a complexity term, . to reward such regions:
Larger values of give greater weight to rules which include more units. However, this could potentially be at the expense of the mean effectiveness of an estimated rule. To ensure that rules are maximize effectiveness as much as possible this term may be selected using crossvalidation. The optimal choices of for the intermediate rule are given by:
This procedure can be generalized to estimate optimal regions and treatment choices for each of the clauses in the listbased rule. The algorithm can be summarized as follows:

Let

Define . If , the expectation of equal to or under the intermediate rule is defined as:
(1) 
Otherwise, if , force and define:
(2) 
Define as:
If set and return to Step 1, otherwise stop.
The algorithm above can be used to estimate . These estimates completely determine the optimal listbased and cost constrained decision rule for the interval, . To estimate treatment rules for decisions , let and we denote the Qfunctions for interval assuming that the optimal listbased and cost constrained rule is followed in future intervals. The unconstrained optimal treatment decision at interval is . To approximate this rule within the class of listbased and cost constrained treatment rules, we may apply the steps listed above replacing the stage Qfunctions with their stage equivalents, and the cost constraint with .
The proposed algorithm allows estimation of an optimal listbased regime under a set of intervalspecific cost constraints. This work requires two important modifications of the algorithm described by zhang2018interpretable for identifying globally optimal listbased regimes. First, our procedure requires that the optimal values of result in a rule which satisfies some prespecified cost constraint. To achieve this, we must be able to estimate the expected cost accrued under each intermediate decision rule. This motivates our second departure from the original algorithm. Equations 2.2 and 2.2 can be viewed as the sum of expected outcomes in three exhaustive groups: (1) those satisfying the current clause but not past clauses (), (2) those satisfying neither the current clause nor past clauses (), and (3) those satisfying at least one previous clause (). In order to estimate the overall expected outcomes, we must consider the expected outcome in each of these groups. Because zhang2018interpretable are interested in finding an optimal listbased regime without a cost constraint, at each clause they need only find that maximize effectiveness among unassigned units, i.e. those in groups (1) and (2). Because the outcomes for those in group (3) are fixed with respect to the choice of , these two methods will result in equivalent rules in cases where there is no cost constraint. This modification is necessary for ensuring the overall expected cost is below the cost constraint.
Estimates of the expected effectiveness and cost under fit regimes can be obtained from the identified treatment rules. Note that estimates the expected survival for a patient with baseline covariates , assuming the optimal cost constrained and listbased treatment is given at every interval. Similarly, estimates the expected total costs under the optimal regime for patients with baseline covariates . To estimate the expected performance of the regime accross the population, we may estimate:
(3)  
(4) 
2.3 Efficient optimization of treatment rules
We propose an algorithm for identifying optimal choices of and , at clause of decision rule . Let denote estimated optimal regions and treatments for clauses prior to . Recall that regions which are defined in terms of a threshold, , on a variable within . Let , for , denote a covariate contained within . First, consider decision regions of the form where patients with are given treatment . For the intermediate decision rule is given by:

If then ;

else if then ;


else if then ;

else if then ;

else
For each unit, define:
The values and represent expected effectiveness outcomes for patients included in and excluded from , respectively. Note that if a patient satisfies a previous clause, then so that the patient’s expected outcome is independent of the choice of and . It can be seen that . If we define and similarly, it follows that .
Although may take any real value, the unique values of and can be obtained by setting to the order statistics of . First, we relabel units so that . If there are no ties in the observed values of , then and follow the recursive relationships:
If there are ties in the observed values of
, as would be the case for ordinal or binary variables, then let
denote the unique values of , where . The recursive relationships can then be rewritten as:These relationships allow us to quickly enumerate all possible values of and for regions of the form and assigned treatment . Analogous relationships can be defined for regions of the form . Optimal choices for and can be obtained by iterating over each variable in and each possible treatment option, and identifying which choices lead to maximal values of under the restriction .
3 Implementation of a Costeffectiveness Analysis
In this section, we describe how our proposed optimal cost constrained regimes may be used to perform a costeffectiveness analysis. To identify an optimally costeffective treatment regime from the observed data, we use the incremental costeffectiveness ratio (ICER) which is commonly utilized in health economics (cohen2008interpreting). Given two candidate regimes, and , ICER is a comparative measure defined as the ratio of the differnce in expected cost to the difference in expected effectiveness between the two regimes:
ICER can be interpreted as the cost per unit change in effectiveness obtained by switching from treatment regime to . Let denote a preselected willingnesstopay (WTP) parameter. The WTP represents the maximum cost a payer is willing to incur for a unit change in effectiveness. Adopting regime over the comparator regime, , is considered costeffective if ICER is less than the chosen WTP.
Given a set of candidate regimes and a preselected WTP, we may use ICER to select an optimally costeffective regime. Let denote the estimated optimal treatment regime for a given cost constraint . Let be an ordered sequence of cost constraints, and be the set of candidate treatment regimes corresponding to these constraints. For these regimes, we know and . Because of this relation, if a regime fit with a lower cost constraint is used as the reference regime, then ICER will always be positive. We provide an algorithm that identifies the optimally costeffective regime by sequentially comparing each candidate regime to the best existing alternative.

Define the current most costeffective regime: and set

If , then , otherwise do not update .

If , set and repeat from Step 2, otherwise is the optimally costeffective regime.
We illustrate this procedure using a simulated example. Suppose we have a binary treatment variable, , which is assigned to patients at distinct timepoints. Prior to treatment assignment at interval , we collect data on two continuous covariates, and , as well as data on each patient’s current treatment costs, , and survival time, . For this example, treatment costs are measured in units of USD , and survival is measured in years. Data are simulated as specified in Section 4. We wish to find a treatment regime which optimally balances treatment effectiveness and cost.
Regime  Constraint  Survival  Cost  ICER  Comparator 

I  26  1.35  25.98  NA  NA 
II  30  1.44  29.76  40.66  I 
III  34  1.54  33.93  42.22  II 
IV  38  1.64  37.86  41.84  III 
V  42  1.71  41.81  53.77  IV 
VI  46  1.76  45.62  61.01  IV 
VII  50  1.79  47.46  63.32  IV 
We first fit optimal listbased regimes with cost constraints ranging from a total cost of to over the three treatment intervals. Table 1 provides the estimated survival in years and cost in USD for each of the identified optimal cost constrained regimes. In this table, ICER is calculated comparing each regime and the current best existing alternative. If the WTP is , then the optimally costeffective regime is the regime with a cost constraint of over the three year period. This regime satisfies the two conditions for optimality: (1) ICER comparing this regime to the best existing alternative (regime II) is below the selected WTP and (2) switching from this regime to any of the more expensive regimes would cost more than per unfit change in survival.
4 Simulation study
We perform a simulation study to evaluate the operating characteristics of our regime identification approach. Our goals in performing this study are twofold. First, we demonstrate that our approach is able to fit valid cost constrained regimes in realistic settings where a costeffectiveness analysis would be undertaken. Second, we illustrate the utility of flexible ensemble learning approaches for improving regime performance. We simulate data with decision points and sample units. For intervals , we simulate a vector of independent standard normal confounding variables, ; a binary treatment decision, ; an indicator of whether a unit has survived to the end of the interval, ; and cost accrued within the interval, . Specifically, baseline data are simulated as:
and data for intervals are simulated as,
We define , , , , , and . For followup intervals the corresponding parameter values are , , , , , , and . The parameter , which is the same for baseline and followup data, controls the level of correlation between the cost and effectiveness outcomes. We consider simulation settings with low, medium, and high levels of correlation between outcomes, corresponding to , and respectively. To ensure costs are similar across simulation settings we include the offset term which is equal to and in low, medium, and high correlation settings. We simulate data with , and sample units. To explore how flexible modeling of the Qfunctions can improve regime performance, we compare two approaches: (1) linear models, and (2) SuperLearner. SuperLearner is an ensemble learner, combining predictions from a set of candidate learners to minimize crossvalidated risk van2011targeted
. We specified random forests, neural nets, elastic net, generalized linear models, and an interceptonly regression as candidate learners in our implementation of Super Learner.
For each setting, we simulate 500 datasets and estimate an optimal cost constrained regime using linear models and another using SuperLearner. Regimes are fit with overall cost constraints of , , and . After regimes are identified, we simulate new units, treat them according to the estimated treatment rules, and record the mean survival and cost among these units. This procedure results in two estimates of the mean cost and survival under each fit regime; one arising from the Qfunctions (as described in Equations 3 and 4), and another based on Monte Carlo simulation. The Monte Carlo approach allows us to determine how each regime performs when applied to new data and approximates the true cost and survival under that regime. On the other hand, estimates of cost and survival arising from the Qfunctions are subject to bias due to model misspecification. If the Qfunctions are appropriately modelled, then estimates of the expected cost and survival for a fit regime should be similar to those from the Monte Carlo approximations of the truth.
Table 2 summarizes our simulation results. For each setting and modeling approach, we provide the mean Qfunctionbased estimate of cost and survival across the identified treatment strategies. In addition, we provide the mean Monte Carlo estimated cost and survival across the fit regimes. In nearly all settings, the mean Monte Carlo estimated cost fall below the specified cost threshold. Only in scenarios where SuperLearner was used with large sample datasets, was the cost constraint not satisfied. In these settings, the mean Qfunction based estimate of the total cost underestimated the true MonteCarlo estimated cost. For example, when cost and survival outcomes were highly correlated, the cost constraint was set to , and data were simulated with units, the mean Qfunction based estimate of cost across regimes was while the mean Monte Carlo estimated cost was . This contrasts with the results seen for regimes identified using linear models. Here, Qlearning based estimates of the mean consistently overestimated the costs observed when new units were treated according to the identified strategies. Because total regime costs are overestimated when using linear models, some regimes which would truly satisfy the cost constraint may be excluded from consideration. This results in more conservative treatment strategies and decreased survival when compared to SuperLearnerbased regimes. In the moderate correlation setting, for sample sizes of and a cost constraint of , the average Qlearning based estimate of regime cost is . However, the regimes fit using linear models have a lower Monte Carlo estimated cost and survival than the corresponding SuperLearnerbased regimes (costs: vs. ; survival: vs. ). Even when Monte Carlo regime costs are similar between regimes fit using linear models and SuperLearner, Monte Carlo estimates of patient survival are greater for SuperLearnerbased regimes. This is the case in the high correlation scenario with a cost constraint of and a sample size of . While regimes fit under both modeling approaches lead to an average cost of when applied to new data, the average survival under linear modelbased regimes is while for SuperLearnerbased regimes it is .
Because SuperLearner uses crossvalidation to create a weighted combination of candidate learners, predictions of survival and cost outcomes are dependent on random splits of the data. In the context of doubly robust estimation of the average treatment effect, benkeser2020practical illustrates the sensitivity of inference to setting different random seeds when outcome and propensity score models are fit using SuperLearner. For dynamic treatment regimes, this sensitivity manifests as different estimated treatment rules across random seeds. We perform additional simulations to assess the sensitivity of our listbased and costconstrained regimes to random splits of the data. For sample sizes of , , , and , we simulate datasets under the setup previously described. For each dataset, we use SuperLearner to fit two separate treatment regimes changing only the random seed. We summarise differences between the two regimes as the number of interval specific rules which differ between the regimes (e.g. if only the final decision rule differs between regimes, then the “distance” is one). We find that the stability of the fit regimes increases with sample size. The proportion of datasets for which fit regimes were identical or only differed at one decision interval is , , and at sample sizes of , , , and , respectively. In practice, sensitivity to random seed can be assessed by fitting treatment regimes using SuperLearner under multiple seeds. If fit regimes differ in meaningful ways, then modeling approaches that do not depend on random seeds should be used to fit Qfunctions.
Linear Model  SuperLearner  

Corr.  n  MC. Cost  MC. Surv.  MC. Cost  MC. Surv.  
Low  26  500  25.75  1.30  25.53  1.25  25.81  1.36  25.81  1.34  
1000  25.64  1.26  25.18  1.25  25.85  1.35  26.12  1.38  
5000  25.55  1.23 
Comments
There are no comments yet.