Balanced Policy Evaluation and Learning for Right Censored Data

11/13/2019 ∙ by Owen E. Leete, et al. ∙ 0

Individualized treatment rules can lead to better health outcomes when patients have heterogeneous responses to treatment. Very few individualized treatment rule estimation methods are compatible with a multi-treatment observational study with right censored survival outcomes. In this paper we extend policy evaluation methods to the right censored data setting. Existing approaches either make restrictive assumptions about the structure of the data, or use inverse weighting methods that increase the variance of the estimator resulting in decreased performance. We propose a method which uses balanced policy evaluation combined with an imputation approach to remove right censoring. We show that the proposed imputation approach is compatible with a large number of existing survival models and can be used to extend any individualized treatment rule estimation method to the right censored data setting. We establish the rate at which the imputed values converge to the conditional expected survival times, as well as consistency guarantees and regret bounds for the combined balanced policy with imputation approach. In simulation studies, we demonstrate the improved performance of our approach compared to existing methods. We also apply our method to data from the University of North Carolina Center for AIDS Research HIV Clinical Cohort.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Differences in individual patient characteristics can result in significant heterogeneity in response to treatment. An individualized treatment rule (ITR) is a function which takes patient specific characteristics and recommends a treatment. The optimal ITR recommends the treatment that maximizes the benefit with respect to some clinical outcome. Treatment decisions are made based on published treatment guidelines which often list several available treatments from which the physician needs to choose based on expert opinion and personal experience. Using data driven approaches to help inform decision making can formalize the process and improve patient outcomes. There has been a considerable amount of effort devoted to the estimation of the optimal ITR (Qian and Murphy, 2011; Zhang et al., 2012; Zhao et al., 2012, 2014; Cui et al., 2017; Kallus, 2018; Athey and Wager, 2017).

To motivate our approach to estimating ITRs, we consider data from an observational cohort study of HIV+ patients who take different types of antiretroviral therapy (ART). The nature of this data presents several challenges to estimating ITRs. First, there are many treatment options. HIV infection is treated with combination therapies consisting of several drugs from multiple drug classes. Second, the data are frequently subject to right censoring. One measure of the effectiveness of treatment is the durability of an ART regimen, which measures how long a patient remains on the same ART. The durability is predictive of long-term patient morbidity and mortality, but it is prone to loss to follow-up. Third, the observational nature of the data means that the probability of a patient receiving a particular treatment may be related to measured and unmeasured factors including comorbid conditions and other therapies. This confounding by indication can make it difficult to establish causal relationships.

Some early methods for estimating ITRs involved a two step approach that first uses regression methodology to estimate a conditional mean for the response under each treatment and then recommends the treatment with the best estimated conditional mean value (Qian and Murphy, 2011). Since these methods are based on regression modeling, they can be easily extended to meet the specific requirements of a wide array of data sources. However, there are a number of concerns about the performance of these methods (Beygelzimer and Langford, 2009; Zhao et al., 2012). Much of the recent work in ITR estimation is based on outcome weighted learning (OWL), which reformulates the estimation procedure as weighted classification (Zhao et al., 2012). The original paper assumed fully observed, binary treatment data from a clinical trial. Several extensions have relaxed each of these assumptions individually (Zhao et al., 2014; Wang et al., 2016; Cui et al., 2017; Liang et al., 2018), but none of them relax all three assumptions simultaneously. Policy evaluation methods are designed to work with multiple treatments and observational data (Dudík et al., 2011; Kallus, 2018; Athey and Wager, 2017). However, current extensions of these methods to right censored data use multiple weighting approaches, which results in high variance estimators, or rely on strong assumptions.

In this article, we propose a new method which addresses the shortcomings in existing approaches with respect to our motivating example. We focus on balanced policy evaluation and learning, an approach proposed by Kallus (2018). Weighting approaches to policy evaluation can suffer from high variance when the weights are large. Existing variance reduction methods result in a biased estimator, but the amount of bias introduced is unknown, making it difficult to optimize the bias-variance tradeoff. Balanced policy evaluation seeks to quantify the bias-variance tradeoff nonparametrically by finding weights which minimize a measure of the conditional mean squared error (CMSE). We extend the balanced policy approach to right censored data by developing an imputation approach which replaces the censored observations with the conditional expected mean survival time. This allows us to define weighted and doubly robust estimators for policy evaluation for two or more treatments under right censoring. By more explicitly quantifying the bias-variance tradeoff and reducing the impact of censoring via imputation, the proposed approach is able to reduce the variance, leading to more efficient policy evaluation and learning.

While we focus on a single policy evaluation method, the imputation approach we propose could be used to extend any ITR estimation method to work with right censored data because it results in an estimated fully observed outcome vector. We demonstrate that the proposed approach results in improved performance when compared to existing methods. Nonetheless, the performance of balanced policy evaluation and learning can suffer in the presence of high dimensional, potentially noisy covariate information. We discuss a number of ways to mitigate this issue in the presence of right censored data, and demonstrate their use in simulation studies.

The remainder of this article is organized as follows. In Section 2, we discuss the policy evaluation approach to estimating ITRs, and introduce an imputation approach for right censored data. In Section 3, we develop the theory for the proposed method. Simulation studies are presented in Section 4. We also illustrate our method using data from the University of North Carolina Center for AIDS Research HIV Clinical Cohort in Section 5. The article concludes with a discussion of future work in Section 6. Some technical results are provided in the Appendix.

2 Methods

2.1 Problem Setup and Notation

Before discussing policy evaluation in the presence of right censoring, we first introduce some notation and describe policy evaluation and learning adapted to the fully observed survival time setting. Let be the observed subject-level covariate vector, where is a -dimensional vector space, and is the treatment, where comes from a finite action space. For each define as the unrestricted failure time that would be observed under treatment , and let . In many studies, the unrestricted failure time is rarely observable due to the existence of a maximum follow-up time . We instead consider , a truncated version of at , i.e., for . For a covariate vector , treatment , and failure time we define the mean-outcome function as .

In most survival studies there are some subjects for which observation of the failure is precluded by the occurrence of censoring events. For any treatment , consider a censoring time which is independent of given and . Define the observed time and the failure indicator . Let the failure time and censoring time be generated from distributions with conditional survival functions and , respectively. The censored data consist of observations of covariate, treatment, observed time, and failure indicator: . We assume that the data are drawn iid from a fixed distribution: , and that we observe . This definition of implies the causal consistency assumption of the potential outcomes framework and the stable unit treatment value assumption (SUTVA), which ensures that the observed outcome for one subject is unaffected by the treatment assignments of the other subjects (Rubin, 1986). We also assume that the data satisfy conditional exchangeability:

Assumption 1.

for all .

This is commonly referred to as the no unmeasured confounders assumption, where the treatment decision is not influenced by the outcomes except through their mutual relationship with the observed covariates . In our setting, conditional exchangeability is equivalent to there being an unknown propensity function which, given covariate vector , assigns treatment with probability .

A policy is a map from the covariate space to a probability vector in the m-simplex, . For an observation with covariate vector , the policy specifies that treatment should be administered with probability . There are two main goals in this framework. The first goal, known as policy evaluation, is to estimate the average outcome that would be observed if treatments were assigned according to the policy . Because the implementation of a bad policy can be costly or dangerous, new policies are generally evaluated using historical observations (Thomas et al., 2015). The second goal, known as policy learning, attempts to identify a policy which maximizes the expected outcome with respect to some reward. In this article we will assume that larger outcomes are preferable, and thus the reward would be or .

To formally define policy evaluation, consider a policy for which we wish to estimate the sample-average policy effect (SAPE),

The SAPE quantifies the average outcome that would have been observed had the treatments been assigned according to a given policy . The SAPE is strongly consistent for the population-average policy effect (PAPE), . If is such that , then is an optimal policy. That is, maximizes over all functions . For an optimal policy , is the regret of . In policy learning, we wish to find a policy that minimizes the expected regret.

2.2 Existing Approaches to Policy Evaluation

Current approaches to policy evaluation generally fall into a few categories which are based on mean outcome modeling, weighting methods, or a combination thereof. The regression-based estimator first fits regression models of for all . These models are then used to estimate the in a plug-in fashion

(1)

This approach assumes that the regression models for each treatment are correctly specified, which can be a strong assumption when the number of treatments is large (Zhao et al., 2012).

Weighting-based estimators use covariate and treatment data to find weights, , that reweight the outcome data to make it look as though it were generated by the policy being evaluated. These estimators are generally of the form

(2)

where the weights are often normalized to sum to . With weights and regression models we can define a doubly robust (DR) estimator

(3)

which can be thought of as denoising the weighted estimator by subtracting the conditional mean from , or debiasing the regression-based estimator using the reweighted residuals.

A common weighting-based approach is inverse propensity weighting (IPW) (Imbens, 2000). By noting that , for , an estimate of the unknown propensity function , one can construct weights

(4)

which gives rise to the estimator . Since the propensities appear in the denominator, small values of can lead to have a large variance. Some ad-hoc solutions for reducing the variance exist, such as clipping, where is replaced by for some , but these methods generally have the effect of increasing the bias (Li et al., 2015).

Kallus (2018) proposed a weighting-based estimator which simultaneously controls both the bias and variance. For a generic weighted estimator , consider the conditional mean squared error (CMSE),

(5)

Balanced policy evaluation takes the and decomposes it into squared bias and variance components. The bias term relies on the weights and the unknown mean outcome function . Rather than directly estimating , the balanced policy approach defines a worst case bias by choosing the mean outcome function from a bounded class of functions that maximizes the squared bias. Once the worst case bias term has been identified, the variance term is estimated, and balanced policy weights are found by minimizing the worst case squared bias and variance with respect to the weights . Additional details for balanced policy evaluation will be provided in Section 2.4.

2.3 Extensions to Right Censored Data

Analyzing right censored data without accounting for the missing data results in bias due to censoring (Fleming and Harrington, 2011), thus consistent policy evaluation requires methods designed to correct for the bias due to censoring.

Regression-based methods can be extended to right censored data by using regression models designed for censored data. Common model choices include parametric and semiparametric models such as the accelerated failure time and Cox proportional hazards models (Kalbfleisch and Prentice, 2011)

. However, these models make restrictive assumptions about the structure of the failure time distributions which are often difficult to verify. Non-parametric models of the survival function have been proposed, such as the kernel conditional Kaplan-Meier estimate

(Dabrowska, 1987)

or random forest based methods

(Hothorn et al., 2004; Ishwaran et al., 2008; Zhu and Kosorok, 2012; Steingrimsson et al., 2016; Cui et al., 2019) which define localized versions of simple survival models, such as the Kaplan-Meier estimator. Non-parametric models make less restrictive assumptions, but they are less efficient and unlikely to produce good results in small sample sizes.

Inverse propensity weighted methods can be extended to right censored data by looking only at the observed failure times, and weighting based on the probability of observation. Inverse probability of censoring weighting (IPCW), is motivated by an idea similar to IPW. Since we have that IPCW can be combined with IPW by defining weights

(6)

As long as the models for and are correctly specified, the resulting estimator, , will be consistent for the SAPE (Anstrom and Tsiatis, 2001). However, with two weights in the denominator, the issues with high variance will be compounded.

Balanced policy weights are incompatible with IPCW. Depending on when the IPC weights are applied, balanced policy weights are either no longer identifiable, or they lose the property of minimizing the worst case bias and variance. To extend balanced policy to right censored data, we use an imputation approach similar to one proposed by Cui et al. (2017). Consider an imputed failure time vector where the censoring times are replaced with the conditional expected failure time given the covariates, treatment, and the knowledge that the subject survived at least until time .

Estimating the expected failure time for the censored subjects uses a conditional estimate of the survival function. In this article we focus on estimation of the conditional expected failure time using random forest methods for right censored data because they provide nonparametric estimates with good performance for higher dimensional data (Cui et al., 2019). Random survival forests (RSF) (Ishwaran et al., 2008) and recursively imputed survival trees (RIST) (Zhu and Kosorok, 2012) are based on extensions of random forests (RF) (Breiman, 2001) and extremely randomized trees (ERT) (Geurts et al., 2006)

, respectively, where the splitting rule for each node is chosen to maximize the log-rank test statistic.

Cui et al. (2019) note that a splitting rule based on the standard log-rank test statistic can produce biased results because it does not appropriately control for the censoring distribution, and instead propose a splitting rule which corrects for the bias due to censoring.

Regardless of the method used to estimate the survival function, given , the conditional expected survival time can be estimated as

and the estimated imputed failure time vector can be defined as

(7)

By using in place of in equation (2) we can now define a balanced policy approach to right censored data.

2.4 Balanced Policy Evaluation with Right Censored Data

Recall from equation (5) that we defined the CMSE as the squared difference between a weighted estimator and the . We define

where is the Kronecker delta. Letting , from Kallus (2018) Theorem 1, we have that

and, under assumption 1,

where . This decomposes the CMSE into squared bias and variance components. The measure of the variance will be given by the norm of weights for a positive semidefinite (PSD) matrix . Instead of directly estimating the bias term, we find the worst case bias by maximizing over all functions for a bounded function class .

To define the worst case bias, we look at functions, , in the unit ball of a direct product of reproducing kernel Hilbert spaces (RKHS):

(8)

where is the norm of the RKHS given by the PSD kernel . Any choice of gives rise to the worst case CMSE objective:

Balanced policy evaluation is given by the estimator where is the minimizer of over the space of all weights that sum to , . Specifically,

(9)

With the imputed survival time vector, , and the balanced policy weights, we can define the weighted and doubly robust balanced policy estimators for right censored data:

(10)
(11)

Next we consider policy learning. For a given policy class , let be the policy which maximizes the balanced policy evaluation estimator, , using the weights defined in equation (9). We formulate this as a bilevel optimization problem:

(12)
(13)

It is difficult to maximize the policy with respect to the full policy class , so it is common to use a reduced class such as the parameterized policy class

(14)

Using a reduced policy class limits the flexibility of balanced policy learning, but for moderate sample sizes the reduced complexity can allow for more accurate estimation of .

A high-level description of the proposed method is given in Algorithm 1 below.

Algorithm 1.

Pseudo algorithm for the Balanced Policy Evaluation and Learning with Imputation

Step 1: Use to fit a model for the conditional survival function , and create the imputed survival time vector, , by using to estimate the conditional expected survival times for the censored observations.
Step 2: For a norm, , and diagonal matrix, , define a policy class , and select
Step 3: Find the balanced policy weights associated with , , and and use the weights to calculate according to equation (10).
Step 4: Calculate the gradient of the policy, , with respect to and use gradient ascent to find an improved policy
Step 5: Repeat steps 3-4 until convergence to an estimated optimal policy .

2.5 High Dimensional Considerations

Because the balanced policy weights are defined using a kernel norm , which depends only on the sample covariate space , balanced policy evaluation and learning can be sensitive to the presence of covariates which contain redundant or noisy information. The negative effects of high-dimensional noisy data can be reduced in several ways including the choice of the kernel norm, and feature elimination, both of which we detail here.

Recall from equation (8

) that we defined the norm of a function with respect to the direct product of RKHSs. One commonly used kernel is the Mahalanobis radial basis function (RBF) kernel

(15)

In Lemma 1 point (3) of Kallus (2018), it was shown that, for the norm defined in (8), if and

has a Gaussian process prior, then kernel hyperparameters, such as

and , can be selected using the marginal likelihood principle. Choosing the hyperparameters in this manner reduces the influence of the unimportant predictors. In simulations this has greatly improved the accuracy of the balanced policy estimator with respect to evaluation. However, since the redundant or noisy predictors are not eliminated, the dimension of the search space remains high, which can still adversely impact policy learning.

Reducing the dimension of the search space using variable selection (VS) methods can improve the performance of policy learning. Nearly any VS method could be used, but to keep our method as general as possible we examined non-parametric VS methods such as the risk-recursive feature elimination algorithm (risk-RFE) proposed by Dasgupta et al. (2019). Risk-RFE uses kernel machines to rank the importance of the features based on the regularized risk. Once the features are ranked in order of estimated importance, change point estimation methods can be used to identify the important predictors based on the increase in regularized risk at each step. For additional details see Dasgupta et al. (2019).

In our implementation for high dimensional data, we used methods where tuning hyperparameters and variable selection required a complete outcome vector. Therefore, these methods were implemented prior to Step 3 of Algorithm

1, but methods which are compatible with right censoring could be implemented prior to Step 1.

3 Theory

The properties of the proposed estimator will depend on the method used to estimate , so to facilitate the discussion we make the following assumption:

Assumption 2 (Convergence rate).

There exists some sequence such that

Assumption 2 may seem strong, but it is met for a large number of methods for estimating . For example, the assumption holds for AFT and Cox PH models with . Assumption 2 is also met for non-parametric methods, such as kernel conditional Kaplan-Meier and random forest based methods. For the kernel conditional Kaplan-Meier, , where is the covariate dimension (Dabrowska, 1989). The convergence rate for the bias corrected random survival forest method proposed by Cui et al. (2019) depends on the splitting criteria, but the rate for the theoretically optimal splitting criteria is . Convergence rates for other random forest methods (Ishwaran et al., 2008; Zhu and Kosorok, 2012) have not yet been established, but we believe that the rates will be similar to those found in Cui et al. (2019).

We show that a convergence rate for is enough to guarantee the convergence rate of ; however, Theorems 12.4 and 12.5 of Kosorok (2008) imply that it is also sufficient to have a convergence rate for an estimate of the cumulative hazard function, .

Lemma 1.

If there exists some sequence such that,

then

where are the balanced policy weights.

To prove Lemma 1 we need an additional assumption.

Assumption 3.

Let where is based on all observations except for the one. For all and any ,

Since is based on more information than , this assumption should be met provided is consistent. The proof of Lemma 1 can be found in the appendix.

Consistent evaluation requires a weak form of overlap between the unknown propensity function and the target policy being evaluated:

Assumption 4 (Weak overlap).

for all , and

Another requirement is well-specification of the mean-outcome function. For balanced policy evaluation the mean-outcome function is well-specified if it is in the RKHS product used to compute . Otherwise, consistency is also guaranteed if the RKHS product consists of -universal kernels, defined below, such as the RBF kernel.

Definition 1.

A PSD kernel on a Hausdorff (e.g., ) is -universal if, for any continuous function with compact support (i.e., for some compact , ) and , there exists such that .

Theorem 1.

Fix and let with , for each . Suppose Assumptions 1-4 hold, is a.s. bounded, , and . Then the following two results hold:

  • If for all , then ,

  • If is -universal for all , then .

Corollary 2.

Suppose the assumptions of Theorem 1 hold, then

  • If and for all , then
    . ,

  • If is -universal for all , then .

Proof of Theorem 1.

Define

as the estimators which use the true and estimated imputed failure time vectors respectively. We can then decompose as follows:

By Lemma 1 we have that

If for all , then by Kallus (2018) Theorem 3, we have that

If is -universal for all , then by Kallus (2018) Theorem 3, we have that

The proof of Corollary 2 follows immediately from Lemma 1, as well as Corollary 4 of Kallus (2018) in a manner similar to the proof of Theorem 1.

Next, we establish the consistency and learning rates of the balanced policy learner uniformly over policy classes. For a given policy class , we define the sample regret as

and the population regret as

Convergence of these regret quantities require that the best-in-class policy is learnable. We quantify learnability using Rademacher complexity, but the results could be extended to VC dimension. Let us define

If , let and set and . We also must use a stronger version of the overlap assumption.

Assumption 5 (Strong overlap).

such that .

Theorem 3.

Fix and let with , , , and . Suppose assumptions 1-2 and 5 hold, a.s. bounded, and for . If is as in (12), then the following results hold:

  • If for all , then ,

  • If is -universal for all , then .

If is as in (13), then the following results hold:

  • If and for all , then
    . ,

  • If is -universal for all , then .

The proof of Theorem 3 follows directly from Assumption 2 and Lemma 1, as well as Corollary 7 of Kallus (2018). All the same results hold when replacing with and/or replacing with .

4 Simulation Studies

We have conducted simulation studies to assess the performance of the proposed method in comparison with existing alternatives. The models we compare are the simple weighted and doubly robust versions of the balanced policy approach with the imputed failure time vector defined in equation (7), a normalized clipped version of the approach which combines IPW with IPCW as defined in equation (6), and a normalized clipped version of the IPW approach combined with the imputed vector from (7). In all settings, both the optimal and estimated policies are members of the reduced policy class defined in (14). The kernel used to define the worst case bias in the balanced policy approach is the Mahalanobis RBF kernel in equation (15), where the hyperparameters are chosen by the marginal likelihood method using Gaussian process regression. The same kernel is used for all of the treatments, for all . For each setting, the conditional expected survival times, , are estimated using RIST with the tuning parameter settings suggested in Zhu and Kosorok (2012)

. Propensity scores are estimated using a correctly specified Gaussian process classifier. Due to different censoring mechanisms, the weights used for IPCW are estimated differently for each setting. For the first setting the censoring weights are estimated using RIST with the recommended tuning parameter settings. For the second setting the censoring weights are estimated using the correctly specified accelerated failure time (AFT) model. A testing dataset with size 10000 is used to approximate the population regret

. For each setting the log survival time is used to estimate the optimal policy on simulated datasets of size , and . Each simulation is repeated 100 times.

4.1 Simulation Settings

Our first setting is a slight modification of the setting found in Kallus (2018)

. To begin, we sample the covariate vector from a multivariate normal distribution

, where and is compound symmetric covariance with diagonal elements of 1 and off diagonal elements of 0.2. The treatments are assigned with probabilities which depend on the first two elements of of . Specifically Multinoulli where for , , , , , and

. The unrestricted failure times are drawn from a log-normal distribution where

where , and for , , and are the real and imaginary components respectively. The observed failure time , where . This mean outcome process results in an optimal treatment policy consisting of 5 equal sized wedges arranged radially about the origin. The censoring times are drawn from a log-normal distribution where where . The censoring rate is approximately 45%.

In the second setting, the failure times are drawn from log-normal distributions where the treatment effect has both prognostic and prescriptive elements. The 10-dimensional covariate vector, , is drawn from a uniform distribution with a compound symmetric covariance structure by first drawing from a multivariate normal distribution where is the same as in setting 1. The CDF transformation was applied to the normally distributed covariates which where then scaled to be within -1 and 1. The treatments are assigned with probabilities which depend on the first three elements of of , where Multinoulli where for , , , . The unrestricted failure times are drawn from a log-normal distribution where , where , , , and . The failure times , where . The censoring times are drawn from a log-normal distribution which depends on the assigned treatment, where , with , , and . The censoring rate is about 45%.

4.2 Simulation Results

The results for setting 1 can be found in Figure 1. This setting is designed to have mismatch between the unknown propensity function and the optimal policy such that the inverse propensity weights are large, resulting in high variability for the IPW with IPCW, and IPW with Imputation approaches. As expected, the increased variance of the IPW based estimators results in reduced performance. The more stable balanced policy with imputation approach performs better than the other approaches. In this setting the feature elimination method effectively eliminates noise variables, even for small sample sizes, which increases performance over the methods using a higher dimensional feature space. We also examined doubly robust approaches for each of the methods, but the mean model was not accurate enough to improve performance over the simple weighted methods. The results for the doubly robust approaches can be found in the supplemental material.

Figure 1: Boxplots showing estimated population regret for setting 1. Smaller is better. IPW+IPCW: IPW combined with IPCW; IPCW+: IPW with imputed outcome vector; BP+: Balanced policy with imputed outcome vector. Variable selection used riskRFE to remove variables.

The results for setting 2 can be found in Figure 2. Again the higher variability of the IPW with IPCW, and IPW with Imputation approaches reduced performance when compared to the balanced policy approach. The doubly robust approaches for the balanced policy with imputation and IPW with Imputation increased performance compared to the simple weighted versions for small sample sizes. We examined the effect of feature elimination in this setting, but the method frequently eliminated important predictors, especially in the smaller sample sizes. The results using feature elimination can be found in the supplemental material.

Figure 2: Boxplots showing estimated population regret for setting 2. Smaller is better. IPW+IPCW: IPW combined with IPCW; IPCW+: IPW with imputed outcome vector; BP+: Balanced policy with imputed outcome vector. Doubly robust versions used the correctly specified AFT model for .

5 Application to HIV Study

We apply the proposed method to analyze data from the UNC CFAR HIV Clinical Cohort (UCHCC). In the UCHCC study, patients were followed from initiation of an integrase strand transfer inhibitor (INSTI) in combination with at least two other antiretroviral (ARV) agents from at least one other ARV therapy class representing current standard of care for initial ART in high income clinical settings. Patients were followed until the first of ART modification or discontinuation, death, loss to follow-up, or administrative censoring. The study data included 957 HIV-infected patients who were 72% male, 62% black, 29% white, 9% Hispanic or other races/ethnicities. The median age at treatment initiation was 44 years, 50% were men who have sex with men (MSM) and 9% had a history of injection drug use (IDU). At INSTI initiation (baseline), the median CD4 cell count was 514 cells/mm (range 9 to 2970.0) and the median HIV RNA level was 1.8 log10 copies/mL (range 1.3 to 7.6). The median number of prior antiretroviral (ARV) compounds was 3 (range 0 to 17) and 29% had no prior ARV experience. The INSTI regimen was chosen by providers and patients based on clinical indication and included in all cases one integrase strand transfer inhibitor (INSTI) and two or more nucleoside reverse-transcriptase inhibitors (NRTIs).

The covariates include age, male (yes, no), race (black, white, or other), MSM (yes, no), IDU (yes, no), baseline CD4 count, baseline viral load (VL), an indicator if baseline RNA is undetectable, the number of prior ARV agents, and an indicator if the patient is treatment naïve. Categorical variables are transformed into dummy variables, resulting in an 11-dimensional covariate vector

.

The treatment regimens of interest are Raltegravir (RAL), Elvitegravir (EVG), and Dolutegravir (DTG), each with 2+ NRTIs. The primary outcome of interest is time to the discontinuation of the initial INSTI regimen, which is defined as either a change in the INSTI agent or discontinuing ART for more than two weeks. The compounds under consideration received FDA approval at different times, so the maximum follow-up time is different for each regimen. The left plot in Figure 3 shows the Kaplan-Meier curves for the UCHCC data by treatment over the unrestricted follow-up window. This plot indicates that there may be large treatment differences for the longer follow-up times, but we chose to limit our analysis to 2.5 years (913 days) of follow-up time. This time period was chosen because it is near the percentile of observed treatment discontinuations for the regimen with the least follow-up time, which ensures sufficient data coverage to support the analysis.

Among the 957 study subjects, 416 (43%) were observed to discontinue treatment, and 319 (33%) were censored due to loss to follow-up during the first 2.5 years. The 222 (23%) patients still being followed after 2.5 years were administratively censored.

We applied balanced policy evaluation and learning to the UCHCC data using all available variables with the kernel defined in (15) and the policy class in (14). The estimated optimal treatment strategy identified by balanced policy evaluation and learning was to treat every patient with DTG and 2+ NRTIs. Each of the three INSTIs evaluated in these analyses have different barriers to resistance evolution, tolerability profiles and dosing frequency, likely affecting the durability of the regimen differentially across different patient groups. Additional analyses stratified by whether patients were ART naive at INSTI initiation or by age at INSTI initiation, supported the findings of the primary analyses, the details of which may be found in the supplemental material.

Figure 3: Estimated marginal survival curves for each treatment. The left plot shows the entire follow-up period. The right plot shows the restricted follow-up period

In order to see how our method performs in comparison to existing methods on realistic data when meaningful subgroup effects are present, we modified the data to equalize baseline differences and highlight heterogeneity. Specifically, a constant scaling factor was applied to the observed times for each of the treatments so that the marginal treatment effects between any two treatments was negligible. Due to the effects of differential censoring rates between treatments, the average policy effect was estimated using the regression-based estimator (1) where the mean process was estimated using RIST.

For this analysis we used a reduced set of predictors which were identified by riskRFE as being most likely to have meaningful subgroup effects. The predictors were age, treatment naïve status, baseline CD4, baseline RNA, and undetectable baseline RNA indicator. The conditional expected survival times, , and the censoring probabilities used for IPCW were estimated using RIST with the tuning parameter settings suggested in Zhu and Kosorok (2012). Propensity scores were estimated using a Gaussian process classifier. The balanced policy evaluation again used the kernel defined in (15) and the policy class in (14).

We compare the balanced policy with imputation approach to several other methods. The regression-based method as defined in equation (1) with estimated using RIST, and two weighted methods as defined in equation (2). The weighted methods are: IPW with IPCW using the weights defined in equation (6), and IPW with imputation using and the weights defined in equation (4).

For comparison between methods, we estimate the optimal policy with a cross-validation type analysis. Specifically, the data are partitioned into ten roughly equal-sized parts. For each method under consideration, we estimate the optimal policy on nine parts of the data, and then compute several estimates of the SAPE using the remaining tenth part. By applying the above procedure, holding out a different part of the data each time, we were able to get out-of-sample predictions which should better represent expected optimal SAPE for future subjects. This approach was applied to 100 different partitions of the data and the results are presented in Table 1 and Figure 4.

From Table 1

, we observe that the proposed approach produces policies with larger mean estimated SAPE with smaller standard errors across all three evaluation criteria. The policies found by the regression-based estimator had high estimated SAPE with respect to its corresponding criteria,

, but low estimated SAPE with respect to the weight based criteria, and . The proposed method was able to find policies with high estimated SAPE for all criteria, indicating that the policies found were truly the best in terms of SAPE, as their improvement was insensitive to the choice of validation estimation method.

Method
Reg