1 Introduction
In healthcare, the traditional treatment (intervention/action/etc.) paradigm has historically been a onesizefitsall framework, where all patients with a certain disease are given the same treatment regardless of their baseline attributes. Yet, we know that individuals can vary drastically in their response to treatments: some people might benefit substantially from a treatment, some may only benefit marginally or not at all, and others could be harmed. This heterogeneity in treatment effects has led to a push for replacing the onesizefitsall framework with personalized medicine, where individuals are given treatments personalized to their unique characteristics, in which case everyone, in theory, will be receiving the best treatment for their individual patient characteristics.
In order for medicine to move to this personalized healthcare framework, we must first understand how different individuals will respond to the same treatment, or treatment effect heterogeneity. Varadhan et al. defined treatment effect heterogeneity (TEH) as nonrandom explainable variability in the direction and magnitude of individual treatment effects [29]. TEH is commonly evaluated using either confirmatory (hypothesis testing) or exploratory (hypothesis generating) analyses. In confirmatory analyses, the goal is to rigorously test prespecified hypotheses. By comparison, analyses that lack a prespecified hypothesis are usually deemed exploratory, where the goal is to generate new hypotheses for future testing. However, the line between this distinction is often blurred and evidence from exploratory analyses are sometimes used as conclusive evidence.
Even when confirmatory analyses are done correctly, adhering to the “best practices for subgroup analysis”, there are still problems as outlined by Lipkovich et al. [19]: Under the conventional guidelines, all subgroups must be prespecified to prevent data dredging, but this forces scientists to choose only a few hypotheses to test, and could severely limit the scope of the scientific inquiry. Another common guideline is that no testing in a subgroup should be done unless the corresponding interaction test is significant. This assumes that modeling of the main effects and interactions is being performed in a onepredictoratatime fashion and fails to consider other more efficient methods of modeling. Finally, it’s often suggested that no testing in subgroups should be performed unless the overall effect is significant, which seems contrary to the idea of personalized medicine where the goal is to find subgroups where the treatment effect is significant, even if it’s not in other subgroups. Lipkovich et al. suggests that instead of an unnecessarily restrictive guidelinedriven approach, we should be using a prespecified and principled, but datadriven approach [19].
Currently, there are several existing methods in the literature that perform datadriven subgroup identification, including treebased methods, hybrid methods, and ensemble methods. Treebased methods range from older methods such as classification and regression trees (CART) [4], to newer methods such as honest causal trees [1]
. In honest causal trees, the “causal effects” refer to the individual treatment effects obtained by taking the difference between the outcome we observe under one treatment, and the counterfactual outcome that would have been observed had the individual received the other treatment. The “honesty” condition refers to the fact that the data used to predict which covariates dictate the splits in the trees are not used to estimate the covariates in the tree. The training set is split into two parts, and one is used to construct the tree, while the other is used to estimate the treatment effects within the leaves of the tree. Although there is a loss of precision due to the sample splitting in the training set, this method allows the elimination of bias that could offset some of the loss. Hybrid methods are methods that combine various existing models to create new models with better performance. FindIt
[20]adapts support vector machine (SVM) by placing 2 separate least absolute shrinkage and selection operator (LASSO) sparsity constraints over the pretreatment covariates: one for the predictive covariates (covariates that are predictive of the treatment effect), and another for prognostic covariates (covariates that are predictive of the disease/outcome prognosis). This approach accounts for the fact that the predictive effects are inherently weaker and need to be treated differently from the prognostic effects. Essentially, 2 separate LASSO penalties are added to an SVM loss function to perform simultaneous variable selection, which allows for a simpler interpretation in the outcome. The Simultaneous Threshold Interaction Modeling Algorithm (STIMA)
[8]combines additive models with treebased regression models to both capture linear main effects of continuous predictors and higher order interaction effects. A linear model is fit at the trunk of the tree, and a pruned tree is made into the form of an additive model by using indicators for all the splits that lead to every node. This creates a hybrid linear/treebased model that is interpretable and more parsimonious than a linear model. Simulation results for STIMA were comparable to existing methods such as multivariate adaptive regression splines (MARS)
[9] and Generalized, Unbiased, Interaction Detection and Estimation (GUIDE) [22]. Finally, there are ensemble methods such as Random Forests
[5] and Superlearner [13] that combine the results from many simpler machine learning methods to improve predictive accuracy.Generally, treebased and hybrid methods like those previously mentioned give interpretable results that could be used in practice by clinicians. However, these results are often not as accurate or robust as results from blackbox ensemble methods such as Superlearner. To combine the predictive accuracy of more complicated ensemble methods and the interpretability of simpler methods, the Virtual Twins framework (VT) was developed by Foster et al.[6]. VT is a flexible framework that separates estimation of the individual treatment effects from estimation of the decision rule model. This allows the use of a more accurate, albeit more complicated, model for the estimation of the treatment effects, while preserving the interpretability of a simpler method for delineating the final subgroups. But, while the flexible specification of models in VT can give both an interpretable and accurate final model, this also burdens the user with making appropriate modeling choices. As our simulations prove, this modeling choice selection is crucial in the performance of VT.
Since its publication, VT has been highly cited and applied in a variety of settings because of its intuitive framework. In the original Foster et al. paper, the authors demonstrated the promise of VT with a logistic model and random forest model. While Foster et al. noted that many other variations could be implemented, many researchers rely heavily on these initial suggestions without examining alternatives. To date, no one has detailed a comprehensive analysis of the different modeling choices that can affect the performance of VT in variety of realworld settings, or the performance of newer predictive methods in the VT framework. This leaves many researchers hesitant to deviate from methods proposed by Foster et al., and leaves much of the potential of VT untapped. To address this gap in knowledge, we conduct a simulation study to evaluate the performance of VT with different combinations of methods in each of its component steps, under a collection of linear and nonlinear problem settings in a comprehensive fashion.
2 Methods
We first provide a brief overview of the VT algorithm, first published by Foster et al. [6]. A complete description of the method can be found in the original manuscript. We then detail the modeling choices faced by researchers when using VT.
2.1 Notation
The data for each individual consist of a continuous outcome , a binary treatment indicator where denotes treatment and denotes control, and covariates denoted by the matrix . We define and for each individual i. The goal of elucidating TEH is to find subgroups
of individuals defined by a rule that subsets the overall population based on the vector
. Virtual Twins achieves this goal by looking for subgroups in the individual treatment effects, or , of a population. Although we will focus on the situation where we have a continuous outcome and binary treatment assignment, this framework can be easily extended to other types of endpoints as well.2.2 Virtual Twins
The Virtual Twins framework is a twostep estimation approach that has its roots in the potential outcomes models of causal inference. As outlined in Figure 2.2, the process consists of first building the individual response surfaces, and then identifying heterogeneous subgroups based on the estimated response surfaces. In the first step, we build the conditional expectation functions for predicting outcomes in each treatment arm using only the covariate information for individuals in each respective arm. With this function, we obtain the estimates and for each subject . We can then calculate an estimate for the individual treatment effect as , which is the difference in expected outcomes if an individual was in the treatment arm vs the control arm. In the second step of VT, we fit a tree structure with
as the response variable and covariates
. The tree is used to partition subjects into heterogeneous groups based on their estimated treatment effects . Thus, the resulting subgroups are defined by the splits in the tree.A wide variety of models can be applied in both steps of the VT framework. In our simulations, we will consider the different methods for performing step 1 and step 2 shown in Figure 2 to evaluate the effect on VT’s performance. We evaluate all possible combinations between the four step 1 and 2 models to understand the effect of each combination on the performance of VT.
2.3 Step 1 of Virtual Twins
This section describes the methods that were compared for step 1 of VT, where we are building the individual response surfaces. We will also clarify the modeling choices that were made within each particular method. In step 1 of VT, we explore a variety of different models, including:
Lasso
LASSO [27]
is a linear regression method that performs variable selection and regularization simultaneously by penalizing the absolute size of the regression coefficients. When LASSO is used in the first step, we estimate the underlying regression functions
using LASSO regression models. We used the
cv.glmnet function from the glmnet package in R with default settings [26], except that lambda.1se was used as the cross validated tuning parameter to encourage model sparsity.Random Forest
Random Forest [5]
is an ensemble method that averages the results of many relatively uncorrelated decision trees to output a robust solution. When Random Forest is used in the first step, we estimate the underlying regression functions
using Random Forest models. The tune function from the randomForestSRC package is used in R [21], so the forest is automatically tuned for the number of variables randomly sampled as candidates at each split (mtry), and the minimum size of terminal nodes (nodesize). All other default settings were kept.Mars
MARS [10]
is a nonparametric regression method that creates a model using spline basis functions with automatic feature selection. It can be seen as a generalization of CART methods that is better at handling numeric variables. When MARS is used in the first step, we estimate the underlying regression functions
using MARS models. The earth package in R was used with all default settings [12].Superlearner
Superlearner [28] is an ensemble method that uses crossvalidation to determine the optimal combination of several candidates models to form a single predictive model.When Superlearner is used in the first step, we estimate the underlying regression functions using Superlearner models. The Superlearner package in R [25] was used with all default settings, and with the following wrappers: SL.glmnet, SL.randomForest, and SL.earth. Note that the SL.earth wrapper was slightly modified to match the default settings of the earth package. These specific wrappers were used so that the Superlearner models would encompass all the other candidate models for step 1 of VT.
2.4 Step 2 of Virtual Twins
This section describes the methods that were compared for step 2 of VT, where we are identifying subgroups for TEH. We will also clarify the modeling choices that were made within each particular method. In step 2 of VT, we evaluated several cases:
None
No step 2 method was used. The estimated individual treatment effects were used without any subgroup assignments for the individual treatment benefit classification accuracy, and for the individual treatment benefit MSE. When prediction accuracy is of paramount importance, not using a second step would be expected to yield the best results. This also allows us to quantify the cost in accuracy for other models that favor interpretability.
Linear Model
A linear model was used. Only the top predictors from a crossvalidated LASSO are used in the linear model as main effects, and the number of top predictors is set to be the same as the number of predictors selected by the regression tree. Although linear models do not allow for a natural way to delineate multiple subgroups, they are one of the most commonly used and interpretable models available. This method also serves as a benchmark for the evaluation of the tree based step 2 methods. Only the top predictors from a crossvalidated LASSO (again using the glmnet package [26]) are used in the linear model, and the number of top predictors is set to be the same as the number of predictors selected by the regression tree.
Regression Tree
A regression tree was used. Regression trees are a natural choice for step 2 of VT since it allows us to easily create interpretable subgroups from the individual treatment effects. This is the primary method mentioned in the original VT paper for step 2 of VT. The caret package in R was used [11] with default settings for the rpart2 model. The maximum depth of the tree was tuned using a 10 fold crossvalidation, repeated 3 times.
Conditional Inference Tree
A conditional inference tree [18] was used. Conditional inference trees use a significant test procedure for variable selection to avoid selecting variables that have many possible splits or many missing values. Conditional inference trees are a variation on regression trees that should have better performance for variable selection. The caret package in R was used [11] with default settings for the ctree2 model. The maximum depth of the tree was tuned using a 10 fold crossvalidation, repeated 3 times.
2.4.1 Parameter Tuning
To control the step 2 model’s behavior in the absence of TEH we propose a permutationbased method to identity tuning parameters for the step 2 models so that a covariate is included in the model with probability
. This approach supports LASSO, regression trees, and conditional inference trees.To do so, we permute the treatment indicators () in our data and refit the selected step 1 model on this permuted data. We then identify the smallest possible penalty parameter for the stage 2 model (lambda for glmnet from glmnet package, cp for rpart from the rpart package, and mincriterion with testtype = "Teststatistic" for ctree from the party package) such that the model contains no covariates (e.g. an interceptonly LASSO). We repeat this process times and take the quantile of all returned penalty parameters to use in the step 2 model on the real data.
3 Simulations
We completed a thorough simulation study to evaluate the performance of the Virtual Twins framework under different data generation schemes and modeling choices.
3.1 Settings
We evaluated the performance of VT under both linear and nonlinear data generation scenarios. For both cases, 3 different data generation models were considered: 1. Regular case  where the covariates are linearly/nonlinearly related to the outcome. 2. Correlated covariates case  where some of the covariates are correlated to each other, but only one is related to the outcome. 3. Selection bias case  where the covariate distribution of the training sample and testing samples are different. For each data generation model we considered a case where 1. there is TEH and 2. there is no TEH. These problem settings are designed to mimic an ideal dataset, and real data problems that can frequently occur in datasets, such as correlated features in high dimensions, and selection bias. We are assuming a continuous outcome measure, with a binary treatment assignment and both continuous and categorical covariates. Specifics on the data generation can be found in Appendix A. In addition to the data generation scheme, the sample size was also varied at 3 different levels: N=1000, N=600 and N=200. Overall, there were a total of 9 different problem settings for each of the linear and nonlinear cases.
3.2 Summaries of Performance
To measure the performance of the different VT method combinations under the various problem settings, we used 3 metrics.
Classification Accuracy
For the individual treatment benefit classification, the estimated optimal treatment group for each subject at the end of the second step of VT is compared to their true optimal treatment group. For each subject , is taken to be the estimated optimal treatment, and is taken to be the true optimal treatment. The correct classification rate for each simulation is defined as: . The correct classification rate averaged over 1000 simulations are reported below.
Individual treatment effect MSE
In addition to the optimal treatment group classification, we are also interested in how accurate the individual treatment effect estimates, , are compared to the true individual treatment effects, . The individual treatment effect MSE for each simulation is defined as: and the MSEs reported in the tables below are averaged over 1000 simulations.
Proportion of variables correctly selected by each model
The final metric we are interested in is if the step 2 models are able to isolate the variables responsible for the TEH. To calculate the proportion of variables correctly selected by each model, the covariates selected into the crossvalidated tree models and linear models of step 2 are compared to the true predictive covariates. Since the step 2 method of “None” does not use any covariates, this was omitted. The pooled results for the variables selected from 1000 simulations are reported.
3.3 Results
We divide the simulation results into separate sections for the linear and nonlinear data generation cases. We present the simulation results for the linear case in Section 3.3.1 and the results for the nonlinear case in Section 3.3.2. Additional linear and nonlinear simulation results with a sample size of N=80 can be found in Appendix B.
3.3.1 Linear Case Results
In the linear case, the performance of VT is highly influenced by the method used in the first step of VT across all 3 metrics. The individual treatment benefit accuracy rates for all the problem settings and VT method combinations are displayed in Table 2(a). With a sample size of 1000 or 600, using the LASSO or Superlearner for step 1 of VT gave the most accurate classification of 80% when no step 2 was used. When a linear model is used for step 2 in conjunction with LASSO or Superlearner as the step 1 method, the classification accuracy drops to 75%. When a tree based method is used for step 2 in conjunction with LASSO or Superlearner as the step 1 method, the accuracy drops again to 70%. Using Random Forest or MARS for step 1 gave classification accuracy rates of 60%. With a sample size of 200, using the LASSO or Superlearner for step 1 and using no step 2 model yield a classification accuracy of 60%. Changing the step 1 or step 2 method in this case to any other method leads to a classification accuracy of 55%. The different problem settings did not have a large impact on the classification accuracy.
The individual treatment MSEs are displayed in Table 2(b). The pattern is consistent with the results for classification accuracy, where using LASSO or Superlearner for Step 1 had the lowest MSEs. Using no step 2 method was again best, with a small penalty for using a linear model, and a bigger penalty for using a treebased model.
The proportion of variables correctly selected by the models are displayed in Table 2(c). Here, we see a drastic difference between the step 1 methods of LASSO and Superlearner versus Random Forest or MARS: With a sample size of 1000 or 600, the variables picked out by both the LASSO and Superlearner are 9099% true predictors for all the problem settings, as opposed to the 2040% when Random Forest or MARS are used. Using conditional inference trees for step 2 gave marginally better results than the linear model and regression tree in the 1000 sample size case. However with a sample size of 600, the conditional inference trees and linear models outperform regression trees. With a sample size of 200, using the Superlearner with a linear model in step 2 yielded the best results, but all the method variations gave results of 1533%.
3.3.2 Nonlinear Case Results
In the nonlinear case, we can again see that the performance of VT is highly influenced by the method used in step 1 of VT. The individual treatment benefit accuracy rates for all scenarios and VT method combinations are displayed in Table 2(d). With a sample size of 1000 and 600, using the Random Forest or Superlearner for step 1 of VT gave the most accurate classification of 85% for every step 2 method except for the linear model, where the classification accuracy drops to 70%. All other method combinations performed similarly at 6575% classification accuracy. At a sample size of 200, using Random Forest for step 1 of VT still gave results of 80% accuracy for every step 2 method except for the linear model, where the classification accuracy drops to 70%. Superlearner performed well, with an accuracy of 70%. Using LASSO or MARS for step 1 gave similar performances of 60% classification accuracy.
The individual treatment MSEs are displayed in Table 2(e). Here again, the patterns of the results are consistent with the results from the classification accuracy, where using Random Forest or Superlearner for step 1 of VT gave the best results throughout the different sample sizes. The choice in step 2 method was also important here since the linear model performed drastically worse than the treebased methods.
The proportion of variables correctly selected by the models are displayed in Table 2(f). For sample sizes of 1000 and 600, when LASSO, Random Forest or Superlearner are used as the step 1 method, 95100% of variables selected were true predictors, with the exception of the correlated covariates scenario using conditional inference trees as the step 2 method. Here, the proportion of variables correctly selected is only 70%. When MARS is used as the step 1 method, only 5075% of the variables picked were true predictors for all three step 2 methods. At a sample size of 200, using Random Forest as the step 1 method and a treebased method for step 2 had the best results at a true predictor detection rate of 95%. When a linear model was used as the step 2 method after Random Forest, 80% of selected variables were true predictors. In the case of LASSO as the step 1 method, 7585% of variables selected were correct. Superlearner only performed well as a step 1 method when paired with conditional inference trees, giving a correct selection rate of 85%. For all other method combinations, only 4565% of selected variables were true predictors.
4 Data Application
In clinical trials, smokers randomized to receive very low nicotine content (VLNC) cigarettes versus normal nicotine content cigarettes (control) exhibit, on average, significantly reduced cigarette use, dependence, and biomarkers of harmful constituent exposure compared to the control [15, 2, 3, 14]. In our data application, we use VT to evaluate the existence of TEH in a randomized trial of VLNC cigarettes.
Variable  Median[IQR] or Count(%)  
Control  VLNC  
n  198  340 
Age  47 [35, 55]  49 [36, 57] 
Gender = Male  108 (54.5)  182 (53.5) 
Baseline Weight (kg)  81 [68, 92]  85 [71, 100] 
Race  
White  121 (61.1)  215 (63.2) 
Black  61 (30.8)  103 (30.3) 
Other  16 (8.1)  22 (6.5) 
Education  
HS or less  18 (9.1)  31 (9.1) 
HS Grad  65 (32.8)  105 (30.9) 
College or more  115 (58.1)  204 (60.0) 
Use of Menthol Cigarettes = Nonmenthol  113 (57.1)  182 (53.5) 
Baseline Total Cigarettes/Day  16 [10, 21]  16 [11, 22] 
Baseline Expired Carbon Monoxide (ppm)  18 [12, 25]  17 [13, 24] 
Baseline Total Nicotine Equivalents (nmol/mL)  64 [40, 96]  64 [40, 93] 
Biomarker for NNK Exposure (pmol/mg)  2 [1, 2]  1 [1, 2] 
Biomarker for Acrylonitrile Exposure (nmol/mg)  1 [0, 1]  1 [0, 1] 
Phenanthrene tetraol (pmol/mg)  2 [1, 4]  2 [1, 4] 
PGEM (pmol/mL)  43 [28, 84]  50 [33, 75] 
8isoPGF2a (pmol/mL)  1 [1, 2]  1 [1, 2] 
FTND Score without CPD  4 [3, 5]  4 [3, 6] 
CESD Score  5 [2, 9]  5 [2, 8] 
SMAST Score  3 [2, 4]  3 [1, 4] 
DAST Score  1 [0, 2]  2 [0, 2] 
Perceived Stress Scale  4 [2, 7]  4 [2, 6] 
WISDM Total Score  39 [30, 49]  38 [30, 47] 
WISDM Primary Dependence Motives  4 [3, 5]  4 [3, 5] 
WISDM Automaticity  4 [2, 5]  4 [2, 5] 
WISDM Loss of Control  4 [2, 5]  4 [2, 5] 
WISDM Craving  4 [3, 6]  4 [3, 6] 
WISDM Tolerance  5 [4, 6]  5 [4, 6] 
WISDM Secondary Dependence Motives  3 [2, 4]  3 [2, 4] 
WISDM Affiliative Attachment  2 [1, 3]  2 [1, 3] 
WISDM Cognitive Enhancement  3 [1, 4]  2 [1, 4] 
WISDM Cue Exposure  4 [3, 5]  4 [3, 5] 
WISDM Social Goads  4 [2, 6]  3 [2, 5] 
WISDM Taste  5 [3, 6]  5 [3, 6] 
WISDM Tolerance  5 [4, 6]  5 [4, 6] 
WISDM Weight Control  2 [1, 3]  1 [1, 3] 
WISDM Affective Enhancement  3 [2, 5]  3 [2, 4] 
Baseline QSUBrief, Factor 1 (QSU f1)  16 [10, 25]  16 [10, 24] 
Baseline QSUBrief, Factor 2 (QSU f2)  7 [5, 13]  7 [5, 12] 
Baseline QSUBrief, Total Score (Total QSU)  24 [15, 37]  24 [16, 35] 
MNWS Score  5 [3, 10]  5 [2, 9] 
PANAS Score (Positive)  32 [25, 37]  31 [26, 37] 
PANAS Score (Negative)  14 [11, 19]  13 [11, 17] 
CES Satisfaction  5 [4, 6]  5 [4, 6] 
CES Psychological Reward  3 [2, 4]  3 [2, 4] 
CES Aversion  1 [1, 1]  1 [1, 2] 
CES Enjoyment of Sensation  3 [2, 5]  4 [2, 4] 
CES Craving Reduction  5 [4, 6]  5 [3, 6] 
4.1 Context for Data Application
Smoking remains the leading cause of preventable death in the US [30, 23, 24]. Through the Family Smoking Prevention and Tobacco Control Act, the Food and Drug Administration (FDA) has the authority to mandate a reduction in the nicotine content of cigarettes to nonaddictive levels if it would benefit public health. Nicotine reduction is a promising method to facilitate quit attempts and consequently reduce cigarettecaused disease and death, a potential regulation that is currently being proposed and investigated. [17]. Clinical trials have exhibited, on average, a significant difference between cigarettes containing 15.5mg/g tobacco (which is similar to what is observed in most cigarettes on the market) compared to 0.4mg/g tobacco in reducing smoking behavior among smokers. However, to fully understand the full public health impact of a nicotine reduction policy, it is important to investigate whether subgroups of participants within the trial may have differential treatment effects.
We considered data from a randomized, doubleblind trial (n=1250) where adult smokers were assigned to 1 of 3 treatment conditions: 1. Immediate nicotine reduction, where participants were assigned to use cigarettes with a nicotine content of 0.4 mg/g tobacco. 2. Gradual nicotine reduction, where participants smoked progressively lower nicotine content cigarettes for a period of one month each until receiving the same 0.4 mg/g tobacco VLNC cigarettes as the immediate reduction group for the last month of the study. 3. Usual nicotine control, where participants were given their assigned cigarettes with a nicotine content of 15.5 mg/g tobacco, which is similar to what is observed in most cigarettes on the market. Participants were given study cigarettes for a period of 20 weeks and outcome measures assessed included biomarkers of smoke exposure, and participants’ measures of smoking behavior were recorded. The dataset contains 46 demographic and baseline covariates such as age, gender, race, education, cigarettes per day, total nicotine equivalents (TNE, a common measure of internal nicotine dose which includes nicotine and its metabolites), carbon monoxide levels, etc. Only the immediate nicotine reduction and the usual nicotine control groups were used in this analysis since the effect of VLNC cigarettes was much more pronounced in the immediate group than the gradual reduction group when compared to the control [16], and the average treatment effect in this case was 7 CPD. Table 4 provides summaries of baseline variables by treatment group; the two groups were well balanced for most variables, as would be expected after randomization. The outcome of interest in our analysis is the total cigarettes (study and nonstudy cigarettes) per day (CPD) smoked at week 20.
The goal of this analysis is to use the VT framework to predict individual treatment effects based on baseline covariates, and then segment the population into subgroups based on potential moderators of the effect of VLNC cigarettes. The goal is not to identify subgroups of the population where the intervention of VLNC would cause participants to smoke more or less CPD than they would have if they were smoking control cigarettes by the end of the 20 week study; we are instead interested in describing the heterogeneity of the treatment effect in a parsimonious way.
4.2 Results
LASSO  RF  MARS  Superlearner  
Regression Tree 





Conditional Inference Tree 




We applied the Step 1 VT methods discussed in Section 1 (LASSO, Random Forest, MARS and Superlearner) with the regression and conditional inference trees for Step 2 to evaluate TEH for this trial. The predictors that delineated the different subgroups from each of the VT analyses can be found in Table 5. The predictors selected from each Step 1 method are similar regardless of if regression trees or conditional inference trees were used for Step 2. Age was found to be an important variable in every method combination, as well as baseline CPD. Only the regression tree with individual treatment estimates from MARS identified more than four covariates to delineate the subgroups. An example regression tree with individual treatment estimates from Superlearner is shown in Figure 3. Additional regression and conditional inference trees can be found in Appendix C. We can see that the first split occurs at age 47, with younger participants having a larger reduction in the estimated number of CPD. Further splits on the baseline CPD, baseline TNE, baseline QSU f1 and age divided the population into 8 subgroups, where QSU f1 is Factor 1 of the QSUBrief, which is a measure of smoking craving. Factor 1 summarizes the craving to smoke with smoking perceived as rewarding, as compared to Factor 2, which summarizes the craving to smoke due to an anticipation of relief from negative affect [7].
All the different method combinations gave from 4 to 8 subgroups. In all the trees constructed from VT using the various prediction methods, all of the subgroups had a treatment effect from a reduction of 1.7 CPD to a reduction of 13 CPD. There were no subgroups where participants were estimated to smoke more CPD than at baseline if VLNC cigarettes were used instead of the control cigarettes. Overall, participants under the age of 47 and smoking heavily at baseline have a more drastic decrease in the number of cigarettes smoked per day as compared to older participants who are smoking less at baseline.
5 Discussion
In both the linear and nonlinear cases, the performance of VT is heavily dependent on the modeling choice of step 1. In the linear case, VT implemented with the LASSO or Superlearner with the LASSO as a candidate model for step 1 had the best performance for all three metrics we considered. From the simulation results we obtained, using a linear model for step 2 of VT only outperformed the treebased models slightly for the individual treatment benefit classification. In the nonlinear case, VT with Random Forest or Superlearner in step 1 yielded the best performance for all 3 metrics when paired with a regression tree for step 2. From these simulations, regression trees seems to have the most consistent performance in both the linear and nonlinear cases. Interestingly, when LASSO was used for step 1, almost all the variables selected were truly predictive of the outcome for sample sizes of 1000 and 600.
The sample sizes also had an impact on the performance of VT. From our simulations, it can be seen that samples of 1000 and 600 gave largely accurate and consistent results if the “correct” step 1 method, or a Superlearner with the “correct” model included was used. However, in the 200 sample cases, the performance of all 3 metrics was inconsistent. In the linear case, using LASSO is still comparable to using a Superlearner model with LASSO as a candidate model for all 3 metrics we evaluated. However, in the nonlinear case, the Random Forest outperformed the Superlearner with Random Forest as a candidate model for all 3 metrics. For both the linear and nonlinear cases, the different problem settings of “Regular”, “Correlated” and “Selection Bias” gave similar results in most cases of our simulations. This indicates that VT is fairly robust to these types of data problems, and thorough preprocessing of the data is not necessary.
However, our simulation study does have some limitations. The nonlinear data was generated using a tree structure. This biases the results towards the treebased methods in the simulation. Also, the correlated covariates and selection bias data scenarios we emulated in this paper do not usually exist in isolation, and there will usually be a combination of complicating factors. Our simulations break down the individual impact of each scenario, but a combination of these scenarios could cause VT to perform differently.
The conclusions drawn from the simulation study are reflected in our analysis of the clinical trial dataset. Here again, the choice of method to use in Step 1 is more influential than the choice of Step 2; the regression tree and causal tree from the same Step 1 method are similar if not identical in most cases. Overall, the results of the VT seem generally invariant to the choices of Step 1 and 2 methods for the dataset. Though the tree constructed using the different method combinations are not identical, they all show a very similar story: while most participants are being placed in subgroups with an estimated treatment effect size of a decrease in CPD of 7 compared to control, there are smaller subgroups of older participants with elevated baseline CPD or TNE that saw an even greater decrease in CPD of 12. For younger participants who have a lower baseline CPD or TNE, most subgroups have an estimated decrease in the number of CPD of 4. These results are consistent with our expectations; younger smokers have a shorter smoking duration and may be less dependent and more likely to change their behavior, while TEH due to the baseline rate of smoking is to be expected due to a ceiling on the treatment effect for light smokers at baseline.
In the future, it would be interesting to evaluate the nonlinear results of the simulation using different data generation schemes that were not tree based. In addition, since this study only examined the case where there is a binary treatment, it is unclear how these results would generalize to additional treatment groups.
In summary, the choice of method for Step 1 has substantial impact on VT performance and, ideally, the method used in Step 1 could be matched to the anticipated features of the data. If no prior information is known, then a blackbox model such as Superlearner should be considered since it performs just as well as the LASSO in the case where the data is linearly related to the outcome, and it performs adequately in nonlinear cases as well. Also, in cases where the sample size is less than 1000, special care should be taken with the interpretation of results.
Data Availability Statement
References
 [1] (2016) Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113 (27), pp. 7353–7360. External Links: ISSN 00278424 Cited by: §1.
 [2] (2012) Smoking behavior and exposure to tobacco toxicants during 6 months of smoking progressively reduced nicotine content cigarettes. Cancer Epidemiology and Prevention Biomarkers 21 (5), pp. 761–769. Cited by: §4.
 [3] (2015) Effect of reducing the nicotine content of cigarettes on cigarette smoking behavior and tobacco smoke toxicant exposure: 2year follow up. Addiction 110 (10), pp. 1667–1675. Cited by: §4.
 [4] (1984) Algorithm cart. Classification and Regression Trees. California Wadsworth International Group, Belmont, California. Cited by: §1.
 [5] (1999) Random forests. UC Berkeley TR567. Cited by: §1, §2.3.
 [6] (2011) Subgroup identification from randomized clinical trial data. Statistics in Medicine 30 (24), pp. 2867–2880. External Links: Document, Link Cited by: Document, §1, §2.
 [7] (200102) Evaluation of the brief questionnaire of smoking urges (QSUbrief) in laboratory and clinical settings. Nicotine & Tobacco Research 3 (1), pp. 7–16. External Links: ISSN 14622203, Document, Link Cited by: §4.2.
 [8] (2010) Combining an additive and treebased regression model simultaneously: stima. Journal of Computational and Graphical Statistics 19 (3), pp. 514–530. External Links: ISSN 10618600 Cited by: §1.
 [9] (1991) Multivariate adaptive regression splines. The annals of statistics 19 (1), pp. 1–67. External Links: ISSN 00905364 Cited by: §1.
 [10] (1991) Multivariate adaptive regression splines. The annals of statistics, pp. 1–67. Cited by: §2.3.
 [11] (2019) Caret: classification and regression training. Note: R package version 6.084 External Links: Link Cited by: §2.4, §2.4.
 [12] (2019) Earth: multivariate adaptive regression splines. Note: R package version 5.1.2 External Links: Link Cited by: §2.3.
 [13] (2017) Estimating heterogeneous treatment effects and the effects of heterogeneous treatments with ensemble methods. Political Analysis 25 (4), pp. 413–434. External Links: ISSN 10471987 Cited by: §1.
 [14] (2013) Reduced nicotine content cigarettes and nicotine patch. Cancer Epidemiology and Prevention Biomarkers 22 (6), pp. 1015–1024. Cited by: §4.
 [15] (2010) Reduced nicotine content cigarettes: effects on toxicant exposure, dependence and cessation. Addiction 105 (2), pp. 343–355. Cited by: §4.
 [16] (2018) Effect of immediate vs gradual reduction in nicotine content of cigarettes on biomarkers of smoke exposure: a randomized clinical trial. Jama 320 (9), pp. 880–891. Cited by: §4.1.
 [17] (2010) Nicotine reduction revisited: science and future directions. Tobacco control 19 (5), pp. e1–e1. Cited by: §4.1.
 [18] (2006) Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical statistics 15 (3), pp. 651–674. Cited by: §2.4.
 [19] (2017) Tutorial in biostatistics: datadriven subgroup identification and analysis in clinical trials. Statistics in Medicine 36 (1), pp. 136–196. External Links: Document, Link Cited by: §1.
 [20] (2013) Estimating treatment effect heterogeneity in randomized program evaluation. Ann. Appl. Stat. 7 (1), pp. 443–470. External Links: ISSN 19326157, Document, Link Cited by: §1.
 [21] (2008) Random survival forests. Ann. Appl. Statist. 2 (3), pp. 841–860. External Links: Link Cited by: §2.3.
 [22] (2002) Regression tress with unbiased variable selection and interaction detection. Statistica Sinica, pp. 361–386. External Links: ISSN 10170405 Cited by: §1.
 [23] (2011) WHO report on the global tobacco epidemic, 2011: warning about the dangers of tobacco. Geneva: World Health Organization. Cited by: §4.1.
 [24] (2017) WHO report on the global tobacco epidemic, 2017: monitoring tobacco use and prevention policies. World Health Organization. Cited by: §4.1.
 [25] (2019) SuperLearner: super learner prediction. Note: R package version 2.025 External Links: Link Cited by: §2.3.
 [26] (2011) Regularization paths for cox’s proportional hazards model via coordinate descent. Journal of Statistical Software 39 (5), pp. 1–13. External Links: Link Cited by: §2.3, §2.4.
 [27] (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §2.3.
 [28] (2007) Super learner. Statistical applications in genetics and molecular biology 6 (1). Cited by: §2.3.
 [29] (2013) A framework for the analysis of heterogeneity of treatment effect in patientcentered outcomes research. Journal of clinical epidemiology 66 (8), pp. 818–825. External Links: ISSN 08954356 Cited by: §1.
 [30] (2014) The 2014 surgeon general’s report:“the health consequences of smoking–50 years of progress”: a paradigm shift in cancer care. Cancer 120 (13), pp. 1914–1916. Cited by: §4.1.
Appendices
Appendix A Data Generation
REGULAR  CORRELATED  SELECTION BIAS  
CONTINUOUS COVARIATES 
For and  For and  Same as Regular Case  
BINARY COVARIATES 

Same as Regular Case  Same as Regular Case  
TREATMENT ASSIGNMENT 
Same as Regular Case  Same as Regular Case  
LINEAR OUTCOME VARIABLE 
This results in a of 0.63 for and 0.7 for .  Same as Regular Case, and this results in a of 0.65 for and 0.71 for .  Same as Regular Case  
REGULAR  CORRELATED  SELECTION BIAS  
LINEAR OUTCOME VARIABLE NO TEH 
This results in a of 0.67 for and 0.7 for .  Same as Regular Case, and this results in a of 0.63 for and 0.7 for .  Same as Regular Case  
NONLINEAR OUTCOME VARIABLE 
This results in a of 0.76 for both and .  Same as Regular Case  Same as Regular Case  
NONLINEAR OUTCOME VARIABLE NO TEH 
This results in a of ??? for both and .  Same as Regular Case  Same as Regular Case  
TRAINING SET 
Random Sampling  Random Sampling 


TESTING SET 
Random Sampling  Random Sampling  Random Sampling 
Comments
There are no comments yet.