ROC-Guided Survival Trees and Forests

09/15/2018
by   Yifei Sun, et al.
0

Tree-based methods are popular nonparametric tools in studying time-to-event outcomes. In this article, we introduce a novel framework for survival trees and forests, where the trees partition the dynamic survivor population and can handle time-dependent covariates. Using the idea of randomized tests, we develop generalized time-dependent ROC curves to evaluate the performance of survival trees and establish the optimality of the target hazard function with respect to the ROC curve. The tree-growing algorithm is guided by decision-theoretic criteria based on ROC, targeting specifically for prediction accuracy. While existing survival trees with time-dependent covariates have practical limitations due to ambiguous prediction, the proposed method provides a consistent prediction of the failure risk. We further extend the survival trees to random forests, where the ensemble is based on martingale estimating equations, in contrast with many existing survival forest algorithms that average the predicted survival or cumulative hazard functions. Simulations studies demonstrate strong performances of the proposed methods. We apply the methods to a study on AIDS for illustration.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/18/2020

Optimal survival trees ensemble

Recent studies have adopted an approach of selecting accurate and divers...
06/27/2020

Survival trees for right-censored data based on score based parameter instability test

Survival analysis of right censored data arises often in many areas of r...
06/25/2020

BoXHED: Boosted eXact Hazard Estimator with Dynamic covariates

The proliferation of medical monitoring devices makes it possible to tra...
12/20/2019

Interval censored recursive forests

We propose the interval censored recursive forests (ICRF) which is an it...
04/02/2018

A Fast Divide-and-Conquer Sparse Cox Regression

We propose a computationally and statistically efficient divide-and-conq...
07/03/2021

Boost-R: Gradient Boosted Trees for Recurrence Data

Recurrence data arise from multi-disciplinary domains spanning reliabili...
11/14/2020

Dynamic Risk Prediction Using Survival Tree Ensembles with Application to Cystic Fibrosis

With the availability of massive amounts of data from electronic health ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tree-based methods are popular alternatives to semiparametric and parametric methods. The basic idea of trees is to partition the covariate space into subsets (nodes) where individuals belonging to the same node are alike regarding the outcome of interest. In the setting of classification and regression trees (CART) (Breiman et al., 1984), this can be achieved by a top-down and greedy approach that aims to minimize a measure of node impurity and the sum of squared deviations from the node mean, respectively; then a cost-complexity pruning algorithm is applied to select a right-sized tree. Moreover, trees are ideal candidates for ensemble methods such as bagging (Breiman, 1996) random forests (Breiman, 2001).

With the increasing focus on personalized risk prediction, tree-based methods for time-to-event data have received much attention. Survival tree has a few appealing features that make it a useful addition to the conventional survival analysis: first, partition arises in many applications where individuals in the same group share similar event risks, and the tree structure retains this natural interpretability. For example, a single survival tree can identify different prognostic groups, so that treatment or prevention strategies can be tailored for patients with different failure risks. Second, survival trees serve as building blocks for ensemble methods and can be easily transformed into powerful risk prediction tools. Gordon and Olshen (1985) first adapt the idea of CART for right-censored survival data and defined the node impurity to be the minimum Wasserstein distance between the Kaplan-Meier curves of the current node and a pure node. Other works that provided a within-node impurity or an estimate of error include Davis and Anderson (1989), LeBlanc and Crowley (1992), Zhang (1995), Jin et al. (2004), Molinaro et al. (2004) and Steingrimsson et al. (2016). As an alternative of maximizing the within-node homogeneity, Segal (1988) and LeBlanc and Crowley (1993), among others, adopted between-node separation approaches that quantify the dissimilarity between two children nodes using the log-rank statistic; other possible splitting criteria include the Harrell’s C-statistic (Schmid et al., 2016) and the integrated absolute difference between survival functions (Moradian et al., 2017). Readers are referred to Bou-Hamad et al. (2011b) for a comprehensive review of survival trees. In practice, a small perturbation in the training data may result in a large change in a tree-based prediction function. To address the instability issue of a single tree and improve the prediction accuracy, researchers, including Hothorn et al. (2006), Ishwaran et al. (2008), Zhu and Kosorok (2012), Schmid et al. (2016), Steingrimsson et al. (2018)

, have developed ensemble methods that enlarge the class of models and yield reduced variance.

In many applications, the use of time-dependent covariates offers opportunities for exploring the association between a failure event and risk factors that change over time. Applying standard methods such as the Cox model is technically challenging in the choice of covariate form and can often yield biased estimation (Fisher and Lin, 1999). In contrast, tree-based methods allow the event risk to depend on the covariates in a flexible way, thus can circumvent the challenge in specifying the functional form of time-dependence. However, the aforementioned survival tree methods only deal with baseline covariates and cannot be directly applied. Bacchetti and Segal (1995)

incorporated time-dependent covariates by using “pseudo-subject”, where the survival experience of one subject was viewed as survival experiences of multiple pseudo-subjects on non-overlapping intervals, and survival probability using the truncation product-limit estimator

(Wang et al., 1986) was reported as the node summary. The idea of pseudo-subject was employed by most of the existing works dealing with time-dependent covariates (Huang et al., 1998; Bou-Hamad et al., 2011a; Wallace, 2014; Fu and Simonoff, 2017)

. However, the pseudo-subject approach may have practical limitations because one subject could be classified into multiple nodes in a tree, leading to a loss of simple interpretation and possible ambiguous prediction.

In this article, we propose a unified framework for tree-structured analysis with right censored survival outcomes and possibly time-dependent covariates. Existing survival trees are typically implemented via a greedy algorithm that maximizes the within node homogeneity or between node heterogeneity, which is not directly related to the prediction accuracy. We define time-dependent ROC curves for survival trees using the idea of randomized test and show that the target hazard function yields the highest ROC curve. The optimality of the target hazard function motivates us to use the ROC-related summary measures for (i) evaluating the prediction performance of survival trees and (ii) guiding the tree-growing procedure. We further propose a novel risk prediction forest, where the ensemble is on unbiased martingale estimating equations. Our forest algorithm shows great potential in reducing the bias compared to algorithms that directly average node summaries such as Kaplan-Meier and Nelson-Aalen estimates.

The article is organized as follows. In Section 2, we rigorously define the ROC curves to quantify the prediction performance of survival tree models. In Section 3, we develop ROC-guided splitting and pruning procedures for a right-censored survival outcome with time-dependent covariates. In Section 4, we extend the proposed survival trees to random forests to improve prediction accuracy. In Section 5, simulations studies are conducted to examine the performance of the proposed methods. In Section 6, the methods are applied to an AIDS study for illustration. We conclude the paper with a discussion in Section 7.

2 Evaluating Survival Trees with ROC Curves

In this section, we introduce a survival tree framework that can incorporate time-dependent covariates, and propose the use of ROC curves in evaluating the prediction performance. The conventional definition of incident/dynamic time-dependent ROC curve (Heagerty and Zheng, 2005) is appropriate for quantifying the prediction performance of a continuous disease marker, but may not be applicable in evaluating discrete-valued risk scores derived from a tree. For a discrete-valued marker, the ROC curve at time degenerates to a finite number of points, hence important summary measures such as area under the curve (AUC) are not well defined. We fill in the gap by generalizing the definition of time-dependent ROC curves to evaluate continuous, discrete and mixed types of markers.

2.1 Survival trees that partition the survivor population

Suppose is a continuous survival time outcome and is a

-dimensional vector of possibly time-dependent covariates. Denote by

the hazard function of given , that is, . The hazard function characterizes the instantaneous risk of failure at time among survivors.

At time , let denote the covariate space of in the survivor population (i.e., the subpopulation satisfying ). We consider a tree that divides the survivor population into subgroups according to a partition on , denoted by . The elements of the partition, are disjoint subsets of satisfying , and are called terminal nodes of the tree. A subject enters a terminal node at time if and . For ease of discussion, we restrict the attention to a fixed time interval and assume for . Thus the partition can be applied on for all . Since can vary over time, one subject can enter different terminal nodes at different time points. The partition induces the following model for the hazard function at given ,

(1)

where subjects in the same terminal node have the same hazard, and is the node-specific hazard function. Define the partition function so that if and only if and . The partition-based hazard function at can be written as . Under our framework, the time-dependent covariates are handled similarly as in the Cox model in the sense that the hazard at time depends on the covariates at . On one hand, a time-invariant partition on allows a sparse model and easy interpretation of the decision rule; on the other hand, as the number of terminal nodes in increases, the partition-based hazard function approximates the true hazard function . In Section 3.3, we show that the time-invariant rule is asymptotically valid. Compared to the Cox model that assumes multiplicative covariate effects and requires a correct choice of the functional form of the time-dependent covariates (Fisher and Lin, 1999), the partition-based hazard function can be more robust to model mis-specification.

For a fixed partition , we consider the estimation of with right-censored survival data. Let be the observed survival time and be the failure event indicator. We use to denote the covariate history up to . The observed training data are , which are assumed to be independent identically distributed replicates of . For , define . By assuming independent censoring within each terminal node, that is, , , we have

(2)

where . Define the observed counting process . We estimate by the following kernel type estimator,

where , , is a second order kernel function with a support on and is the bandwidth parameter. When , we either set or use the second order boundary kernel in Müller (1991) to avoid biased estimation in the boundary region. For a fixed node , consistently estimates as , and . We use to denote the empirical estimate using the training data. Thus is estimated by

and is estimated by .

If the domain of the time-dependent covariates among survivors changes over time, we can transform onto via a one-to-one function . Let and use to denote the hazard function of given . A tree can be constructed using the transformed covariates , and the partition-based hazard estimates . Since , the estimated hazard is . For example, for the th covariate , let

be its empirical cumulative distribution function in the at-risk population, then

is a possible choice. After transformation, the partition is time-varying on the original scale of .

2.2 Generalized ROC curves for survival trees

Since different tree algorithms usually generate different partitions, questions arise as to how the accuracy of partition-based hazard functions can be properly measured. We propose the use of time-dependent ROC curves to evaluate

. Heuristically, at time

, the ROC curve is defined on the survivor population, where a subject is considered a case if and a control if . This corresponds to the incident/dynamic ROC curve in Heagerty and Zheng (2005). Let be a scalar function that summarizes information from , and we predict or based on , with a larger value being more indicative of . Following Heagerty and Zheng (2005), the false positive rate is , the true positive rate is , and the ROC function is . In the literature of binary classification, it has been recognized that, when predicting a binary disease outcome with multiple disease markers, the risk score (i.e., the probability of disease given markers) yields the highest ROC curve (Green Dand Swets, 1966; McIntosh and Pepe, 2002). For survival outcomes, the hazard can be viewed as an analog of the risk score. Following arguments of the Neyman-Pearson Lemma, it can be shown that setting yields the highest . Thus with a higher curve is desired.

With a finite number of terminal nodes, at time is a discrete-valued scalar function of , thus the function for becomes a finite set of points rather than a continuous curve. More generally, if has a point mass at , the function is not defined for . In this case, we construct a continuous curve, denoted by

, via linear interpolation. Specifically, for

, the point on the curve corresponds to the following prediction rule: if , predict ; if , predict with probability ; and if , predict . In the special case where is a continuous variable, reduces to . The formal definition of is given in the Appendix. We establish the optimality of true hazard function with respect to in Proposition 1. The proof is given in the Supplementary Material. Our result suggests that can be used to evaluate the predictive ability of , and a higher curve is favorable.

Proposition 1

At time , among all scalar functions , is optimal in the sense that yields the highest curve.

The area under the curve is , which has the interpretation of a concordance measure (Pepe, 2003). For survival time data, we can shown that is equivalent to

where and are i.i.d. replicates of . Based on Proposition 1, can be used as a summary measure to evaluate the predictive ability of at time . It is worthwhile to point out that other summary measures of ROC curves such as the partial AUC and specific ROC points (Pepe, 2003) can also be employed. Here we focus on the discussion of , and the results can be extended to other summary measures. To evaluate over time, a global measure on is needed. We define a scalar function that combines in a possibly time-dependent way. For survivors at , we use to characterize the risk of . To derive a global measure, we integrate over with a weight function and define

Following Proposition 1, the true hazard function maximizes CON. Motivated by this fact, we propose to use CON as a guidance to build survival trees, and the goal is to construct a partition so that is as large as possible. A detailed discussion is given in Section 3.

In practice, investigators can use their own weight functions to reflect costs of misclassification on different time points. A simple example is to set . Moreover, let be the marginal density and survival function of . By setting , we have

(3)

which measures the probability that the subject who fails earlier has a higher risk at the failure time. In the special case where

is a continuous random variable that does not depend on

, in (2.2) reduces to the global summary in Heagerty and Zheng (2005).

The Harrell’s C-statistic (Harrell et al., 1982) has been commonly used to quantify the capacity of a risk score at baseline in discriminating among subjects with different event times. When the event time is subject to right censoring, however, the population parameters corresponding to the Harrell’s C-statistics depend on the study-specific censoring distribution. Uno et al. (2011) studied a modified C-statistics that is consistent for a population concordance measure free of censoring under a working Cox model , where the linear combination maximizes the limiting value of the Uno’s C-statistic. However, without the Cox model assumption, it is not clear how to combine so that the limiting C-statistic is maximized. The proposed CON is defined in a different way so that CON is maximized when , thus is appropriate for guiding the tree building procedure.

3 ROC-Guided Survival Trees

In this section, we develop tree-growing algorithms based on and CON. We note that although the assumption

(4)

is adopted for establishing the large-sample properties of a grown tree where the diameters of the terminal nodes tend to zero, stronger assumptions are often needed to understand the splitting criteria, especially in the early steps of the algorithm. For example, when selecting the optimal split at the root node, the log-rank splitting rule (Segal, 1988; LeBlanc and Crowley, 1993) implicitly assumes that is independent of within the children nodes, which is not guaranteed by (4). For ease of discussion, we assume independent censoring in Section 3.1 and 3.2, and assume (4) in Section 3.3.

3.1 Splitting based on CON

We first consider the use of as the splitting criterion. At each step, we choose to split the node that leads to the largest increase in CON of the tree. Under independent censoring, the estimation of is developed based on the following expression,

Therefore, replacing the unknown quantities in with the corresponding estimators in Section 2.1 yields a consistent estimator for the concordance measure, denoted by . To estimate CON, we use , where is a weight function that possibly depends on the data.

Consider the partition and a split on the node . The partition after splitting is denoted by . Proposition 2 shows that CON is a proper splitting criterion and can detect the difference in hazards of two children nodes. The proof is given in the Supplementary Material.

Proposition 2

Splitting does not decrease the concordance. Specifically, , and the equality holds if and only if . Moreover, , and the equality holds if and only if .

Based on Proposition 2, we have . When , the equality holds if and only if almost everywhere on . The optimal split is thus defined as . Note that the validity of Proposition 2 does not depend on the censoring distribution. In practice, may not correctly estimate if the independent censoring assumption is violated, but the true concordance usually increases after splitting.

One may also consider another splitting criterion where the optimal split on a node is chosen to maximize the increase of CON within the node. For a node , define and . For ’s children nodes and , it can be shown that within is 0.5 before splitting and is after splitting. Hence the increase in CON within after splitting is

Moreover, can be estimated by

The rule based on can be viewed as maximizing a weighted average of over . Compared with the log-rank splitting rule that maximizes a weighted average of over , the -based rule encourages more balanced children nodes and can better detect the difference especially when hazards in the children nodes cross each other.

3.2 Pruning

Although splitting increases the concordance, a large tree often overfit the data. Following Breiman et al. (1984), we continue splitting until a tree is fully grown and prune it from the bottom up to find the most predictive subtree. As a guidance for pruning, we define a concordance-complexity measure,

where is the number of terminal nodes in and is a complexity parameter. The second term is a penalty that controls the size of the tree. For each , the optimal subtree is defined as the tree that has the largest value of among all the possible subtrees. If more than one subtrees have the same largest value of , we define the one with the smallest size to be the optimal subtree. In practice, and the corresponding optimal subtree can be determined by cross-validation. Details of the splitting and pruning algorithm are given in the Supplementary Material.

3.3 Prediction

We use to denote the partition constructed based on the training sample . Define . Given a new observation , we predict the hazard to be . Theorem 3.3 summarizes the large sample property of the estimated partition-based hazard function . The proof is given in the Supplementary Material. Under conditions (A1)-(A5) in the Supplementary Material, for and any , as , we have with probability 1. Similar as most of the existing works on consistency of tree-based estimators, the convergence result is independent of the splitting and pruning algorithm. Theorem 3.3 is developed to provide justification on the time-invariant partition under the commonly adopted assumption that diameters of terminal nodes go to zero as increases (Breiman et al., 1984; LeBlanc and Crowley, 1993). Large-sample results incorporating the algorithm to grow the trees will be investigated in our future work.

Although the above discussion focuses on the hazard function, we can also predict survival probability when is a vector of external time-dependent covariates (Kalbfleisch and Prentice, 2011). Assume the hazard at depend on only through . The prediction of survival probability is based on the equation . We predict the survival probability at for a subject with covariate path to be

(5)

4 ROC-Guided Random Survival Forests

The proposed survival trees can be further transformed into powerful risk prediction tools by applying ensemble methods such as bagging (Breiman, 1996) and random forests (Breiman, 2001). The essential idea in bagging is to average many noisy but approximately unbiased tree models to reduce the variance. Random forests improve the variance reduction of bagging by reducing the correlation between the trees via random selection of predictors in the tree-growing process. In the original random forests for regression and classification (Breiman, 2001), the prediction for a new data point is the averaged prediction of all trees in the forest, and trees are grown sufficiently deep to achieve low bias. However, for right-censored survival data, averaging estimated survival or cumulative hazard functions from deeply grown trees may result in large bias, because the node-specific estimate is biased when a node contains a small number of observations. For this, we propose to average the unbiased martingale estimating equations rather than averaging node summaries from the Kaplan-Meier or Nelson-Aalen estimates (Ishwaran et al., 2008; Zhu and Kosorok, 2012; Steingrimsson et al., 2016; Schmid et al., 2016). Moreover, in light of Meinshausen (2006) and Athey et al. (2018), we treat forests as a type of adaptive nearest neighbor estimator and construct forest-based local estimation for the survival or hazard functions.

Let be a collection of survival trees obtained from resampling the original training data. Each tree is constructed via a tree-growing process where at each split, predictors are selected at random as candidates for splitting. For the th partition and a node

, one can solve the following unbiased estimating equation for the node-specific hazard at

,

Let be the partition function for so that if and only if and , then the th tree induces the following estimating equation for ,

(6)

It is easy to see that solving Equation (6) yields the prediction of hazard at given based on one single tree . Note that Equation (6) can be rewritten as the following weighted estimating equation,

(7)

where and . Specifically, the weight is a positive constant if the th subject is at-risk and falls in the node ; otherwise, the weight is zero.

To get forest-based prediction, we take average of the estimating functions in (7) from all the trees and obtain the following local martingale estimating equation for ,

where the weight function is . The weight captures the frequency with which the th observation falls into the same node as , and . Therefore, the forest based estimator for the hazard function is

When are external time-dependent covariates, the survival probability at given is predicted to be

(8)

To achieve good prediction performance, our algorithm relies on subsampling (without replacement) and sample-splitting techniques (Athey et al., 2018). Specifically, we divide the subsample from the original training data into two halves. The first half is used for tree-growing and the second half is used to calculate the weight. The algorithm for forest-based estimation is given in the Appendix.

5 Simulation Studies

We performed simulations to investigate the performance of the proposed methods. To cover some of the common survival models, we considered the following scenarios where the survival distribution only depends on baseline covariates :

(I)

Proportional hazards model: , where ,
and follows a multivariate normal with mean 0 and covariate matrix with elements .

(II)

Proportional hazards model with noise variables: , where , .

(III)

Proportional hazards model with nonlinear covariate effects:
.

(IV)

Accelerated failure time model: , where .

(V)

Generalized gamma family: , , , and , .

Except for Scenario (I), the covariate

were generated from uniform distributions over

. In Scenarios (I), (II), and (III), we set the baseline hazard to be , which is the hazard function of a Weibull distribution. Additionally, we generated censoring times, , from a uniform distribution over , where was tuned to yield censoring percentages of 25% and 50%. As described in Remark 2.1, we treat the baseline covariates as time-dependent covariates through , where is the empirical cumulative distribution function of the at-risk subjects. Given a new baseline observation , we set and predict the survival probability as in (5). We compare the proposed methods with the relative risk tree in LeBlanc and Crowley (1992) and the conditional inference survival tree in Hothorn et al. (2006), which are implemented in R (R Core Team, 2018) functions rpart and ctree in packages rpart and party, respectively. We used ten-fold cross-validation in choosing the tuning parameter in the concordance-complexity measure as well as when selecting the right-sized tree in rpart. To evaluate the performance of different methods, we use the integrated absolute error:

where are generated independently from the simulated data. We set and

to be approximately the 95% quantile of

. When predicting survival probability, the proposed algorithm is not sensitive to kernel function and the bandwidth parameter , so we use the Epanechnikov kernel function and set in all scenarios.

With and 500 replications, the top panel of Table 1 reports the mean of integrated absolute error of the proposed method, rpart and ctree. For all scenarios considered, the integrated absolute errors decrease with sample size but increase with censoring percentage. On the other hand, splitting by CON and yield similar integrated absolute errors. The proposed methods are as efficient or more efficient than the competitors.

We next consider scenarios where the survival time depends on time-dependent covariates. In what follows, we considered a time-dependent covariate, , and a time-independent covariate , where the latter was generated from a uniform distribution over .

(VI)

Dichotomous time dependent covariate with at most one change in value: Survival times were generated from , where , , , and

follows an exponential distribution with rate 5.

(VII)

Dichotomous time dependent covariate with multiple changes: Survival times were generated from , , , and are the first three terms of a stationary Poisson process with rate 10.

(VIII)

Continuous time dependent covariate: Survival times were generated from , where , and follow independent uniform distributions over .

In these scenarios, we continue to consider the transformation in Remark 2.1 to transform onto . In the presence of time-dependent covariate, we define the integrated absolute error as

where denotes the covariate history up to , and are generated from the distribution of independently of the training sample. To our knowledge, there is no available software for predicting the survival probability at time based on the covariate history up to . Although there have been existing works on survival tree with time-dependent covariates, it is not clear how existing methods can be applied for prediction. We compare our methods with rpart and ctree that can only handle baseline covariates, . As expected, the lower panel of Table 1 shows that the proposed methods outperform rpart and ctree under Scenario (VI) – (VIII). In particular, our methods show a substantial improvement over rpart and ctree in the presence of a continuous time-dependent covariate (i.e., Scenario (VIII)). This indicates incorporating time-dependent covariates improves the prediction accuracy.

Proposed Methods
CON rpart ctree
Sce 0% 25% 50% 0% 25% 50% 0% 25% 50% 0% 25% 50%
Scenarios with baseline covariates
100 I 122 126 126 122 126 125 118 127 130 115 124 125
II 91 99 108 90 96 107 96 105 114 96 103 111
III 89 107 112 90 107 112 87 102 110 89 101 106
IV 100 117 120 99 115 118 90 111 124 95 118 133
V 77 94 108 77 93 107 97 115 132 97 113 127
200 I 110 112 119 120 120 119 113 121 122 111 119 120
II 78 83 88 76 80 86 79 86 94 78 84 91
III 71 91 105 72 90 106 71 86 95 80 93 99
IV 74 94 108 73 100 103 77 95 107 75 96 112
V 67 80 93 66 78 93 91 109 118 89 106 114
Scenarios with a time-dependent covariate
100 VI 71 86 99 70 85 98 138 159 178 136 156 174
VII 81 92 110 80 90 108 149 200 238 150 198 234
VIII 87 89 92 86 90 93 388 309 212 402 352 259
200 VI 54 59 73 55 58 79 128 148 167 126 144 163
VII 73 78 85 74 77 84 145 201 233 150 202 228
VIII 79 78 80 78 78 80 402 329 220 447 396 274
Table 1: Summaries of integrated absolute errors (). The numbers 0%, 25% and 50% correspond to different censoring rates; CON and are the proposed methods with CON and as the splitting criterion, respectively; rpart is the recursive regression survival tree implemented in R package rpart; ctree is the conditional inference survival tree implemented in R package party.

We continue to use Scenarios (I) – (VIII) to investigate the performance of the proposed forest-based methods. To grow the trees in the forest, we considered subsampling and set the size of subsample to be 80 when and 100 when . We randomly select variables at each splitting. For each tree, the minimum number of failure that must exist in a node for a split to be attempted is 3 and the minimum number of failures in any terminal node is 1. For each data, we set and compute the integrated absolute error as before, but with replaced with the survival function in (8). We compare the proposed forest methods with random survival forests in Ishwaran et al. (2008) and Schmid et al. (2016), which are implemented in R functions rfsrc (Ishwaran and Kogalur, 2018) and ranger (Wright and Ziegler, 2017) in packages randomForestSRC and ranger, respectively. We fit rfsrc and ranger using the default parameter settings and using smaller trees in the forest with larger minimum node sizes (i.e., with nodesize = 10 in rfsrc and min.node.size = 10 in ranger). Since rfsrc and ranger cannot handle time-dependent covariates, we fit these with the baseline covariates, . Table 2 shows the averaged integrated absolute error based on 500 replications. The proposed forest methods perform better or similar to their survival tree counterparts and outperform ranger and rfsrc in almost all of the settings. As expected, our methods have a substantial advantage over the rfsrc and ranger under Scenarios (VI) – (VIII); echoing the importance to incorporate time-dependent covariates in prediction. Interestingly, for ranger and rfsrc, the default setting leads to larger error compared to single trees, and their performances are substantially improved after increasing the size of terminal nodes. We conjecture this could be due to the fact that within-node estimates are less biased with larger node sizes. In summary, our proposed methods are competitive with the existing methods when all the covariates are time-independent, and even show superior performance in the presence of time-dependent covariates.

Proposed Methods ranger rfsrc
CON default smaller trees default smaller trees
Sce 0% 25% 50% 0% 25% 50% 0% 25% 50% 0% 25% 50% 0% 25% 50% 0% 25% 50%
Scenarios with baseline covariates
100 I 120 124 124 120 124 124 122 132 137 120 129 131 151 156 157 122 131 136
II 65 67 69 65 67 69 119 123 126 71 75 79 117 120 122 72 77 82
III 79 86 93 80 88 95 136 139 125 99 109 108 139 143 142 84 94 96
IV 68 78 86 69 78 87 129 136 126 95 106 109 138 141 138 81 91 95
V 64 71 76 64 71 76 120 128 121 72 90 99 128 132 131 80 97 109
200 I 112 115 116 112 115 116 122 132 137 116 125 128 150 155 156 118 129 134
II 50 52 54 50 52 54 119 123 126 64 67 72 117 119 122 67 71 77
III 65 71 78 67 73 79 136 139 125 81 90 90 134 139 137 73 83 86
IV 54 62 71 54 63 71 129 136 126 77 85 88 134 137 133 66 75 78
V 48 54 59 49 55 59 120 128 121 67 83 92 126 129 127 72 89 101
Scenarios with a time-dependent covariate
100 VI 57 64 70 57 64 70 186 201 205 123 138 150 199 215 218 128 143 156
VII 61 71 80 61 71 80 205 238 251 151 189 209 164 203 220 220 249 260
VIII 68 74 75 69 74 76 408 357 262 206 197 174 240 229 197 405 354 261
200 VI 47 51 57 47 51 57 189 206 212 118 134 149 204 221 226 120 138 153
VII 48 57 65 48 57 65 209 249 260 148 196 216 225 261 271 153 203 223
VIII 54 58 59 54 58 59 436 389 286 254 244 206 432 386 285 291 277 229
Table 2: Summaries of integrated absolute errors (). The numbers 0%, 25% and 50% correspond to different censoring rates; CON and are the proposed methods with CON and as the splitting criterion, respectively.

6 Application

We illustrate the proposed methods through an application to a clinical trial conducted by Terry Beirn Community Programs for Clinical Research on AIDS (Abrams et al., 1994; Fleming et al., 1995; Goldman et al., 1996). The trial was conducted to compare didanosine (ddI) and zalcitabine (ddC) treatments for HIV-infected patients who were intolerant to or had failed zidovudine treatments. Of the 467 patients recruited for the study, 230 were randomized to receive ddI treatment and the other 237 received ddC treatment. The average follow up time is 15.6 months, and 188 patients died at the end of the study. Despite having longitudinal measurements that were measured at follow-up visits, Abrams et al. (1994) showed that the ddC treatment is as efficacious as the ddI treatment in prolonging survival time, based on a proportional hazards model with covariates measured at the baseline visit. In what follows, we apply the proposed methods to investigate time-dependent risk factors for overall survival. We included baseline covariates at randomization such as age, gender, treatment received (ddI/ddC), and AIDS diagnosis (yes/no). We also included time-dependent covariates such as CD4 count, Karnofsky score, and cumulative recurrent opportunistic infections count. The CD4 count and Karnofsky score are measured at the baseline visit and bimonthly follow-up visits. We adopt the last covariate carried forward approach between visit times when constructing these time-dependent covariates. For the opportunistic infection, we use the cumulative number of infections prior to as the covariate value at . Variables including Karnofsky score and CD4 count are transformed into the range using the corresponding estimated cumulative distribution functions.

Figure 1 displays the proposed ROC guided survival tree using the CON splitting criterion. The splitting criterion yielded the same tree and the result is not shown. With the concordance-complexity pruning, there are three terminal nodes in the final tree. The terminal nodes are , , and , where KSC is the transformed Karnofsky score at and OP is the cumulative number of opportunistic infection up to . The partitions corresponds to node 2, 6, and 7, whose estimated hazard rates are plotted in Figure (a)a. Figure (a)a clearly shows that lower Karnofsky score is associated with higher mortality risk. For these with high Karnofsky score (e.g., KSC 0.396), previous opportunistic infections are also associated with higher mortality risk. Since a patient can enter different terminal nodes at different time points, one hazard curve does not necessarily represent the hazard of one patient. For example, a patient with a high Karnofsky score and no opportunistic infection enters node 6 and have relatively low mortality risk. As time progress, the patient could enter node 7 if the patient experiences an opportunistic infection or enter node 2 if the patient’s Karnofsky score decreases. Either progression leads to higher mortality risk.

We also applied the proposed forest-based method, using the same setting as in our simulation studies. For prediction, we considered hypothetical patients with Karnofsky score that is either increasing linearly