Interval censored recursive forests

12/20/2019 ∙ by Hunyong Cho, et al. ∙ 0

We propose the interval censored recursive forests (ICRF) which is an iterative tree ensemble method for interval censored survival data. This nonparametric regression estimator makes the best use of censored information by iteratively updating the survival estimate, and can be viewed as a self-consistent estimator with convergence monitored using out-of-bag samples. Splitting rules optimized for interval censored data are developed and kernel-smoothing is applied. The ICRF displays the highest prediction accuracy among competing nonparametric methods in most of the simulations and in an applied example to avalanche data. An R package icrf is available for implementation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we study a tree-based survival probability estimation problem for interval censored data. Censored survival data, where the event time is not exactly observed, are challenging for survival probability estimation. Any analysis using censored data requires key assumptions about the censoring distribution. For right-censored data, where the time to event is either exactly observed or only known to be greater than a censoring time, it is often assumed that the censoring mechanism is non-informative of the event time. In this case, the marginal survival curve can be estimated by ignoring the censored observations in risk sets after their censoring time. This implicitly assumes that censored observations experience the same hazard as uncensored observations subsequent to the censoring time.

A more realistic assumption is covariate-conditional non-informative censoring. Under this censoring mechanism, estimation can be done assuming that censored observations share the same hazard over their censoring period as other observations that have the same or similar covariate values. However, when the observations considered for a common hazard are not similar enough, the estimation could incur significant bias.

Most existing tree-based regression methods are subject to this bias. Often, at each node of survival trees, censored data are supplemented with information borrowed from the marginal survival probability of the node. To be more specific, most survival tree methods recursively partition the node using two-sample tests by implicitly estimating the survival distribution of the censored subjects based on the marginal survival probability of the node or one of the candidate offspring nodes they belong to. By using such crude marginal information, however, the heterogeneity of individuals is not sufficiently accounted for, especially in the early phases of tree partitioning. Although as trees grow toward their terminal nodes they utilize more covariate-conditional information and eventually form a finer partition, it is probable that the early stages of partitioning that utilize insufficient information might adversely affect subsequent splits resulting in potential bias. Thus, utilizing covariate-conditional survival probabilities for censored data from the beginning of the partitioning procedure is essential for reducing potential bias.

However, the issue is that the finest covariate-conditional survival probabilities are available only after the trees grow well towards their terminal nodes. This problem was addressed in Zhu and Kosorok (2012)

where the authors proposed recursively imputed survival trees (RIST). In this tree-based method, a recursion technique was used to provide the covariate-conditional survival probability information for censored subjects.

Interval censored survival data, where only intervals that include the event times are observed–not exact event times–presents harder challenges for inference. Interval censored data are less straightforward to handle, since the information contained in such data is less than for right censored observations. There is no closed form nonparametric maximum likelihood estimator (NPMLE) based on general interval censored data, and an NPMLE with a convergence guarantee was only developed in the early 1990’s (Groeneboom and Wellner, 1992; Jongbloed, 1998). Particular challenges arise in current status data (also known as case-I censored data) where the survival status of a subject is inspected at a single random monitoring time, thus yielding an extreme form of interval censoring.

Tree-based regression methods for interval censored data are both sparse and more recent, and the issue of insufficient usage of covariate-conditional information has yet to be fully addressed. To respond to this issue, we propose here a tree-based nonparametric regression method for interval censored survival data which uses a recursion strategy. The proposed method shows high prediction accuracy both on simulated data and on our illustrative example describing the mortality of avalanche victims (Haegeli et al. (2011), Jewell and Emerson (2013)). We next review relevant tree and survival tree methods and then outline the idea and benefits of our new method.

Trees, such as classification and regression trees (CART, Breiman (2001)), are widely used due to their simplicity and effectiveness. Trees can be tuned for greater flexibility by setting terminal node sizes to be smaller, but as a consequence, they can suffer from greater uncertainty. In an attempt to reduce variability while not losing appropriate flexibility, various ensemble learners have been developed such as bootstrap aggregating (“bagging”) (Breiman, 1996), random forests (RF) (Breiman, 2001), and extremely randomized trees (ERT) (Geurts et al., 2006). For bagging, bootstrapped samples are generated to build multiple trees, and the trees are averaged to form a single estimator. For RF, in addition to bootstrap sampling, tree partitioning is done by considering a random subset of the variables at each node. For ERT, trees are constructed without resampling but by adding further randomization into the partitioning procedure. The common idea of these ensemble learners is “de-correlation” through randomization via resampling or random splitting. Using such randomization, base learners become more diverse or de-correlated with each other, and consequently the ensemble learner increases in stability.

Tree-based methods are attractive to use in survival analyses due to their flexibility: they do not pose restrictive assumptions such as linearity in features or proportional hazards. For right-censored or uncensored survival data, non-ensemble tree methods have been previously proposed based on various node-splitting rules (Gordon and Olshen (1985), Segal (1988), Ciampi et al. (1991), LeBlanc and Crowley (1992), and LeBlanc and Crowley (1993)), followed by ensemble methods (Hothorn et al. (2004), Hothorn et al. (2005), and Ishwaran et al. (2008)) and an iterative ensemble method (Zhu and Kosorok (2012)). Also nonparametric survival tree methods with parametric error models have been proposed (Davis and Anderson (1989) and Zeileis et al. (2008)). However, for interval censored data, the literature about tree-based methods is sparse (Yin et al. (2002), and Fu and Simonoff (2017)) and there are no tree ensemble methods as far as we know. For a comprehensive review about survival tree methods, see Bou-Hamad et al. (2011).

Developing a regression tree ensemble method for interval censored data is the main aim of this paper. We develop what we call the interval censored recursive forests (ICRF). ICRF is an iterative tree ensemble method that is self-consistent. As in RIST, ICRF can be regarded as an EM algorithm-based estimator. At each iteration, the conditional expectation is updated and based on the updated information, a random forest is built as a maximization step. Besides this property, an ICRF is built with the following features. For each iteration of ICRF, ERT serves as a basic framework with a slight adaptation of bootstrap sampling to monitor convergence. At each node splitting step, a method (e.g. the generalized Wilcoxon rank sum test–GWRS) is employed. For terminal node prediction, we propose quasi-honest and exploitative predictions and study their properties. Finally, smoothing in the time domain is further applied. Using both simulated and actual data, we show that the ICRF has high predictive accuracy. An R package icrf is available on CRAN.

The rest of this paper is organized as follows. In Section 2, we describe the data structure and modelling assumptions. In Section 3, existing methods are briefly reviewed and the proposed methods are introduced and discussed in context. The predictive accuracy of the proposed method is evaluated using simulations and avalanche victim data in Sections 4 and 5, respectively. In Section 6, we discuss limitations and future areas for research.

2 Data setup and model

The proposed method is applicable to interval censored data that includes right censored and current status data as special cases. The event time, , is only known to lie within an interval , where and for an exactly observed . Let , , , and denote the marginal, covariate-conditional, the interval-conditional, and the full-conditional distributions at time , respectively, where is a -dimensional covariate with distribution function . We use to represent a corresponding (conditional) survival function. For the censoring mechanism, we consider covariate-conditional non-informative censoring under which intervals do not provide any further information than the fact that the failure time lies in the interval given the covariate (Oller et al., 2004): i.e., . The study length is denoted by

. A random vector

denotes the monitoring times at each element of which the survival status of the subject is identified. follows a distribution with maximum potential number of follow-up times . Among the monitoring times, only one pair of two neighboring time points that includes contributes to the likelihood. Thus we only consider in the data analysis, where denotes the th order statistic of the elements of with and . Current status data corresponds to . As a generalization for uncensored or right-censored survival data, is a segment of the real line with a random length, i.e.,

with a random variable

, , , and for uncensored data.

3 Interval censored recursive forests

3.1 Existing methods

Ishwaran et al. (2008)

proposed Random Survival Forests (RSF) for right censored data. For RSF, one possible splitting rule uses the log-rank test statistic, where the hazard of the censored subjects after the censored time are assumed equal to the average hazard of the uncensored subjects in the same potential daughter node. i.e.,

for a small , where is the binary survival status of the th subject and is the candidate daughter node. In other words, censored subjects do not contribute to the log-rank statistic after censoring.

In fact, the accuracy of the estimator could be enhanced, if the covariate-conditional survival probability is available to obtain the censored subjects’ survival probabilities conditional on both their censoring times and covariates rather than assuming that censored subjects follow the same marginal distribution. However, the covariate-conditional survival probabilities are not available until a tree reaches its terminal nodes. To overcome this issue, Zhu and Kosorok (2012) proposed Recursively Imputed Survival Trees (RIST). The key idea of RIST is to recursively borrow the conditional survival prediction from the previous forest in “guessing” the exact survival time of the censored subjects. In other words, RIST uses a better approximation to , instead of , for censored subjects, as the basis for two sample tests. This innovative approach, however, is designed for right censored data. Thus a more general approach is required to accommodate more general forms of survival data structures, such as interval censored data as studied herein.

In the interval censored survival data literature, there are two current tree-based methods. Yin et al. (2002) developed a tree model for interval censored data that uses the likelihood ratio test as a splitting criterion. Fu and Simonoff (2017) proposed an interval censored survival tree using a modified log-rank test.

These methods have two limitations. One is that they are not an ensemble learner. Enhanced prediction accuracy can be achieved using variance reduction techniques in ensemble methods. Second, similarly to RSF, they do not fully utilize the covariate-conditional distribution information for censored observations as in RIST. While it is easy to expand these methods to corresponding ensemble learners, the second limitation is not trivial and requires development of new methods.

We further focus on the fact that existing tree or random forest survival methods are based on Kaplan-Meier type discrete survival curves, while the true survival curve might be smoother. This discrepancy can be dramatic, when the terminal nodes of trees include only a small number of observations. To overcome this, we employ kernel smoothing as in Groeneboom et al. (2010).

3.2 Overview of the proposed method

Before we give a detailed description of the proposed method, we outline the overall idea. As an initial step, to provide rough information about the censored intervals, we estimate the marginal survival curve as a proxy for the covariate-conditional survival curve, , and obtain the estimate of the full conditional survival probability for each subject, . Instead of doing imputation as in RIST, we store the conditional probability information for each subject and use it in the splitting tests. In this way, we can avoid the Monte Carlo error resulting from the imputation procedures which can be significant for interval censored data. We use the Generalized Wilcoxon’s Rank Sum (GWRS) test for splitting, wherein we generalize the Wilcoxon’s Rank Sum (WRS) test to allow comparing two groups of time intervals. We also consider other splitting rules such as Generalized Log-Rank (GLR) statistics. With one of those splitting rules selected, a multitude of trees are built under a modified ERT algorithm. At each terminal node of the trees, a local survival probability estimate is obtained in two ways: 1) the NPMLE of the survival curve is obtained based on raw interval data without using the survival curve information, or 2) the full conditional survival curves are averaged. We call the former a “quasi-honest” approach, and the latter an “exploitative” approach. The tree survival probability estimates formed in this manner are averaged to obtain a forest survival probability estimate, , for the first iteration. Then is used to update the full conditional survival curve of each subject at the th iteration, . For each , is obtained by kernel-smoothing. The final prediction is then given by the smoothed survival curve at the iteration of the smallest out-of-bag error. A detailed pseudo-algorithm is given in Table 1.

 

  • Initialize
    Initialize by NPMLE;
    Kernel smooth , if INITIAL_SMOOTH is TRUE;

  • Estimate the th forest,

    • Estimate the conditional probability
      Update based on and for each ;

    • Construct the th tree,

      • Obtain an -out-of- bootstrap sample from the dataset where

      • ;

      • Recursively partition the dataset using a splitting rule (e.g., GWRS) based on :
        (At each node, randomly pick variables, pick a random cutoff for each selected variable, run the splitting test for each of the cutoffs, and pick the one that maximizes the test score.)

      • Prediction at each terminal node:
        If QUASIHONEST, NPMLE() 
        If not QUASIHONEST,
        Kernel smoothing:

      • Obtain the conditional survival function for the tree:

      • Calculate the out-of-bag error for the tree: ;

    • Obtain the conditional survival function for the forest:
       

    • Calculate the out-of-bag error for the forest:
      ;

  • Finalize the estimator: where

 

Table 1: Pseudo-algorithm for ICRF

3.3 Splitting rules

We develop two splitting rules and describe two existing rules. Peto and Peto (1972) compared the two-sample test statistics including the WRS test and log-rank test. They showed that the log-rank test is the most locally powerful test under Lehman-type alternative hypotheses while WRS also has a strong power under Log-normal mean-shift alternative hypotheses. Thus, these tests can be considered as potential splitting rules with some modifications for interval censored data.

We develop the extensions of the WRS and log-rank tests for interval censored data. In Peto and Peto (1972), they also came up with score-based tests for interval censored data which correspond to the WRS and log-rank tests. A description of the four splitting rules are given below, and their performances are compared in Section 4.3.2.

A. Generalized Wilcoxon’s Rank Sum test (GWRS). The WRS test statistic,

estimates for a randomly picked and from two samples where and and are the sample sizes of the two groups, respectively. We generalize this statistic to allow interval data as follows:

where . Suppose the full conditional probabilities are expressed as a vector of probability masses over disjoint intervals, , for . Then .

B. Generalized Log-Rank test (GLR). The log-rank test statistic for uncensored data or right-censored data is given by

where is the number of distinct observed time points, and are the number of subjects at risk right before and the number of events at the th time point in group , respectively, for ; and .

Using the full-conditional survival probabilities from the previous iteration, the log-rank test can be extended to a generalized log-rank test (GLR) for interval censored data:

where , and As an approximation, we evaluate the integration only at the observed monitoring times.

C. WRS-score test (SWRS). Peto and Peto (1972) introduced an asymptotic score statistic, the two sample WRS test for interval censored data. The test statistic is given by where

D. Log-Rank-score test (SLR). Another score statistic (SLR) based on the log-rank test was proposed by Peto and Peto (1972) and is given by where

The best cut point is the one that maximizes , , , or . We use GWRS as our main splitting rule in the subsequent analyses. In Section 4.3.2, we illustrate how different splitting rules affect the prediction accuracy.

3.4 Ert

To overcome the high variability of trees, various ensemble learners such as Breiman (2001)’s random forest (RF) and Geurts et al. (2006)’s extremely randomized trees (ERT) have been proposed. A common strategy is to build multiple decorrelated trees and average them. In RF, decorrelation is achieved by using bootstrap samples for each tree and by considering only a random subset of variables for splitting at each node. In ERT, multiple different trees are generated without resampling but by randomly selecting a subset of variables and by randomly selecting one cutoff point for each selected variable.

However, this variance reduction comes at the price of higher bias. Ensemble learners are used in the hope that the gain in variance reduction is higher than the loss in increased bias. In Geurts et al. (2006), ERT was shown to have an enhanced accuracy compared to RF in certain settings as measured by mean squared error. This is perhaps because, given the same terminal node size () or the minimum number of subjects falling into a terminal node, the measure of a node is on average smaller for ERT than that for RF since ERT does not involve resampling. As a result, while the extra randomness brings lower variance and higher bias, the bias is less increased because of the absence of resampling.

3.5 Self-consistency and convergence monitoring

RIST can be understood as a self-consistent estimator (Efron, 1967), where imputation acts as an expectation step and random forest estimation as a maximization step in the EM iterations. The proposed ICRF has analytically the same structure except that it is designed for interval censored data and the distributional information is saved and used in the tree-building procedures instead of being used for imputation.

The consistency of survival trees for uncensored data can be shown heuristically using the following strategy. Assume that the conditional survival function is smooth along the covariate values so that there exist a kernel function

with bandwidth and a constant satisfying . Since the bandwidth is often a decreasing function of the sample size , is interchangeably used as the subscript of , when there is no confusion. Then what tree-based methods essentially do is to estimate the kernel function (a form of partitioning) and the local survival functions in the integration (a form of prediction in terminal nodes). The kernel function estimator of the trees can be shown to be consistent, provided the length of every side of the terminal node containing a signal becomes arbitrarily small in probability while at the same time every terminal node includes arbitrarily many sample points, as the sample size grows larger. Note that the estimator of the kernel function is governed by the splitting rule, . Most of the terminal node estimators used in survival tree methods are consistent: e.g., the EM-ICM algorithm-based (Wellner and Zhan, 1997) non-parametric maximum likelihood estimator (NPMLE) for interval censored data, and the Kaplan-Meier estimator for uncensored and right-censored data are consistent estimators. Thus, a tree estimator constructed in this manner can be shown to be consistent: , where is a consistent local survival estimator and denotes the empirical distribution. The consistency of a tree estimator is trivially inherited by the corresponding ensemble estimators.

To obtain the consistency of survival trees for censored data, it is required that another splitting rule for censored data gives a kernel estimator that is close enough to that given by . For example, could satisfy the following:

for some distance metric where is a censored sample of size and is the corresponding true uncensored data. The splitting rule uses the interval-conditional survival probabilities of each individual in given some survival function , or . The metric which the convergence between the splitting rules is based on has only to provide the convergence between the resulting kernel estimators and , i.e., (1) implies . Since given a consistent local estimator , a consistent estimator would solve for in , might be obtained using recursion. That is, once a covariate-conditional survival function estimate at the th iteration is obtained, the st forest can make partitions using the same splitting rule but with the updated survival information . For example, this is done when

is used in the GWRS splitting rule instead of the more naive or at the st iteration.

This self-consistent estimator makes sense when for some large . However, it is well-known that self-consistency equations do not guarantee global convergence under interval censoring (Wellner and Zhan, 1997). For some initial guesses, the estimator may give an inconsistent survival estimate. Thus, it is crucial to make sure that an additional forest iteration brings reduction in error. To monitor this in the absence of knowing the true survival curve, the out-of-bag samples are used for estimating the accuracy. That is, for each tree in ERT, we randomly subsample a large fraction, e.g. 95% instead of 63% which is a default value in RF, build a tree, and evaluate it using the small (5%) hold-out sample. Using a metric that will be discussed in Section 4.2, we monitor the performance of the ERT’s over a prespecified number of iterations, e.g. .

3.6 Quasi-honesty

Once partitioning procedures are done, the terminal node survival curves are estimated either i) by applying NPMLE to the raw interval data (quasi-honest prediction) or ii) by averaging the full conditional survival curves (exploitative prediction). The former approach is quasi-honest, as the survival probability of the previous iteration is only used in the partitioning procedure but not in the prediction procedure. It is not genuine honesty (Athey and Imbens, 2016), in the sense that ICRF still uses the same interval data in both partitioning and terminal node prediction.

The second approach is exploitative. This approach is efficient, as it utilizes all available information for terminal node prediction. Also since the prediction does not require a complicated optimization procedures, it is computationally light. However, as is discussed in the following paragraphs, this approach tends to have bias, non-convergence, and dilution of signals. RIST, where imputed values containing the information about the covariate-conditional survival curve are used for both partitioning and terminal node prediciton, is hence exploitative.

The role of (quasi-) honesty in the prediction accuracy should be understood in terms of the bias-variance trade-off. While honesty induces higher variability by not utilizing the whole information at each procedure, it relaxes the overfitting problem and makes trees less biased by posing independence between the partitioning procedure and the terminal node prediction procedure. Hence, quasi-honesty may or may not be beneficial to interval censored survival analysis. A large amount of information about the true survival curve is lost due to interval censoring. This means that there might be a room for an exploitative approach to make up the information loss, since it more fully utilizes the information. However, it is also true that the interval censored data is a highly noisy setting, and thus the quasi-honest approach may dampen the overfitting problem which the exploitative approach might have. Once the estimation moves in a wrong direction initially, then the exploitative approach may keep driving the estimation sequence in the wrong direction, while the quasi-honest approach may suffer less from such non-convergence.

Another property of the exploitative approach is dilution of signal. As the initial survival probability starts with the marginal survival distribution, even after partitioning, two different points in a feature space share a significant amount of information about the survival distribution. This results in lower variance and hence, sometimes, underfitting. This exploitative approach should therefore be used when the features do not contain a large amount of information about the time distribution. We compare the performances of these two approaches in Sections 4 and 5.

3.7 Smoothed forests

Random forests are relatively smoother than base learners with respect to features. However, they are still discrete in the time domain, especially for the NPMLE of interval-censored data. Since in reality the survival function is unlikely to include step functions, it can be beneficial to assume some smoothness on the true survival function. Groeneboom et al. (2010) proposed two ways of estimating smooth survival curves for current status data. Although their first method may not apply to general interval censored data, one can easily use the second method, the smoothed maximum likelihood estimator (SMLE), for such data. The idea is to find a non-smooth nonparametric maximum likelihood estimator (NPMLE), , and use kernel smoothing to obtain an SMLE: , where is a kernel function with bandwidth . For survival forests, the SMLE is computed for each terminal node of each tree: , where is the th terminal node in the th tree of the th forest iteration, , , and . Then the smoothed random survival forest is . In this paper we use a Gaussian kernel with bandwidth where we choose

to be half of the inter-quartile range of the marginal survival distribution estimate. For the boundary kernel near

, we use a mirror kernel for .

4 Simulations

In this section, we run simulations in order to evaluate the prediction accuracy of ICRF in multiple aspects. The first set of simulations is to compare the prediction accuracy of ICRF to that of existing methods under multiple scenarios. The second set of simulations is to compare the performances of different splitting rules of ICRF and to compare the performances of quasi-honest and exploitative prediction rules. The final set shows the performance as sample size grows.

The competitors considered include the Cox proportional hazards model (Finkelstein, 1986) which is implemented using the R package icenReg (Anderson-Bergman, 2017), survival trees for interval-censored data (STIC) (Fu and Simonoff, 2017). As STIC is not an ensemble learner, its random forest version (SFIC) is also included in the simulation. All the models except the Cox model are implemented using an R package icrf

. Since for other methods than ICRF, the estimates are not identifiable at each time point but are uniquely obtained only as a set of probability masses in intervals, we interpolate the within-interval survival curve assuming a uniform density within those intervals. However, when the length of the intervals is not finite, an exponential density is assumed. That is, given the estimated probability

of the last unbounded interval, the interpolated survival estimate is given by for . We further include smoothed versions of STIC and SFIC for a fair comparison with ICFR. For the Cox model, we did not consider regularization, as there is no currently available scalable software for this. The codes for the simulations are provided in the Supplementary Material.

4.1 Generative models and tuning parameters

We first define the simulation settings by describing generative models and tuning parameters for the estimators. The basic framework for the generative models is largely taken from Zhu and Kosorok (2012).

Generative models. Six scenarios for two different monitoring times ( and ) are studied. Scenario 1 (PH-L) assumes a proportional hazards model with linear hazards ratio, Scenario 2 (PH-NL) has a nonlinear hazards ratio (PH-NL) in place of Scenario 1, and the third (non-PH) is a non-proportional hazards model, where all three scenarios assume non-informative censoring. The fourth scenario (CNIC) has non-informative censoring conditional on , and the fifth scenario (IC) has informative censoring. To further study how smoothed estimators behave under a non-smooth true survival curve, we further adopted Scenario 6 (non-SM), where the first scenario is modified so that the density of the event times is degenerate. The settings are defined more concretely in Table 2. The sample size of the training sets is 300 and samples are independently drawn. The study period () is set to 5 for all scenarios.

scenario 1 PH-L 25 0.9 1.27 2 PH-NL 10 - 0.24 3 non-PH 25 0.75 0.68 4 CNIC 25 0.75 0.41 5 IC 10 0.2 0.46 6 Non-SM 25 0.9 1.27

Table 2: Simulation settings. Independent samples of size , , , with and elements (conditionally) independent of each other, is the

-dimensional normal distribution with mean

and variance ,

is the log-normal distribution with mean

and variance 1,

is the uniform distribution over

,

is the exponential distribution with mean

, and

is the Gamma distribution with shape

and scale , is the semi-discretized Exponential defined as where is the smallest integer greater than or equal to , , and is a constant near the sample average of the ’s.

Tuning parameters. The tuning parameters for the tree-based methods are summarized in Table 3. The minimum size of the terminal nodes is 6 for ensemble learners and 20 for the non-ensemble tree method. For ensemble learners, 300 trees are built by considering randomly chosen candidate variables at each node. The default splitting rule for ICRF is set as GWRS and both quasi-honest and exploitative prediction are used for terminal node predictions. However, other splitting rules are also compared. As for smoothing, the bandwidths are chosen to be with .

method
ICFR 10 300 no 6
SFIC - 300 yes 6
STIC - - - - - 20
Table 3: Tuning parameters for tree-based methods. is the maximum number of iterations for ICFR, is the number of trees making up the random forests, is the number of candidate features on which splitting tests are done at each node, is the size of the random resample for a tree in random forests, is whether to resample with replacement or not, is the minimum number of observations in terminal nodes.

4.2 Prediction Accuracy

To assess the prediction accuracy of the estimators, we use integrated absolute error and supremum absolute error over the study period. They are defined as and , respectively. These error measurements are obtainable only when the true survival curve is available. To measure the error in the absence of the true survival curve, we use the integrated mean squared errors type 1 () and type 2 () (Banerjee et al., 2019). is defined as the squared discrepancy of the estimate from the actual survival status averaged over the interval of the known survival status and then averaged over the sample. That is, , where is the test set. This can be regarded as a modified integrated Brier score (Graf et al., 1999).

is defined over the whole time domain up to the study length, where the discrepancy over the censored interval is calculated by the difference between the covariate-conditional survival curve and the full-conditional survival curve:

As mentioned in the previous section, is used for convergence monitoring of ICFR, as it is a model-free measure. The out-of-bag samples are used as a test set for measuring . The error measurement for convergence monitoring is given by

where is the whole training data and is the out-of-bag sample left for the th tree.

4.3 Simulation results

4.3.1 Comparison with other methods

Simulations are done with replicates for each distinct setting. The simulation results based on quasi-honesty and GWRS rule are illustrated in Figure 1. The results in the left column are for Case-I censoring and those in the right column are for Case-II censoring. For convenience, we denote the ICRF estimator at the th iteration by ICRF-. The iteration with the best out-of-bag error among the ten iterations is denoted by . In the results, ICRF-1, ICRF-2, ICRF-3, ICRF-5, ICRF-10, and ICRF-A are presented.

Comparison with other methods. For most of the scenarios, ICRF’s have minimum or close-to-minimum integrated and supremum absolute errors. For Scenario 5 (both ), where Cox models have better integrated absolute errors than the ICRF’s, ICRF’s have as good supremum errors as the Cox models. Also noting that simpler models such as the STIC’s and the Cox models have better accuracy than the SFIC’s under Scenario 5, there is evidence that underfitting might be beneficial for settings where features contain weak signals, i.e., when is low.

In Scenario 1, where data are generated under the proportional hazards model, ICRF’s have better average accuracy than that of the Cox models. Although the Cox models eventually have higher accuracy for larger samples (see Figure 3), the results indicate that ICRF methods have a relatively high prediction accuracy.

Convergence monitoring. The ICRF’s error rate often becomes smaller as the number of iterations increases on average. Although in general it decreases, it often fluctuates and sometimes increases. However, ICRF-A, the ICRF at the best iteration of measured against the out-of-bag samples, have integrated and supremum absolute errors close to the minimums most of the time.

Figure 1: Prediction errors of methods under different simulation settings (the ICRF’s are built in a quasi-honest manner). The boxes on the left column are for case-I censoring () and those on the right column are for case-II censoring (). For each setting, the horizontal line indicates the minimum of mean error levels of the methods

4.3.2 Splitting rules and quasi-honesty

Four splitting rules (GWRS, GLR, SWRS, SLR) with quasi-honest versus exploitative predictions are compared in Figure 2 under six scenarios, with monitoring time. Most of the time, the new splitting rules (GWRS and GLR) have on average less error than the score-based rules (SWRS and SLR). Between GWRS and GLR, the two methods have about the same prediciton accuracy. The gap between the new splitting rules and the score-based rules might reflect the fact that score-based rules rely on approximation, while GWRS and GLR do not.

On the other hand, the comparison between quasi-honest and exploitative predictions is less consistent. One does not always beat the other. In Scenarios 2, 4, and 5, the exploitative prediction has lower integrated absolute error, and in Scenario 3 and early iterations of 1 and 6, it has higher error rates. As mentioned in the last paragraph of Section 3.6, exploitative prediction tends to make weak contrasts between two feature values and is expected to perform well when the true distribution has faint signals. In contrast, quasi-honest prediction provides more precise estimates when the signal is strong.

Figure 2: Mean and 1st and 3rd quartile of splitting rules and prediction rules under Case-I censoring

4.3.3 Varying sample sizes

The prediction accuracy of each method is evaluated under different sample sizes for current status data () under Scenario 1 (proportional hazards model). The integrated and supremum error and are measured. For ICRF, the last fold (10th) estimate is used for illustration. The mean, the 1st quartile, and the 3rd quartile of error measurements across 100 replicates are illustrated in Figure 3.

The Cox model, although it does not have the smallest errors for small sample sizes (), has rapidly decreasing errors as the sample size grows larger for both integrated and supremum absolute errors. Among the nonparametric models, ICRF shows the highest prediction accuracy in terms of all error measures for all sample sizes. Not only is the level of error the lowest, but the rate of decrease in error for larger sample sizes is the highest.

Figure 3: Prediction errors under different sample sizes for Scenario 1 and .

5 Avalanche data analysis

5.1 Data description and assumptions

In this section, we apply ICRF (using 10 iterations), three other methods, STIC, SFIC, and Cox, and the corresponding smoothed estimators to the data of 1,247 avalanche victims buried in Switzerland and Canada between October 1980 and September 2005 ((Jewell and Emerson, 2013)). The dataset includes duration of burial and status of survival of the subjects, and thus can be regarded as current status data. The covariates include location, burial depth, and the type of outdoor activities involved. Approximately 10% of the observations have missing burial depth. The main quantity of interest is the covariate-conditional survival probability where the event time is defined as time from burial to death. The event time here is counterfactual in a sense that the event time is the time until death had the person not been discovered (prior to death).

We use the following assumptions. First, burial duration is independent of the time to event. This assumption is feasible as avalanche recovery is usually performed in the absence of knowledge of the survival status of victims. Second, the missingness of burial depth is completely at random. Although this assumption may not be fully valid, we also analyze the data using complete cases only for comparative purposes. Third, the survival of individual victims are independent. Since a single avalanche may involve multiple burials due to group activities, without a sufficient number of covariates, this assumption may not be valid. However, the point estimator of the survival function for grouped survival data remains valid.

We randomly partition the complete data ( into training ( and test (

) datasets 300 times. The training sets are used for estimation of the survival curves, and the fitted models are evaluated using the corresponding test sets. The avalanche dataset is highly skewed (median = 30, mean = 2,932, 3rd quartile = 110, max = 342,720 in minutes). To make the estimation computationally feasible, a log-transformed time domain is used with a transformation

where , and the prediction accuracy is evaluated in the transformed time domain. The study length is set as minutes (10 days) or . The analyses are implemented using the R package icrf and the codes are provided in the Supplementary Material. preliminary parametric and semi-parametric regression analyses of the data are provided in Jewell and Emerson (2013).

5.2 Results

The prediction accuracy (IMSE) of the fitted models is summarized in Table 4. Among the seven methods being compared, the smoothed Cox model is shown to have the best prediction accuracy. Among nonparametric methods, ICRF with exploitative prediction has the best prediction accuracy. For most methods smoothing improves prediction accuracy.

ICRF (quasi-honest) 0.036 (0.0052) 0.038 (0.0065)
ICRF (exploitative) 0.023 (0.0025) 0.019 (0.0015)
STIC 0.037 (0.0065) 0.041 (0.0087)
STIC (smoothed) 0.036 (0.0065) 0.039 (0.0086)
SFIC 0.034 (0.0031) 0.036 (0.0033)
SFIC (smoothed) 0.035 (0.0031) 0.036 (0.0034)
Cox 0.020 (0.0021) 0.018 (0.0015)
Cox (smoothed) 0.020 (0.0021) 0.018 (0.0014)
Table 4:

Mean (and Monte Carlo standard error) of IMSE of the avalanche survival models with a training sample size

Figure 4 illustrates the expected truncated log survival time, , of avalanche victims estimated by each smoothed model. While the Cox model, by assumption, has a monotone expected survival time with respect to each of the covariates, nonparametric models show non-monotone curves. The expected truncated survival time curves of the two prediction rules have a significant difference in their model variability, or . Quasi-honest ICRF, compared to exploitative ICRF, has a wigglier curve along burial depths and has wider gaps among different group activities.

For most models, burial depth seems to be the most important covariate. In general, the mean truncated survival time decreases as the burial depth increases. However, for the emsemble methods (ICRF, SFIC), the mean survival time increases for depths greater than 350 cm. This is considered to be an overfitting problem in a sparse data region. In many models, the location also plays as important a role as burial depth; In the Cox model, the mean survival time in Canada is on average smaller than in Switzerland. Unlike the Cox model, nonparametric models have very different patterns of expected survival time curves for different countries.

Variable importance is formally quantified by measuring the increase in IMSE for a dataset where the values of each covariate in the original dataset are randomly permuted across the sample. The increase in IMSE is averaged across ten sets of random permutations. A larger increase in error for a variable indicates higher importance of the variable. The variable importance calculated for the model fitted on the first training set of the avalanche data is presented in Table 5. For either type of measurement ( or ), burial depth is the most important variable explaining the survival probability. The second most important variable is chosen differently between the two prediction rules. For the exploitative rule, group activity is the second most important variable, while location is the second most for the quasi-honest rule.

Figure 4: Estimated mean truncated log survival time in the avalanche data. The size of dots at the bottom of each box represents the number of sample data points.
Quasi-honest exploitative Quasi-honest exploitative
Burial depth 1.000 0.385 0.947 1.000
Group activity -0.017 0.302 0.185 0.643
Location 0.096 0.152 0.680 0.300
multiplier 0.0101 0.0035
Table 5: Variable importance of the ICRF model fitted on the first training set of the avalanche data. The importance values are rescaled so that maximum values for each measure become 1. The multiplier is the original importance scale.

6 Discussion

In this paper, we proposed a new tree-based iterative ensemble method for interval censored survival data. As interval censoring masks a huge amount of information, maximizing the use of available information can significantly improve the performance of estimators. Using an iterative fitting algorithm with convergence monitoring, ICRF solves the potential bias issue which most existing tree-based survival estimators have. Specifically, this bias issue arises from not fully utilizing the covariate-conditional survival probabilities in the early phases of the tree partitioning procedure for these methods, which causes the kernel estimate to incur significant bias. WRS and the log-rank tests were generalized for interval-censored data and were used as splitting rules to fully utilize the hidden information. Quasi-honesty and exploitative rules were discussed for terminal node prediction. Smoothing adds another feature to ICRF.

We suggested many of the default modeling hyper-parameters, such as using GWRS or GLR as a splitting rule, the bandwidth of kernel smoothing, and the best iteration selection procedure by the out-of-bag (or ) measurement. However, the choice of the terminal node prediction rule remains unspecified. The quasi-honest and exploitative prediction rules have their own features. The quasi-honest rule induces higher model variability, while the exploitative rule tends to favor simpler models. Thus, they perform well under high and weak signal settings, respectively.

The challenge is that IMSE measurements are not always a good replacement for the true error measurement ( and ). The out-of-bag measurement recommends the exploitative prediction rule for most of the simulation settings, including scenario 3 where the quasi-honest rule has higher accuracy than the exploitative rule. Although the exploitative rule still beats the quasi-honest rule for five out of six scenarios and hence a decision rule based on out-of-bag measurements may make sense, care must be taken.

This problem can be seen as a model selection problem between parsimony and flexibility. If the true model is thought to be smooth and simple, the exploitative rule should be employed. If the true model is believed to be complicated, the quasi-honest rule should be used. Unfortunately, the complexity or smoothness of true models is usually unknown. As model selection criteria such as AIC, BIC, and Mallow’s Cp have been proposed in linear regression settings, new model selection criteria for interval censored survival models might greatly improve the prediction accuracy.

The signal dilution property of the exploitative prediction rule might be caused by the fact that the marginal survival probability is shared by all censored subjects and the shared information is again carried forward to the next conditional survival probability estimate. This property might be mitigated by using non-marginal survival curves as the initial estimate. For example, the Cox model estimate or the first iteration of the quasi-honest ICRF estimate can be used as the initial estimate.

Acknowledgements

The authors thank Pascal Haegeli for providing the avalanche data from Haegeli et al. (2011). The third author was funded in part by grant P01 CA142538 from the National Cancer Institute.

References

  • C. Anderson-Bergman (2017) IcenReg: regression models for interval censored data in r. Journal of Statistical Software 81 (12), pp. 1–23. Cited by: §4.
  • S. Athey and G. Imbens (2016) Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113 (27), pp. 7353–7360. Cited by: §3.6.
  • A. Banerjee, N. Yoganandan, F. Hsu, S. Gayzik, and F. A. Pintar (2019) EVALUATING performance of survival regression models with interval censored data in motorvehicle crash experiments. A draft paper. Cited by: §4.2.
  • I. Bou-Hamad, D. Larocque, H. Ben-Ameur, et al. (2011) A review of survival trees. Statistics Surveys 5, pp. 44–71. Cited by: §1.
  • L. Breiman (1996) Bagging predictors. Machine learning 24 (2), pp. 123–140. Cited by: §1.
  • L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §1, §3.4.
  • A. Ciampi, R. du Berger, H. G. Taylor, and J. Thiffault (1991) RECPAM: a computer program for recursive partition and amalgamation for survival data and other situations frequently occurring in biostatistics. iii. classification according to a multivariate construct. application to data on haemophilus influenzae type b meningitis. Computer methods and programs in biomedicine 36 (1), pp. 51–61. Cited by: §1.
  • R. B. Davis and J. R. Anderson (1989) Exponential survival trees. Statistics in Medicine 8 (8), pp. 947–961. Cited by: §1.
  • B. Efron (1967) The two sample problem with censored data. in proceedings of the fifth berkeley symposium on mathematical statistics and probability. Cited by: §3.5.
  • D. M. Finkelstein (1986) A proportional hazards model for interval-censored failure time data.. Biometrics 42 (4), pp. 845–854. Cited by: §4.
  • W. Fu and J. S. Simonoff (2017) Survival trees for interval-censored survival data. Statistics in medicine 36 (30), pp. 4831–4842. Cited by: §1, §3.1, §4.
  • P. Geurts, D. Ernst, and L. Wehenkel (2006) Extremely randomized trees. Machine learning 63 (1), pp. 3–42. Cited by: §1, §3.4, §3.4.
  • L. Gordon and R. A. Olshen (1985) Tree-structured survival analysis.. Cancer treatment reports 69 (10), pp. 1065–1069. Cited by: §1.
  • E. Graf, C. Schmoor, W. Sauerbrei, and M. Schumacher (1999) Assessment and comparison of prognostic classification schemes for survival data. Statistics in medicine 18 (17-18), pp. 2529–2545. Cited by: §4.2.
  • P. Groeneboom, G. Jongbloed, B. I. Witte, et al. (2010) Maximum smoothed likelihood estimation and smoothed maximum likelihood estimation in the current status model. The Annals of Statistics 38 (1), pp. 352–387. Cited by: §3.1, §3.7.
  • P. Groeneboom and J. A. Wellner (1992) Information bounds and nonparametric maximum likelihood estimation. Vol. 19, Springer Science & Business Media. Cited by: §1.
  • P. Haegeli, M. Falk, H. Brugger, H. Etter, and J. Boyd (2011) Comparison of avalanche survival patterns in canada and switzerland. Cmaj 183 (7), pp. 789–795. Cited by: §1, Acknowledgements.
  • T. Hothorn, P. Bühlmann, S. Dudoit, A. Molinaro, and M. J. Van Der Laan (2005) Survival ensembles. Biostatistics 7 (3), pp. 355–373. Cited by: §1.
  • T. Hothorn, B. Lausen, A. Benner, and M. Radespiel-Tröger (2004) Bagging survival trees. Statistics in medicine 23 (1), pp. 77–91. Cited by: §1.
  • H. Ishwaran, U. B. Kogalur, E. H. Blackstone, M. S. Lauer, et al. (2008) Random survival forests. The annals of applied statistics 2 (3), pp. 841–860. Cited by: §1, §3.1.
  • N. P. Jewell and R. Emerson (2013) Current status data: an illustration with data on avalanche victims. Handbook of Survival Analysis, pp. 391–412. Cited by: §1, §5.1, §5.1.
  • G. Jongbloed (1998) The iterative convex minorant algorithm for nonparametric estimation. Journal of Computational and Graphical Statistics 7 (3), pp. 310–321. Cited by: §1.
  • M. LeBlanc and J. Crowley (1992) Relative risk trees for censored survival data. Biometrics, pp. 411–425. Cited by: §1.
  • M. LeBlanc and J. Crowley (1993) Survival trees by goodness of split. Journal of the American Statistical Association 88 (422), pp. 457–467. Cited by: §1.
  • R. Oller, G. Gómez, and M. L. Calle (2004) Interval censoring: model characterizations for the validity of the simplified likelihood. Canadian Journal of Statistics 32 (3), pp. 315–326. Cited by: §2.
  • R. Peto and J. Peto (1972) Asymptotically efficient rank invariant test procedures. Journal of the Royal Statistical Society: Series A (General) 135 (2), pp. 185–198. Cited by: §3.3, §3.3, §3.3, §3.3.
  • M. R. Segal (1988) Regression trees for censored data. Biometrics, pp. 35–47. Cited by: §1.
  • J. A. Wellner and Y. Zhan (1997) A hybrid algorithm for computation of the nonparametric maximum likelihood estimator from censored data. Journal of the American Statistical Association 92 (439), pp. 945–959. Cited by: §3.5, §3.5.
  • Y. Yin, S. J. Anderson, G. Parran Hall, D. Street, et al. (2002) Nonparametric tree-structured modeling for interval-censored survival data. In Joint Statistical Meeting, Cited by: §1, §3.1.
  • A. Zeileis, T. Hothorn, and K. Hornik (2008) Model-based recursive partitioning. Journal of Computational and Graphical Statistics 17 (2), pp. 492–514. Cited by: §1.
  • R. Zhu and M. R. Kosorok (2012) Recursively imputed survival trees. Journal of the American Statistical Association 107 (497), pp. 331–340. Cited by: §1, §1, §3.1, §4.1.