An Ensemble Method for Interval-Censored Time-to-Event Data

01/14/2019
by   Weichi Yao, et al.
NYU college
0

Interval-censored data analysis is important in biomedical statistics for any type of time-to-event response where the time of response is not known exactly, but rather only known to occur between two assessment times. Many clinical trials and longitudinal studies generate interval-censored data; one common example occurs in medical studies that entail periodic follow-up. In this paper we propose a survival forest method for interval-censored data based on the conditional inference framework. We describe how this framework can be adapted to the situation of interval-censored data. We show that the tuning parameters have a non-negligible effect on the survival forest performance and guidance is provided on how to tune the parameters in a data-dependent way to improve the overall performance of the method. Using Monte Carlo simulations we find that the proposed survival forest is at least as effective as a survival tree method when the underlying model has a tree structure, performs similarly to an interval-censored Cox proportional hazards model fit when the true relationship is linear, and outperforms the survival tree method and Cox model when the true relationship is nonlinear. We illustrate the application of the method on a tooth emergence data set.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

01/28/2019

A Multi-parameter regression model for interval censored survival data

We develop flexible multi-parameter regression survival models for inter...
05/18/2020

Optimal survival trees ensemble

Recent studies have adopted an approach of selecting accurate and divers...
01/01/2019

A weighted random survival forest

A weighted random survival forest is presented in the paper. It can be r...
03/31/2018

A proportional hazards model for interval-censored subject to instantaneous failures

The proportional hazards (PH) model is arguably one of the most popular ...
03/04/2019

Similarity-based Random Survival Forest

Predicting the time to a clinical outcome for patients in intensive care...
12/20/2019

Interval censored recursive forests

We propose the interval censored recursive forests (ICRF) which is an it...
03/31/2018

A proportional hazards model for interval-censored data subject to instantaneous failures

The proportional hazards (PH) model is arguably one of the most popular ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most statistical methods for the analysis of survival time (time-to-event) data have been developed in the situation where the observations could be right-censored. In many situations, however, the survival time cannot be directly observed and it is only known to have occurred in an interval obtained from a sequence of examination times. In this situation, we say that the survival time is interval-censored.

Interval-censored data are encountered in many medical and longitudinal studies and various methods have been developed for their analysis. Finkelstein (1986)

provided the first method for estimation of the Cox proportional hazard model from interval-censored data. Surveys of later approaches to the estimation of the Cox model and other semi or parametric survival models for interval-censored data can be found in

Sun (2006) and Bogaerts et al. (2017). However, these methods rely on restrictive assumptions such as proportional hazards and a log-linear relationship between the hazard function and covariates. Furthermore, because these methods are often parametric, nonlinear effects of variables must be modeled by transformations or expanding the design matrix to include specialized basis functions for more complex data structures in real world applications.

Recently, Fu and Simonoff (2017) proposed a nonparametric recursive-partitioning (tree) method for interval-censored survival data, as an extension of the conditional inference tree method for right-censored data of Hothorn et al. (2006b)

. As is well known, tree estimators are nonparametric and as such often exhibit low bias and high variance. Compared to simple models like trees, ensemble methods like bagging and random forest can reduce variance while preserving low bias. These methods average over predictions of the base learners (the trees) that have been fit to bootstrap samples, and are able to remain stable in high-dimensional settings and therefore can substantially improve prediction performance

(Breiman, 2001). Ishwaran et al. (2008) proposed the random survival forest (RSF) that extends random forest (Breiman, 2001) to right-censored survival data. Hothorn et al. (2006a)

proposed the conditional inference survival forest (with the conditional inference survival tree as the base learner) by incorporating weights into random forest-like algorithms and extending gradient boosting in order to minimize a weighted form of the empirical risk.

In this paper, we propose a conditional inference survival forest method appropriate for interval-censored data (we will refer to this method as the IC cforest method). The goal of this ensemble tree algorithm is to lower the variance compared to an individual tree and therefore stabilize and improve the prediction performance. The proposed method is an extension of the conditional inference forest method (which is designed to handle right-censored survival data, and will be referred as the cforest method) with the base learner being the conditional inference survival tree proposed by Fu and Simonoff (2017) (we will refer to this as the IC ctree method).

2 An interval-censored survival forest

2.1 Extending the survival forest of Hothorn et al. (2006a)

The recursive partitioning proposed in Hothorn et al. (2006b)

for building the ctree is based on a test of the global null hypothesis of independence between response variable

and any of the covariates

. As a decision tree-based ensemble method, cforest induces randomness into each node of each individual tree (that is built from a bootstrap sample) when selecting a variable to split on. Only a random subset of covariates is considered for splitting at each node. The recursive partitioning in cforest is based on a test of the global null hypothesis of independence between response variable

and any of the elements in a random subset of the total covariates (indeed, the size of this random subset is prespecified, with further discussion given in Section 2.2). In each node, after such a random subset is selected, permutation-based multiple testing procedures are applied. The recursion stops if the global null hypothesis of independence cannot be rejected at a prespecified level . If it can be rejected, the association between and each of the covariates , is measured to select the covariate with strongest association to the response variable (the one with minimum -value, indicating the largest deviation from the partial null hypotheses). Once a covariate is selected, the permutation test framework is again used to find the optimal binary split.

The

-dimensional covariate vector

falls in a space denoted by , and . The association of the response variable and a predictor , based on a random sample is measured by linear statistics of the form

where is a vector of non-negative integer-valued case weights having nonzero elements when the corresponding observations are elements of the node and zero otherwise, is a nonrandom transformation of covariate , and is the influence function and depends on the responses in a permutation-symmetric way. In their extension of ctree to IC ctree, Fu and Simonoff (2017) specified the influence function to be the log-rank score for interval-censored data proposed by Pan (1998). This score assigns a univariate scalar value to the bivariate response , where and are the left and right endpoints of the censoring interval for the -th observation. It is defined as

and

where is the nonparametric maximum likelihood estimator (NPMLE) of the survival function. We similarly use the log-rank score in our proposed extension of cforest to IC cforest.

The aggregation scheme of the cforest is different from that of the random survival forest. Instead of averaging predictions directly as in the random survival forest, it works by averaging observation weights extracted from each of the individual trees and estimates the conditional survival probability function by computing one single Kaplan-Meier curve based on weighted observations identified by the leaves of bootstrap survival trees. The idea of averaging weights instead of predictions is advocated in

Meinshausen (2006)

for quantile regression.

Athey et al. (2019) also adopt the same scheme for more general settings and propose the generalized random forest. These weights can be viewed as “adaptive nearest neighbor weights,” a term borrowed from Lin and Jeon (2006), where these weights were theoretically studied for the estimation of conditional means for regression forests. The core idea is to obtain a “distance” or a “similarity” measure based on the number of times a pair of observations is assigned to the same terminal node in the different trees of the forest. For conditional mean estimation, the averaging and weighting views of forests are equivalent; however, if we move to more general settings like constructing a nonparametric method for complex data situations, the weighting scheme has been proved to be more efficient (Athey et al., 2019).

Consider cforest where a set of trees is grown, indexed by . Each leaf of a tree corresponds to a rectangular subspace of . For any new observation , for each tree there is one and only one leaf such that falls into it. Denote the corresponding rectangular subspace of this leaf in the -th tree as . The weight of each observation in the original sample, , measures the “similarity” of the -th observation to the new observed value by counting how many times the value of in the original sample falls into the same leaf as in the -th tree

Averaging over trees, the weights are

which sum to one. The survival function can then be constructed by using a weighted version of the non-parametric maximum likelihood estimator (NPMLE). Since the weights can be viewed as replications of the corresponding observations, the corresponding log likelihood function to be maximized can be written as

In practice, such an estimator can be constructed using the algorithm proposed by Turnbull (1976). Denote the Turnbull intervals as and the mass that is assigned to as , for . Maximization of reduces to maximization of the following log likelihood function:

(1)

where and the parameters are subject to the constraints and . Since the weights define the forest-based adaptive neighborhood of , the resulting estimator from the weighting scheme can be viewed as a locally adaptive maximum likelihood estimator.

The weighted version of Turnbull’s self-consistent estimator of can be obtained as the solution of the simultaneous equation

Turnbull’s estimator uses a self-consistency argument to motivate an iterative algorithm for the NPMLE, which turns out to be a special case of the EM-algorithm. Anderson-Bergman (2017) recently proposed an efficient implementation of the EMICM algorithm to fit the NPMLE, which greatly improves the computation power and therefore enables efficient prediction from the forest for interval-censored data. In the case of weighted observations, the EM step uses the same log likelihood function as in (1), and the ICM step, which reparameterizes the problem in terms of the vector for , , is to update the likelihood function as

This is then approximated with a second-order Taylor expansion for maximization (Anderson-Bergman, 2017).

2.2 Regulating the construction of the IC ctrees in the IC cforest

As discussed in Section 2.1 only a random subset of covariates is considered for splitting at each node. The size of this random set is denoted by mtry. It will be shown later that mtry is a very important tuning parameter. Other parameters such as minsplit (the minimum sum of weights in a node in order to be considered for splitting), minprob (the minimum proportion of observations needed to establish a terminal node) and minbucket (the minimum sum of weights in a terminal node), which control whether or not to implement a split (and thereby regulate the size of the individual trees), can potentially be essential in avoiding overfitting, and therefore may improve the overall performance.

The recommended values for these parameters are usually given as defaults to the algorithm. For example, mtry is usually set to be , where is the number of covariates (Hothorn et al., 2006a; Ishwaran et al., 2008). However, in practice, we find that the choice of these parameters has a non-negligible effect on the overall performance of the proposed ensemble method. Hastie et al. (2001) suggests that the best values for these parameters depend on the problem and they should be treated as tuning parameters. How these parameters affect the performance of proposed IC cforest and further guidelines on how to set these values are discussed in Section 3.3.

3 Properties of the conditional inference forest method

In this section, we use computer simulations to investigate the properties of the proposed IC cforest estimation method. The event time is generated from distribution and the gap between any two consecutive examination times from a distribution . The -th of in total examination times therefore is and the intervals will be , each with width . The censoring interval of is the one that contains . Here and are independent, and therefore the survival times and the censoring mechanism are independent. This mechanism ensures the possibility that some observations can potentially be right-censored, i.e. lies in .

We will study the properties of the proposed cforest method in terms of its estimation performance. The simulation setups are similar to those in Fu and Simonoff (2017).

3.1 Model setup

We use three simulation setups, each with five distributions () of survival (event) time to test the prediction performance of the proposed IC cforest. The three survival families are as follows:

  1. Tree structured data:
    There are ten covariates , where , , and randomly take values from the set , , , and are binary and , , , are .

  2. .

  3. .

Figure 1: Tree structure used in simulations.

In the first setup, only the first three covariates determine the distribution of the survival (event) time . The survival time has distribution according to the values of , , by a tree structure given in Figure 1.

The survival time is generated from one of five different possible distributions:

  1. Exponential with four different values of from {0.1, 0.23, 0.4, 0.9}.

  2. Weibull distribution with shape parameter , which corresponds to decreasing hazard with time. The scale parameter takes the values {7.0, 3.0, 2.5, 1.0}.

  3. Weibull distribution with shape parameter , which corresponds to increasing hazard with time. The scale parameter takes the values {2.0, 4.3, 6.2, 10.0}.

  4. Log-normal distribution with location parameter and scale parameter with 4 different pairs .

  5. Bathtub-shaped hazard model (Hjorth, 1980). The survival function is given by

    with , and set to take values {0.01, 0.15, 0.20, 0.90}.

The second and third setups are similar to those in Hothorn et al. (2004). Here is a location parameter whose value is determined by covariates and . In these settings six independent covariates

serve as predictor variables, with

binary {0, 1} and uniform . The survival time again depends on with five different possible distributions:

  1. Exponential with parameter ;

  2. Weibull with increasing hazard, scale parameter and shape parameter ;

  3. Weibull with decreasing hazard, scale parameter and shape parameter ;

  4. Log-normal distribution with location parameter and scale parameter ;

  5. Bathtub-shaped hazard model (Hjorth, 1980). The survival function is given by

    with , and .

To see how the IC cforest compares with a (semi-)parametric model and the corresponding tree model, we also include the Cox proportional hazards model implemented in the R package

icenReg (Anderson-Bergman, 2016) (we will refer to this as IC Cox) and the IC ctree model implemented in the R package LTRCtrees (Fu and Simonoff, 2018) in the simulations for comparison. To see the amount of information loss due to interval-censoring, the oracle versions of all three models, Cox, ctree and cforest, which are fitted using the actual event time , are also included as in Hothorn et al. (2006b).

In the second setup where , the linear proportional hazards assumption is satisfied, so the Cox PH model should perform best. The third setup is similar to the second except that in this setup has a more complex nonlinear structure in terms of covariates, which is potentially more like a real world application. This complex structure can make the distributions of satisfy neither the Cox PH model nor the tree structure.

In all three simulation setups with five distributions , we consider three different distributions of censoring interval width ,

  1. , Uniform distribution

    ;

  2. , Uniform distribution ;

  3. , Uniform distribution .

Notice that censoring interval widths generated by should be around three times wider than those generated by , and censoring interval widths generated by should be around seven times wider than those generated by . Intuitively, as the width of the censoring interval gets wider, less information about the actual survival time is available.

We also consider three possible right-censoring rates, right-censoring, light censoring with about observations being right-censored, and heavy censoring with about observations being right-censored.

The simulation setup is designed to investigate the extent to which estimation performance of the proposed IC cforest deteriorates with the loss of information due to widening of censoring intervals, and also due to the increasing rate of right censoring.

3.2 Evaluation methods

To evaluate estimation performance, the average integrated distance between the true and the estimated survival curves

(2)

is used, where is the (actual) event time of the -th observation and () is the estimated (true) survival function for the -th observation from a particular estimator.

3.3 Evaluation of tuning parameters

3.3.1 mtry as a tuning parameter

In the cforest algorithm, a random selection of mtry input variables is used in each node for each tree. A split is established when all of the following criteria are met: 1) the sum of the weights in the current node is larger than minsplit, 2) a fraction of the sum of weights of more than minprob will be contained in all daughter nodes, 3) the sum of the weights in all daughter nodes exceeds minbucket, and 4) the depth of the tree is smaller than maxdepth. Default values of mtry, minsplit,minprob, minbucket and maxdepth have been given in of the R package partykit (Hothorn et al., 2018), where mtry is set to be (where is the number of covariates), and the other four parameters are set to be . Since typically unstopped and unpruned trees are used in random forests, we do not see maxdepth as a tuning parameter in the proposed IC cforest method.

The value of mtry can be fined-tuned on the “out-of-bag observations.” The “out-of-bag observations” for the -th tree are those observations that are left out of the -th bootstrap sample and not used in the construction of the -th tree (in fact, about one-third of the observations in the original sample are “out-of-bag observations” for each bootstrap sample). The response for the -th observation can then be predicted by using each of the trees in which that observation was “out-of-bag” (this will yield around predictions for the -th observation). The resulting prediction error is a valid estimate of the test error for the ensemble method. The idea of tuning mtry on the out-of-bag observations is borrowed from the function tuneRF() in the R package randomForest (Breiman et al., 2018). A version of tuneRF() for interval-censored data starts with the default values of mtry, then searches for the optimal values with a prespecified step factor with respect to out-of-bag error estimate mtry for IC cforest. The integrated Brier score (Graf et al., 1999), which is the most popular measure of prediction error in survival analysis, is used in the function tuneRF() for right-censored time data. Tsouprou (2015) adapted the integrated Brier score to interval-censored time data,

(3)

with and estimated by

where is the estimated survival function for the -th observation. Using this evaluation measure we can tune the mtry by the “out-of-bag” tuning procedure given in Appendix A.

Figure 2: Integrated difference of IC cforest with different mtry values, with , no right censoring and the interval censoring width generated by . The default value in cforest function is . The value of mtry tuned by the “out-of-bag” tuning procedure is given in the last column in each boxplot. Top row gives results for the first setup (tree structure), middle row gives results for the second setup (linear model), and bottom row gives results for the third setup (nonlinear model).

Figure 2 gives an example of how IC cforest performs with different values of mtry. The mtry values are chosen using stepFactor in the algorithm given in Appendix A. In this example, the default value of mtry in the cforest function is not always optimal and sometimes the performance can be significantly improved by setting a larger value (values smaller than the default value never had better performance, so they are not given). In fact, different distributions with different underlying models favor different values of mtry. The “out-of-bag” tuning procedure provides a relatively reliable choice of mtry that gives relatively good performance overall.

The size with no right censoring and the censoring interval width generated by is used in the simulations presented in Figure 2; results with and were similar and are given in Appendix B.1 and Appendix C.1.

3.3.2 minsplit, minprob and minbucket as tuning parameters

The optimal values that determine the split vary from case to case. As a fixed number, the default values may not affect the splitting at all when the sample size is large, while having a noticeable effect in smaller data sets. This inconsistency can potentially result in good performance in some data sets and poor performance in others. Here we wish to determine a rule that can automatically adjust those values to the size of the data set, whose performance is relatively stable and better than that of the default values.

The values of minsplit, minprob and minbucket determine whether a split in a node will be implemented. We design our experiments to explore the individual effect of each parameter. Based on the results, we propose the “15%-Default-6% Rule,” which is to set minsplit to be 15% of the sample size , minprob to be the default value, and minbucket to be 6% of the sample size .

Figure 3: Example: integrated difference of IC cforest with different minsplit, minprob and minbucket values, with , no right-censoring and the interval-censoring width generated by . 1-(minsplit, minprob, minbucket) . 2-(minsplit, minprob, minbucket) , 3-(minsplit, minprob, minbucket) , 4-(minsplit, minprob, minbucket) , 5-(minsplit, minprob, minbucket) , 6-(minsplit, minprob, minbucket) , 7-(minsplit, minprob, minbucket) . 8-The “15%-Default-6% Rule”: (minsplit, minprob, minbucket) . Top row gives results for the first setup (tree structure), middle row gives results for the second setup (linear model), and bottom row gives results for the third setup (nonlinear model).

Figure 3 gives an example of the sensitivity of IC cforest to the different values of minsplit, minprob, and minbucket. The choices of minsplit are 20 (default value), 30 (15% of the sample size ), and 40 (20% of the sample size ). The choices of minprob are 0.01 (default value), 0.05, and 0.10. The choices of minbucket are 7 (default value), 12 (6% of the sample size ), and 16 (8% of the sample size ). In each plot of Figure 3, column 1 shows the integrated under the default setting, columns 2-7 show the the integrated differences when changing the value of one parameter at a time while holding the others the same, and column 8 shows the results of the proposed “15%-Default-6% Rule.” Here the performance of IC cforest is shown with a limited number of values and these values are selected to give as much understanding of the performance change due to the tuning parameters as possible. We can see that overall the value of minprob does not change the performance much (as expected, since we set the equivalent parameter, minbucket, to be a much larger proportion of the size of the data set), while changing minsplit and minbucket can possibly improve the performance of the overall performance. Empirically, the “15%-Default-6% Rule” has shown to improve the overall performance over the default setting under different models with different distributions. The simulation results show that a slightly larger size of leaf is favored, since the smaller default size makes the forest more prone to capturing noise and overfitting, and therefore exhibits worse performance.

The size with no right censoring and the censoring interval width generated by is used in the simulations presented here; results with and were similar and are given in Appendix B.2 and Appendix C.2.

3.4 Estimation performance

We run 500 simulation trials for each setting to see how well the proposed IC cforest performs compared to the IC Cox model and the corresponding IC ctree model. The parameter mtry in IC cforest is tuned following the “out-of-bag” tuning procedure and the values for minsplit, minprob and minbucket are chosen using the “15%-Default-6% Rule” described in Section 3.3. The size with censoring interval width generated by is used in the simulations presented here; results with and were similar and are given in Appendix D and Appendix E, respectively.

Figure 4: True tree model with censoring interval width generated from : integrated difference boxplots with . Methods are numbered as 1-IC Cox model, 2-IC ctree, 3-IC cforest with parameters set by default, 4-IC cforest with parameters set through “out-of-bag” tuning procedure and the “15%-Default-6% Rule.” Top row gives results without right-censoring, middle row gives results for light (right-)censoring, and bottom row gives results for heavy (right-)censoring.
Figure 5: True linear model with censoring interval width generated from : integrated difference boxplots with . Methods are numbered as 1-IC Cox model, 2-IC ctree, 3-IC cforest with parameters set by default, 4-IC cforest with parameters set through “out-of-bag” tuning procedure and the “15%-Default-6% Rule.” Top row gives results without right-censoring, middle row gives results for light (right-)censoring, and bottom row gives results for heavy (right-)censoring.
Figure 6: True nonlinear model with censoring interval width generated from : integrated difference boxplots with . Methods are numbered as 1-IC Cox model, 2-IC ctree, 3-IC cforest with parameters set by default, 4-IC cforest with parameters set through “out-of-bag” tuning procedure and the “15%-Default-6% Rule.” Top row gives results without right-censoring, middle row gives results for light (right-)censoring, and bottom row gives results for heavy (right-)censoring.

Figures 4 to 6 give side-by-side integrated difference boxplots for all three setups with sample size with censoring width generated from . We can see that the “out-of-bag” tuning procedure and the “15%-Default-6% Rule” improve the IC cforest performance over the parameters set by default. Figure 4 shows that in the presence of right-censoring, the proposed IC cforest performs as least as well as the IC ctree method in the first setup, where the true model is a tree. In addition, for all five distributions, the IC cforest outperforms the IC Cox model.

As expected, the IC Cox model can outperform the IC cforest method in the second setup (where the true model is a linear model). This occurs when the underlying distribution is the Weibull-Increasing distribution, but for other distributions and up to right-censoring rate , the proposed IC cforest can represent a linear model as well as the IC Cox model or even better than it.

IC ctree outperforms IC Cox model in the third setup due to its flexible structure (Fu and Simonoff, 2017), and we can see in Figure 6, the proposed IC cforest further improves the performance and shows its advantage in a relatively complex survival relationship.

The censoring interval width generating distribution is used in the simulations presented here. Intuitively, a wider censoring interval, meaning less information and more uncertainty, will result in poorer performance in the forest.

Figure 7: Integrated difference boxplots with , no right-censoring. 1-Oracle, 2-censoring interval width generated from , 3-Censoring interval width generated from , 4-Censoring interval width generated from . Methods that give results in columns 2-4 are IC cforest with mtry chosen through “out-of-bag” tuning procedure and minsplit, minprob, minbucket chosen following “15%-def-6% Rule.” Top row gives results for tree model, middle row gives results for linear model, and bottom row gives results for nonlinear model.
Figure 8: Integrated difference boxplots with , no right-censoring. In each boxplot, 1-3 give results of IC Cox, IC ctree and IC cforest for censoring interval width generated from respectively, 4-6 gives results of IC Cox, IC ctree and IC cforest for censoring interval width generated from respectively, 7-9 give results of IC Cox, IC ctree and IC cforest for censoring interval width generated from respectively. Top row gives results for tree model, middle row gives results for linear model, and bottom row gives results for nonlinear model.

Figure 7 shows how the censoring interval width affects the performance of IC cforest. When the censoring interval width is small, IC cforest can perform as well as the “Oracle,” where the true survival times are known, and there is no right-censoring. When the censoring interval width is roughly three times wider, the loss of information starts to affect the IC cforest performance, but not greatly. When the censoring interval width is roughly seven times wider, the IC cforest performance deteriorates considerably more.

In fact, this loss of information due to the increased censoring interval widths affects all three different methods, and the patterns across methods we have seen in Figures 4 to 6 with censoring interval width generating distribution are similar to those with and . That is, the proposed IC cforest can still outperform the IC ctree method even under the tree model and outperform the IC Cox model under a linear model. Figure 8, for example, demonstrates that the patterns across the three methods for each model preserve well under the change of censoring interval widths in the situation with no right-censoring.

4 Real data set

The Signal Tandmobiel® study is a longitudinal prospective oral health study that was conducted in the Flanders region of Belgium from 1996 to 2001. In this study, 4430 first year primary school schoolchildren were randomly sampled at the beginning of the study and were dental-examined annually by trained dentists. The data consist of at most 6 dental observations for each child including time of tooth emergence, caries experience, and data on dietary and oral hygiene habits. The details of study design and research methodology can be found in Vanobbergen et al. (2000). The data are provided as the tandmob2 data set in the R package bayesSurv (Komárek, 2015). The tandmob2 data set provides the time to emergence of 28 teeth in total. Each of the tooth emergence times can be taken as a response variable and we can test the prediction performance of the proposed IC cforest method, compared to the corresponding IC ctree method and IC Cox method. Potential predictors of emergence time of the child’s tooth include gender, province, evidence of fluoride intake, type of educational system, starting age of brushing teeth, whether each of the twelve deciduous teeth were decayed or missing due to caries or filled, whether each of the twelve deciduous teeth were removed because of orthodontic reasons, and whether each of the twelve deciduous teeth were removed due to the orthodontic reasons or decayed on at most the last examination before the first examination when the emergence of the permanent successor was recorded. These potential predictors cover all of the variables in the data set.

To compare different methods, we conducted leave-one-out cross-validation on the entire data set, and then computed the average absolute prediction distance below or above when the predicted median emergence time falls outside of the observed interval, which measures the distance away from the interval for those observations (if a predicted emergence time falls within the observed emergence interval it is impossible to say what the prediction error is, so such observations are not considered).

      Tooth         IC Cox         IC ctree         IC cforest
11 33.7 0.3558 33.0 0.3489 32.1 0.3732
21 34.2 0.3428 33.2 0.3439 33.7 0.3639
31 23.6 84.1325 21.5 0.3195 20.9 0.3312
41 21.4 71.1985 17.4 0.6236 18.0 0.6019
12 54.0 0.5259 52.6 0.5369 54.3 0.5187
22 51.0 0.5215 50.3 0.5232 52.1 0.5026
32 38.1 0.4036 37.4 0.4050 37.7 0.4010
42 39.4 0.4004 38.1 0.4110 39.5 0.3969
13 57.8 0.6894 57.6 0.6236 56.7 0.6564
23 59.1 1.3304 60.6 0.5863 60.1 0.5822
33 64.4 0.6454 71.3 0.6279 65.6 0.6926
43 63.6 0.6386 63.6 0.6434 64.6 0.6304
14 66.8 0.7321 65.6 0.7479 67.0 0.7311
24 67.0 0.7082 68.0 0.6934 66.8 0.7176
34 66.1 0.6976 66.4 0.7012 66.3 0.7109
44 65.0 0.7108 65.8 0.7022 66.6 0.7221
15 55.6 0.7141 58.7 0.6602 56.4 0.6382
25 55.9 2.0519 60.1 0.6635 58.5 0.6629
35 52.6 0.7245 56.6 0.6670 55.9 0.6401
45 51.5 0.7221 52.4 0.6866 54.7 0.6374
16 25.5 0.3138 22.0 0.3765 23.3 0.3470
26 26.4 0.3250 22.8 0.3300 22.8 0.3237
36 27.5 0.4036 28.0 0.3274 27.0 0.3304
46 26.6 0.3125 24.1 0.3277 24.3 0.3234
17 28.8 55.2018 28.5 28.0678 28.0 11.4780
27 30.6 96.5333 31.3 43.3953 30.9 30.2143
37 46.3 0.5876 48.2 0.5157 47.2 0.5436
47 43.1 6.1757 46.3 0.5615 43.7 0.5935
-Proportion of the predicted median emergence times lying outside censoring intervals.
-Average absolute prediction distance below or above .
The bolded value in each row indicates the smallest one among the three ’s.
Table 1: Evaluation on 28 tooth data sets in Signal Tandmobiel® Study.

The IC cforest method applied with mtry chosen through the “out-of-bag” tuning procedure and minplit, minprob, minbucket chosen by the “15%-Default-6% Rule,” IC ctree, and the IC Cox model are applied to each of the tooth data sets. Table 1 shows that the proportion of the time the predicted median emergence falls outside the observed intervals is roughly the same for the three methods, although it varies greatly from tooth to tooth. Among these 28 tooth data sets IC cforest gives the smallest average absolute prediction distance away from the observed intervals for those observations that fall outside of them for 54% of the teeth; the IC ctree follows (32%) and the IC Cox model trails both (14%). Thus, the IC cforest method does a good job of predicting the actual emergence times.

5 Conclusion

In this paper, we have proposed a new ensemble algorithm based on the conditional inference survival forest designed to handle interval-censored data. Through the use of a simulation study, we see that the proposed IC cforest method can outperform the IC ctree and the IC Cox proportional hazards model even when the underlying true model is designed for the tree structure or the linear relationship, respectively, in terms of prediction performance, and clearly outperforms both in the nonlinear situation that neither is designed for.

The tuning parameters in the proposed IC cforest affect the overall performance of the method. In this paper, we have provided guidance on how to choose those parameters to improve on the potentially poor performance of the default settings. Further investigation of the best way to choose these parameters in a data-dependent way would be useful. It would also be interesting to extend these results to competing risks data.

An R package, ICcforest, that implements the IC cforest method is available at CRAN.

Acknowledgements

Data collection of the Signal Tandmobiel data was supported by Unilever, Belgium. The Signal-Tandmobiel project comprises the following partners: Dominique Declerck (Department of Oral Health Sciences, KU Leuven), Luc Martens (Dental School, Gent Universiteit), Jackie Vanobbergen (Oral Health Promotion and Prevention, Flemish Dental Association and Dental School, Gent Universiteit), Peter Bottenberg (Dental School, Vrije Universiteit Brussel), Emmanuel Lesaffre (L-Biostat, KU Leuven), and Karel Hoppenbrouwers (Youth Health Department, KU Leuven; Flemish Association for Youth Health Care).

References

  • Anderson-Bergman (2016) C. Anderson-Bergman. icenreg: Regression models for interval censored data. Version 2.0.8. 2016.
  • Anderson-Bergman (2017) C. Anderson-Bergman. An efficient implementation of the EMICM algorithm for the interval censored NPMLE. Journal of Computational and Graphical Statistics, 26(2):463–467, 2017.
  • Athey et al. (2019) S. Athey, J. Tibshirani, and S. Wager. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, 2019.
  • Bogaerts et al. (2017) K. Bogaerts, A. Komárek, and E. Lesaffre. Survival Analysis with Interval-Censored Data: A Practical Approach with examples in R, SAS and BUGS. Chapman and Hall/CRC, Boca Raton, FL, 2017.
  • Breiman (2001) L. Breiman. Random forests. Machine Learning, 45(1):5–22, 2001.
  • Breiman et al. (2018) L. Breiman, A. Cutler, A. Liaw, and M. Wiener. randomforest: Breiman and Cutler’s random forests for classification and regression. Version 4.6-14. 2018.
  • Finkelstein (1986) D. M. Finkelstein. A proportional hazards model for interval-censored failure time data. Biometrics, 42(4):845–854, 1986.
  • Fu and Simonoff (2017) W. Fu and J. S. Simonoff. Survival trees for interval-censored survival data. Statistics in Medicine, 36(30):4831–4842, 2017.
  • Fu and Simonoff (2018) W. Fu and J. S. Simonoff. LTRCtrees: Survival trees to fit left-truncated and right-censored and interval-censored survival data. Version 1.1.0. 2018.
  • Graf et al. (1999) E. Graf, C. Schmoor, W. Sauerbrei, and M. Schumacher. Assessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine, 18(17-18):2529–2545, 1999.
  • Hastie et al. (2001) T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
  • Hothorn et al. (2004) T. Hothorn, B. Lausen, A. Benner, and M. Radespiel-Tröger. Bagging survival trees. Statistics in Medicine, 23(1):77–91, 2004.
  • Hothorn et al. (2006a) T. Hothorn, P. Bühlmann, S. Dudoit, A. Molinaro, and M. J. Van Der Laan. Survival ensembles. Biostatistics, 7(3):355–373, 2006a.
  • Hothorn et al. (2006b) T. Hothorn, K. Hornik, and A. Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3):651–674, 2006b.
  • Hothorn et al. (2018) T. Hothorn, H. Seibold, and A. Zeileis. partykit: A toolkit with infrastructure for representing, summarizing, and visualizing tree-structured regression and classification models. Version 1.2-2. 2018.
  • Ishwaran et al. (2008) H. Ishwaran, U. B. Kogalur, E. H. Blackstone, and M. S. Lauer. Random survival forest. The Annals of Applied Statistics, 2(3):841–860, 2008.
  • Komárek (2015) A. Komárek. bayessurv: Bayesian survival regression with flexible error and random effects distributions. Version 2.6. 2015.
  • Lin and Jeon (2006) Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590, 2006.
  • Meinshausen (2006) N. Meinshausen. Quantile regression forests. The Journal of Machine Learning Research, 7:983–999, 2006.
  • Pan (1998) W. Pan. Rank invariant tests with left truncated and interval censored data. Journal of Statistical Computation and Simulation, 61(1-2):163–174, 1998.
  • Sun (2006) J. Sun. The Statistical Analysis of Interval-Censored Failure Time Data. Statistics for Biology and Health. Springer-Verlag New York Inc., New York, NY, 2006.
  • Tsouprou (2015) S. Tsouprou. Measures of discrimination and predictive accuracy for interval censored survival data. Master’s thesis, Leiden University, 2015.
  • Turnbull (1976) B. W. Turnbull. The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society. Series B (Methodological), 38(3):290–295, 1976.
  • Vanobbergen et al. (2000) J. Vanobbergen, L. Martens, E. Lesaffre, and D. Declerck. The Signal-Tandmobiel® project – a longitudinal intervention health promotion study in Flanders (Belgium): Baseline and first year results. European Journal of Paediatric Dentistry, 2:87–96, 2000.

Appendix A Algorithm of “out-of-bag” tuning procedure

1:procedure tuneICCF(, stepFactor)
2:      stepFactor
3:     
4:     
5:     
6:     for  in  do
7:         iccf.obj ICcforest, mtryTest = mtry)
8:         pred.oob predict(iccf.obj, OOB = TRUE)
9:         err.oob sbrier_IC(, pred.oob) calculating IBS defined in (3)      
10:     end
11:      err.oob
12:      return .
Algorithm 1 “Out-of-bag” tuning procedure for mtry

Appendix B Evaluation of tuning parameters for

b.1 mtry as a tuning parameter

Figure 9: Integrated difference of IC cforest with different mtry values, with , no right censoring and the interval censoring width generated by . The default value in cforest function is . The value of mtry tuned by the “out-of-bag” tuning procedure is given in the last column in each boxplot. Top row gives results for the first setup (tree structure), middle row gives results for the second setup (linear model), and bottom row gives results for the third setup (nonlinear model).

b.2 minsplit, minprob and minbucket as tuning parameters

Figure 10: Example: integrated difference of IC cforest with different minsplit, minprob and minbucket values, with , no right-censoring and the interval-censoring width generated by . 1-(minsplit, minprob, minbucket) . 2-(minsplit, minprob, minbucket) , 3-(minsplit, minprob, minbucket) , 4-(minsplit, minprob, minbucket) , 5-(minsplit, minprob, minbucket) , 6-(minsplit, minprob, minbucket) , 7-(minsplit, minprob, minbucket) . 8-The “15%-Default-6% Rule”: (minsplit, minprob, minbucket) . Top row gives results for the first setup (tree structure), middle row gives results for the second setup (linear model), and bottom row gives results for the third setup (nonlinear model).

Appendix C Evaluation of tuning parameters for

c.1 mtry as a tuning parameter

Figure 11: Integrated difference of IC cforest with different mtry values, with , no right censoring and the interval censoring width generated by . The default value in cforest function is . The value of mtry tuned by the “out-of-bag” tuning procedure is given in the last column in each boxplot. Top row gives results for the first setup (tree structure), middle row gives results for the second setup (linear model), and bottom row gives results for the third setup (nonlinear model).

c.2 minsplit, minprob and minbucket as tuning parameters

Figure 12: Example: integrated difference of IC cforest with different minsplit, minprob and minbucket values, with , no right-censoring and the interval-censoring width generated by . 1-(minsplit, minprob, minbucket) . 2-(minsplit, minprob, minbucket) , 3-(minsplit, minprob, minbucket) , 4-(minsplit, minprob, minbucket) , 5-(minsplit, minprob, minbucket) , 6-(minsplit, minprob, minbucket) , 7-(minsplit, minprob, minbucket) . 8-The “15%-Default-6% Rule”: (minsplit, minprob, minbucket) . Top row gives results for the first setup (tree structure), middle row gives results for the second setup (linear model), and bottom row gives results for the third setup (nonlinear model).

Appendix D Estimation performance for

d.1 Method performance under three different underlying true models

Figure 13: True tree model with censoring interval width generated from : integrated difference boxplots with . Methods are numbered as 1-IC Cox model, 2-IC ctree, 3-IC cforest with parameters set by default, 4-IC cforest with parameters set through “out-of-bag” tuning procedure and the “15%-Default-6% Rule.” Top row gives results without right-censoring, middle row gives results for light (right-)censoring, and bottom row gives results for heavy (right-)censoring.
Figure 14: True linear model with censoring interval width generated from : integrated difference boxplots with . Methods are numbered as 1-IC Cox model, 2-IC ctree, 3-IC cforest with parameters set by default, 4-IC cforest with parameters set through “out-of-bag” tuning procedure and the “15%-Default-6% Rule.” Top row gives results without right-censoring, middle row gives results for light (right-)censoring, and bottom row gives results for heavy (right-)censoring.
Figure 15: True nonlinear model with censoring interval width generated from : integrated difference boxplots with . Methods are numbered as 1-IC Cox model, 2-IC ctree, 3-IC cforest with parameters set by default, 4-IC cforest with parameters set through “out-of-bag” tuning procedure and the “15%-Default-6% Rule.” Top row gives results without right-censoring, middle row gives results for light (right-)censoring, and bottom row gives results for heavy (right-)censoring.

d.2 Method performance under different censoring interval widths

Figure 16: Integrated difference boxplots with , no right-censoring. 1-Oracle, 2-censoring interval width generated from , 3-Censoring interval width generated from , 4-Censoring interval width generated from . Methods that give results in columns 2-4 are IC cforest with mtry chosen through “out-of-bag” tuning procedure and minsplit, minprob, minbucket chosen following “15%-def-6% Rule.” Top row gives results for tree model, middle row gives results for linear model, and bottom row gives results for nonlinear model.
Figure 17: Integrated difference boxplots with , no right-censoring. In each boxplot, 1-3 give results of IC Cox, IC ctree and IC cforest for censoring interval width generated from respectively, 4-6 gives results of IC Cox, IC ctree and IC cforest for censoring interval width generated from respectively, 7-9 give results of IC Cox, IC ctree and IC cforest for censoring interval width generated from respectively. Top row gives results for tree model, middle row gives results for linear model, and bottom row gives results for nonlinear model.

Appendix E Estimation performance for

e.1 Method performance under three different underlying true models

Figure 18: True tree model with censoring interval width generated from : integrated difference boxplots with . Methods are numbered as 1-IC Cox model, 2-IC ctree, 3-IC cforest with parameters set by default, 4-IC cforest with parameters set through “out-of-bag” tuning procedure and the “15%-Default-6% Rule.” Top row gives results without right-censoring, middle row gives results for light (right-)censoring, and bottom row gives results for heavy (right-)censoring.
Figure 19: True linear model with censoring interval width generated from : integrated difference boxplots with . Methods are numbered as 1-IC Cox model, 2-IC ctree, 3-IC cforest with parameters set by default, 4-IC cforest with parameters set through “out-of-bag” tuning procedure and the “15%-Default-6% Rule.” Top row gives results without right-censoring, middle row gives results for light (right-)censoring, and bottom row gives results for heavy (right-)censoring.
Figure 20: True nonlinear model with censoring interval width generated from : integrated difference boxplots with . Methods are numbered as 1-IC Cox model, 2-IC ctree, 3-IC cforest with parameters set by default, 4-IC cforest with parameters set through “out-of-bag” tuning procedure and the “15%-Default-6% Rule.” Top row gives results without right-censoring, middle row gives results for light (right-)censoring, and bottom row gives results for heavy (right-)censoring.

e.2 Method performance under different censoring interval widths

Figure 21: Integrated difference boxplots with , no right-censoring. 1-Oracle, 2-censoring interval width generated from , 3-Censoring interval width generated from , 4-Censoring interval width generated from . Methods that give results in columns 2-4 are IC cforest with mtry chosen through “out-of-bag” tuning procedure and minsplit, minprob, minbucket chosen following “15%-def-6% Rule.” Top row gives results for tree model, middle row gives results for linear model, and bottom row gives results for nonlinear model.
Figure 22: Integrated difference boxplots with , no right-censoring. In each boxplot, 1-3 give results of IC Cox, IC ctree and IC cforest for censoring interval width generated from respectively, 4-6 gives results of IC Cox, IC ctree and IC cforest for censoring interval width generated from respectively, 7-9 give results of IC Cox, IC ctree and IC cforest for censoring interval width generated from respectively. Top row gives results for tree model, middle row gives results for linear model, and bottom row gives results for nonlinear model.