Optimal survival trees ensemble

05/18/2020 ∙ by Naz Gul, et al. ∙ University of Essex 0

Recent studies have adopted an approach of selecting accurate and diverse trees based on individual or collective performance within an ensemble for classification and regression problems. This work follows in the wake of these investigations and considers the possibility of growing a forest of optimal survival trees. Initially, a large set of survival trees are grown using the method of random survival forest. The grown trees are then ranked from smallest to highest value of their prediction error using out-of-bag observations for each respective survival tree. The top ranked survival trees are then assessed for their collective performance as an ensemble. This ensemble is initiated with the survival tree which stands first in rank, then further trees are tested one by one by adding them to the ensemble in order of rank. A survival tree is selected for the resultant ensemble if the performance improves after an assessment using independent training data. This ensemble is called an optimal trees ensemble (OSTE). The proposed method is assessed using 17 benchmark datasets and the results are compared with those of random survival forest, conditional inference forest, bagging and a non tree based method, the Cox proportional hazard model. In addition to improve predictive performance, the proposed method reduces the number of survival trees in the ensemble as compared to the other tree based methods. The method is implemented in an R package called "OSTE".



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Survival Analysis concerns the estimation of time until the occurrence of an event of interest. For example, an event of interest could be the follow up time of an individual from entry into a study until recovery from a certain type of disease. Survival analysis not only estimates and compares survival probabilities of individuals from a specific time point to the endpoint of interest but also finds the ratio of the probability density function to the survival function and assesses the relationship between explanatory variables and survival time.

The presence of unobserved events i.e. censoring reduces the predictive performance of survival modelling. To address this issue, many techniques are developed in the literature. Kaplan and Meier Meier and Kaplan (1958) computed successive probabilities of non-occurrence of an event at certain points in time, then obtained the product of these probabilities and earlier similarly calculated probabilities for a resultant estimate. Nelson-Aalen Borgan (1997) used the estimation of the cumulative hazard function for censored data. Cox David (1972)

introduced a large family of semi-parametric models by focusing on hazard functions via the Cox proportional hazard models. In the search for increased predictive performance in survival models, tree based approaches offer one of many attempts. A stand-alone survival tree provides some advantages in respect to interpretation and requires fewer assumptions than simple regression techniques.

The main aim in the process of risk prediction model building, is to build a model that predicts future risk accurately. Usually, a model is fitted using given data and then the performance of the model is checked using same data. However, using the same dataset for constructing a model and for assessing its performance creates over-optimization problems with low generalizability Zhou (2012)

. In any model based upon decision trees this level of over-optimisation becomes high due to large-scale searching in each and every node of a tree

Breiman and others (1996). To alleviate this problem, the given data may be partitioned into training and testing data sub-sets. In this approach, a training part is used for building a model and a testing part is then used for assessing the performance of the derived model Zhou (2012). This technique decreases sample size and hence the model power is also decreased.

Alternatively, a model is built on a number of random bootstrap samples drawn from the given data. The model performance is then tested using the same data by reporting means and standard deviations.

Random survival forest is one of these aggregation methods which draws multiple bootstrap samples from the given data. In addition to that, to construct each tree node, a random sample of the independent variables is selected and only these are used in the construction of the tree. Combining the results from these trees i.e. forming an ensemble of survival trees, increases the predictive power of the decision model. Similarly, a survival forest consisting of accurate and diverse survival trees may perform better than a forest grown by combining simple survival trees. Breiman’s Breiman et al. (1984) forest of regression and classification trees also holds to this intuition, which further led to the method of optimal trees ensemble (OTE) for regression and classification Khan et al. (2016b, a).

OTE combines the best trees i.e. accurate and diverse, from a large number of trees grown initially by Breiman’s Breiman et al. (1984)random forest and thus refines bagging and random forest approaches. On the bases of individual performance with out-of-bag observations, OTE selects a proportion of top ranked trees in the first phase and uses Brier scores to assess these trees on independent training data for their collective performance. For the resultant ensemble, this method selects trees one by one from the trees ordered with respect to the highest prediction accuracy to the lowest prediction accuracy. A tree is discarded, if the predictive performance decreases or does not improve Khan et al. (2016b, a). OTE uses fewer trees and gives comparable results to some other state-of-the-art methods.

The extension of OTE to survival data is the main aim for this paper. The objective is to select the best survival trees, in terms of their individual and collective predictive accuracy and integrate them together to develop a new ensemble. This ensemble will be called an optimal survival trees ensemble (OSTE). The results from OSTE are compared with those of Cox proportional hazard model, bagging survival trees, random survival forest and conditional inference forest, using 17 benchmark survival datasets.

2 Optimal ensemble of survival tree (OSTE)

In survival analysis an improvement to a single base model , namely a survival tree, is bagging survival trees Hothorn et al. (2004), which combines a number of single survival trees by using the aggregated Kaplan–Meier curve for each new observation.

Random survival forest Ishwaran et al. (2008) broadens this idea by selecting a subset of features instead of choosing from the whole set of features while splitting the nodes of the tree. OSTE is an attempt to refine this idea by assessing survival trees both on their collective and individual performance.

To obtain the ensemble of optimal survival trees divide the given training data randomly into two non overlapping parts and . After partitioning, draw bootstrap samples from . Grow a survival tree on each sample by randomly selecting a subset of features at each node of the tree to induce additional randomness. Some of the observations are left out of samples during bootstrapping which are called out-of-bag (OOB) observations. These observations play no role in training the corresponding model, however, they could be used as test data to determine a prediction error for the corresponding individual survival tree. According to their C-index (introduced in Section 2.1) the grown survival trees are arranged in ascending order and the top trees are selected. The selected trees are tested one by one for diversity as follows:

  • The collective performance of survival trees is assessed on independent training data to get the resultant ensemble of survival trees. From the arranged list of trees, an ensemble of two i.e. the second best survival tree combined with the top best survival tree is assessed by using the training data . Similarly, in next step, the third best survival tree is added to the ensemble of size two, obtained in the previous step, and the performance is again measured. The decision to add or discard a tree in the final ensemble is dependent on whether or not the prediction error of the ensemble is decreased or increased respectively. The same steps are repeated for all survival trees.

  • A survival tree, where is chosen as an optimal survival tree for the intended ensemble of optimal survival trees if its addition to the ensemble with out the survival tree fulfils the following criterion

where is the integrated brier score (IBS) of the ensemble including the tree and is the integrated brier score of the ensemble in which the tree is not yet included.

2.1 Concordance index

In survival analysis, for each subject under study, there is a survival time and prediction of it. Therefore, for the evaluation of the survival model, instead of the absolute survival time for each subject the relative risk of an event for different subjects is considered. The concordance index (C-index) Harrell et al. (1982), an extension of the concept of the receiver operating characteristic (ROC) curve area, is one of the most reported measures used for checking survival model performance.

The C-index is actually the relative frequency of concordant pairs. In other words, a pair of two observations is said to be concordant if the observation which is predicted to have an earlier outcome is observed to fail earlier than the other.

The C-index can also be described as the probability that a subject of interest, whose survival time is short is associated with a high value of an indicator (biomarker) and vice versa. In other words C-index measures the discriminative ability of a biomarker. For example, in biomedical research, where grouping of patients into good or poor prognosis groups is required Mayr and Schmid (2014).

The C-index usually calculated as

  1. All given observations are paired.

  2. Discard those pairs of observations that have same survival time for an event and those pairs where there is a censored observation at the lower time point.

  3. Score each of the permissible pairs (the set of remaining pairs) as

    A: 1 if

    • A pair consists the observations with unequal survival time and the outcome of an observation with the shorter survival time is correctly predicted.

    • The survival times are equal, while the event is observed for only one observation with the lower predicted survival time.

    • The survival times and their predicted outcomes are equal for both observations.

    B: 0.5 if

    • The outcome of the observations are predicted to be equal for their unequal survival time.

    • The outcome of the observations are predicted to be unequal for their equal survival time.

    • For equal survival times the survival time is predicted to be lower for the observation for whom the event remains unobserved.

  4. The error rate is then calculated using the formula where = Concordance.

In right censored survival settings, the C-index is defined as:


where are the predicted biomarker values while, represent event timesMayr and Schmid (2014). The biomarker does not perform well for .

2.2 Integrated Brier score (IBS)

In the context of survival analysis, the squared difference between the survival function indicator and the predicted survival probability is called Brier score. In case of non-censored data, it can easily be calculated by averaging the squared distances between the survival function indicator for the subject and the predicted survival probability given by the model for that subject. Tsouprou et al. (2015) In other words, for non-censored data the prediction error is assessed by taking average of the squared residual . However, for right censored data, which is the main concern in the OSTE method, the squared residual needs to be weighted at each time point, t. Hence, the integrated brier score (IBS) technique is used. IBS is simply the integration of the Brier score.

Let be the time of the event of interest of subject and be the outcome while the estimated probability of a subject at risk surviving beyond ( the follow-up time) is then the BS is given as

For an individual that is censored before i.e , the observation is consider to be unknown. To overcome this issue, Inverse-probability-of-censoring weighting (IPCW) Robins and Finkelstein (2000) is used . Hence, the Brier score can be written as:



where is the Kaplan-Meier estimate of the probability of being uncensored at time . In right censored data over a range of time points a measure of predictive accuracy is required. Therefore, an integrated Brier score (IBS) with respect to some weight function for can be estimated by integrating the BS calculated via the method shown above

where a weight function at individual time points is represented by , integration of the area under the prediction curve. .

2.3 OSTE Algorithm

The algorithm of the proposed method OSTE consists of the following steps:

  • Partition the training data randomly into two non-overlapping parts and i.e = and

  • Draw bootstrap samples from the data .

  • On the bootstrap samples grow survival trees in such a way that at each node features are chosen.

  • On the bases of individual prediction error using OOB data, arrange the grown trees in ascending order and chose the top trees. The prediction error is estimated via the concordance index given in Section 2.1.

  • Add the selected trees one by one starting from the single top tree and calculate the integrated Brier score. Select the survival tree if the result is improved after testing performance on the validation data .

  • Predict new data by combining the results of the selected trees into the resultant ensemble.

2.4 Related work

In the literature there are many other techniques used by researchers to resolve right censored survival data, for example regularized Cox regression models Wright et al. (2017). However, these models violate the proportional hazard assumptions and hence, miss-specify the model Wright et al. (2017)

. Alternatively, machine learning methods such as Conditional inference forests (CIF), have been suggested as have random survival forest (RSF) and the bagging of survival trees. CIF uses different statistical approaches while selecting the split variables and the split points. This approach creates problems in the case of non-linear covariates. Maximally selected rank statistic

Lausen and Schumacher (1992) is used to avoid this difficulty and to reduce bias Wright et al. (2017). The proposed statistic is applied to obtain an exact p-value. Moreover, to get the exact distribution of maximally selected rank statistics a lower bound is calculated by extending an algorithm originating from linear rank statistics Hothorn and Lausen (2003).

In the literature, three types of artificial neural network (ANN) are proposed for survival analysis problems. These are the prediction of the survival time of a subject from the given data directly, neural network survival analysis has been employed

Dunn and Clark (2009), where the survival status of a subject has been taken as the output of the neural network Dunn and Clark (2009) and the extension of the Cox PH model Faraggi and Simon (1995) to the non-linear ANN predictor with a suggestion for fitting of a neural network.

3 Experiment and results

3.1 Benchmark datasets

To assess and compare the predictive performance of the proposed OSTE approach, with other state-of-the-art methods a total of 17 benchmark datasets are considered. A brief summary of these chosen datasets are given in Table 1. which details the number of observations and the number of the features the type of the features, whether real, integer or nominal is given against each dataset.

Datasets No. of No. of Features Censored Source
observations features type observations
kidney 119 3 (2/1/0) 26 by Klein et al. (2012)
twins 24 4 (4/0/0) 8 by Klein et al. (2012)
kidtran 863 5 (5/0/0) 140 by Klein et al. (2012)
channing 462 5 (5/0/0) 176 by Klein et al. (2012)
hodg 43 6 (6/0/0) 26 by Klein et al. (2012)
myeloid 646 6 (5/0/1) 320 Therneau (2015)
veteran 137 8 (0/7/1) 128 Therneau (2015)
retinopathy 394 9 (5/1/3) 155 Therneau (2015)
bfeed 927 10 (10/0/0) 892 by Klein et al. (2012)
GBSG2 686 10 (7/0/3) 299 Mogensen et al. (2012)
NKI 295 14 (0/8/6) 79 nlm.nih.gov/gap/?
cgd 203 15 (6/4/5) 76 Therneau (2015)
colon 1858 15 (0/15/1) 920 Therneau (2015)
cost 518 15 (4/1/10) 404 Mogensen et al. (2012)
burn 154 17 (17/0/0) 99 by Klein et al. (2012)
Pbc 418 19 (11/7/1) 347 Therneau (2015)
BMT 137 22 (22/0/0) 81 by Klein et al. (2012)
Table 1: Datasets description: Number of observations, number of features, type of features whether integer, real, or nominal (I/N/F) and data source are given against each dataset.

A brief description of these datasets are now follows.

The dataset Veteran has been taken from a randomized trial of two treatment procedures for lung cancer. A total of 137 patients are observed, measuring their survival time since the start of the treatment with a status 1 for dead and 0 for others. Covariates also considered are the type of treatment whether standard or test drug, type of cell, the time since the diagnosis, age, the Karnofsky score. any prior therapy i.e 0 if none and 1 for yes. The survial time has been recorded for days while the time since diagnosis has been recorded for months.

Kidtran dataset is taken from the study designed to assess the time to first clinically apparent infection in a group of patients with renal insufficiency. Total 863 cases are observed on 5 features i.e. gender (male, female), race( white, black), age in years, time which shows period of study, death indicator delta (0 if alive otherwise 1). The original source of the dataset is Klein and Moeschberger (2005).

myeloid dataset consists of 646 observations with features treatment arm A or B, time to death represented as futime is 1 for a death and 0 for censoring, time to transplant of hematropetic stem cell, time until complete response and time to relapse of disease.

The hodg dataset has 43 observations made on 6 features i.e, graft type 1 for allogenic and 2 for autologous, disease type 1 and 2 for Non Hodgkin lymphoma and Hodgkins disease respectively, time to death, delta (death/relapse indicator), Karnofsky score and waiting time in months to transplant.

The retinopathy dataset is based on a trial to delay diabetic retinopathy through laser coagulation treatment. A data frame consists of 394 observations on 9 variables, type of laser used, treated eye, person age at diagnosis time, type of diabetes, trt that is 0 for control eye and 1 for treated eye, time to loss of vision and eye risk score. A variable status is recorded as censoring indicator. For each patient there are two observations in the dataset, one for the eye received laser treatment and the other for the untreated eye. The time when s treatment starts to the time when visual acuity dropped below 5/200 is considered as the event of interest for each eye. The difference between actual time when vision is lost and minimum possible time to event is considered as a survival time.

bfeed dataset consists of breast feeding related information collected from 927 mothers (with first born children) who choose breast feeding. This dataset is actually the main section of the survey conducted by “ the National Longi tudinal Survey of Youth“ in 1983. The main aim of the survey is to collect information from females about any pregnancies that have occurred in 1983. The response feature in the dataset is duration followed by an indicator feature showing that weaning to infant is completed or not. Other covariates observed during this study are year of child’s birth, race, education, age, poverty, smoking status, alcohol-drinking and lack of prenatal care status in first trimester of pregnancy.

The datasets kidney and cgd are records of assessment time to first exit site infection in patients with renal insufficiency and time to serious infections observed in granulotomous disease (CGD) through to the end of the study respectively. A total of 119 patients are observed in kidney dataset in two groups whether the catheter is placed surgically or percutaneously with a delta as the censored indicator. This dataset is available free in KMsurv R package by Klein et al. (2012). On the other side, a total of 203 cases on 15 features are observed in the cgd dataset. These features are enrolling centre, treatment whether placebo or gamma interferon, sex, age, at study entry, height (in cm), weight (in kg), inheritance pattern, use of steroids and prophylactic antibiotics, a categorization of the centres into 4 groups, days to last follow-up, start and end of each time interval and observation number within subject with the status 1 if the interval ends with an infection as censoring indicator. The original source of the data is “Counting Processes and Survival Analysis“Fleming and Harrington (2011).

Dataset twins is the record of 24 cases who died from coronary heart disease (CHD). The survival time of each individual is recorded in months with the indicator death sets as 1 if the cause of death is CHD. The other two features show the identification number and gender of an individual under the study.

burn dataset consists of information about the methods used to take care of the burned patients, the infections in their wound and other medical concerns. This dataset consists of the medical records of 154 burned patients treated during the 18-months study period. During study the time until staphylococcus infection was recorded in days with an indicator whether an infection had occurred or not. Gender, race, severity of the burn, burn site and type of burn, time to excision and time to prophylactic antibiotic treatment administration along with the two indicator features, namely, whether the patient’s wound had been removed or not and whether the patient had been treated with an antibiotics during the course of the study.

The information about those women who suffered from breast cancer is collected in GBSG2 dataset. Originally the dataset is available in “Building multivariable prognostic and diagnostic models transformation of the predictors by using fractional polynomials“Sauerbrei and Royston (1999). A total of 686 women are observed on features, age, time of recurrence free survival time (in days), hormonal therapy as a two level factor whether no or yes, menopausal status as a factor at two levels, premenopausal and postmenopausal recorded as horTh and menostat respectively, tumor size and grade, number of positive nodes and progesterone receptor, estrogen receptor with censoring indicator i.e 0 if censored. The dataset is freely available in the pec R package Mogensen et al. (2012).

The dataset Channing has observations of a total of 462 individuals allowed easy access to medical care without any additional financial burden, as they were covered by a health care program provided by the centre. The original source of the data is the ”Survival Analysis Techniques for Censored and Truncated Data“ Klein and Moeschberger (2005). There are a total of 5 features in the dataset, death status (1 or 0), age of entry into retirement home, age of death or left retirement home, difference between the above two ages and gender.

Pbc dataset consists 424 patients who were eligible for the controlled randomized placebo trial of the D-penicillamine drug. The observed features in this study are alkaline phosphotase, age, serum albumin and serum bilirunbin, presence of ascites, aspartate aminotransferase, serum cholesterol, urine copper, edema as 0, 0.5 and for no edema, untreated or successfully treated and edema despite diuretic therapy respectively and presence of hepatomegaly or enlarged liver.

colon is the collection of information about colon cancer disease observed in the first successful adjuvant chemotherapy. Two records per individual i.e, one for recurrence and one for death are recorded. There are a total of 1858 subjects observed on 15 features. These features are study, rx as Treatment - Obs(ervation), Lev(amisole) and Lev(amisole)+5-FU, sex and age of patient, obstruction of colon by tumour, perforation of colon, adherence to nearby organs, number of cancer detectable lymph nodes, days until event or censoring, differentiation of tumour, Extent of local spread, time from surgery to registration, more than 4 positive lymph nodes, event type and censoring status. The BMT dataset consists of information about the recovery process from a bone marrow transplantation of 137 patients. At the time of transplantation several risk factors were measured. For each disease, patients were grouped into risk categories based on their status. Risk factors denoted by z1 to z10 consists of recipient and donor age, gender, cytomegalovirus immune status (CMV) status, waiting time from diagnosis to transplantation, their French-American-British (FAB) classification based on standard morphological criteria, Hospital and MTX as a Graft-Versus-Host- Prophylactic that is 1 if Yes. t1 and t2 represent time to death and a disease free survival time. d1, d2 and d3 are recorded as death indicator, relapse indicator and disease free survival indicator while da, dc and dp features shows acute and chronic GVHD indicator and platelet recovery indicator respectively. ta, tc and tp show time to acute Graft-Versus-Host and chronic Graft-Versus-Host disease respectively.

In the NKI dataset the gene expression measurements of 337 lymph node positive breast cancer patients are recorded. The computational burden is reduced and missing data is eliminated by excluding all SNPs with a call fraction below 100% keeping 151,346 SNPs. The endpoint relapse-free survival are analysed. This dataset is available at dbGaP and has ID phs000547.v1.p1 (https://www.ncbi.nlm.nih.gov/gap/?term=phs000547.v1.p1).

The cost dataset contains a subset of the data from the Copenhagen stroke study which observed 518 stroke patients. There are a total of 14 features i.e. age and sex, Hypertension, History of ischemic heart disease at admission, history of previous strokes before admission, history of other disabling diseases, daily alcohol consumption, diabetes mellitus status indicating if the glucose level was higher than 11 mmol/L, daily smoking status, atrial fibrillation, stroke subtype, stroke score, cholesterol level with the survival time

3.2 Experimental setup

A random sub-set of 70% of each dataset is taken for the training while the remaining 30% is considered as testing data for all the methods considered in the analysis.

In the case of OSTE, an initial ensemble of a total of 1000 independent survival trees is grown on bootstrap samples. At total of 95% of the training data is used for bootstrapping. At each node of a tree features are randomly selected from the total of features as a splitting criterion using the log-rank statistic. The diversity is checked on remaining 5% of the training data. Usually i.e , the default value in the standard random survival forest is considered for all datasets. Based on individual accuracy, of total grown trees are selected while keeping the terminal node size fixed at 3. For each dataset a total of 1000 runs is performed and using the remaining 30% test data, the final results are averaged.

For RSF, the number of unproned trees from the set is tuned using 10-fold cross validation. The corresponding training part is fine tuned using 10-fold cross validation for all possible values of the number of features variable mtry. .The Log-rank statistic is used while growing the trees in package ranger Wright and Ziegler (2017) by keeping terminal node size equal to 3. Unpruned survival trees are grown on for the bagging survival trees method using the package ipred Peters and Hothorn (2015). 10-fold cross validation is used for tuning while a log-rank statistic is used for measuring the between nodes distance.

For conditional inference forest, the r package party Hothorn et al. (2006)

is used. The total number of trees and number of features selected at each node for splitting is denoted by

ntree and mtry respectively, these are the only hyper-parameters that are tuned using values from the set .

For Cox-proportional hazard model, a non-tree based method, the package survival Therneau (2015) is used. For all parameters the default values are considered. For init the default is zero for all features. Until the relative change in the log partial likelihood is smaller than , i.e. eps the iterations will remain continue. Value 20 is a default for maximum iteration attempts for convergence. The Efron approximationEfron (1977) is used to handle ties.

4 Results and discussion

Using the experimental settings given in the previous section, the integrated Brier scores for all the methods are calculated on the introduced datasets. The results are given in Table 2. The final results are the average integrated Brier scores from a total of 1000 runs for each of the methods. Each time, given data is randomly divided into training and testing parts as described in the above section. It is clear from the table that OSTE is giving better average results than the alternative methods in 5 out of 17 data sets. On 5 datasets bagging outperformed others while, on 1 data set RSF gives better results. On the twins dataset the results of bagging and RSF are same. CIF gave better results on 4 and Coxph on 3 datasets.

Datasets n d E Cox bagging RSF CIF OSTE
kidney 119 3 26 0.1272 0.1296 0.1296 0.1257 0.1291
twins 24 4 8 0.0144 0.0132 0.0131 0.0147 0.0139
kidtran 863 5 140 0.0341 0.0324 0.0345 0.0203 0.0357
channing 462 5 176 0.0584 0.0512 0.0554 0.0664 0.0550
Hodg 43 6 26 0.1521 0.1885 0.1836 0.1703 0.2067
myeloid 646 6 320 0.1393 0.1348 0.1349 0.1360 0.2474
veteran 137 8 128 0.2571 0.1707 0.1692 0.1582 0.1683
retinopathy 394 9 155 0.1757 0.1795 0.1765 0.1714 0.1762
bfeed 927 10 892 0.1925 0.2397 0.1942 0.1941 0.1478
GBSG2 686 10 299 0.0148 0.0151 0.0149 0.0182 0.0170
NKI 295 14 79 0.1510 0.1154 0.1113 0.1077 0.1110
cgd 203 15 76 0.2831 0.0819 0.0862 0.0831 0.0844
colon 1858 15 920 0.1737 0.1534 0.1605 0.1735 0.1897
cost 518 15 404 0.1764 0.1825 0.1807 0.1851 0.1789
burn 154 17 99 0.1661 0.1477 0.1474 0.1527 0.1469
Pbc 418 19 347 0.0669 0.0669 0.0504 0.0523 0.0082
BMT 137 22 81 0.0799 0.0450 0.0560 0.0511 0.0299
Table 2: Integrated Brier scores of the methods against each data set. The best score is highlighted in bold font.

Figures 2-4 give the results in the form of box plots for 17 datasets. Colors brown, gray, red, yellow and blue are used for the box plots of Cox, bagging, RSF, CIF and OSTE receptively. Figure 2 gives the integrated Brier scores for all the methods for the datasets veteran, kidtran, bfeed, twins, GBSG2 and burn. For two datasets, kidtran and bfeed OSTE shows better performance while for other datasets the results of OSTE are similar to the alternative methods.

Figure 1: The boxplots showing IBS on the datasets veteran, kidtran, bfeed, twins, GBSG2 and burn. Cox, Bagging, RSF, CIF and OSTE are shown by brown, gray, red, yellow and blue colors, respectively. OSTE shows better performance on kidtran and bfeed datasets while for others datasets the results are similar to the alternative methods.

Figure 2: The boxplots showing IBS on the datasets retinophty, cgd, cost, myeliod, channing and NKI. Cox, Bagging, RSF, CIF and OSTE are shown by brown, gray, red, yellow and blue colors, respectively. OSTE shows similar performance on all the datasets except myeliod.

The boxplots given in Figure 3 show the IBS on the datasets retinophty, cgd, cost, myeliod, channing and NKI. Cox, Bagging, RSF, CIF and OSTE are shown by brown, gray, red, yellow and blue colors, respectively. The results of OSTE are almost same on all datasets except myeliod. On cgd and NKI datasets the performance of Cox is poor while on other datasets the results of the methods are almost similar to the rest of the methods.

Figure 3: The boxplots showing IBS on the datasets BMT, colon, Hodg, kidney and Pbc. Cox, Bagging, RSF, CIF and OSTE are shown by brown, gray, red, yellow and blue colors, respectively. For the kidney dataset OSTE shows similar performance while on Pbc and BMT datasets OSTE shows better results.

The boxplots given in Figure 4 showing IBS on the datasets BMT, colon, Hodg, kidney and Pbc. Cox, Bagging, RSF, CIF and OSTE are shown by brown, gray, red, yellow and blue colours, respectively. For kidney dataset OSTE shows similar performance with the rest of methods while for Pbc and BMT datasets the results of OSTE are superior.

On some random splits of the data into training and testing parts OSTE might give comparatively larger error estimates as can be seen, for example in Figure 4.4 for Pbc data. This may happen when patterns in the selected trees are not in-line with those in the test data as OSTE selects trees with specific patterns. Furthermore, a comparison of OSTE and RSF methods is given in terms of feature importance as OSTE improves RSF by removing trees from the original random survival forest with adverse effects on its overall efficiency. The permutation method is used fo this purpose. For survival tree, a given variable in the out-of-bag data is randomly permuted to estimate a variables permutation importance. After permutation, this OOB data is dropped down the tree and the OOB estimate of prediction error is calculated. The estimate of the variable importance is the difference between this estimate and the OOB error without permutation, averaged over all trees. The larger the permutation importance of a variable, the more predictive the variable.

The estimate of the variable importance is checked on 4 data sets, burn, bmt, GBSG2 and colon for both the methods, OSTE and RSF as shown in Figure 4.5. It can be seen from the figure that for burn and bmt data sets OSTE give larger importance values to predictive features compared to random survival forest which might be due to the removal of harmful trees (i.e. the tree might have the effects of non-informative features) from the initially grown forest. For colon and GBSG2 datasets fails to give more importance to predictive features which might be the reason for the out performance of OSTE.

Figure 4: The plots showing feature importance for RSF and OSTE. The dots and + sign shows RSF and OSTE respectively.

4.1 Hyper-parameters assessment

The effect of various number of trees () grown initially in the ensemble, proportion of trees (M) chosen on individual accuracy and the number of features have been assessed on the results of the proposed method i.e. OSTE. The effect of , is assessed on various values in initial set. The results are given in Figure 5. It can be seen that the Brier scores on the given datasets shows no/little effect with any increase in the number of trees from 1000, while for kidtran dataset, the error is increased by growing more than 1000 trees.

Figure 5: The boxplot showing a comparison of IBS on four datasets for different number of trees in the initial set.

The proposed approach OSTE is also examined for various values of M i.e. . The results are shown in Figure 6. As can be seen from the results, OSTE shows same performance by only selecting 5% of trees from the total grown trees initially. This selection of trees is based on individual accuracy against higher values of as shown in the figure. This has led to a resultant ensemble of sizes 25, 23, 31 and 24 for veteran, kidtran, bfeed and twins datasets respectively. This reveals that a significant reduction in the number of trees used for the final ensemble can be accomplished by using OSTE.

Figure 6: The boxplot showing a comparison of IBS on the datasets for different percentages of total number of trees () selected in the first phase. The trees selected by OSTE for the final ensemble are given on the x-axis.

The effect of the number of features that we chose randomly for splitting the nodes of the trees on IBS are shown in Figure 7. As seen in the figure that for changing value of the results shows variations. The results suggest the tuning of this parameter for the corresponding data set.

Figure 7: Boxplots showing a comparison of IBS on veteran, kidtran and bfeed datasets for different values of .

4.2 Size comparison

In terms of the number of survival trees used a comparative analysis of ensemble sizes has also been done. The number of used survival trees in the resultant ensemble by the methods are given in Table 3. The table shows that a comparable performance could be achieved by selecting a total of 103, 92, 109, 1, 87, 35, 99, 104, 95, 102, 105, 203, 109, 97, 34, 46, 39 and 51 trees for veteran-Pbc datasets, receptively by selecting only as compared to the selection of hundreds of survival trees in the corresponding final ensembles by other state-of-the-art methods. This might be very helpful in reducing computational cost of the ensemble in terms of storage resources.

Dataset Bagging RSF CIF OSTE
veteran 1000 1500 1000 103
kidtran 1500 1000 1000 92
bfeed 500 1000 500 109
twins 1000 1000 1500 1
VA 1000 1000 1500 87
BMT 1000 1000 1000 35
retinophty 1000 1000 1000 99
cgd 1000 1000 1500 104
channing 1000 1500 1000 95
Burn 1500 1000 1500 102
GBSG2 1500 1500 1500 105
Cost 1500 1000 1000 203
myeliod 1000 1000 1500 109
NKI 1000 1500 1000 97
colon 1500 1500 1500 34
Hodg 1500 1000 1000 46
Kidney 1000 1500 1000 39
Pbc 1000 1000 1500 51
Table 3: Table showing sizes of ensemble for the datasets. Size of OSTE is shown for .

5 Conclusion

The main aim of this paper is to lessen the number of survival trees in the final ensemble in addition to improving its performance. The idea of OSTE “optimal survival trees ensemble“ is proposed to achieve this goal.

To find trees that showed better performance based on the C-index, out-of-bag (OOB) observations from the bootstrap samples are used as the test subjects. Ensemble predictive accuracy is checked by assessing top ranked survival trees on independent training data. For the final ensemble those survival trees who performed well both individually and collectively have been selected. The results, in terms of the integrated Brier score (IBS), of 17 benchmark datasets after applying OSTE, are compared with Cox proportional hazard model, random survival forest, conditional inference forest and bagging survival trees.

From the integrated Brier scores, after applying all the methods to the datasets discussed above, average IBS values are calculated. It has been observed from the final results in term of boxplots, that OSTE, the proposed method, is giving better or comparable results to the best of the other methods.

It has also been observed that OSTE reduced the number of survival trees in the resultant ensemble and improved predictive performance. OSTE consisting of less than 20 survival trees is seen to give comparable results to those ensembles which select hundreds of survival trees.

Furthermore, OSTE performance has also been checked for various hyper-parameters. In this regard, the effect of changing the selected features number i.e, at the nodes of the survival trees, number of trees in the initial ensemble and proportion of the top ranked trees have been assessed. is considered as a tuning parameter of the proposed method and should be fine tuned for a given dataset accordingly. needs not to be greater than 20% as higher values of M show no improvements and only increase the size of the ensemble. A total of 1000 survival trees are suggested to be grown in an initial set, for best results. The proposed ensemble is implemented in an package “OSTE“.

For the purposes of internal validation, OSTE leaves some observations from the training data during bootstrapping, therefore, some information is lost in the learning process. Whereas the remaining methods use the whole training set. Thus the performance of OSTE may be negatively affected. These out-of-bag observations could be used for internal validation as a future direction for research into this problem and further improve OSTE.

While growing a survival tree, a maximally selected rank statistic Lausen and Schumacher (1992) could be used instead of the log-rank statistic for the selection of split points as the log-rank test favours many split points.

For the proposed method to work well in high dimensional settings, some state-of-the-art feature selection/dimensionality reduction techniques could be used with OSTE.


  • Ø. Borgan (1997) Three contributions to the encyclopedia of biostatistics: the nelson-aalen, kaplan-meier, and aaalen-johansen estimators. Department of Mathematics, University of Oslo. Cited by: §1.
  • L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen (1984) Classification and regression trees. CRC press. Cited by: §1, §1.
  • L. Breiman et al. (1996) Heuristics of instability and stabilization in model selection. The annals of statistics 24 (6), pp. 2350–2383. Cited by: §1.
  • O. by Klein, Moeschberger, and modifications by Jun Yan (2012) KMsurv: data sets from klein and moeschberger (1997), survival analysis. Note: R package version 0.1-5 External Links: Link Cited by: §3.1, Table 1.
  • C. David (1972) Regression models and life tables (with discussion). Journal of the Royal Statistical Society 34, pp. 187–220. Cited by: §1.
  • O. J. Dunn and V. A. Clark (2009) Basic statistics: a primer for the biomedical sciences. John Wiley & Sons. Cited by: §2.4.
  • B. Efron (1977) The efficiency of cox’s likelihood function for censored data. Journal of the American statistical Association 72 (359), pp. 557–565. Cited by: §3.2.
  • D. Faraggi and R. Simon (1995) A neural network model for survival data. Statistics in medicine 14 (1), pp. 73–82. Cited by: §2.4.
  • T. R. Fleming and D. P. Harrington (2011) Counting processes and survival analysis. Vol. 169, John Wiley & Sons. Cited by: §3.1.
  • F. E. Harrell, R. M. Califf, D. B. Pryor, K. L. Lee, and R. A. Rosati (1982) Evaluating the yield of medical tests. Jama 247 (18), pp. 2543–2546. Cited by: §2.1.
  • T. Hothorn, P. Bühlmann, S. Dudoit, A. Molinaro, and M. J. Van Der Laan (2006) Survival ensembles. Biostatistics 7 (3), pp. 355–373. Cited by: §3.2.
  • T. Hothorn, B. Lausen, A. Benner, and M. Radespiel-Tröger (2004) Bagging survival trees. Statistics in medicine 23 (1), pp. 77–91. Cited by: §2.
  • T. Hothorn and B. Lausen (2003) On the exact distribution of maximally selected rank statistics. Computational Statistics & Data Analysis 43 (2), pp. 121–137. Cited by: §2.4.
  • H. Ishwaran, U. B. Kogalur, E. H. Blackstone, and M. S. Lauer (2008) Random survival forests. The annals of applied statistics, pp. 841–860. Cited by: §2.
  • Z. Khan, A. Gul, A. Perperoglou, M. Miftahuddin, O. Mahmoud, W. Adler, and B. Lausen (2016a) An ensemble of optimal trees for classification and regression (ote). External Links: Link Cited by: §1, §1.
  • Z. Khan, A. Gul, O. Mahmoud, M. Miftahuddin, A. Perperoglou, W. Adler, and B. Lausen (2016b) An ensemble of optimal trees for class membership probability estimation. In Analysis of Large and Complex Data, pp. 395–409. Cited by: §1, §1.
  • J. P. Klein and M. L. Moeschberger (2005) Survival analysis: techniques for censored and truncated data. Springer Science & Business Media. Cited by: §3.1, §3.1.
  • B. Lausen and M. Schumacher (1992) Maximally selected rank statistics. Biometrics, pp. 73–85. Cited by: §2.4, §5.
  • A. Mayr and M. Schmid (2014) Boosting the concordance index for survival data–a unified framework to derive and evaluate biomarker combinations. PloS one 9 (1), pp. e84483. Cited by: §2.1, §2.1.
  • P. Meier and E. Kaplan (1958) Nonparametric estimation from incomplete observations. Journal of the American Statistical Association 53 (282), pp. 457–481. Cited by: §1.
  • U. B. Mogensen, H. Ishwaran, and T. A. Gerds (2012) Evaluating random forests for survival analysis using prediction error curves. Journal of Statistical Software 50 (11), pp. 1–23. External Links: Link Cited by: §3.1, Table 1.
  • A. Peters and T. Hothorn (2015) Ipred: improved predictors. Note: R package version 0.9-5 External Links: Link Cited by: §3.2.
  • J. M. Robins and D. M. Finkelstein (2000) Correcting for noncompliance and dependent censoring in an aids clinical trial with inverse probability of censoring weighted (ipcw) log-rank tests. Biometrics 56 (3), pp. 779–788. Cited by: §2.2.
  • W. Sauerbrei and P. Royston (1999) Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. Journal of the Royal Statistical Society: Series A (Statistics in Society) 162 (1), pp. 71–94. Cited by: §3.1.
  • T. M. Therneau (2015) A package for survival analysis in s. Note: version 2.38 External Links: Link Cited by: §3.2, Table 1.
  • S. Tsouprou, H. Putter, and M. Fiocco (2015) Measures of discrimination and predictive accuracy for interval censored survival data. Ph.D. Thesis, Universiteit Leiden. Cited by: §2.2.
  • M. N. Wright, T. Dankowski, and A. Ziegler (2017) Unbiased split variable selection for random survival forests using maximally selected rank statistics. Statistics in Medicine. Cited by: §2.4.
  • M. N. Wright and A. Ziegler (2017)

    ranger: a fast implementation of random forests for high dimensional data in C++ and R

    Journal of Statistical Software 77 (1), pp. 1–17. External Links: Document Cited by: §3.2.
  • Z. Zhou (2012) Ensemble methods: foundations and algorithms. CRC press. Cited by: §1.