1 Introduction
Survival Analysis concerns the estimation of time until the occurrence of an event of interest. For example, an event of interest could be the follow up time of an individual from entry into a study until recovery from a certain type of disease. Survival analysis not only estimates and compares survival probabilities of individuals from a specific time point to the endpoint of interest but also finds the ratio of the probability density function to the survival function and assesses the relationship between explanatory variables and survival time.
The presence of unobserved events i.e. censoring reduces the predictive performance of survival modelling. To address this issue, many techniques are developed in the literature. Kaplan and Meier Meier and Kaplan (1958) computed successive probabilities of nonoccurrence of an event at certain points in time, then obtained the product of these probabilities and earlier similarly calculated probabilities for a resultant estimate. NelsonAalen Borgan (1997) used the estimation of the cumulative hazard function for censored data. Cox David (1972)
introduced a large family of semiparametric models by focusing on hazard functions via the Cox proportional hazard models. In the search for increased predictive performance in survival models, tree based approaches offer one of many attempts. A standalone survival tree provides some advantages in respect to interpretation and requires fewer assumptions than simple regression techniques.
The main aim in the process of risk prediction model building, is to build a model that predicts future risk accurately. Usually, a model is fitted using given data and then the performance of the model is checked using same data. However, using the same dataset for constructing a model and for assessing its performance creates overoptimization problems with low generalizability Zhou (2012)
. In any model based upon decision trees this level of overoptimisation becomes high due to largescale searching in each and every node of a tree
Breiman and others (1996). To alleviate this problem, the given data may be partitioned into training and testing data subsets. In this approach, a training part is used for building a model and a testing part is then used for assessing the performance of the derived model Zhou (2012). This technique decreases sample size and hence the model power is also decreased.Alternatively, a model is built on a number of random bootstrap samples drawn from the given data. The model performance is then tested using the same data by reporting means and standard deviations.
Random survival forest is one of these aggregation methods which draws multiple bootstrap samples from the given data. In addition to that, to construct each tree node, a random sample of the independent variables is selected and only these are used in the construction of the tree. Combining the results from these trees i.e. forming an ensemble of survival trees, increases the predictive power of the decision model. Similarly, a survival forest consisting of accurate and diverse survival trees may perform better than a forest grown by combining simple survival trees. Breiman’s Breiman et al. (1984) forest of regression and classification trees also holds to this intuition, which further led to the method of optimal trees ensemble (OTE) for regression and classification Khan et al. (2016b, a).
OTE combines the best trees i.e. accurate and diverse, from a large number of trees grown initially by Breiman’s Breiman et al. (1984)random forest and thus refines bagging and random forest approaches. On the bases of individual performance with outofbag observations, OTE selects a proportion of top ranked trees in the first phase and uses Brier scores to assess these trees on independent training data for their collective performance. For the resultant ensemble, this method selects trees one by one from the trees ordered with respect to the highest prediction accuracy to the lowest prediction accuracy. A tree is discarded, if the predictive performance decreases or does not improve Khan et al. (2016b, a). OTE uses fewer trees and gives comparable results to some other stateoftheart methods.
The extension of OTE to survival data is the main aim for this paper. The objective is to select the best survival trees, in terms of their individual and collective predictive accuracy and integrate them together to develop a new ensemble. This ensemble will be called an optimal survival trees ensemble (OSTE). The results from OSTE are compared with those of Cox proportional hazard model, bagging survival trees, random survival forest and conditional inference forest, using 17 benchmark survival datasets.
2 Optimal ensemble of survival tree (OSTE)
In survival analysis an improvement to a single base model , namely a survival tree, is bagging survival trees Hothorn et al. (2004), which combines a number of single survival trees by using the aggregated Kaplan–Meier curve for each new observation.
Random survival forest Ishwaran et al. (2008) broadens this idea by selecting a subset of features instead of choosing from the whole set of features while splitting the nodes of the tree. OSTE is an attempt to refine this idea by assessing survival trees both on their collective and individual performance.
To obtain the ensemble of optimal survival trees divide the given training data randomly into two non overlapping parts and . After partitioning, draw bootstrap samples from . Grow a survival tree on each sample by randomly selecting a subset of features at each node of the tree to induce additional randomness. Some of the observations are left out of samples during bootstrapping which are called outofbag (OOB) observations. These observations play no role in training the corresponding model, however, they could be used as test data to determine a prediction error for the corresponding individual survival tree. According to their Cindex (introduced in Section 2.1) the grown survival trees are arranged in ascending order and the top trees are selected. The selected trees are tested one by one for diversity as follows:

The collective performance of survival trees is assessed on independent training data to get the resultant ensemble of survival trees. From the arranged list of trees, an ensemble of two i.e. the second best survival tree combined with the top best survival tree is assessed by using the training data . Similarly, in next step, the third best survival tree is added to the ensemble of size two, obtained in the previous step, and the performance is again measured. The decision to add or discard a tree in the final ensemble is dependent on whether or not the prediction error of the ensemble is decreased or increased respectively. The same steps are repeated for all survival trees.

A survival tree, where is chosen as an optimal survival tree for the intended ensemble of optimal survival trees if its addition to the ensemble with out the survival tree fulfils the following criterion
where is the integrated brier score (IBS) of the ensemble including the tree and is the integrated brier score of the ensemble in which the tree is not yet included.
2.1 Concordance index
In survival analysis, for each subject under study, there is a survival time and prediction of it. Therefore, for the evaluation of the survival model, instead of the absolute survival time for each subject the relative risk of an event for different subjects is considered. The concordance index (Cindex) Harrell et al. (1982), an extension of the concept of the receiver operating characteristic (ROC) curve area, is one of the most reported measures used for checking survival model performance.
The Cindex is actually the relative frequency of concordant pairs. In other words, a pair of two observations is said to be concordant if the observation which is predicted to have an earlier outcome is observed to fail earlier than the other.
The Cindex can also be described as the probability that a subject of interest, whose survival time is short is associated with a high value of an indicator (biomarker) and vice versa. In other words Cindex measures the discriminative ability of a biomarker. For example, in biomedical research, where grouping of patients into good or poor prognosis groups is required Mayr and Schmid (2014).
The Cindex usually calculated as

All given observations are paired.

Discard those pairs of observations that have same survival time for an event and those pairs where there is a censored observation at the lower time point.

Score each of the permissible pairs (the set of remaining pairs) as
A: 1 if

A pair consists the observations with unequal survival time and the outcome of an observation with the shorter survival time is correctly predicted.

The survival times are equal, while the event is observed for only one observation with the lower predicted survival time.

The survival times and their predicted outcomes are equal for both observations.
B: 0.5 if

The outcome of the observations are predicted to be equal for their unequal survival time.

The outcome of the observations are predicted to be unequal for their equal survival time.

For equal survival times the survival time is predicted to be lower for the observation for whom the event remains unobserved.


The error rate is then calculated using the formula where = Concordance.
In right censored survival settings, the Cindex is defined as:
(1) 
where are the predicted biomarker values while, represent event timesMayr and Schmid (2014). The biomarker does not perform well for .
2.2 Integrated Brier score (IBS)
In the context of survival analysis, the squared difference between the survival function indicator and the predicted survival probability is called Brier score. In case of noncensored data, it can easily be calculated by averaging the squared distances between the survival function indicator for the subject and the predicted survival probability given by the model for that subject. Tsouprou et al. (2015) In other words, for noncensored data the prediction error is assessed by taking average of the squared residual . However, for right censored data, which is the main concern in the OSTE method, the squared residual needs to be weighted at each time point, t. Hence, the integrated brier score (IBS) technique is used. IBS is simply the integration of the Brier score.
Let be the time of the event of interest of subject and be the outcome while the estimated probability of a subject at risk surviving beyond ( the followup time) is then the BS is given as
For an individual that is censored before i.e , the observation is consider to be unknown. To overcome this issue, Inverseprobabilityofcensoring weighting (IPCW) Robins and Finkelstein (2000) is used . Hence, the Brier score can be written as:
where
(2) 
where is the KaplanMeier estimate of the probability of being uncensored at time . In right censored data over a range of time points a measure of predictive accuracy is required. Therefore, an integrated Brier score (IBS) with respect to some weight function for can be estimated by integrating the BS calculated via the method shown above
where a weight function at individual time points is represented by , integration of the area under the prediction curve. .
2.3 OSTE Algorithm
The algorithm of the proposed method OSTE consists of the following steps:

Partition the training data randomly into two nonoverlapping parts and i.e = and

Draw bootstrap samples from the data .

On the bootstrap samples grow survival trees in such a way that at each node features are chosen.

On the bases of individual prediction error using OOB data, arrange the grown trees in ascending order and chose the top trees. The prediction error is estimated via the concordance index given in Section 2.1.

Add the selected trees one by one starting from the single top tree and calculate the integrated Brier score. Select the survival tree if the result is improved after testing performance on the validation data .

Predict new data by combining the results of the selected trees into the resultant ensemble.
2.4 Related work
In the literature there are many other techniques used by researchers to resolve right censored survival data, for example regularized Cox regression models Wright et al. (2017). However, these models violate the proportional hazard assumptions and hence, missspecify the model Wright et al. (2017)
. Alternatively, machine learning methods such as Conditional inference forests (CIF), have been suggested as have random survival forest (RSF) and the bagging of survival trees. CIF uses different statistical approaches while selecting the split variables and the split points. This approach creates problems in the case of nonlinear covariates. Maximally selected rank statistic
Lausen and Schumacher (1992) is used to avoid this difficulty and to reduce bias Wright et al. (2017). The proposed statistic is applied to obtain an exact pvalue. Moreover, to get the exact distribution of maximally selected rank statistics a lower bound is calculated by extending an algorithm originating from linear rank statistics Hothorn and Lausen (2003).In the literature, three types of artificial neural network (ANN) are proposed for survival analysis problems. These are the prediction of the survival time of a subject from the given data directly, neural network survival analysis has been employed
Dunn and Clark (2009), where the survival status of a subject has been taken as the output of the neural network Dunn and Clark (2009) and the extension of the Cox PH model Faraggi and Simon (1995) to the nonlinear ANN predictor with a suggestion for fitting of a neural network.3 Experiment and results
3.1 Benchmark datasets
To assess and compare the predictive performance of the proposed OSTE approach, with other stateoftheart methods a total of 17 benchmark datasets are considered. A brief summary of these chosen datasets are given in Table 1. which details the number of observations and the number of the features the type of the features, whether real, integer or nominal is given against each dataset.
Datasets  No. of  No. of  Features  Censored  Source 

observations  features  type  observations  
kidney  119  3  (2/1/0)  26  by Klein et al. (2012) 
twins  24  4  (4/0/0)  8  by Klein et al. (2012) 
kidtran  863  5  (5/0/0)  140  by Klein et al. (2012) 
channing  462  5  (5/0/0)  176  by Klein et al. (2012) 
hodg  43  6  (6/0/0)  26  by Klein et al. (2012) 
myeloid  646  6  (5/0/1)  320  Therneau (2015) 
veteran  137  8  (0/7/1)  128  Therneau (2015) 
retinopathy  394  9  (5/1/3)  155  Therneau (2015) 
bfeed  927  10  (10/0/0)  892  by Klein et al. (2012) 
GBSG2  686  10  (7/0/3)  299  Mogensen et al. (2012) 
https://www.ncbi.  
NKI  295  14  (0/8/6)  79  nlm.nih.gov/gap/? 
term=phs000547.v1.p1  
cgd  203  15  (6/4/5)  76  Therneau (2015) 
colon  1858  15  (0/15/1)  920  Therneau (2015) 
cost  518  15  (4/1/10)  404  Mogensen et al. (2012) 
burn  154  17  (17/0/0)  99  by Klein et al. (2012) 
Pbc  418  19  (11/7/1)  347  Therneau (2015) 
BMT  137  22  (22/0/0)  81  by Klein et al. (2012) 
A brief description of these datasets are now follows.
The dataset Veteran has been taken from a randomized trial of two treatment procedures for lung cancer. A total of 137 patients are observed, measuring their survival time since the start of the treatment with a status 1 for dead and 0 for others. Covariates also considered are the type of treatment whether standard or test drug, type of cell, the time since the diagnosis, age, the Karnofsky score. any prior therapy i.e 0 if none and 1 for yes. The survial time has been recorded for days while the time since diagnosis has been recorded for months.
Kidtran dataset is taken from the study designed to assess the time to first clinically apparent infection in a group of patients with renal insufficiency. Total 863 cases are observed on 5 features i.e. gender (male, female), race( white, black), age in years, time which shows period of study, death indicator delta (0 if alive otherwise 1). The original source of the dataset is Klein and Moeschberger (2005).
myeloid dataset consists of 646 observations with features treatment arm A or B, time to death represented as futime is 1 for a death and 0 for censoring, time to transplant of hematropetic stem cell, time until complete response and time to relapse of disease.
The hodg dataset has 43 observations made on 6 features i.e, graft type 1 for allogenic and 2 for autologous, disease type 1 and 2 for Non Hodgkin lymphoma and Hodgkins disease respectively, time to death, delta (death/relapse indicator), Karnofsky score and waiting time in months to transplant.
The retinopathy dataset is based on a trial to delay diabetic retinopathy through laser coagulation treatment. A data frame consists of 394 observations on 9 variables, type of laser used, treated eye, person age at diagnosis time, type of diabetes, trt that is 0 for control eye and 1 for treated eye, time to loss of vision and eye risk score. A variable status is recorded as censoring indicator. For each patient there are two observations in the dataset, one for the eye received laser treatment and the other for the untreated eye. The time when s treatment starts to the time when visual acuity dropped below 5/200 is considered as the event of interest for each eye. The difference between actual time when vision is lost and minimum possible time to event is considered as a survival time.
bfeed dataset consists of breast feeding related information collected from 927 mothers (with first born children) who choose breast feeding. This dataset is actually the main section of the survey conducted by “ the National Longi tudinal Survey of Youth“ in 1983. The main aim of the survey is to collect information from females about any pregnancies that have occurred in 1983. The response feature in the dataset is duration followed by an indicator feature showing that weaning to infant is completed or not. Other covariates observed during this study are year of child’s birth, race, education, age, poverty, smoking status, alcoholdrinking and lack of prenatal care status in first trimester of pregnancy.
The datasets kidney and cgd are records of assessment time to first exit site infection in patients with renal insufficiency and time to serious infections observed in granulotomous disease (CGD) through to the end of the study respectively. A total of 119 patients are observed in kidney dataset in two groups whether the catheter is placed surgically or percutaneously with a delta as the censored indicator. This dataset is available free in KMsurv R package by Klein et al. (2012). On the other side, a total of 203 cases on 15 features are observed in the cgd dataset. These features are enrolling centre, treatment whether placebo or gamma interferon, sex, age, at study entry, height (in cm), weight (in kg), inheritance pattern, use of steroids and prophylactic antibiotics, a categorization of the centres into 4 groups, days to last followup, start and end of each time interval and observation number within subject with the status 1 if the interval ends with an infection as censoring indicator. The original source of the data is “Counting Processes and Survival Analysis“Fleming and Harrington (2011).
Dataset twins is the record of 24 cases who died from coronary heart disease (CHD). The survival time of each individual is recorded in months with the indicator death sets as 1 if the cause of death is CHD. The other two features show the identification number and gender of an individual under the study.
burn dataset consists of information about the methods used to take care of the burned patients, the infections in their wound and other medical concerns. This dataset consists of the medical records of 154 burned patients treated during the 18months study period. During study the time until staphylococcus infection was recorded in days with an indicator whether an infection had occurred or not. Gender, race, severity of the burn, burn site and type of burn, time to excision and time to prophylactic antibiotic treatment administration along with the two indicator features, namely, whether the patient’s wound had been removed or not and whether the patient had been treated with an antibiotics during the course of the study.
The information about those women who suffered from breast cancer is collected in GBSG2 dataset. Originally the dataset is available in “Building multivariable prognostic and diagnostic models transformation of the predictors by using fractional polynomials“Sauerbrei and Royston (1999). A total of 686 women are observed on features, age, time of recurrence free survival time (in days), hormonal therapy as a two level factor whether no or yes, menopausal status as a factor at two levels, premenopausal and postmenopausal recorded as horTh and menostat respectively, tumor size and grade, number of positive nodes and progesterone receptor, estrogen receptor with censoring indicator i.e 0 if censored. The dataset is freely available in the pec R package Mogensen et al. (2012).
The dataset Channing has observations of a total of 462 individuals allowed easy access to medical care without any additional financial burden, as they were covered by a health care program provided by the centre. The original source of the data is the ”Survival Analysis Techniques for Censored and Truncated Data“ Klein and Moeschberger (2005). There are a total of 5 features in the dataset, death status (1 or 0), age of entry into retirement home, age of death or left retirement home, difference between the above two ages and gender.
Pbc dataset consists 424 patients who were eligible for the controlled randomized placebo trial of the Dpenicillamine drug. The observed features in this study are alkaline phosphotase, age, serum albumin and serum bilirunbin, presence of ascites, aspartate aminotransferase, serum cholesterol, urine copper, edema as 0, 0.5 and for no edema, untreated or successfully treated and edema despite diuretic therapy respectively and presence of hepatomegaly or enlarged liver.
colon is the collection of information about colon cancer disease observed in the first successful adjuvant chemotherapy. Two records per individual i.e, one for recurrence and one for death are recorded. There are a total of 1858 subjects observed on 15 features. These features are study, rx as Treatment  Obs(ervation), Lev(amisole) and Lev(amisole)+5FU, sex and age of patient, obstruction of colon by tumour, perforation of colon, adherence to nearby organs, number of cancer detectable lymph nodes, days until event or censoring, differentiation of tumour, Extent of local spread, time from surgery to registration, more than 4 positive lymph nodes, event type and censoring status. The BMT dataset consists of information about the recovery process from a bone marrow transplantation of 137 patients. At the time of transplantation several risk factors were measured. For each disease, patients were grouped into risk categories based on their status. Risk factors denoted by z1 to z10 consists of recipient and donor age, gender, cytomegalovirus immune status (CMV) status, waiting time from diagnosis to transplantation, their FrenchAmericanBritish (FAB) classification based on standard morphological criteria, Hospital and MTX as a GraftVersusHost Prophylactic that is 1 if Yes. t1 and t2 represent time to death and a disease free survival time. d1, d2 and d3 are recorded as death indicator, relapse indicator and disease free survival indicator while da, dc and dp features shows acute and chronic GVHD indicator and platelet recovery indicator respectively. ta, tc and tp show time to acute GraftVersusHost and chronic GraftVersusHost disease respectively.
In the NKI dataset the gene expression measurements of 337 lymph node positive breast cancer patients are recorded. The computational burden is reduced and missing data is eliminated by excluding all SNPs with a call fraction below 100% keeping 151,346 SNPs. The endpoint relapsefree survival are analysed. This dataset is available at dbGaP and has ID phs000547.v1.p1 (https://www.ncbi.nlm.nih.gov/gap/?term=phs000547.v1.p1).
The cost dataset contains a subset of the data from the Copenhagen stroke study which observed 518 stroke patients. There are a total of 14 features i.e. age and sex, Hypertension, History of ischemic heart disease at admission, history of previous strokes before admission, history of other disabling diseases, daily alcohol consumption, diabetes mellitus status indicating if the glucose level was higher than 11 mmol/L, daily smoking status, atrial fibrillation, stroke subtype, stroke score, cholesterol level with the survival time
3.2 Experimental setup
A random subset of 70% of each dataset is taken for the training while the remaining 30% is considered as testing data for all the methods considered in the analysis.
In the case of OSTE, an initial ensemble of a total of 1000 independent survival trees is grown on bootstrap samples. At total of 95% of the training data is used for bootstrapping. At each node of a tree features are randomly selected from the total of features as a splitting criterion using the logrank statistic. The diversity is checked on remaining 5% of the training data. Usually i.e , the default value in the standard random survival forest is considered for all datasets. Based on individual accuracy, of total grown trees are selected while keeping the terminal node size fixed at 3. For each dataset a total of 1000 runs is performed and using the remaining 30% test data, the final results are averaged.
For RSF, the number of unproned trees from the set is tuned using 10fold cross validation. The corresponding training part is fine tuned using 10fold cross validation for all possible values of the number of features variable mtry. .The Logrank statistic is used while growing the trees in package ranger Wright and Ziegler (2017) by keeping terminal node size equal to 3. Unpruned survival trees are grown on for the bagging survival trees method using the package ipred Peters and Hothorn (2015). 10fold cross validation is used for tuning while a logrank statistic is used for measuring the between nodes distance.
For conditional inference forest, the r package party Hothorn et al. (2006)
is used. The total number of trees and number of features selected at each node for splitting is denoted by
ntree and mtry respectively, these are the only hyperparameters that are tuned using values from the set .For Coxproportional hazard model, a nontree based method, the package survival Therneau (2015) is used. For all parameters the default values are considered. For init the default is zero for all features. Until the relative change in the log partial likelihood is smaller than , i.e. eps the iterations will remain continue. Value 20 is a default for maximum iteration attempts for convergence. The Efron approximationEfron (1977) is used to handle ties.
4 Results and discussion
Using the experimental settings given in the previous section, the integrated Brier scores for all the methods are calculated on the introduced datasets. The results are given in Table 2. The final results are the average integrated Brier scores from a total of 1000 runs for each of the methods. Each time, given data is randomly divided into training and testing parts as described in the above section. It is clear from the table that OSTE is giving better average results than the alternative methods in 5 out of 17 data sets. On 5 datasets bagging outperformed others while, on 1 data set RSF gives better results. On the twins dataset the results of bagging and RSF are same. CIF gave better results on 4 and Coxph on 3 datasets.
Datasets  n  d  E  Cox  bagging  RSF  CIF  OSTE 

kidney  119  3  26  0.1272  0.1296  0.1296  0.1257  0.1291 
twins  24  4  8  0.0144  0.0132  0.0131  0.0147  0.0139 
kidtran  863  5  140  0.0341  0.0324  0.0345  0.0203  0.0357 
channing  462  5  176  0.0584  0.0512  0.0554  0.0664  0.0550 
Hodg  43  6  26  0.1521  0.1885  0.1836  0.1703  0.2067 
myeloid  646  6  320  0.1393  0.1348  0.1349  0.1360  0.2474 
veteran  137  8  128  0.2571  0.1707  0.1692  0.1582  0.1683 
retinopathy  394  9  155  0.1757  0.1795  0.1765  0.1714  0.1762 
bfeed  927  10  892  0.1925  0.2397  0.1942  0.1941  0.1478 
GBSG2  686  10  299  0.0148  0.0151  0.0149  0.0182  0.0170 
NKI  295  14  79  0.1510  0.1154  0.1113  0.1077  0.1110 
cgd  203  15  76  0.2831  0.0819  0.0862  0.0831  0.0844 
colon  1858  15  920  0.1737  0.1534  0.1605  0.1735  0.1897 
cost  518  15  404  0.1764  0.1825  0.1807  0.1851  0.1789 
burn  154  17  99  0.1661  0.1477  0.1474  0.1527  0.1469 
Pbc  418  19  347  0.0669  0.0669  0.0504  0.0523  0.0082 
BMT  137  22  81  0.0799  0.0450  0.0560  0.0511  0.0299 
Figures 24 give the results in the form of box plots for 17 datasets. Colors brown, gray, red, yellow and blue are used for the box plots of Cox, bagging, RSF, CIF and OSTE receptively. Figure 2 gives the integrated Brier scores for all the methods for the datasets veteran, kidtran, bfeed, twins, GBSG2 and burn. For two datasets, kidtran and bfeed OSTE shows better performance while for other datasets the results of OSTE are similar to the alternative methods.
The boxplots given in Figure 3 show the IBS on the datasets retinophty, cgd, cost, myeliod, channing and NKI. Cox, Bagging, RSF, CIF and OSTE are shown by brown, gray, red, yellow and blue colors, respectively. The results of OSTE are almost same on all datasets except myeliod. On cgd and NKI datasets the performance of Cox is poor while on other datasets the results of the methods are almost similar to the rest of the methods.
The boxplots given in Figure 4 showing IBS on the datasets BMT, colon, Hodg, kidney and Pbc. Cox, Bagging, RSF, CIF and OSTE are shown by brown, gray, red, yellow and blue colours, respectively. For kidney dataset OSTE shows similar performance with the rest of methods while for Pbc and BMT datasets the results of OSTE are superior.
On some random splits of the data into training and testing parts OSTE might give comparatively larger error estimates as can be seen, for example in Figure 4.4 for Pbc data. This may happen when patterns in the selected trees are not inline with those in the test data as OSTE selects trees with specific patterns. Furthermore, a comparison of OSTE and RSF methods is given in terms of feature importance as OSTE improves RSF by removing trees from the original random survival forest with adverse effects on its overall efficiency. The permutation method is used fo this purpose. For survival tree, a given variable in the outofbag data is randomly permuted to estimate a variables permutation importance. After permutation, this OOB data is dropped down the tree and the OOB estimate of prediction error is calculated. The estimate of the variable importance is the difference between this estimate and the OOB error without permutation, averaged over all trees. The larger the permutation importance of a variable, the more predictive the variable.
The estimate of the variable importance is checked on 4 data sets, burn, bmt, GBSG2 and colon for both the methods, OSTE and RSF as shown in Figure 4.5. It can be seen from the figure that for burn and bmt data sets OSTE give larger importance values to predictive features compared to random survival forest which might be due to the removal of harmful trees (i.e. the tree might have the effects of noninformative features) from the initially grown forest. For colon and GBSG2 datasets fails to give more importance to predictive features which might be the reason for the out performance of OSTE.
4.1 Hyperparameters assessment
The effect of various number of trees () grown initially in the ensemble, proportion of trees (M) chosen on individual accuracy and the number of features have been assessed on the results of the proposed method i.e. OSTE. The effect of , is assessed on various values in initial set. The results are given in Figure 5. It can be seen that the Brier scores on the given datasets shows no/little effect with any increase in the number of trees from 1000, while for kidtran dataset, the error is increased by growing more than 1000 trees.
The proposed approach OSTE is also examined for various values of M i.e. . The results are shown in Figure 6. As can be seen from the results, OSTE shows same performance by only selecting 5% of trees from the total grown trees initially. This selection of trees is based on individual accuracy against higher values of as shown in the figure. This has led to a resultant ensemble of sizes 25, 23, 31 and 24 for veteran, kidtran, bfeed and twins datasets respectively. This reveals that a significant reduction in the number of trees used for the final ensemble can be accomplished by using OSTE.
The effect of the number of features that we chose randomly for splitting the nodes of the trees on IBS are shown in Figure 7. As seen in the figure that for changing value of the results shows variations. The results suggest the tuning of this parameter for the corresponding data set.
4.2 Size comparison
In terms of the number of survival trees used a comparative analysis of ensemble sizes has also been done. The number of used survival trees in the resultant ensemble by the methods are given in Table 3. The table shows that a comparable performance could be achieved by selecting a total of 103, 92, 109, 1, 87, 35, 99, 104, 95, 102, 105, 203, 109, 97, 34, 46, 39 and 51 trees for veteranPbc datasets, receptively by selecting only as compared to the selection of hundreds of survival trees in the corresponding final ensembles by other stateoftheart methods. This might be very helpful in reducing computational cost of the ensemble in terms of storage resources.
Dataset  Bagging  RSF  CIF  OSTE 

veteran  1000  1500  1000  103 
kidtran  1500  1000  1000  92 
bfeed  500  1000  500  109 
twins  1000  1000  1500  1 
VA  1000  1000  1500  87 
BMT  1000  1000  1000  35 
retinophty  1000  1000  1000  99 
cgd  1000  1000  1500  104 
channing  1000  1500  1000  95 
Burn  1500  1000  1500  102 
GBSG2  1500  1500  1500  105 
Cost  1500  1000  1000  203 
myeliod  1000  1000  1500  109 
NKI  1000  1500  1000  97 
colon  1500  1500  1500  34 
Hodg  1500  1000  1000  46 
Kidney  1000  1500  1000  39 
Pbc  1000  1000  1500  51 
5 Conclusion
The main aim of this paper is to lessen the number of survival trees in the final ensemble in addition to improving its performance. The idea of OSTE “optimal survival trees ensemble“ is proposed to achieve this goal.
To find trees that showed better performance based on the Cindex, outofbag (OOB) observations from the bootstrap samples are used as the test subjects. Ensemble predictive accuracy is checked by assessing top ranked survival trees on independent training data. For the final ensemble those survival trees who performed well both individually and collectively have been selected. The results, in terms of the integrated Brier score (IBS), of 17 benchmark datasets after applying OSTE, are compared with Cox proportional hazard model, random survival forest, conditional inference forest and bagging survival trees.
From the integrated Brier scores, after applying all the methods to the datasets discussed above, average IBS values are calculated. It has been observed from the final results in term of boxplots, that OSTE, the proposed method, is giving better or comparable results to the best of the other methods.
It has also been observed that OSTE reduced the number of survival trees in the resultant ensemble and improved predictive performance. OSTE consisting of less than 20 survival trees is seen to give comparable results to those ensembles which select hundreds of survival trees.
Furthermore, OSTE performance has also been checked for various hyperparameters. In this regard, the effect of changing the selected features number i.e, at the nodes of the survival trees, number of trees in the initial ensemble and proportion of the top ranked trees have been assessed. is considered as a tuning parameter of the proposed method and should be fine tuned for a given dataset accordingly. needs not to be greater than 20% as higher values of M show no improvements and only increase the size of the ensemble. A total of 1000 survival trees are suggested to be grown in an initial set, for best results. The proposed ensemble is implemented in an package “OSTE“.
For the purposes of internal validation, OSTE leaves some observations from the training data during bootstrapping, therefore, some information is lost in the learning process. Whereas the remaining methods use the whole training set. Thus the performance of OSTE may be negatively affected. These outofbag observations could be used for internal validation as a future direction for research into this problem and further improve OSTE.
While growing a survival tree, a maximally selected rank statistic Lausen and Schumacher (1992) could be used instead of the logrank statistic for the selection of split points as the logrank test favours many split points.
For the proposed method to work well in high dimensional settings, some stateoftheart feature selection/dimensionality reduction techniques could be used with OSTE.
References
 Three contributions to the encyclopedia of biostatistics: the nelsonaalen, kaplanmeier, and aaalenjohansen estimators. Department of Mathematics, University of Oslo. Cited by: §1.
 Classification and regression trees. CRC press. Cited by: §1, §1.
 Heuristics of instability and stabilization in model selection. The annals of statistics 24 (6), pp. 2350–2383. Cited by: §1.
 KMsurv: data sets from klein and moeschberger (1997), survival analysis. Note: R package version 0.15 External Links: Link Cited by: §3.1, Table 1.
 Regression models and life tables (with discussion). Journal of the Royal Statistical Society 34, pp. 187–220. Cited by: §1.
 Basic statistics: a primer for the biomedical sciences. John Wiley & Sons. Cited by: §2.4.
 The efficiency of cox’s likelihood function for censored data. Journal of the American statistical Association 72 (359), pp. 557–565. Cited by: §3.2.
 A neural network model for survival data. Statistics in medicine 14 (1), pp. 73–82. Cited by: §2.4.
 Counting processes and survival analysis. Vol. 169, John Wiley & Sons. Cited by: §3.1.
 Evaluating the yield of medical tests. Jama 247 (18), pp. 2543–2546. Cited by: §2.1.
 Survival ensembles. Biostatistics 7 (3), pp. 355–373. Cited by: §3.2.
 Bagging survival trees. Statistics in medicine 23 (1), pp. 77–91. Cited by: §2.
 On the exact distribution of maximally selected rank statistics. Computational Statistics & Data Analysis 43 (2), pp. 121–137. Cited by: §2.4.
 Random survival forests. The annals of applied statistics, pp. 841–860. Cited by: §2.
 An ensemble of optimal trees for classification and regression (ote). External Links: Link Cited by: §1, §1.
 An ensemble of optimal trees for class membership probability estimation. In Analysis of Large and Complex Data, pp. 395–409. Cited by: §1, §1.
 Survival analysis: techniques for censored and truncated data. Springer Science & Business Media. Cited by: §3.1, §3.1.
 Maximally selected rank statistics. Biometrics, pp. 73–85. Cited by: §2.4, §5.
 Boosting the concordance index for survival data–a unified framework to derive and evaluate biomarker combinations. PloS one 9 (1), pp. e84483. Cited by: §2.1, §2.1.
 Nonparametric estimation from incomplete observations. Journal of the American Statistical Association 53 (282), pp. 457–481. Cited by: §1.
 Evaluating random forests for survival analysis using prediction error curves. Journal of Statistical Software 50 (11), pp. 1–23. External Links: Link Cited by: §3.1, Table 1.
 Ipred: improved predictors. Note: R package version 0.95 External Links: Link Cited by: §3.2.
 Correcting for noncompliance and dependent censoring in an aids clinical trial with inverse probability of censoring weighted (ipcw) logrank tests. Biometrics 56 (3), pp. 779–788. Cited by: §2.2.
 Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. Journal of the Royal Statistical Society: Series A (Statistics in Society) 162 (1), pp. 71–94. Cited by: §3.1.
 A package for survival analysis in s. Note: version 2.38 External Links: Link Cited by: §3.2, Table 1.
 Measures of discrimination and predictive accuracy for interval censored survival data. Ph.D. Thesis, Universiteit Leiden. Cited by: §2.2.
 Unbiased split variable selection for random survival forests using maximally selected rank statistics. Statistics in Medicine. Cited by: §2.4.

ranger: a fast implementation of random forests for high dimensional data in C++ and R
. Journal of Statistical Software 77 (1), pp. 1–17. External Links: Document Cited by: §3.2.  Ensemble methods: foundations and algorithms. CRC press. Cited by: §1.
Comments
There are no comments yet.