Experimental-comparison-for-time-to-event-analysis-through-the-concordance-index
Experimental comparison of semi-parametric, parametric and machine learning models for time-to-event analysis through the concordance index
view repo
In this paper, we make an experimental comparison of semi-parametric (Cox proportional hazards model, Aalen's additive regression model), parametric (Weibull AFT model), and machine learning models (Random Survival Forest, Gradient Boosting with Cox Proportional Hazards Loss, DeepSurv) through the concordance index on two different datasets (PBC and GBCSG2). We present two comparisons: one with the default hyper-parameters of these models and one with the best hyper-parameters found by randomized search.
READ FULL TEXT VIEW PDFExperimental comparison of semi-parametric, parametric and machine learning models for time-to-event analysis through the concordance index
Time-to-event analysis originated from the idea to predict the time until a certain critical event occurs. For example, in healthcare, the goal is usually to predict the time until a patient with a certain disease dies. Another example is maintenance where the objective is to predict the time until a component fails. There are many other examples that are of interest to time-to-event analysis such as predicting customer churn, predicting the time until a convicted criminal reoffends, etc. One of the main challenges of time-to-event analysis is right censoring, which means that the event of interest has only occurred for a subset of the observations, making the problem different from typical regression problems in machine learning.
In this paper, we will use two datasets to perform this analysis. The first one is about patients diagnosed with breast cancer (GBCSG2) and the second one are patients diagnosed with primary biliary cirrhosis (PBC). For the first dataset the critical event of interest will be the recurrence of cancer while for the second one it will be the death of the patient.
In each dataset and for each sample we have an observed time that could be either the survival time or the censored time. A censored time will occur when the time of death has not been observed, and then, in this case this time corresponds to the last medical record of the patient. The censored time will be a lower bound for the survival time.
The fundamental task of time-to-event analysis is to estimate the probability distribution of time until some event of interest happens.
Consider a covariates/features vector
, a random variable that takes on values in the covariates/features space
. Consider a survival time , a non-negative real-valued random variable. Then, for a feature vector , our aim is to estimate the conditional survival function:(1) |
where is the time and is the probability function.
In order to estimate the conditional survival function , we assume that we have access to training samples, in which for the -th sample we have: the feature vector, the survival time indicator, which indicates whether we observe the survival time or the censoring time, and which is the survival time if and the censoring time otherwise.
Many models have been proposed to estimate the conditional survival function
. The most standard approaches are the semi-parametric and parametric models, which assume a given structure of the hazard function:
(2) |
The concordance index, introduced by Harrell et al. (1996) in [7], is the most used performance metric for time-to-event analysis. It measures the fraction of pairs of subjects that are correctly ordered within the pairs that can be ordered. The highest (and best) value that can be obtained is , which means that there is complete agreement between the order of the observed and predicted times. The lowest value that can be obtained is , which denotes a perfectly wrong model, while a value of means that it is a random model.
To calculate the concordance index we first take every pair in the test set such that the earlier observed time is not censored. Then we consider only pairs such that and we also eliminate the pairs for which the times are tied unless at least one of them has an event indicator value of . Next, we compute for each pair a score which for is if the subject with earlier time (between and ) has higher predicted risk (between and ), is if the risks are tied and otherwise. For and we set if the risks are tied and otherwise. If only one of or is we set if the predicted risk is higher for the subject with and otherwise.
Final we compute the concordance index as follows
(3) |
where represents the set of eligible pairs .
The German Breast Cancer Study Group (GBCSG2) dataset, made available by Schumacher et al. (1994) in [14], studies the effects of hormone treatment on recurrence-free survival time. The event of interest is the recurrence of cancer time. The dataset has 686 samples and 8 covariates/features: age, estrogen receptor, hormonal therapy, menopausal status (premenopausal or postmenopausal), number of positive nodes, progesterone receptor, tumor grade, and tumor size. At the end of the study, there were 387 patients (56.4%) who were right censored (recurrence-free). In our experiments, we reserve 25% of the dataset as testing set.
The Mayo Clinic Primary Biliary Cirrhosis dataset, made available by Therneau and Grambsch (2000) in [15], studies the effects of the drug D-penicillamine on the survival time. The event of interest is the death time. The dataset has 276 samples and 17 covariates/features: age, serum albumin, alkaline phosphotase, presence of ascites, aspartate aminotransferase, serum bilirunbin, serum cholesterol, urine copper, edema, presence of hepatomegaly or enlarged liver, case number, platelet count, standardised blood clotting time, sex, blood vessel malformations in the skin, histologic stage of disease, treatment and triglycerides. At the end of the study, there were 165 patients (59.8%) who were right censored (alive). In our experiments, we reserve 25% of the dataset as testing set.
Cox in [4] proposes a semi-parametric model, also known as Cox proportional hazards model, to estimate the conditional survival function. This model assumes that the log-hazard of a subject is a linear function of their static covariates/features , and a population-level baseline hazard function that changes over time:
(4) |
The term ‘proportional hazards’ refers to the assumption of a constant relationship between the dependent variable and the regression coefficients. Also, this model is semi-parametric in the sense that the baseline hazard function does not have to be specified and it can vary allowing a different parameter to be used for each unique survival time.
Aalen’s additive model, proposed by Aalen (1989) in [1], estimates the hazard function but instead of being a multiplicative model as the Cox proportional hazards model, it is an additive model. The hazard function estimator is the following
(5) |
Consider we have two survival functions for each one of two different populations, and and an accelerated failure rate such that where can be modeled as a function of the covariates/features and it describes stretching out or contraction of the survival time:
(6) |
Then, we suppose a Weibull form for the survival function leading us to assume
(7) |
where is an unknown parameter that must be fitted. This model is called Weibull accelerated failure time shortened as Weibull AFT model.
The random survival forest model, proposed by Ishwaran et al. (2008) in [9]
, is an extension of the random forest model, introduced by Breiman et al. (2001) in
[2], that can take into account censoring. The randomness is introduced in two ways, first we use bootstrap samples of the dataset to grow the trees and second, at each node of the tree, we randomly choose a subset of variables as candidates for the split. The quality of a split is measured by the log-rank splitting rule. Then, we average the trees results which allows us to improve the accuracy and avoid overfitting.We also consider a random survival forest variation from Chen (2019) in [3]. Each leaf will be associated to a different subset of the data set for which a Kaplan Meier survival estimator is applied, and so, each leaf is associated to a survival function estimate. Then, for a test point we choose all the leaves that belongs to and we only average the results of these leaves to obtain our final estimation.
The idea of gradient boosting was originated by Breiman and later developed by J.H. Friedman (2001) in [6]
. Gradient boosting is an additive model in which at each step it adds a new weak learner so that it minimizes a loss function. The model has principally three components, the loss function, the weak learner and the additive model. The loss function we aim to minimize will be the negative Cox’s log partial likelihood, as proposed by Ridgeway (1999) in
[13]. At each step we have an estimator and we add an estimatorwhich will be originated by a decision tree and such that minimizes the loss function. Then, our estimator at the stage
will be(8) |
DeepSurv, proposed by Katzman et al. (2018) in [10]
, is a nonlinear version of Cox proportional hazards model. Cox proportional hazards is a semiparametric model that calculates the effects of observed covariates on the risk of an event occurring and it supposes that this risk is a linear combination of the covariates. However, this linear assumption may be too simplistic and not accurate enough. DeepSurv proposes to use deep neural networks to learn a nonlinear relationship between covariates/features and an individual’s risk of failure. DeepSurv is a multi-layer perceptron and it estimates for each feature
the risk function parametrized by the weights of the network . This function is the same function presented in the Cox proportional hazard model, but the difference is that in this case it is not assumed to be linear and it is given by minimizing the loss function of the neural network(9) |
where is a regularization parameter, is the number of uncensored subjects and is the set of subjects at risk at time .
We compared all the models described for the two datasets through the concordance index. To do this analysis we used Scikit-learn [11], Lifelines [5], Scikit-survival [12], and Matplotlib [8]. For each dataset, the experiment we performed is the following: we choose different seeds for splitting the dataset, this generates different partitions between training and validation sets. Then we run the model times (one for each partition) and we make a boxplot with the distribution of the concordance indexes obtained. In the figures, we can observe the median of the obtained concordance indexes represented by the red lines and the average represented by the red triangles.
Figure 1 shows the comparison of the concordance indexes for PBC dataset where we can appreciate that random survival forest model fitted with a random search of the hyperparameters outperforms the other models. Figure 2 shows the comparison of the concordance indexes for GBCSG2 dataset and we can see that random survival forest with adaptive nearest neighbors outperforms the other models.
Furthermore, we can observe that traditional methods performed reasonably well for the small dataset PBC (see Cox proportional hazards with randomized search), but they underperformed against machine learning methods for the larger dataset (GBCSG2). We can also observe that the deep learning method (Deepsurv) did not perform better than random survival forest model and therefore the progress made by deep learning in other areas (computer vision, NLP, etc.) has not yet been replicated for time-to-event analysis.
Classical methods for predicting survival time are easier to interpret and to analyze the way in which each covariate/feature has an influence in the model. For the case of PBC dataset, random survival forest with random search outperforms Cox proportional hazards with random search by less than while in GBCSG2 the method RSF with adaptive nearest neighbor increase the performance by with respect to randomized search Cox proportional hazards model. Therefore, if this increment in performance is significant enough to compensate for the loss of easier interpretation of the model will highly depend on the application.
The work presented in this paper has been partially carried out at LINCS (http://www.lincs.fr).
Aalen, O. O. (1989). A linear regression model for the analysis of life times.
Statistics in Medicine, vol. 8, pp. 907–925.
Comments
There are no comments yet.