1 Introduction
Timetoevent analysis, also called survival analysis, is needed in many areas. This branch of statistics which emerged in the century is heavily used in engineering, economics and finance, insurance, marketing, health field and many more application areas. Most previous works and diverse literature approach timetoevent analysis by dealing with time until occurrence of an event of interest; e.g. cardiovascular death after some treatment intervention, tumor recurrence, failure of an aircraft air system, etc. The time of the event may nevertheless not be observed within the relevant time period, and could potentially occur after this recorded time, producing so called rightcensored data. The main objective of survival analysis is to identify the relationship between the distribution of the timetoevent distribution and the covariates of the observations, such as the features of a given patient, the characteristics of an electronic device or a mechanical system with some informations concerning the environment in which it must operate. The Weibull distribution could be used as lifetime distributions in survival analysis where the goal would be to estimate its parameters taking account the rightcensored data. Several previous works focused on the estimation of a Weibull distribution with rightcensored data (see Bacha and Celeux [1], Ferreira and Silva [5], ShuoJye Wu [14], etc.)
Among the first estimators widely used in this field is the KaplanMeier estimator [8]
that may be useful to estimate the probability that an event of interest occurs at a given point in time. However, it is limited in its ability to estimate this probability adjusted for covariates; i.e. it doesn’t incorporate observations’ covariates. The semiparametric Cox proportional hazards (CPH)
[3] is used to estimate covariateadjusted survival, but it assumes that the subject’s risk is a linear function of their covariates which may be too simplistic for many real world data. Since neural networks can learn nonlinear functions, many researchers tried to model the relationship between the covariates and the times that passes before some event occurs, including FaraggiSimon network [4] who proposed a simple feedforward as the basis for a non‐linear proportional hazards model to model this relationship. After that, several works focused on combining neural networks and survival analysis, notably DeepSurv [9]whose architecture is deeper than FaraggiSimon’s one and minimizes the negative log Cox partial likelihood with a risk not necessarily linear. These models use multilayer perceptron that is capable to learn nonlinear models, but it is sensitive to feature scaling which is necessary in data preprocessing step and has limitations when we use unstructured data (e.g. images). There is a number of other models that approach survival analysis with rightcensored data using machine learning, namely RandomForest Survival
[7], dependent logistic regressors [15] and Liao’s model [11] who are capable of incorporating the individual observation’s covariates.This paper proposes a novel approach to survival analysis: we assume that the survival times distribution are modeled according to a finite Mixture of Weibull distributions (at least one), whose parameters depends on the covariates of a given observations with rightcensored data. As Luck [12]
, we propose a deep learning model that learns the survival function, but we will do this by estimating the Weibull’s parameters. Unlike DeepHit
[10] whose model consists on discretizing the time considering a predefined maximum time horizon. Here, as we try to estimate the parameters, we can model a continuous survival function, and thus, estimate the risk at any given survival time horizon. For this purpose, we construct a deep neural network model considering that the survival times follow a finite mixture of twoparameter Weibull distributions. This model, which we call DeepWeiSurv tries to estimate the parameters that maximizes the likelihood of the distribution. To prove the usefulness of our method, we compare its predictive performance with that of stateofthe art methods using two realworld datasets. DeepWeiSurv outperforms the previous stateoftheart methods.2 Weibull Mixture Distribution for survival analysis
2.1 Survival Analysis with rightcensored data
Let be a set of observations with , the observation of the baseline data (covariates), its survival time associated, and indicates if the observation is censored () or not (). As can be seen in Figure 1, a blue point represents an uncensored observation and a red point represents a censored observation . In order to characterize the distribution of the survival times , the aim is to estimate, for each observation, the probability that the event occurs after or at a certain survival time horizon defined by:
Note that, may be different to the censoring threshold time . An alternative characterization of the distribution of is given by the hazard function that is defined as the event rate at time conditional on survival at time or beyond. Literature has shown that can be expressed as follows: , being the density function.
Instead of estimating the , it is common to estimate directly the survival time . In this case, we can measure the quality of estimations with the concordance index [6] defined as follows:
(1) 
is designed to calculate the number of concordant pairs of observations among all the comparable pairs such that ==1. It estimates the probability P() that compares the rankings of two independent pairs of survival times , and associated predictions ,.
2.2 Weibull distribution for censored data
From now, we consider that follows a finite mixture of twoparameter Weibull (at least a single Weibull) distributions independently from (i.e. ). In this case, we have the analytical expressions of and with respect to the mixture parameters. This leads to consider a problem of parameters estimation of mixture of Weibull distributions with rightcensored observations.
2.2.1 Single Weibull case
Here, we are dealing with a particular case where follows a single twoparameter Weibull distribution, (, ), whose parameters are (shape) and (scale). We can estimate these parameters by solving the following likelihood optimization problem:
where:
and being the censoring threshold time. is the loglikelihood of Weibull distribution with rightcensored data. To be sure that the is concave, we make a choice to consider that the shape parameter is greater than 1 ().
2.2.2 Mixture case.
Now, we suppose that follows = [(), ()] a mixture of Weibull distributions with its weighting coefficients ( = 1, 0). In statistics, the density associated is defined by:
Thus, the loglikelihood of can be written as follows:
(2) 
In addition to the mixture’s parameters , we need to estimate the weighting coefficients considered as probabilities. Therefore, we estimate the tuple by solving the following problem:
Knowing Weibull’s mean formula and given that the mean of a a mixture is a weighted combination of the means of the distributions that form this mixture (more precisely, = ), the mean lifetime can thus be estimated as follows:
(3) 
where is the Gamma function. can be used as the survival time estimation for the computation of the concordance index (with when the parameters of the distribution are independent from ).
3 Neural network for estimating conditional Weibull mixture
We now consider that the Weibull mixture’s parameters depend on the covariates =. We propose to use a neural network to model this dependence.
Model description
We name the function that models the relationship between and the parameters of the conditional Weibull mixture:
where and . Note that, when , it is no more required to estimate . This function is represented by the network named DeepWeiSurv described in Figure 2. Hence, our goal is to train the network to learn and thus
the vector of parameters that maximise the likelihood of the timetoevent distribution (
as well if ). DeepWeiSurv is therefore a multitask network. It consists of a common subnetwork, a classification subnetwork (clf) and a regression subnetwork (reg)The shared subnetwork takes as an input the baseline data x of size and compute a latent representation of the data . When , clf and reg take as an input towards producing and respectively. For reg
subnetwork, we use ELU (with its constant = 1) as an activation function for both output layers. We use this function to be sure that we have enough gradient to learn the parameters thanks of the fact that it becomes smooth slowly unlike ReLU function. However the codomain of ELU is
, which is problematic given the constraints on the parameters mentioned in the previous section ( and ). To get around this problem, the network will learn and . The offset is then applied in the opposite direction to recover the parameters concerned. For the classification part we need to learn . To ensure that and , we use a activation in the output layer of clf. For each , clf produces, where is such that: with and a probability estimate, whereas outputs = and = . Otherwise, i.e. = 1, we have = 1, thus we don’t need to train clf.To train DeepWeiSurv, we minimize the following loss function:
where is the vector of event indicators and:
with:
and
exploits uncensored data, whereas exploits censored observations by extracting the knowledge that the event will occur after the given censoring threshold time . Figure 3 is an illustration of the computational graph of our training loss: the inputs are the covariates x, the real values of time and event indicator and the outputs are the estimates .
Experiment on SYNTHETIC dataset
The main objective in this section is to validate mathematically DeepWeiSurv, that is, to show that this latter is able to estimate the parameters. For this purpose, we perform an experiment on a simulated data. In this experiment, we treat the case of a single Weibull distribution ( = ) and a mixture of 2 Weibull distributions ( = ) using three different functions: (linear), (quadratic), (cubic). For each function we generate and (). We compare the predicted likelihood with the real, and optimal one. These two likelihoods are equal when the estimated parameters correspond to the real ones. Let be a vector of 10000 observations generated from an uniform distribution . Here we select 50% of observations to be right censored at the median of survival times ( = 0 if ). We set the parameters to be the following functions:
The bar plot in Figure 4 displays the predicted likelihood of each distribution and their real one . We notice that the real value and predicted one of each case are very close to each together which means that the model can identify very precisely the parameters of the conditional distributions. Now, we test DeepWeisurv on the realworld datasets.
4 Experiments
We perform two sets of experiments based on real survival data : METABRIC and SEER. We give a brief descriptions of the datasets below; Table 1
gives an overview on some descriptive statistics of both realword datasets. We train DeepWeiSurv on real survival datasets. We compare the predictive performance of DeepWeiSurv with that of CPH
[3] which is the mostwidely used model in the medical field and DeepHit[10] that seems to achieve outperformance over previous methods. These models are also tested in the same experimental protocol as DeepWeiSurv.4.0.1 Metabric
METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) dataset is for a CanadaUK project that aims to classify breast tumours into further subcategories. It contains gene expressions profiles and clinical features used for this purpose. In this data, we have 1981 patients, of which 44.8% were died during the study and 55.2% were rightcensored. We used 21 clinical variables including tumor size, age at diagnosis, Progesterone Receptor (PR) status etc (see Bilal et al.
[2]).4.0.2 Seer
The Surveillance, Epidemiology, and End Results (SEER^{1}^{1}1https://seer.cancer.gov)[13] Program provides information on cancer statistics during 19752016. We focused on the patients (in total 33387) recorded between 1998 and 2002 who died from a breast cancer BC (42.8%) or a heart disease HD (49.6%), or who were rightcensored (57.2% and 50.4% respectively). We extracted 30 covariates including gender, race, tumor size, number of malignant of benign tumors, Estrogen Receptor status (ER), PR status, etc. For evaluation we separated the data into two datasets with respect of the death’s cause (BC & HD) while keeping censored patients in both of them.
Datasets  No. Uncensored  No. censored  No. Features  
Qualitative  Quantitative  
METABRIC  888 (44.8%)  1093 (55.2%)  15  6 
SEER BC  9152(42.8%)  12221 (57.2%)  23  11 
SEER HD  12014 (49.6%)  12221 (50.4%)  23  11 
Network Configuration
DeepWeiSurv is consisted of three blocks: the shared subnetwork which is a 4layer network, 3 of which are fully connected layers (128, 64, 32 nodes respectively) and the remain is a batch normalization layer, the second and the third block (
reg, clfrespectively) consisted of 2 fully connected layers (16, 8 nodes) and 1 batch normalization layer. Added to that, the network finishes by one softmax layer and two ELU layers as outputs. The hidden layers are activated by ReLU function. DeepWeiSurv is trained via Adam optimizer and learning rate of
. DeepWeiSurv is implemented in a PyTorch environment.
4.0.3 Experimental Protocol
We applied 5fold cross validation: the data is randomly splitted into training set (80% and 20% of which is reserved for validation) and test set (20%). We use the predicted values of the parameters to calculate the mean lifetime and then defined by equation (1). This latter is calculated on the validation set. We tested DeepWeiSurv with and (we tested higher values of , but without better performances).
4.0.4 Results
Algorithms  METABRIC  SEER BC  SEER HD 
CPH  0.658  0.833  0.784 
(0.646  0.671)  (0.829  0.838)  (0.779  0.788)  
DeepHit  0.651  0.875  0.846 
(0.641  0.661)  (0.867  0.883)  (0.842  0.851)  
DeepWeiSurv( = )  0.805  0.877  0.857 
(0.782  0.829)  (0.864  0.891)  (0.85  0.866)  
DeepWeiSurv( = )  0.819  0.908  0.863 
(0.812  0.837)  (0.906  0.909)  (0.86  0.868) 
performance tested on METABRIC and SEER (mean and 95% confidence interval)
Table 2 displays the results of the experiments realized on SEER and METABRIC datasets. We can observe that, for METABRIC, DeepWeiSurv’s performances exceed by far that of DeepHit and CPH. For the SEER data, DeepWeiSurv with =1 outperfoms CPH (in BC and HD cases) and has a slight improvement over DeepHit especially for SEER HD data but without a significant difference (their confidence intervals did overlap). However, the improvement of DeepWeiSurv with =2 over all the other methods is highly statistically significant. We suspect that the good performances of DeepWeiSurv comes from its ability to learn implicitly the relationship between the covariates and the parameters without making any restrictive assumption.
4.0.5 Censoring threshold sensitivity
In the previous experiments the survival time horizon and the censoring threshold coincide, but it is not always the case. Since DeepWeiSurv predicts the conditional Weibull distributions with respect to the covariates, it is able to consider any survival time horizon given a censoring threshold. We add another experiment on METABRIC^{2}^{2}2We have chosen METABRIC dataset because of its small size compared to that of SEER dataset in order to avoid long calculations. dataset where we assess DeepWeiSurv () performance with respect to censoring threshold time . The aim of this experiment, is to check if DeepWeiSurv can handle data in highly censored setting for different survival time horizons. For this purpose, we apply the same experimental protocole as before, but changing the censoring threshold. We do this for some values of far below than that used in the previous experiment (
). This values, expressed in quantiles
^{3}^{3}3We choose this values by using the quantiles of the survival times vector ., are carefully selected in order to have a significant added portion (compared to that of the adjacent value that precedes) of censored observations. As an observation may change from a censored status to an uncensored status by changing the threshold of censorship and vice versa, for each value of censoring threshold time we therefore have a new set of observed events if else 0 } (i.e. comparable events, and this contributes to the calculation of ). The training set, as it is selected, contains censored observations. Table 3 gives the number of censored and uncensored observations of each selected value of . For each value of , we apply the 5fold cross validation and then calculate the average for every survival time horizons . The results are displayed in Figure 5.No. uncensored  No. censored  Added portion (w.r.t )  
1026  558  160  
1127  457  261  
1248  336  382  
1338  246  472 
Each curve in Figure 5 represents the scores calculated for a given censoring threshold in different survival time horizons in xaxis. We can notice that the average score decreases when decreases which is expected because we have less and less of uncensored data which means that it becomes more and more difficult to model the distribution of survival times. However, DeepWeiSurv still performing well in highly censored setting.
5 Conclusion
In this paper, we described a new approach, DeepWeiSurv, to the survival analysis. The key role of DeepWeiSurv is to predict the parameters of a mixture of Weibull distributions with respect to the covariates in presence of rightcensored data. In addition to the fact that Weibull distributions are known to be a good representation for this kind of problem, it also permits to consider any survival time horizon given a censoring threshold. Experiments on generated databases show that DeepWeiSurv converges to the real parameters when the survival time data follows a mixture of Weibull distributions whose parameters are a simple function of the covariates. On real datasets, DeepWeiSurv clearly outperforms the stateoftheart approaches and demonstrates its ability to consider any survival time horizon.
References
 [1] (1996) Bayesian estimation of a weibull distribution in a highly censored and small sample setting. Ph.D. Thesis, INRIA. Cited by: §1.
 [2] (2013) Improving breast cancer survival analysis through competitionbased multidimensional modeling. PLoS computational biology 9 (5), pp. e1003047. Cited by: §4.0.1.
 [3] (1972) Regression models and life tables (with discussion). Journal of the Royal Statistical Society. Series B. 34, pp. 187–220. Cited by: §1, §4.
 [4] (1995) A neural network model for survival data. Statistics in medicine 14 (1), pp. 73–82. Cited by: §1.
 [5] (2017) Parameter estimation for weibull distribution with right censored data using em algorithm. Eksploatacja i Niezawodność 19 (2). Cited by: §1.
 [6] (1982) Evaluating the yield of medical tests. Jama 247 (18), pp. 2543–2546. Cited by: §2.1.
 [7] (2008) Random survival forests. The annals of applied statistics 2 (3), pp. 841–860. Cited by: §1.
 [8] (1958) Nonparametric estimation from incomplete observations. Journal of the American statistical association 53 (282), pp. 457–481. Cited by: §1.
 [9] (2016) Deep survival: a deep cox proportional hazards network. stat 1050, pp. 2. Cited by: §1.

[10]
(2018)
Deephit: a deep learning approach to survival analysis with competing risks.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §1, §4.  [11] (2016) Combining deep learning and survival analysis for asset health management. International Journal of Prognostics and Health Management. Cited by: §1.
 [12] (2017) Deep learning for patientspecific kidney graft survival analysis. arXiv preprint arXiv:1705.10245. Cited by: §1.
 [13] (2019) Surveillance, epidemiology, and end results (seer) program (www.seer.cancer.gov) seer*stat database: incidence  seer 18 regs research data + hurricane katrina impacted louisiana cases, nov 2018 sub (19752016 varying). Note: Linked To County Attributes  Total U.S., 19692017 Counties, released April 2019, based on the November 2018 submission. Cited by: §4.0.2.
 [14] (2002) Estimations of the parameters of the weibull distribution with progressively censored data. Journal of the Japan Statistical Society 32 (2), pp. 155–163. Cited by: §1.
 [15] (2011) Learning patientspecific cancer survival distributions as a sequence of dependent regressors. In Advances in Neural Information Processing Systems 24, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 1845–1853. External Links: Link Cited by: §1.
Comments
There are no comments yet.