Deep Survival Machines: Fully Parametric Survival Regression and Representation Learning for Censored Data with Competing Risks

03/02/2020 ∙ by Chirag Nagpal, et al. ∙ 0

We describe a new approach to estimating relative risks in time-to-event prediction problems with censored data in a fully parametric manner. Our approach does not require making strong assumptions of constant baseline hazard of the underlying survival distribution, as required by the Cox-proportional hazard model. By jointly learning deep nonlinear representations of the input covariates, we demonstrate the benefits of our approach when used to estimate survival risks through extensive experimentation on multiple real world datasets with different levels of censoring. We further demonstrate advantages of our model in the competing risks scenario. To the best of our knowledge, this is the first work involving fully parametric estimation of survival times with competing risks in the presence of censoring.



There are no comments yet.


page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Survival regression is a field of statistics and machine learning that deals with the estimation of a survival function representing the probability of an event of interest, typically a failure, to occur beyond a certain time in the future. Survival regression models time-to-event by estimating the survival function,

, conditional on , the input covariates. Examples include estimating the survival times of patients after certain treatment using clinical variables, or predicting the failure times of machines using their usage histories, etc. Survival regression differs from standard regression due to censoring of data, i.e. observation of some subjects stops before occurrence of an event of interest. In practical settings there might be multiple different events that may lead to failure, and this generalized setting is known as the competing risks scenario.

Classical statistical learning techniques for survival regression rely on non-parametric or semi-parametric methods for survival function estimation, primarily because they make working with censored data relatively straightforward. However, non-parametric methods may suffer from curse of dimensionality, and semi-parametric approaches usually depend on strong modelling assumptions. In particular, the prevailing assumption of constant proportional hazard over lifetime as proposed by

Cox (1972) in the Proportional Hazards model, is very likely to be unrealistic in many practical scenarios encountered in healthcare, predictive maintenance, econometrics, or operations research. This and similar assumptions have recently attracted much controversy.

In this paper, we propose Deep Survival Machines

, a novel approach to estimate time-to-event in the presence of censoring. By leveraging a Hierarchical Graphical model parameterized by Neural Networks, we learn distributional representations of the input covariates and mitigate existing challenges in survival regression.

Our main contributions can be summarized as follows:

  1. Our approach estimates the conditional survival function as a mixture of individual parametric survival distributions.

  2. We do not make strong assumptions of proportional hazards and enable learning with time-varying risks.

  3. Finally, our approach allows for learning of rich, distributed representations of the input covariates, helping knowledge transfer across multiple competing risks.

Through extensive experimentation on multiple datasets, we demonstrate the superiority of our approach in both the single event and competing risks scenarios as compared to classic survival analysis techniques as well as more modern competitive baselines.

2 Related Work

The Cox proportional hazards regression model (CPH) is a popular choice for survival regression. In the Cox model, the estimator of the survival function conditional on , , is assumed to have constant proportional hazard. Thus, the relative proportional hazard between individuals is constant across time. Another way of stating this assumption is that if an individual is at a higher risk of death at a certain time as compared to another individual, then the relative risk associated with the individual would be higher at anytime of the lifetime of that individual.This is a very strong assumption which may not hold in many practical scenarios when the risks are time-varying.

Significant amount of recent research has been involved in improving Cox model. Researchers have tried to incorporate structural sparsity, regularization, and active and multitask learning when available data is scarce (Vinzamuri et al., 2014; Vinzamuri & Reddy, 2013; Li et al., 2016). Other efforts have involved incorporating non-linear interactions between the covariates in the original Cox model. Rosen & Tanner (1999) proposed using a mixture of linear experts for the original Cox model. Nagpal et al. (2019) recently improved this approach with a variational inference based objective and demonstrate state-of-the-art results. Other approaches for incorporating non-linearities have involved replacing the linear interaction terms in the Cox model with deep neural networks, as was explored first in Faraggi & Simon (1995), followed by Xiang et al. (2000) and again recently by Katzman et al. (2018) with the DeepSurv

approach. Extensions to this work have involved convolutional neural networks, and active learning for healthcare application in oncology

(Mobadersany et al., 2018; Nezhad et al., 2019). However, these approaches are still subject to the same strong assumption of proportional hazards as the original Cox model.

More recently, Lee et al. (2018, 2019a)

has proposed a deep learning approach,

DeepHit, to model the survival outcomes in the competing risks

scenario. Their approach is similar to our approach in that they also aim to learn a fully parametric model, however their architecture only allows for the prediction of failure times over a discrete set of fixed size. This has a major drawback that for problems with long survival horizons, accurate prediction of actual failure times would require the discrete output space to be of large size, resulting in an extremely large number of parameters to be learnt, making parameter inference intractable. Another drawback of this approach is that its performance is sensitive to events at shorter horizons and does not model long term event horizons well. In order to mitigate this,

(Lee et al., 2019b)

has proposed to use black box optimization to adaptively select the best model from a large ensemble for a given event horizon. In this paper we explicitly demonstrate robust performance of our model at different quantiles of event times with varying amounts of censoring.

Recent research also includes Deep Survival Analysis proposed by Ranganath et al. (2016), which models survival problems with deep exponential families and aligns all observations by their failure time; Chapfuwa et al. (2018) proposed to use adversarial training methods by adapting a conditional GAN (Mirza & Osindero, 2014) to survival regression problems. However, these approaches do not consider competing risks scenarios.

In addition to these approaches, non-parametric methods have also been popular for survival estimation. These methods include improvements over the Kaplan-Meier (KM) estimator Kaplan & Meier (1958) by fitting a KM Estimator in a small neighbourhood around an individual observation to accommodate conditioning. Chen (2019) recently presented non-asymptotic error bounds with strong consistency results for these methods, and found that the use of forest ensembles for building conditional estimators of the survival function (Ishwaran et al., 2008) is an appropriate choice of kernel for such methods. Yet more recent approaches have involved Gaussian Processes (Alaa & van der Schaar, 2017) with a similar intuition in the competing risks scenario.

Figure 1: The proposed Deep Survival Machines pipeline. The input features,

are passed through a deep multilayer perceptron followed by a softmax over

. The Conditional Distribution of is then described as a mixture of , Primitive distributions, drawn from some prior.

Existing literature on survival regression can thus be divided into two groups, 1) Semi-parametric approaches involving fitting proportional hazards (Coxian Models) 2) Non-parametric models requiring some notion of similarity or kernel between individuals. To the best of our knowledge, the proposed approach is the first fully-parametric method for survival regression in the presence of competing risks.

3 Approach: ‘Deep Survival Machines

In this section we describe our approach, Deep Survival Machines (DSM) architecture and inference in further detail. Fig. 1 is a visual representation of our approach while Fig. 2 describes the model in plate notation.

3.1 Survival Data

We assume that the survival data we have access to is right-censored. This implies that our data, is a set of tuples . Where typically, are features associated with an individual i, is the time at which an event of interest took place, or the censoring time and is an indicator that signifies whether is event time or censoring time. For a given individual, we only either observe the actual failure or censoring time but not both. For simplicity it is assumed that the true data generating process is such that the censoring process is independent of the actual time to failure. We denote the uncensored subset of data as and the censored subset as .

3.2 Primitive Distributions

We choose to model the conditional distribution as a mixture over well-defined, parametric distributions which we call as Primitive distributions for the remainder of this paper. Given that we are modelling survival times, a natural assumption for these Primitive distributions is to have support only in the space of positive reals. Another property of interest is to have a closed form solution for the cdf, this would enable the use of gradient based optimization for Maximum Likelihood Estimation.

Weibull Log-Normal
Table 1: Distributional choices for the Primitive distributions.

For DSM we experiment with two distributions that satisfy this property, the Weibull and the Log-Normal distribution. The Weibull has closed form

pdf and cdf. For the Log-Normal, we compute the cdf by using the standard approximation of the complementary error function erfc in PyTorch. The full functional forms of the distributions are listed in Table 1. We parameterize the and as

Here the act is the SELU and Tanhactivation functions for the Weibull and Log-Normal respectively, and is a Multilayer Perceptron.

are the input covariates. and are all parameters that are learnt during training. Another set of parameters that are learnt are that determine the mixture weights for each data point. The following Section 3.3 introduces the proposed model in plate notation (Fig. 2 ) and the corresponding generative story.

[enhanced, sharp corners, boxrule=.8pt,drop fuzzy shadow,colback=white,colframe=black, colbacktitle=white,coltitle=black, title=

3.3 The Generative Story


Figure 2: Deep Survival Machines in Plate Notation

  1. We draw the covariates of the individual,

  2. The parameters of the model are drawn from a zero mean Gaussian distribution.

  3. Conditioned on the covariates, and the parameters, we draw the latent

  4. The set of parameters and are drawn from the prior and .

  5. Finally, the event time is drawn conditioned on and .

3.4 Parameter Estimation

In order to accommodate for heterogeneity arising in the data, we propose to model the Survival distribution of each individual as a fixed size mixture of Survival distribution primitives. At test time, the survival function corresponding to this held out individual is described as a weighted mixture of the survival distribution primitives. Here the weights are a Softmax the output of a Deep Neural Network. At training time, the parameters of the Deep Neural Network and the Survival Distribution primitives are learnt jointly.

Uncensored Loss. We consider the maximum likelihood estimator for the uncensored data which can be written as

Censoring Loss. Proceeding as above, we can write the lower bound of the censored observations as

Mitigating Long Tail Bias. Survival distributions with positive support typically have long tails which adds to the bias when performing Maximum Likelihood Estimation. Note that for the censored instances of data we are maximizing the probability . One reasonable way of adjusting for this bias is to instead maximize where is some arbitrarily large value that can be tuned as a hyper-parameter. However, for simplicity we choose to directly discount the censoring loss by multiplying it with a discounting factor , which has a similar effect of diminishing bias arising from long tails.

Prior Loss. We include the strength of the prior on the , as

Combined Loss. We finally combine the individual losses described above as


is a scalar hyperparameter that trades off the contribution of Regression Loss vis-à-vis the Evidence Lower Bound of the uncensored observations to the combined objective function. For a complete formulation of the loss function, in terms of functions and parameters please refer to Appendix


Dataset Type Dataset Dim. Feature Dim. No. Events No. Censoring
SUPPORT Single Risk 9,105 30 6,201 (68.1 %) 2,904 (31.9 %)
METABRIC Single Risk 1,904 9 1,103 (57.9 %) 801 (42.1 %)
SYNTHETIC Competing Risks 30,000 12
Event 1
Event 2
7,600 (25.3%)
7,400 (24.7%)
15,000 (50.0 %)
SEER Competing Risks 65,481 21
13,564 (20.7%)
4,245   (6.5%)
47,672 (72.8 %)
Table 2: Descriptive statistics of the datasets used in the experiments.

3.5 Handling Multiple Competing Risks

We adapt Deep Survival Machines to scenarios involving multiple competing risks by allowing learning of a common representation for the multiple competing risks by passing through a single MLP ( in Fig.1). This representation then interacts with a separate set of in order to describe the event distribution for each competing risk. Maximum Likelihood Estimation is performed by treating the occurrence of a competing event before the other event as a form of independent censoring. This strategy allows the model to leverage knowledge from the two competing tasks by allowing parameter sharing through a single intermediate representation.

4 Experiments

We evaluate Deep Survival Machines (DSM) on their ability to measure relative risks for a single event of interest in the presence of censoring, and then we further consider ablation experiments where we artificially increase the amount of censoring to demonstrate the robustness of this approach. Finally, we demonstrate DSM’ ability to learn representations of the covariates for transferring knowledge across two events in the competing risks scenario with censoring.

4.1 Datasets

Single Event/Single Risk. We evaluated performance on the following real-world medical datasets with single events: Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT) (Knaus et al., 1995), and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) (Curtis et al., 2012). A brief introduction of each dataset is provided below.


The SUPPORT was a study conducted to describe a prognostic model to estimate survival over a 180-day period for 9,105 seriously ill hospitalized patients. Of the 9,105 patients, 6,201 patients (68.1%) were followed to death, with a median survival time of 58 days. We used 30 patient covariates, including age, gender, race, education, income, physiological measurements, co-morbidity information etc. Missing values of certain physiological measurements were imputed using the suggested normal values

111 and other missing values were imputed using the mean value for numerical features and the mode for categorical features.

METABRIC: The METABRIC was a study conducted to determine new breast cancer subgroups and facilitate treatment improvement using patients’ gene expressions and clinical variables. The dataset consists of 1,904 patients and 9 features. 1,103 patients (57.9%) were followed to death with a median survival time of 115.9 months. The dataset used was preprocessed as in Katzman et al. (2018) and downloaded from the PySurvival library222

Competing Risks. We also evaluated the performances on two datasets with competing risks: a synthetic dataset and the Surveillance, Epidemiology, and End Results (SEER) dataset.

SYNTHETIC: In order to demonstrate the effectiveness of DSM as a representation learning framework, we experiment with synthetic data that is generated following the spirit of Alaa & van der Schaar (2017) & Lee et al. (2018) using the same generative process as they described.

Here is a tuple representing the covariates of the individual . The Event times and

are exponentially distributed around functions that are both linear and quadratic in

. We generate 30,000 patients from the distribution out of which 50% are subjected to random right censoring by uniformly sampling the censoring times in the interval . Clearly, the choice of our distributions for the event times are not independent, and would allow a model to leverage knowledge of one event to better predict the other, which is what we intend to demonstrate.

SEER: The SEER333 dataset provides information on cancer statistics among the U.S. population. We focused on the breast cancer patients in the registries of Alaska, San Jose-Monterey, Los Angeles and Rural Georgia during the years from 1992 to 2007, with the follow-up period restricted to 10 years. Among the 65,481 patients, 13,564 (20.7%) died due to breast cancer (BC) and 4,245 (6.5%) died due to cardiovascular disease (CVD), which were treated as the two competing risks in our experiments. We used 21 patient covariates, including age, race, gender, diagnostic confirmation, morphology information (primary site, laterality, histologic type, etc.), tumor information (size, type, number etc.), and surgery information. Missing values were imputed using the mean value for numerical features and the mode for categorical features.

4.2 Baselines

We compare the performance of DSM to the following competing baseline approaches:

Cox Proportional Hazards (CPH): This is the standard semi-parametric Cox Proportional Hazards model, making the assumption of constant baseline hazard. The features interact with the learnt set of weights in a log-linear fashion in order to determine the hazard for a held out individual.

Random Survival Forests (RSF): This is a popular non-parametric approach involving learning an ensemble of trees, adapted to censored survival data (Ishwaran et al., 2008).

DeepSurv (DS): Proposed by (Katzman et al., 2018), DeepSurv involves learning a non-linear function that describes the relative hazard of a test instance. It makes the similar assumption of constant baseline hazard as CPH.

DeepHit (DH)  (Lee et al., 2018)

: This approach involves learning the joint distribution of all event times by jointly modelling all competing risks and discretizing the output space of event times.

Fine-Gray (FG)  (Fine & Gray, 1999): This is a classic approach used for modelling competing risks that focuses on the Cumulative Incidence function by extending the proportional hazards model to sub-distributions.

For the SYNTHETIC and SEER datasets with competing risks, we compare performance of DSM to cause-specific (cs-) versions of CPH and RSF that involve learning separate survival regressions for each competing event by treating the other event as censored.

4.3 Performance Metrics

We evaluate DSM by assessing the ordering of pairwise relative risks using Concordance-Index (C-Index) (Harrell, 1982). To demonstrate the superiority of our approach over the methods subject to Coxian assumption, we show the comparison of performances using the time-dependent Concordance-Index (Antolini et al., 2005).

Here, is the estimated CDF by the model at the truncation time , given features

. The probability is estimated by comparing relative risks pairwise. In order to obtain an unbiased estimate for the quantity, we adjust the estimate with an inverse propensity of censoring estimate

(Gerds et al., 2013), as is common practice in survival analysis literature.

by different evaluation time horizons enable us to measure how good the models are at capturing the possible changes in risk over time, thus alleviating the restrictive assumption C-Index makes of constant proportional hazards. For completeness, we report the at different truncation event horizon quantiles of 25%, 50%, 75%.

4.4 Experimental Setup

Hyperparameters: For all the experiments described subsequently we train DSM with the Adam optimizer (Kingma & Ba, 2014) with a learning rates of . The number of experts, for each event is tuned between and the discounting factor is tuned between . The prior strength is set as for all the experiments and not tuned. We report the for the best performing set of parameters over the grid in cross validation for both DSM and the baselines. The representation learning function is a fully connected Multi-Layer Perceptron with 1 or 2 Hidden Layers with the number of nodes and ReLU6 activations. The choice of Log-Normal or Weibull outcome is further tuned as a hyper parameter. All experiments were conducted in PyTorch (Paszke et al., 2019).

Evaluation Protocol: All the reported errors around are 90% CI via 5-fold cross validation.444Except for METABRIC we perform 10-fold cross validation to get tighter confidence bounds. For a full details of hyperparameter choices for the baselines please refer to the Appendix C.

Figure 3: for SUPPORT Dataset at different Quantiles of Event times for different levels of Censoring.
Figure 4: for METABRIC Dataset at different Quantiles of Event times for different levels of Censoring.
Figure 5: for competing risks on SYNTHETIC Figure 6: for competing risks on SEER

4.5 Single Event Survival Regression

Parameter inference for DSM involves the exploitation of a closed form of the CDF, which makes DSM amenable to gradient based optimization. Naturally one would expect that a greater amount of censoring will reduce the available information to be modelled, thus adding bias and leading to poorer estimates of the survival function.

In this section we will empirically investigate DSM’s robustness to censoring and compare it to the relevant baselines by artificially censoring the event times. We uniformly sample a censoring time between for a randomly chosen subset of the uncensored training data. This is only applied to the uncensored instances of the training splits with the same experimental protocol as used in the previous Section 4.4. (By not censoring the test splits we are able to better estimate the ). We perform this artificial censoring on the single event METABRIC and SUPPORT datasets and reduce the uncensored training data to 50% and 25% of its original amount.

Figure 3 summarizes the performance of DSM on the SUPPORT dataset in 5-fold cross validation. Notice that RSF is comparable to DSM in the 25% quantile of event time horizons across all levels of censoring, however DSM significantly outperforms RSF on the longer event quantiles. Similarly we observed that although DeepSurv was competitive in longer event horizons, DSM significantly outperformed DeepSurv in the shorter horizons, demonstrating superiority.

For METABRIC, we observed that DSM outperformed the Deep Learning baselines significantly. Although RSF was competitive, DSM outperformed RSF on average in 10-fold cross validation. For both METABRIC and SUPPORT, the actual performance numbers and CIs are in Appx. B.

4.6 Competing Risks Scenario

For the SYNTHETIC dataset, we observe in Fig. 5 that DSM is competitive with DeepHit and outperforms all the other baselines in the 25%, 50%, 75% quantiles of event horizons. For comparison, we also report the performance at 100% quantile and observe that DSM is significantly superior to DeepHit for both events, thus confirming its robustness to events at longer horizons.

From Fig. 6

, on the SEER dataset we observe that for the majority risk, Breast Cancer, DSM significantly outperformed all the other baselines. The results for CVD were less conclusive with DeepHit being competitive at the 25% quantile. We owe this to the class imbalance between the two risks. Note that for visual clarity we do not report Fine-Gray and cs-RSF since their performance was poor. We defer the actual numbers and confidence intervals to the Appx.


5 Representation Learning and Knowledge Transfer

In this section we conduct a set of experiments to evaluate the performance of Deep Survival Machines (DSM) as a representation learning framework in the competing risks scenario. We compare DSMs ability to transfer knowledge across multiple competing risks to other Deep Learning based approaches.

Model C-Index (90%-CI)
DSM 0.7724 0.0025
Table 3: Knowledge transfer across tasks and representation learning capability on the SYNTHETIC dataset. Representations were trained on Event 1 and used to predict relative risks for a held-out set on Event 2 using a Cox Proportional Hazards (CPH) Model.

We divide the SYNTHETIC data into two equal subsets of 15,000 samples each. For the first set we discard all rows that had Event 2 before Event 1. For the second set, we perform similar preprocessing and discard all rows where Event 1 occurred before Event 2. This effectively treats the two subsets into single event censored datasets for Event 1 and Event 2 respectively. We train DSM, DeepSurv and Deep Hit on the first half of the dataset for the prediction of Event 1. The learnt model is then used to extract representations for the second subset. The output of the final layer is exploited as an overcomplete representation of the original set of covariates of the individual observation. For both models, we tune the models with one and two layer hidden layers, with the dimensionality of the hidden layers being .

For completeness, we also experiment with Kernel-PCA (K-PCA) (Schölkopf et al., 1997), Non-Negative Matrix factorization (NNMF) (Lee & Seung, 2001) and modern Variational Auto Encoders (VAE) to learn latent representations. Note that as compared to DeepSurv and DSM, K-PCA, NNMF and VAE are intrinsic methods that do not have access to the label of the original risk (Event 1) at training time and hence are somewhat limited in their expressive capability.

Once the representations are extracted for the second subset of the data, a linear Cox Proportional Hazards (CPH) Model is trained on them for the competing risk (Event 2). Table 3 presents the result of concordance of the learnt CPH model on the extracted embeddings. DSM outperforms the competing baselines.

6 Model Complexity and Scalability

We stress again that the advantage of Deep Survival Machines (DSM) is not only in terms of predictive performance, but also in computational and inference complexity. Since DSM involves making reasonable parametric assumptions, inference requires us to learn lesser number of parameters as compared to the competing baselines. In this section, we compare the training time and the model complexity in terms of number of parameters of DSM vis-à-vis the other established Deep Learning baselines, DeepHit and DeepSurv as well as the linear Cox Proportional Hazards Regression CPH.

Figure 7: Training time of Deep Survival Machines in comparison to the Baselines. Parameter Inference with DSM is faster than other deep learning approaches, and scales better with dataset size.
Figure 8: Number of learnable parameters in best DSM architecture in comparison to the Baselines. DSM requires inference over a smaller set of parameters as compared to other approaches.

From Figures 7 and 8, the advantage of DSM in runtime and space complexity is abundantly clear. Note that while RSF is faster in training on METABRIC, it scales poorly with increasing amount of data as evidenced by slower runtime on the larger SUPPORT dataset. Specifications of the machine used to benchmark performance are in Appendix D

7 Conclusion and Future Work

We proposed Deep Survival Machines, a novel fully-parametric approach to estimate time-to-event in the presence of censoring and competing risks. Our approach models the survival function as a weighted mixture of individual parametric survival distributions, and is trained over a loss function designed to handle both the censored and uncensored data. We demonstrated the benefits of our approach by comparing its performance to other classical and state-of-the-art survival regression approaches on multiple diverse datasets, and show that the representations learnt by the deep neural networks in our approach can be leveraged for the knowledge transfer across different competing risks.

Future directions include extending our approach to multiple censoring scenarios: in this paper we assumed that the data is right-censored, but our framework is readily amenable to left truncation and interval censoring. Additional research directions include further relaxing parametric assumptions on the survival distributions.


We thank the anonymous reviewers for taking the time to review this manuscript.


  • Alaa & van der Schaar (2017) Alaa, A. M. and van der Schaar, M. Deep multi-task gaussian processes for survival analysis with competing risks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 2326–2334. Curran Associates Inc., 2017.
  • Antolini et al. (2005) Antolini, L., Boracchi, P., and Biganzoli, E. A time-dependent discrimination index for survival data. Statistics in Medicine, 24(24):3927–3944, 2005.
  • Chapfuwa et al. (2018) Chapfuwa, P., Tao, C., Li, C., Page, C., Goldstein, B., Carin, L., and Henao, R. Adversarial time-to-event modeling. arXiv preprint arXiv:1804.03184, 2018.
  • Chen (2019) Chen, G. H. Nearest neighbor and kernel survival analysis: Nonasymptotic error bounds and strong consistency rates. arXiv preprint arXiv:1905.05285, 2019.
  • Cox (1972) Cox, D. R. Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202, 1972.
  • Curtis et al. (2012) Curtis, C., Shah, S. P., Chin, S.-F., Turashvili, G., Rueda, O. M., Dunning, M. J., Speed, D., Lynch, A. G., Samarajiwa, S., and Yuan, Y. e. a. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486(7403):346–352, 2012. doi: 10.1038/nature10983.
  • Faraggi & Simon (1995) Faraggi, D. and Simon, R. A neural network model for survival data. Statistics in medicine, 14(1):73–82, 1995.
  • Fine & Gray (1999) Fine, J. P. and Gray, R. J. A proportional hazards model for the subdistribution of a competing risk. Journal of the American statistical association, 94(446):496–509, 1999.
  • Gerds et al. (2013) Gerds, T. A., Kattan, M. W., Schumacher, M., and Yu, C. Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring. Statistics in Medicine, 32(13):2173–2184, 2013.
  • Harrell (1982) Harrell, F. E. Evaluating the yield of medical tests. JAMA: The Journal of the American Medical Association, 247(18):2543, 1982. doi: 10.1001/jama.1982.03320430047030.
  • Ishwaran et al. (2008) Ishwaran, H., Kogalur, U. B., Blackstone, E. H., Lauer, M. S., et al. Random survival forests. The annals of applied statistics, 2(3):841–860, 2008.
  • Kaplan & Meier (1958) Kaplan, E. L. and Meier, P. Nonparametric estimation from incomplete observations. Journal of the American statistical association, 53(282):457–481, 1958.
  • Katzman et al. (2018) Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., and Kluger, Y. Deepsurv: personalized treatment recommender system using a cox proportional hazards deep neural network. BMC medical research methodology, 18(1):24, 2018.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Knaus et al. (1995) Knaus, W. A., Harrell, F. E., Lynn, J., Goldman, L., Phillips, R. S., Connors, A. F., Dawson, N. V., Fulkerson, W. J., Califf, R. M., Desbiens, N., et al. The support prognostic model: objective estimates of survival for seriously ill hospitalized adults. Annals of internal medicine, 122(3):191–203, 1995.
  • Lee et al. (2018) Lee, C., Zame, W. R., Yoon, J., and van der Schaar, M. Deephit: A deep learning approach to survival analysis with competing risks. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • Lee et al. (2019a) Lee, C., Yoon, J., and Van Der Schaar, M. Dynamic-deephit: A deep learning approach for dynamic survival analysis with competing risks based on longitudinal data. IEEE Transactions on Biomedical Engineering, 2019a.
  • Lee et al. (2019b) Lee, C., Zame, W., Alaa, A., and Schaar, M. Temporal quilting for survival analysis. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 596–605, 2019b.
  • Lee & Seung (2001) Lee, D. D. and Seung, H. S. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pp. 556–562, 2001.
  • Li et al. (2016) Li, Y., Wang, J., Ye, J., and Reddy, C. K. A multi-task learning formulation for survival analysis. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1715–1724. ACM, 2016.
  • Mirza & Osindero (2014) Mirza, M. and Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • Mobadersany et al. (2018) Mobadersany, P., Yousefi, S., Amgad, M., Gutman, D. A., Barnholtz-Sloan, J. S., Vega, J. E. V., Brat, D. J., and Cooper, L. A. Predicting cancer outcomes from histology and genomics using convolutional networks. Proceedings of the National Academy of Sciences, 115(13):E2970–E2979, 2018.
  • Nagpal et al. (2019) Nagpal, C., Sangave, R., Chahar, A., Shah, P., Dubrawski, A., and Raj, B. Nonlinear semi-parametric models for survival analysis. arXiv preprint arXiv:1905.05865, 2019.
  • Nezhad et al. (2019) Nezhad, M. Z., Sadati, N., Yang, K., and Zhu, D. A deep active survival analysis approach for precision treatment recommendations: Application of prostate cancer. Expert Systems with Applications, 115:16–26, 2019.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035, 2019.
  • Ranganath et al. (2016) Ranganath, R., Perotte, A., Elhadad, N., and Blei, D. Deep survival analysis. Proceedings of the 1st Machine Learning for Healthcare Conference, PMLR, 56:101–114, 2016.
  • Rosen & Tanner (1999) Rosen, O. and Tanner, M. Mixtures of proportional hazards regression models. Statistics in Medicine, 18(9):1119–1131, 1999.
  • Schölkopf et al. (1997) Schölkopf, B., Smola, A., and Müller, K.-R.

    Kernel principal component analysis.

    In International conference on artificial neural networks, pp. 583–588. Springer, 1997.
  • Vinzamuri & Reddy (2013) Vinzamuri, B. and Reddy, C. K. Cox regression with correlation based regularization for electronic health records. In 2013 IEEE 13th International Conference on Data Mining, pp. 757–766. IEEE, 2013.
  • Vinzamuri et al. (2014) Vinzamuri, B., Li, Y., and Reddy, C. K. Active learning based survival regression for censored data. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 241–250. ACM, 2014.
  • Xiang et al. (2000) Xiang, A., Lapuerta, P., Ryutov, A., Buckley, J., and Azen, S. Comparison of the performance of neural network methods and cox regression for censored survival data. Computational statistics & data analysis, 34(2):243–257, 2000.

Appendix A Loss Function Formulation

At test time, Deep Survival Machines (DSM) describes the survival function of the test individual as a weighted mixture of survival distribution primitives, and the weights are a softmax over the output of a Deep Neural Network. The loss function of DSM is designed to handle both the censored and uncensored data.

Uncensored Loss. The maximum likelihood estimator for the uncensored data can be written as

Here are the input covariates of the -th observation, and

is the probability density function (PDF) of the primitive distribution.

and for the -th observation are parameterized as

where act is the SELU activation function if Weibull is used as the primitive distribution and the Tanh activation function if Log-Normal is used as the primitive distribution. is a Multilayer Perceptron.

Censoring Loss. As above, the lower bound of the censored observations can be written as

is the survival function of the primitive distribution.

For the scenario of competing risks, and are computed for the -th competing risk by treating other events as censoring. The total loss can be written as

Appendix B Results in Tabular Format

In this section, we provide the comparison of the performances of Deep Survival Machines (DSM) with the baseline approaches using at different event time horizons. The was evaluated at the 25%, 50%, 75% quantiles of event times. The mean and the 90% confidence interval of the were computed using 5-fold cross validation.

The results of two single-risk datasets, SUPPORT and METABRIC, are respectively shown in Table 4 and Table 5. To investigate the models’ robustness to censoring, we also artificially increased the amount of censoring in training set by censoring a randomly chosen subset which included 25% or 50% of the originally uncensored observations in the training data, on both SUPPORT and METABRIC. The results of added censoring are also shown.

The results of two datasets with competing risks, SYNTHETIC and SEER, are shown respectively in Table 6 and Table 7. cs-CPH and cs-RSF stand for the cause-specific versions of CPH and RSF models.

Models Quantiles of Event Times
% % %
Models Quantiles of Event Times
% % %
Models Quantiles of Event Times
% % %
Table 4: for SUPPORT dataset at different quantiles of event times for different levels of censoring.
Models Quantiles of Event Times
% % %
Models Quantiles of Event Times
% % %
Models Quantiles of Event Times
% % %
Table 5: for METABRIC dataset at different quantiles of event times for different levels of censoring.
Models Quantiles of Event Times
% % % %
Models Quantiles of Event Times
% % % %
Table 6: for competing risks on SYNTHETIC.
Models Quantiles of Event Times
% % % %
Models Quantiles of Event Times
% % % %
Table 7: for competing risks on SEER.

Appendix C Hyperparameter Tuning for the Baselines

We compared the performance of Deep Survival Machines (DSM) to several competing baseline approaches. In this section, we provide details of the hyperparameter tuning for each baseline approach. The hyperparameters tuned for Random Survival Forests (RSF) (Ishwaran et al., 2008) and DeepHit (Lee et al., 2018) are described as below, and the best set of hyperparameters was chosen based on the time-dependent Concordance-Index  (Antolini et al., 2005) on the validation set. For Cox Proportional Hazards (CPH) model (Cox, 1972), we used the default settings in the python PySurvival library. 555 For DeepSurv (Katzman et al., 2018), We directly used the hyperparameters provided in the DeepSurv GitHub repository.666 For Fine-Gray (FG) model  (Fine & Gray, 1999), we used the default settings in the R cmprsk package. 777

Random Survival Forests (RSF): The number of trees in the forest was selected from and the maximum depth of the trees was set to 4.

DeepHit (DH): We followed the experiment settings provided in the DeepHit GitHub repository.888 The number of layers in the shared sub-network and in each cause-specific (CS) sub-network was selected from ; the number of nodes in each layer was selected from ; the activation function was selected from [RELU, ELU, Tanh]; and the coefficients for trading off the ranking losses of the competing risks were chosen from . We generated 10 settings by randomly sampling each hyperparameter from the given lists of candidates 10 times, and selected the best set of hyperparameters which had the highest validation . The hyperparameters for each dataset are shown in Table 8.

Dataset Type Shared Sub-network CS Sub-network Activation
No. Layers No. Nodes No. Layers No. Nodes
SUPPORT Single Risk eLU
METABRIC Single Risk Tanh
SYNTHETIC Competing Risks eLU
SEER Competing Risks eLU
Table 8: The hyperparameters of DeepHit for each dataset.

Appendix D Benchmarking Machine Specifications

All experiments except the experiments for DeepHit were run on a Linux version 3.10.0-1062.9.1.el7.x86_64 machine with an Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz (8-core CPU) and RAM 32 GB. The experiments for DeepHit were run on a TITAN X (Pascal) GPU cluster (1 GPU) with an Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32-core CPU), NVIDIA driver version 418.74 and CUDA 10.1.