Log In Sign Up

Siamese Survival Analysis with Competing Risks

by   Anton Nemchenko, et al.

Survival analysis in the presence of multiple possible adverse events, i.e., competing risks, is a pervasive problem in many industries (healthcare, finance, etc.). Since only one event is typically observed, the incidence of an event of interest is often obscured by other related competing events. This nonidentifiability, or inability to estimate true cause-specific survival curves from empirical data, further complicates competing risk survival analysis. We introduce Siamese Survival Prognosis Network (SSPN), a novel deep learning architecture for estimating personalized risk scores in the presence of competing risks. SSPN circumvents the nonidentifiability problem by avoiding the estimation of cause-specific survival curves and instead determines pairwise concordant time-dependent risks, where longer event times are assigned lower risks. Furthermore, SSPN is able to directly optimize an approximation to the C-discrimination index, rather than relying on well-known metrics which are unable to capture the unique requirements of survival analysis with competing risks.


page 1

page 2

page 3

page 4


PyDTS: A Python Package for Discrete Time Survival Analysis with Competing Risks

Time-to-event analysis (survival analysis) is used when the outcome or t...

Generalized Concordance for Competing Risks

Existing metrics in competing risks survival analysis such as concordanc...

A data-driven prospective study of incident dementia among older adults in the United States

We conducted a prospective analysis of incident dementia and its associa...

One-step TMLE to target cause-specific absolute risks and survival curves

This paper considers one-step targeted maximum likelihood estimation met...

Nonparametric competing risks analysis using Bayesian Additive Regression Trees (BART)

Many time-to-event studies are complicated by the presence of competing ...

Reflection on modern methods: competing risks versus multi-state models

Survival competing risks models are very useful for studying the inciden...

1 Introduction

1.1 Motivation

Survival analysis is a method for analyzing data where the outcome variable is the time to the occurrence of an event (death, disease, stock liquidation, mechanical failure, etc.) of interest. Competing risks are additional possible events or outcomes that “compete” with and may preclude or interfere with the desired event observation. Though survival analysis is practiced across many disciplines (epidemiology, econometrics, manufacturing, etc.), this paper focuses on healthcare applications, where competing risk analysis has recently emerged as an important analytical tool in medical prognosis [9, 26, 22]. With an increasing aging population, the presence of multiple coexisting chronic diseases (multimorbidities) is on the rise, with more than two-thirds of people aged over 65 considered multimorbid. Developing optimal treatment plans for these patients with multimorbidities is a challenging problem, where the best treatment or intervention for a patient may depend upon the existence and susceptibility to other competing risks. Consider oncology and cardiovascular medicine, where the risk of a cardiac disease may alter the decision on whether a cancer patient should undergo chemotherapy or surgery. Countless examples like this involving competing risks are pervasive throughout the healthcare industry and insufficiently addressed in it’s current state.

1.2 Related Works

Previous work on classical survival analysis has demonstrated the advantages of deep learning over statistical methods [18, 14, 27]. Cox proportional hazards model [6] is the baseline statistical model for survival analysis, but is limited since the dependent risk function is the product of a linear covariate function and a time dependent function, which is insufficient for modeling complex non-linear medical data. [14]

replaced the linear covariate function with a feed-forward neural network as input for the Cox PH model and demonstrated improved performance. The current literature addresses competing risks based on statistical methods (the Fine Gray model


), classical machine learning (Random Survival Forest

[13, 12]), multi-task learning [1]) etc., with limited success. These existing competing risk models are challenged by computational scalability issues for datasets with many patients and multiple covariates. To address this challenge, we propose a deep learning architecture for survival analysis with competing risks to optimize the time-dependent discrimination index. This is not trivial and will be elaborated in the next section.

1.3 Contributions

In both machine learning and statistics, predictive models are compared in terms of the area under the receiver operating characteristic (ROC) curve or the time-dependent discrimination index (in the survival analysis literature). The equivalence of the two metrics was established in [11]

. Numerous works on supervised learning

[4, 20, 19, 23] have shown that training the models to directly optimize the AUC improves out-of-sample (generalization) performance (in terms of AUC) rather than optimizing the error rate (or the accuracy). In this work, we adopt and apply this idea to survival analysis with competing risks. We develop a novel Siamese feed-forward neural network [3] designed to optimize concordance and account for competing risks by specifically targeting the time-dependent discrimination index [2]. This is achieved by estimating risks in a relative fashion so that the risk for the “true” event of a patient (i.e. the event which actually took place) must be higher than: all other risks for the same patient and the risks for the same true event of other patients that experienced it at a later time. Furthermore, the risks for all the causes are estimated jointly in an effort to generate a unified representation capturing the latent structure of the data and estimating cause-specific risks. Because our neural network issues a joint risk for all competing events, it compares different risks for the different events at different times and arranges them in a concordant fashion (earlier time means higher risk for any pair of patients).

Unlike previous Siamese neural networks architectures [5, 3, 25]

developed for purposes such as learning the pairwise similarity between different inputs, our architecture aims to maximize the distance between output risks for the different inputs. We overcome the discontinuity problem of the above metric by introducing a continuous approximation of the time-dependent discrimination function. This approximation is only evaluated at the survival times observed in the dataset. However, training a neural network only over the observed survival times will result in poor generalization and undesirable out-of-sample performance (in terms of discrimination index computed at different times). In response to this, we add a loss term (to the loss function) which for any pair of patients, penalizes cases where the longer event time does not receive lower risk.

The nonidentifiability problem in competing risks arises from the inability to estimate the true cause-specific survival curves from empirical data [24]. We address this issue by bypassing and avoiding the estimation of the individual cause-specific survival curves and utilize concordant risks instead. Our implementation is agnostic to any underlying causal assumptions and therefore immune to nonidentifiability.

We report statistically significant improvements over state-of-the-art competing risk survival analysis methods on both synthetic and real medical data.

2 Problem Formulation

We consider a dataset comprising of time-to-event information about subjects who are followed up for a finite amount of time. Each subject (patient) experiences an event , where is the event type. means the subject is censored (lost in follow-up or study ended). If , then the subject experiences one of the events of interest (for instance, subject develops cardiac disease). We assume that a subject can only experience one of the above events and that the censorship times are independent of them [17, 22, 8, 7, 10, 24]. is defined as the time-to-event, where we assume that time is discrete and ( denotes the elapsed time since ). Let , where is the time-to-event for subject , is the event experienced by the subject and are the covariates of the subject (the covariates are measured at baseline, which may include age, gender, genetic information etc.).

The Cumulative Incidence Function (CIF) [8] computed at time for a certain event

is the probability of occurrence of a particular event

before time conditioned on the covariates of a subject , and is given as . The cumulative incidence function evaluated at a certain point can be understood as the risk of experiencing a certain event before a specified time.

In this work, our goal is to develop a neural network that can learn the complex interactions in the data specifically addressing competing risks survival analysis. In determining our loss function, we consider that the time-dependent discrimination index is the most commonly used metric for evaluating models in survival analysis [2]

. Multiple publications in the supervised learning literature demonstrate that approximating the area under the curve (AUC) directly and training a classifier leads to better generalization performance in terms of the AUC (see e.g.

[4, 20, 19, 23]). However, these ideas were not explored in the context of survival analysis with competing risks. We will follow the same principles to construct an approximation of the time-dependent discrimination index to train our neural network. We first describe the time-dependent discrimination index below.

Consider an ordered pair of two subjects

in the dataset. If the subject experiences event , i.e., and if subject ’s time-to-event exceeds the time-to-event of subject , i.e., , then the pair is a comparable pair. The set of all such comparable pairs is defined as the comparable set for event , and is denoted as .

A model outputs the risk of the subject for experiencing the event before time , which is given as . The time-dependent discrimination index for a certain cause is the probability that a model accurately orders the risks of the comparable pairs of subjects in the comparable set for event . The time-dependent discrimination index [2] for cause is defined as




The discrimination index in (1) cannot be computed exactly since the distribution that generates the data is unknown. However, the discrimination index can be estimated using a standard estimator, which takes as input the risk values associated with subjects in the dataset. [2] defines the estimator for (1) as


Note that in the above (4) only the numerator depends on the model. Henceforth, we will only consider the quantity in the numerator and we write it as


The above equation can be simplified as


where is the indicator function, () is the left (right) element of the comparable pair in the set and () is the respective time-to-event. In the next section, we will use the above simplification (6) to construct the loss function for the neural network.

Figure 1: Illustration of the architecture.

3 Siamese Survival Prognosis Network

In this section, we will describe the architecture of the network and the loss functions that we propose to train the network.

Denote as a feed-forward neural network which is visualized in Fig. 1. It is composed of a sequence of fully connected hidden layers with “scaled exponential linear units” (SELU) activation. The last hidden layer is fed to layers of width

. Each neuron in the latter

layers estimates the probability that a subject experiences cause occurs in a time interval , which is given as . For an input covariate

the output from all the neurons is a vector of probabilities given as

The estimate of cumulative incidence function computed for cause at time is given as . The final output of the neural network for input is vector of estimates of the cumulative incidence function given as .

The loss function is composed of three terms: discrimination, accuracy, and a loss term.

We cannot use the metric in (6

) directly to train the network because it is a discontinuous function (composed of indicators), which can impede training. We overcome this problem by approximating the indicator function using a scaled sigmoid function

. The approximated discrimination index is given as


The scaling parameter determines the sensitivity of the loss function to discrimination. If the value of is high, then the penalty for error in discrimination is also very high. Therefore, higher values of alpha guarantee that the subjects in a comparable pair are assigned concordant risk values.

The discrimination part defined above captures a model’s ability to discriminate subjects for each cause separately. We also need to ensure that the model can predict the cause accurately. We define the accuracy of a model in terms of a scaled sigmoid function with scaling parameter as follows


The accuracy term penalizes the risk functions only at the event times of the left subjects in comparable pairs. However, it is important that the neural network is optimized to produce risk values that interpolate well to other time intervals as well. Therefore, we introduce a loss term below


The loss term ensures that the risk of each right subject is minimized for all the times before time-to-event of the left subject in the respective comparable pair. Intuitively, the loss term can be justified as follows. The right subjects do not experience an event before the time . Hence, the probability that they experience an event before should take a small value.

The final loss function is the sum of the discrimination terms (described above), the accuracy and the loss terms, and is given as


Finally, we adjust for the event imbalance and the time interval imbalance caused by the unequal number of pairs for each event and time interval with inverse propensity weights. These weights are the frequency of the occurrence of the various events at the various times and are multiplying the loss functions of the corresponding comparable pairs.

We train the feed-forward network using the above loss function (10) and regularize it using SELU dropout [16]. Since the loss function involves the discrimination term, each term in the loss function involves a pairwise comparison. This makes the network training similar to a Siamese network [3]

. The backpropagation terms now depend on each comparable pair.

4 Experiments

This section includes a discussion of hyper-parameter optimization followed by competing risk and survival analysis experiments111Code available at

. We compare against Fine-Gray model (“cmprsk” R package), Competing Random Forest (CRF) (“randomForestSRC” R package) and the cause-specific (cs) extension of two single event (non-competing risks) methods, Cox PH model and

[14]. In cause-specific extension of single event models, we mark the occurrence of any event apart from the event of interest as censorship and decouple the problem into separate single event problem (one for each cause); this is a standard way of extending single-event models to competing risk models. In the following results we refer to our method with the acronym SSPN.

Parameter batch size # hidden layers hidden layers width dropout rate
SEER 2048 3 50 0.4
Synthetic data 2048 2 40 0.35
Table 1: Summary of hyper-parameters

4.1 Hyper-parameter Optimization

Optimization was performed using a 5-fold cross-validation with fixed censorship rates in each fold. We choose 60-20-20 division for training, validation and testing sets. A standard grid search was used to determine the batch size, number of hidden layers, width of the hidden layers and the dropout rate. The optimal values of and were consistently 500 and 0.01 for all datasets. As previously mentioned, the sets are comprised of patient pairs. In each training iteration, a batch size of pairs was sampled with replacement from the training set which reduces convergence speed but doesn’t lower performance relative to regular batches [21]

. We note that the training sets are commonly in the tens of million pairs with patients appearing multiple times in both sides of the pair. A standard definition of an epoch would compose of a single iteration over all patient. However, in our case, we not only learn patient specific characteristics but also patient comparison relationships, which means an epoch with a number of iterations equal to the number of patients is not sufficient. On the other hand, an epoch definition as an iteration over all pairs is impractical. Our best empirical results were attained after 100K iterations with Tensorflow on 8-core Xeon E3-1240, Adam optimizer

[15] and a decaying learning rate, . Table 1 summarizes the optimal hyper-parameters.

Dataset CVD Breast Cancer Other
cs-Cox PH 0.656 [0.629-0.682] 0.634 [0.626-0.642] 0.695 [0.675-0.714]
cs-[14] 0.645 [0.625-0.664] 0.697 [0.686-0.708] 0.675 [0.644-0.706]
Fine-Gray 0.659 [0.605-0.714] 0.636 [0.622-0.650] 0.691 [0.673-0.708]
CRF 0.601 [0.565-0.637] 0.705 [0.692-0.718] 0.636 [0.624-0.648]
SSPN 0.663 [0.625-0.701] 0.735 [0.678-0.793] 0.699 [0.681-0.716]
*p-value 0.05
Table 3: Summary of competing index on synthetic data.
Method Cause 1 Cause 2
cs-Cox PH 0.571 [0.554-0.588] 0.581 [0.570-0.591]
cs-[14] 0.580 [0.556-0.603] 0.593 [0.576-0.611]
Fine-Gray 0.574 [0.559-0.590] 0.586 [0.577-0.594]
Competing Random Forest 0.591 [0.575-0.606] 0.573 [0.557-0.588]
SSPN 0.603 [0.593-0.613] 0.613 [0.598-0.627]
*p-value 0.05
Table 2: Summary of competing index on SEER.

4.2 Seer

The Surveillance, Epidemiology, and End Results Program (SEER) dataset provides information on breast cancer patients during the years 1992-2007. A total of 72,809 patients experienced breast cancer, cardiovascular disease (CVD), other diseases, or were right-censored. The cohort consists of 23 features, including age, race, gender, morphology information, diagnostic information, therapy information, tumor size, tumor type, etc. Missing values were replaced by mean value for real-valued features and by the mode for categorical features. 1.3% of the patients experienced CVD and 15.6% experienced breast cancer. Table 3 displays the results for this dataset. We notice that for the infrequent adverse event, CVD, the performance gain is negligible while for the frequent breast cancer event, the gain is significant. However, we wish to remind the reader that our focus is on healthcare where even minor gains have the potential to save lives. Considering there are 72,809 patients, a performance improvement even as low as 0.1% has the potential to save multiple lives and should not be disregarded.

4.3 Synthetic Data

Due to the relative scarcity of competing risks datasets and methods, we have created an additional synthetic dataset to further validate the performance of our method. We have constructed two stochastic processes with parameters and the event times as follows


where is the vector of features for patient . For , the features only have an effect on the event time for event , while

has an effect on the event times of both events. Note that we assume event times are exponentially distributed with a mean parameter depending on both linear and non-linear (quadratic) function of features. Given the parameters, we first produced

patients; among those, we randomly selected

patients (50%) to be right-censored at a time randomly drawn from the uniform distribution on the interval

. (This censoring fraction was chosen to be roughly the same censoring fraction as in the real datasets, and hence to present the same difficulty as found in those datasets.) Table 3 displays the results for the above dataset. We demonstrate the same consistent performance gain as in the previous case.

5 Conclusion

Competing risks settings are pervasive in healthcare. They are encountered in cardiovascular diseases, in cancer, and in the geriatric population suffering from multiple diseases. To solve the challenging problem of learning the model parameters from time-to-event data while handling right censoring, we have developed a novel deep learning architecture for estimating personalized risk scores in the presence of competing risks based on the well-known Siamese network architecture. Our method is able to capture complex non-linear representations missed by classical machine learning and statistical models. Experimental results show that our method is able to outperform existing competing risk methods by successfully learning representations which flexibly describe non-proportional hazard rates with complex interactions between covariates and survival times that are common in many diseases with heterogeneous phenotypes.


  • [1] Alaa, A.M., van der Schaar, M.: Deep multi-task gaussian processes for survival analysis with competing risks (2017)
  • [2] Antolini, L., Boracchi, P., Biganzoli, E.: A time-dependent discrimination index for survival data. Statistics in medicine 24(24), 3927–3944 (2005)
  • [3] Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a” siamese” time delay neural network. In: Advances in Neural Information Processing Systems. pp. 737–744 (1994)
  • [4]

    Chen, Y., Jia, Z., Mercola, D., Xie, X.: A gradient boosting algorithm for survival analysis via direct optimization of concordance index. Computational and mathematical methods in medicine

    2013 (2013)
  • [5]

    Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. vol. 1, pp. 539–546. IEEE (2005)

  • [6] Cox, D.R.: Models and life-tables regression. JR Stat. Soc. Ser. B 34, 187–220 (1972)
  • [7] Crowder, M.J.: Classical competing risks. CRC Press (2001)
  • [8] Fine, J.P., Gray, R.J.: A proportional hazards model for the subdistribution of a competing risk. Journal of the American statistical association 94(446), 496–509 (1999)
  • [9] Glynn, R.J., Rosner, B.: Comparison of risk factors for the competing risks of coronary heart disease, stroke, and venous thromboembolism. American journal of epidemiology 162(10), 975–982 (2005)
  • [10] Gooley, T.A., Leisenring, W., Crowley, J., Storer, B.E., et al.: Estimation of failure probabilities in the presence of competing risks: new representations of old estimators. Statistics in medicine 18(6), 695–706 (1999)
  • [11] Heagerty, P.J., Zheng, Y.: Survival model predictive accuracy and roc curves. Biometrics 61(1), 92–105 (2005)
  • [12] Ishwaran, H., Gerds, T.A., Kogalur, U.B., Moore, R.D., Gange, S.J., Lau, B.M.: Random survival forests for competing risks. Biostatistics 15(4), 757–773 (2014)
  • [13] Ishwaran, H., Kogalur, U.B., Blackstone, E.H., Lauer, M.S.: Random survival forests. The annals of applied statistics pp. 841–860 (2008)
  • [14] Katzman, J., Shaham, U., Bates, J., Cloninger, A., Jiang, T., Kluger, Y.: Deep survival: A deep cox proportional hazards network. arXiv preprint arXiv:1606.00931 (2016)
  • [15] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [16] Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks. arXiv preprint arXiv:1706.02515 (2017)
  • [17] Lambert, P., Dickman, P., Nelson, C., Royston, P.: Estimating the crude probability of death due to cancer and other causes using relative survival models. Statistics in medicine 29(7-8), 885–895 (2010)
  • [18] Luck, M., Sylvain, T., Cardinal, H., Lodi, A., Bengio, Y.: Deep learning for patient-specific kidney graft survival analysis. arXiv preprint arXiv:1705.10245 (2017)
  • [19] Mayr, A., Hofner, B., Schmid, M.: Boosting the discriminatory power of sparse survival models via optimization of the concordance index and stability selection. BMC bioinformatics 17(1),  288 (2016)
  • [20] Mayr, A., Schmid, M.: Boosting the concordance index for survival data–a unified framework to derive and evaluate biomarker combinations. PloS one 9(1), e84483 (2014)
  • [21]

    Recht, B., Re, C.: Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences (2012)

  • [22] Satagopan, J., Ben-Porat, L., Berwick, M., Robson, M., Kutler, D., Auerbach, A.: A note on competing risks in survival data analysis. British journal of cancer 91(7), 1229–1235 (2004)
  • [23] Schmid, M., Wright, M.N., Ziegler, A.: On the use of harrell’s c for clinical risk prediction via random survival forests. Expert Systems with Applications 63, 450–459 (2016)
  • [24] Tsiatis, A.: A nonidentifiability aspect of the problem of competing risks. Proceedings of the National Academy of Sciences 72(1), 20–22 (1975)
  • [25] Wang, J., Fang, Z., Lang, N., Yuan, H., Su, M.Y., Baldi, P.: A multi-resolution approach for spinal metastasis detection using deep siamese neural networks. Computers in Biology and Medicine 84, 137–146 (2017)
  • [26] Wolbers, M., Koller, M.T., Witteman, J.C., Steyerberg, E.W.: Prognostic models with competing risks: methods and application to coronary risk prediction. Epidemiology 20(4), 555–561 (2009)
  • [27] Yousefi, S., Amrollahi, F., Amgad, M., Dong, C., Lewis, J.E., Song, C., Gutman, D.A., Halani, S.H., Vega, J.E.V., Brat, D.J., et al.: Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. bioRxiv p. 131367 (2017)