1 Introduction
1.1 Motivation
Survival analysis is a method for analyzing data where the outcome variable is the time to the occurrence of an event (death, disease, stock liquidation, mechanical failure, etc.) of interest. Competing risks are additional possible events or outcomes that “compete” with and may preclude or interfere with the desired event observation. Though survival analysis is practiced across many disciplines (epidemiology, econometrics, manufacturing, etc.), this paper focuses on healthcare applications, where competing risk analysis has recently emerged as an important analytical tool in medical prognosis [9, 26, 22]. With an increasing aging population, the presence of multiple coexisting chronic diseases (multimorbidities) is on the rise, with more than twothirds of people aged over 65 considered multimorbid. Developing optimal treatment plans for these patients with multimorbidities is a challenging problem, where the best treatment or intervention for a patient may depend upon the existence and susceptibility to other competing risks. Consider oncology and cardiovascular medicine, where the risk of a cardiac disease may alter the decision on whether a cancer patient should undergo chemotherapy or surgery. Countless examples like this involving competing risks are pervasive throughout the healthcare industry and insufficiently addressed in it’s current state.
1.2 Related Works
Previous work on classical survival analysis has demonstrated the advantages of deep learning over statistical methods [18, 14, 27]. Cox proportional hazards model [6] is the baseline statistical model for survival analysis, but is limited since the dependent risk function is the product of a linear covariate function and a time dependent function, which is insufficient for modeling complex nonlinear medical data. [14]
replaced the linear covariate function with a feedforward neural network as input for the Cox PH model and demonstrated improved performance. The current literature addresses competing risks based on statistical methods (the Fine Gray model
[8]), classical machine learning (Random Survival Forest
[13, 12]), multitask learning [1]) etc., with limited success. These existing competing risk models are challenged by computational scalability issues for datasets with many patients and multiple covariates. To address this challenge, we propose a deep learning architecture for survival analysis with competing risks to optimize the timedependent discrimination index. This is not trivial and will be elaborated in the next section.1.3 Contributions
In both machine learning and statistics, predictive models are compared in terms of the area under the receiver operating characteristic (ROC) curve or the timedependent discrimination index (in the survival analysis literature). The equivalence of the two metrics was established in [11]
. Numerous works on supervised learning
[4, 20, 19, 23] have shown that training the models to directly optimize the AUC improves outofsample (generalization) performance (in terms of AUC) rather than optimizing the error rate (or the accuracy). In this work, we adopt and apply this idea to survival analysis with competing risks. We develop a novel Siamese feedforward neural network [3] designed to optimize concordance and account for competing risks by specifically targeting the timedependent discrimination index [2]. This is achieved by estimating risks in a relative fashion so that the risk for the “true” event of a patient (i.e. the event which actually took place) must be higher than: all other risks for the same patient and the risks for the same true event of other patients that experienced it at a later time. Furthermore, the risks for all the causes are estimated jointly in an effort to generate a unified representation capturing the latent structure of the data and estimating causespecific risks. Because our neural network issues a joint risk for all competing events, it compares different risks for the different events at different times and arranges them in a concordant fashion (earlier time means higher risk for any pair of patients).Unlike previous Siamese neural networks architectures [5, 3, 25]
developed for purposes such as learning the pairwise similarity between different inputs, our architecture aims to maximize the distance between output risks for the different inputs. We overcome the discontinuity problem of the above metric by introducing a continuous approximation of the timedependent discrimination function. This approximation is only evaluated at the survival times observed in the dataset. However, training a neural network only over the observed survival times will result in poor generalization and undesirable outofsample performance (in terms of discrimination index computed at different times). In response to this, we add a loss term (to the loss function) which for any pair of patients, penalizes cases where the longer event time does not receive lower risk.
The nonidentifiability problem in competing risks arises from the inability to estimate the true causespecific survival curves from empirical data [24]. We address this issue by bypassing and avoiding the estimation of the individual causespecific survival curves and utilize concordant risks instead. Our implementation is agnostic to any underlying causal assumptions and therefore immune to nonidentifiability.
We report statistically significant improvements over stateoftheart competing risk survival analysis methods on both synthetic and real medical data.
2 Problem Formulation
We consider a dataset comprising of timetoevent information about subjects who are followed up for a finite amount of time. Each subject (patient) experiences an event , where is the event type. means the subject is censored (lost in followup or study ended). If , then the subject experiences one of the events of interest (for instance, subject develops cardiac disease). We assume that a subject can only experience one of the above events and that the censorship times are independent of them [17, 22, 8, 7, 10, 24]. is defined as the timetoevent, where we assume that time is discrete and ( denotes the elapsed time since ). Let , where is the timetoevent for subject , is the event experienced by the subject and are the covariates of the subject (the covariates are measured at baseline, which may include age, gender, genetic information etc.).
The Cumulative Incidence Function (CIF) [8] computed at time for a certain event
is the probability of occurrence of a particular event
before time conditioned on the covariates of a subject , and is given as . The cumulative incidence function evaluated at a certain point can be understood as the risk of experiencing a certain event before a specified time.In this work, our goal is to develop a neural network that can learn the complex interactions in the data specifically addressing competing risks survival analysis. In determining our loss function, we consider that the timedependent discrimination index is the most commonly used metric for evaluating models in survival analysis [2]
. Multiple publications in the supervised learning literature demonstrate that approximating the area under the curve (AUC) directly and training a classifier leads to better generalization performance in terms of the AUC (see e.g.
[4, 20, 19, 23]). However, these ideas were not explored in the context of survival analysis with competing risks. We will follow the same principles to construct an approximation of the timedependent discrimination index to train our neural network. We first describe the timedependent discrimination index below.Consider an ordered pair of two subjects
in the dataset. If the subject experiences event , i.e., and if subject ’s timetoevent exceeds the timetoevent of subject , i.e., , then the pair is a comparable pair. The set of all such comparable pairs is defined as the comparable set for event , and is denoted as .A model outputs the risk of the subject for experiencing the event before time , which is given as . The timedependent discrimination index for a certain cause is the probability that a model accurately orders the risks of the comparable pairs of subjects in the comparable set for event . The timedependent discrimination index [2] for cause is defined as
(1) 
where
(2) 
(3) 
The discrimination index in (1) cannot be computed exactly since the distribution that generates the data is unknown. However, the discrimination index can be estimated using a standard estimator, which takes as input the risk values associated with subjects in the dataset. [2] defines the estimator for (1) as
(4) 
Note that in the above (4) only the numerator depends on the model. Henceforth, we will only consider the quantity in the numerator and we write it as
(5) 
The above equation can be simplified as
(6) 
where is the indicator function, () is the left (right) element of the comparable pair in the set and () is the respective timetoevent. In the next section, we will use the above simplification (6) to construct the loss function for the neural network.
3 Siamese Survival Prognosis Network
In this section, we will describe the architecture of the network and the loss functions that we propose to train the network.
Denote as a feedforward neural network which is visualized in Fig. 1. It is composed of a sequence of fully connected hidden layers with “scaled exponential linear units” (SELU) activation. The last hidden layer is fed to layers of width
. Each neuron in the latter
layers estimates the probability that a subject experiences cause occurs in a time interval , which is given as . For an input covariatethe output from all the neurons is a vector of probabilities given as
.The estimate of cumulative incidence function computed for cause at time is given as . The final output of the neural network for input is vector of estimates of the cumulative incidence function given as .
The loss function is composed of three terms: discrimination, accuracy, and a loss term.
We cannot use the metric in (6
) directly to train the network because it is a discontinuous function (composed of indicators), which can impede training. We overcome this problem by approximating the indicator function using a scaled sigmoid function
. The approximated discrimination index is given as(7) 
The scaling parameter determines the sensitivity of the loss function to discrimination. If the value of is high, then the penalty for error in discrimination is also very high. Therefore, higher values of alpha guarantee that the subjects in a comparable pair are assigned concordant risk values.
The discrimination part defined above captures a model’s ability to discriminate subjects for each cause separately. We also need to ensure that the model can predict the cause accurately. We define the accuracy of a model in terms of a scaled sigmoid function with scaling parameter as follows
(8) 
The accuracy term penalizes the risk functions only at the event times of the left subjects in comparable pairs. However, it is important that the neural network is optimized to produce risk values that interpolate well to other time intervals as well. Therefore, we introduce a loss term below
(9) 
The loss term ensures that the risk of each right subject is minimized for all the times before timetoevent of the left subject in the respective comparable pair. Intuitively, the loss term can be justified as follows. The right subjects do not experience an event before the time . Hence, the probability that they experience an event before should take a small value.
The final loss function is the sum of the discrimination terms (described above), the accuracy and the loss terms, and is given as
(10) 
Finally, we adjust for the event imbalance and the time interval imbalance caused by the unequal number of pairs for each event and time interval with inverse propensity weights. These weights are the frequency of the occurrence of the various events at the various times and are multiplying the loss functions of the corresponding comparable pairs.
We train the feedforward network using the above loss function (10) and regularize it using SELU dropout [16]. Since the loss function involves the discrimination term, each term in the loss function involves a pairwise comparison. This makes the network training similar to a Siamese network [3]
. The backpropagation terms now depend on each comparable pair.
4 Experiments
This section includes a discussion of hyperparameter optimization followed by competing risk and survival analysis experiments^{1}^{1}1Code available at https://github.com/santon834/SiameseCompetingRisks
. We compare against FineGray model (“cmprsk” R package), Competing Random Forest (CRF) (“randomForestSRC” R package) and the causespecific (cs) extension of two single event (noncompeting risks) methods, Cox PH model and
[14]. In causespecific extension of single event models, we mark the occurrence of any event apart from the event of interest as censorship and decouple the problem into separate single event problem (one for each cause); this is a standard way of extending singleevent models to competing risk models. In the following results we refer to our method with the acronym SSPN.Parameter  batch size  # hidden layers  hidden layers width  dropout rate 

SEER  2048  3  50  0.4 
Synthetic data  2048  2  40  0.35 
4.1 Hyperparameter Optimization
Optimization was performed using a 5fold crossvalidation with fixed censorship rates in each fold. We choose 602020 division for training, validation and testing sets. A standard grid search was used to determine the batch size, number of hidden layers, width of the hidden layers and the dropout rate. The optimal values of and were consistently 500 and 0.01 for all datasets. As previously mentioned, the sets are comprised of patient pairs. In each training iteration, a batch size of pairs was sampled with replacement from the training set which reduces convergence speed but doesn’t lower performance relative to regular batches [21]
. We note that the training sets are commonly in the tens of million pairs with patients appearing multiple times in both sides of the pair. A standard definition of an epoch would compose of a single iteration over all patient. However, in our case, we not only learn patient specific characteristics but also patient comparison relationships, which means an epoch with a number of iterations equal to the number of patients is not sufficient. On the other hand, an epoch definition as an iteration over all pairs is impractical. Our best empirical results were attained after 100K iterations with Tensorflow on 8core Xeon E31240, Adam optimizer
[15] and a decaying learning rate, . Table 1 summarizes the optimal hyperparameters.Dataset  CVD  Breast Cancer  Other 

csCox PH  0.656 [0.6290.682]  0.634 [0.6260.642]  0.695 [0.6750.714] 
cs[14]  0.645 [0.6250.664]  0.697 [0.6860.708]  0.675 [0.6440.706] 
FineGray  0.659 [0.6050.714]  0.636 [0.6220.650]  0.691 [0.6730.708] 
CRF  0.601 [0.5650.637]  0.705 [0.6920.718]  0.636 [0.6240.648] 
SSPN  0.663 [0.6250.701]  0.735 [0.6780.793]  0.699 [0.6810.716] 
*pvalue 0.05 
Method  Cause 1  Cause 2 

csCox PH  0.571 [0.5540.588]  0.581 [0.5700.591] 
cs[14]  0.580 [0.5560.603]  0.593 [0.5760.611] 
FineGray  0.574 [0.5590.590]  0.586 [0.5770.594] 
Competing Random Forest  0.591 [0.5750.606]  0.573 [0.5570.588] 
SSPN  0.603 [0.5930.613]  0.613 [0.5980.627] 
*pvalue 0.05 
4.2 Seer
The Surveillance, Epidemiology, and End Results Program (SEER) dataset provides information on breast cancer patients during the years 19922007. A total of 72,809 patients experienced breast cancer, cardiovascular disease (CVD), other diseases, or were rightcensored. The cohort consists of 23 features, including age, race, gender, morphology information, diagnostic information, therapy information, tumor size, tumor type, etc. Missing values were replaced by mean value for realvalued features and by the mode for categorical features. 1.3% of the patients experienced CVD and 15.6% experienced breast cancer. Table 3 displays the results for this dataset. We notice that for the infrequent adverse event, CVD, the performance gain is negligible while for the frequent breast cancer event, the gain is significant. However, we wish to remind the reader that our focus is on healthcare where even minor gains have the potential to save lives. Considering there are 72,809 patients, a performance improvement even as low as 0.1% has the potential to save multiple lives and should not be disregarded.
4.3 Synthetic Data
Due to the relative scarcity of competing risks datasets and methods, we have created an additional synthetic dataset to further validate the performance of our method. We have constructed two stochastic processes with parameters and the event times as follows
(11) 
where is the vector of features for patient . For , the features only have an effect on the event time for event , while
has an effect on the event times of both events. Note that we assume event times are exponentially distributed with a mean parameter depending on both linear and nonlinear (quadratic) function of features. Given the parameters, we first produced
patients; among those, we randomly selectedpatients (50%) to be rightcensored at a time randomly drawn from the uniform distribution on the interval
. (This censoring fraction was chosen to be roughly the same censoring fraction as in the real datasets, and hence to present the same difficulty as found in those datasets.) Table 3 displays the results for the above dataset. We demonstrate the same consistent performance gain as in the previous case.5 Conclusion
Competing risks settings are pervasive in healthcare. They are encountered in cardiovascular diseases, in cancer, and in the geriatric population suffering from multiple diseases. To solve the challenging problem of learning the model parameters from timetoevent data while handling right censoring, we have developed a novel deep learning architecture for estimating personalized risk scores in the presence of competing risks based on the wellknown Siamese network architecture. Our method is able to capture complex nonlinear representations missed by classical machine learning and statistical models. Experimental results show that our method is able to outperform existing competing risk methods by successfully learning representations which flexibly describe nonproportional hazard rates with complex interactions between covariates and survival times that are common in many diseases with heterogeneous phenotypes.
References
 [1] Alaa, A.M., van der Schaar, M.: Deep multitask gaussian processes for survival analysis with competing risks (2017)
 [2] Antolini, L., Boracchi, P., Biganzoli, E.: A timedependent discrimination index for survival data. Statistics in medicine 24(24), 3927–3944 (2005)
 [3] Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a” siamese” time delay neural network. In: Advances in Neural Information Processing Systems. pp. 737–744 (1994)

[4]
Chen, Y., Jia, Z., Mercola, D., Xie, X.: A gradient boosting algorithm for survival analysis via direct optimization of concordance index. Computational and mathematical methods in medicine
2013 (2013) 
[5]
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. vol. 1, pp. 539–546. IEEE (2005)
 [6] Cox, D.R.: Models and lifetables regression. JR Stat. Soc. Ser. B 34, 187–220 (1972)
 [7] Crowder, M.J.: Classical competing risks. CRC Press (2001)
 [8] Fine, J.P., Gray, R.J.: A proportional hazards model for the subdistribution of a competing risk. Journal of the American statistical association 94(446), 496–509 (1999)
 [9] Glynn, R.J., Rosner, B.: Comparison of risk factors for the competing risks of coronary heart disease, stroke, and venous thromboembolism. American journal of epidemiology 162(10), 975–982 (2005)
 [10] Gooley, T.A., Leisenring, W., Crowley, J., Storer, B.E., et al.: Estimation of failure probabilities in the presence of competing risks: new representations of old estimators. Statistics in medicine 18(6), 695–706 (1999)
 [11] Heagerty, P.J., Zheng, Y.: Survival model predictive accuracy and roc curves. Biometrics 61(1), 92–105 (2005)
 [12] Ishwaran, H., Gerds, T.A., Kogalur, U.B., Moore, R.D., Gange, S.J., Lau, B.M.: Random survival forests for competing risks. Biostatistics 15(4), 757–773 (2014)
 [13] Ishwaran, H., Kogalur, U.B., Blackstone, E.H., Lauer, M.S.: Random survival forests. The annals of applied statistics pp. 841–860 (2008)
 [14] Katzman, J., Shaham, U., Bates, J., Cloninger, A., Jiang, T., Kluger, Y.: Deep survival: A deep cox proportional hazards network. arXiv preprint arXiv:1606.00931 (2016)
 [15] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
 [16] Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Selfnormalizing neural networks. arXiv preprint arXiv:1706.02515 (2017)
 [17] Lambert, P., Dickman, P., Nelson, C., Royston, P.: Estimating the crude probability of death due to cancer and other causes using relative survival models. Statistics in medicine 29(78), 885–895 (2010)
 [18] Luck, M., Sylvain, T., Cardinal, H., Lodi, A., Bengio, Y.: Deep learning for patientspecific kidney graft survival analysis. arXiv preprint arXiv:1705.10245 (2017)
 [19] Mayr, A., Hofner, B., Schmid, M.: Boosting the discriminatory power of sparse survival models via optimization of the concordance index and stability selection. BMC bioinformatics 17(1), 288 (2016)
 [20] Mayr, A., Schmid, M.: Boosting the concordance index for survival data–a unified framework to derive and evaluate biomarker combinations. PloS one 9(1), e84483 (2014)

[21]
Recht, B., Re, C.: Beneath the valley of the noncommutative arithmeticgeometric mean inequality: conjectures, casestudies, and consequences (2012)
 [22] Satagopan, J., BenPorat, L., Berwick, M., Robson, M., Kutler, D., Auerbach, A.: A note on competing risks in survival data analysis. British journal of cancer 91(7), 1229–1235 (2004)
 [23] Schmid, M., Wright, M.N., Ziegler, A.: On the use of harrell’s c for clinical risk prediction via random survival forests. Expert Systems with Applications 63, 450–459 (2016)
 [24] Tsiatis, A.: A nonidentifiability aspect of the problem of competing risks. Proceedings of the National Academy of Sciences 72(1), 20–22 (1975)
 [25] Wang, J., Fang, Z., Lang, N., Yuan, H., Su, M.Y., Baldi, P.: A multiresolution approach for spinal metastasis detection using deep siamese neural networks. Computers in Biology and Medicine 84, 137–146 (2017)
 [26] Wolbers, M., Koller, M.T., Witteman, J.C., Steyerberg, E.W.: Prognostic models with competing risks: methods and application to coronary risk prediction. Epidemiology 20(4), 555–561 (2009)
 [27] Yousefi, S., Amrollahi, F., Amgad, M., Dong, C., Lewis, J.E., Song, C., Gutman, D.A., Halani, S.H., Vega, J.E.V., Brat, D.J., et al.: Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. bioRxiv p. 131367 (2017)