Dynamic prediction of time to event with survival curves

by   Jie Zhu, et al.

With the ever-growing complexity of primary health care system, proactive patient failure management is an effective way to enhancing the availability of health care resource. One key enabler is the dynamic prediction of time-to-event outcomes. Conventional explanatory statistical approach lacks the capability of making precise individual level prediction, while the data adaptive binary predictors does not provide nominal survival curves for biologically plausible survival analysis. The purpose of this article is to elucidate that the knowledge of explanatory survival analysis can significantly enhance the current black-box data adaptive prediction models. We apply our recently developed counterfactual dynamic survival model (CDSM) to static and longitudinal observational data and testify that the inflection point of its estimated individual survival curves provides reliable prediction of the patient failure time.



page 1

page 2

page 3

page 4


CDSM – Casual Inference using Deep Bayesian Dynamic Survival Models

A smart healthcare system that supports clinicians for risk-calibrated t...

Simultaneous Prediction Intervals for Patient-Specific Survival Curves

Accurate models of patient survival probabilities provide important info...

Predicting Kidney Transplant Survival using Multiple Feature Representations for HLAs

Kidney transplantation can significantly enhance living standards for pe...

Statistical Inference on the Cure Time

In population-based cancer survival analysis, the net survival is import...

Why comparing survival curves between two prognostic subgroups may be misleading

We consider the validation of prognostic diagnostic tests that predict t...

Generalized Weighted Survival and Failure Entropies and their Dynamic Versions

The weighted forms of generalized survival and failure entropies of orde...

Individual Survival Curves with Conditional Normalizing Flows

Survival analysis, or time-to-event modelling, is a classical statistica...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Time-to-event (TTE) predictions are extensively used by medical statisticians. Traditional methods of logistic regression are not suited to include both the event and time aspects as the outcome in the model. Non-parametric models such as the Kaplan-Meier 

Kaplan and Meier (1958) estimator and the semi-parametric Cox proportional hazard models and its extentsions Cox (1972); Recknor and Gross (1994)

face the challenge of adjusting for multiple/time-varying covariates. The recent development of data adaptive models such as the deep neural networks 

Gensheimer and Narasimhan (2019) and Super-Learner Golmakani and Polley (2020) enable the efficient estimation of individual survival curves with static and longitudinal data, yet relatively little has been written about the implication of these explanatory techniques in the context of event time prediction.

The strength of explanatory survival analysis has been applied in data adaptive predictive models to improve the estimation accuracy of survival curves. In Deephit Lee et al. (2020)

, a rank loss function is designed to evaluate whether the model can order observations by their expected time to fail; in DeepSurv 

Katzman et al. (2018), authors approximates the Cox proportional hazard function using a densely connected neural network; and in WTTE-RNN Martinsson (2016)

, the predicted event time is assumed to follow a Weibull distribution whose parameters is estimated using a recurrent neural network. These model are in contrast with conventional binary predictors such as the recurrent neural networks proposed in the 2019 PhysioNet Challenge 

Reyna et al. (2019), where the prediction of TTE was equated to a longitudinal binary classification problem.

In our recently proposed counterfactual dynamic survival model (CDSM), we relaxed major limitations of the three models aforementioned. Specifically, we do not assume Cox proportional hazard ratio or any parametric assumption in our model. At the same time, we allow longitudinal covariates and quantify the uncertainty of the neural network estimations using Bayesian dense layers. The focus of our previous work is the model development and its application to causal inference. In this study, we fixate on the prediction power of CDSM as an outcome model and explore how biologically plausible survival curve estimations can improve the TTE predictions.

In Section 2, we describe the methodology to estimate the survival outcomes and predict the time to event. Section 3 introduces a set of case studies and model evaluation techniques. Results are presented in section 4. We end our study with a discussion.

2 Predicting the time to event with survival curves

To formalize the framework for longitudinal survival outcomes, we follow the notations in previous studies Imai and Strauss (2011); Zhu and Gallego (2020). Suppose we observe a sample independently generated by an unknown distribution :

where is the time at the upper limit of each time interval (i.e., hours, months and years), is the maximum of patients’ follow-up time and are baseline covariates at time ; is the exposure condition at time , if observation i receives the treatment and otherwise; denotes the outcome at time , if experienced an event and otherwise; is determined by the event or censor time, or whichever happened first.

For each individual , we define the conditional hazard rate

as the probability of failure in interval



where and are the history of treatments and covariates until time . The conditional probability of surviving to the end of interval

is given by the probability chain rule:


We define our target outcome similar to a multivariate logistic regression but with an additional term to capture the event and censoring:


where is the indicator of the event/censor time of and is the maximum follow-up time. We use Equation (3) to estimate the survival curve in Equation (2) as , where

is the time index of vector


Figure 1: Prediction of event time based on hazard curves.

Conventional predictive models fit the multivariate logistic outcome in Equation (3

) using binary classifiers, where researchers have to set the optimal probability threshold to classify whether an event will occur (see hazard threshold in Figure 

1). For instance, one can use the Nelder-Mead method to locate the optimal probability threshold via minimizing the distance between the actual and predicted event time, yet this in-sample threshold might not be the optimal for predicting the TTE on a new cohort.

This study attempts to learn from the biological survival curve and uses the inflection point, , of the survival curve in Equation (2) to signify the event time, which we define as the time point equateing the second derivative of estimated survival curve to zero:


In Figure 1, we can see the hazard rate has a rapid increase after , which means a high probability of experiencing an event. The uncertainty of the estimated survival curve quantifies the uncertainty of the predicted event time.

3 Study design and databases

We built and then validated a survival outcome model based on the retrospective analysis of three static databases and three dynamic longitudinal databases. The summary of these data sets are presented in Table 1.


Database Sample Covariates Unique Time Points % Censored
SUPPORT 8873 14 1714 32%
METABRIC 1904 9 1686 42%
GBSG 2232 7 1230 43%
PhysioNet 40336 40 20*2 hours 93%
MIMIC-III 20938 44 20*2 hours 86%
CPRD AF 18102 53 20*3 months 82%
Table 1: Summary of the clinical data sets.

The static data sets were provided by the DeepSurv python package Katzman et al. (2018) which includes:

  1. The Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatments (SUPPORT);

  2. The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC); and

  3. The Rotterdam tumor bank and German Breast Cancer Study Group (GBSG).

The longitudinal data sets are:

  1. The Medical Information Mart for Intensive Care version III (MIMIC-III), an open-access, anonymized database of 61,532 admissions from 2001–2012 in six ICUs at a Boston teaching hospital Johnson et al. (2016) .

  2. 2019 PhysioNet Sepsis prediction challenge data set Reyna et al. (2019) (PhysioNet). PhysioNet Sepsis prediction c containing more than 3.3 million admissions from 2003–2016 in 459 ICUs across the United States.

  3. The Clinical Practice Research DataLink data set Herrett et al. (2015) (CPRD AF) comparing Vitamin K Antagonists (VKAs) and Non-Vitamin K antagonist oral anticoagulants (NOAC) in preventing three combined outcomes (ischemic attack, major bleeding and death) of patients with non-valvular atrial fibrillation (AF).

In both MIMIC-III and PhysioNet, we define Sepsis event as a suspected infection (prescription of antibiotics and sampling of bodily fluids for microbiological culture) combined with evidence of organ dysfunction, defined by a two points deterioration of SOFA score Seymour (2016). We follow previous papers Reyna et al. (2019); Komorowski et al. (2018) for data extraction and processing. For the PhysioNet, we combined data from hospital A and B, and used hospital location (A or B) as the synthetic treatment condition. For MIMIC-III, we define the treatment as the usage of mechanical ventilation(MV). For the CPRD AF, the outcome of interest is the first occurrence of combined outcomes of major bleeding, death and stroke. The treatment is the usage of NOAC vs the control of using VKAs.

For the static data sets, we discretized the time points into windows of 50 time steps and censored all steps do not form a complete window (i.e. windows for the SUPPORT data set, windows for METABRIC and windows for GBSG). For the longitudinal data sets, we considered the first 20 time stamps for each patient (i.e., the first 20 2-hour intervals for PhysioNet and MIMIC-III, and the first 20 months for the AF.). We split each database into estimation data set (70% of the original data for training and 10% validation) and testing data set (20% of the original).

3.1 Model evaluation

We performed an evaluation of the estimations of survival curves and predictions of event time using the three metrics described below:

The area under the receiver operating characteristic (AUROC) and C-Index: we use AUROC and Harrell’s C-index Frank et al. (1982) to evaluate the models’ discrimination performance. Both indicators are calculated using the multivariate logistic outcome in Equation (3) .

Utility distance (Distance Score): we define the distance metric to evaluate the predicted event time as:

where is defined in Equation (4) and is the true event/censoring time.

We compared following algorithms on the estimation of survival curves and the prediction of event time:

  • Dynamic Bayesian survival causal model (D-Surv): the model targets the outcome defined in Equation (3

    ) by training two counterfactual sub-networks for treated and controlled observations. If no treatment variable is defined, we create two copies of the original data set, with first one marked as receiving the treatment and the second one as under control. The loss function of D-Surv has three components: 1) the partial log likelihood loss of the joint distribution of the first hitting time and corresponding event or right-censoring; 2) the rank loss function to capture the concordance score defined in survival analysis; 3) The calibration loss function minimizes the selection bias in for treatment assignment. Please refer to our previous paper for details.

  • Plain recurrent neural network with survival outcomes (RNN): the model modifies the D-Surv by removing the counterfactual sub-networks and the third loss function in D-Surv. No treatment variable has to be specified in this model.

  • Plain recurrent neural network with binary outcomes (RNN Binary): the model provides the direct prediction on the longitudinal outcome in Equation (3) using the mean squared error loss function.

  • DeepHit Lee et al. (2018): the model uses the same loss functions as the RNN but does not capture the history of covariates and is only evaluated for static databases.

The model construction and training uses Python 3.8.0 with Tensorflow 2.3.0 and Tensorflow-Probability 0.11.0 

Abadi et al. (2015) (code available at https://github.com/EliotZhu/DSurv).

4 Results

In Table 2, we confirmed CDSM, RNN and DeepHit had similar performance on the estimation of survival curves (see the concordance index) and the prediction of event time (see the distance score) in the three static testing data sets. However, in terms of the AUROC, we noticed RNN Binary had superior performance than the others, although it had lower C-Index.

The counterfactual sub-networks and the selection bias calibration loss function in CDSM did not affect the estimation accuracy, resulting the equivalency among CDSM, RNN and DeepHit in the static non-causal survival estimations.


Dataset Metabric
Metrics CDSM RNN RNN Binary DeepHit
AUROC 0.869 0.877 0.885 0.874
C-Index 0.685 0.655 0.590 0.683
Distance Score 4.186 4.034 3.944 4.097
AUROC 0.781 0.780 0.817 0.798
C-Index 0.617 0.593 0.559 0.613
Distance Score 4.384 4.974 4.464 4.668
AUROC 0.792 0.788 0.802 0.820
C-Index 0.653 0.650 0.550 0.633
Distance Score 4.434 5.474 5.672 4.231
  • All metrics are averaged over estimation windows using testing data sets. The best value in each metric is in bold.

Table 2: Model performance on static datasets

Similar trend was observed when we evaluated CDSM, RNN, and RNN Binary using the longitudinal databases (see the estimation data set evaluations in Table 3). However, in the corresponding testing data sets, D-Surv significantly outperformed RNN Binary, especially for the C-Index and distance score. We saw the imposition of survival outcome in Equation (3) and concordance loss functions defined in D-Surv/RNN produced nominal survival curves, where the RNN Binary only maximized the discrimination performance on the binary indicator of whether the sepsis has occurred (i.e., the estimated survival probabilities for the AF testing data set were stacked at zeros and ones as shown in Figure 2 (a)).

All metrics are averaged over 20 estimation windows using either estimation (the default) or testing data sets (specified in brackets). The best value in each metric is in bold.


Dataset PhysioNet (hours) MIMIC-III (hours) AF (months)
AUROC 0.980 0.984 0.997 0.959 0.953 0.988 0.978 0.986 0.995
AUROC (test) 0.869 0.858 0.824 0.969 0.941 0.983 0.984 0.983 0.933
C-Index 0.991 0.985 0.992 0.749 0.851 0.823 0.871 0.880 0.885
C-Index (test) 0.874 0.837 0.776 0.751 0.653 0.682 0.877 0.863 0.785
Distance Score 3.388 4.034 2.017 2.230 2.410 2.191 1.331 1.082 0.751
Distance Score (test) 3.047 3.635 3.767 2.291 2.375 2.339 1.116 1.035 1.240
Score Std 11.586 7.584 0.248 1.793 0.498 0.041 3.294 1.753 0.009
Score Std (test) 11.522 6.813 0.080 1.743 0.557 0.038 3.118 1.469 0.011
Table 3: Model performance on dynamic datasets
Figure 2: Diagnostic plots for event time prediction with AF testing data set (a) distribution of estimated survival probabilities across all time points for AF testing data set by benchmark algorithms; (b) average difference between the predicted and true event/censor time estimated using probability threshold approach for AF testing data set; and (c) scatter plot of predicted and true event/censor time estimated using the inflection point approach for AF testing data set.

The nominal survival curves by CDSM made it possible to apply Equation (4) to locate the inflection point as the event time. This is a better approach than choosing a probability threshold to construct a Binary classifier. In Figure 2 (b), we saw the error of predicted event time is sensitive to the chosen probability threshold, where the range of average timing difference was from -4.3 to 9.5 months in a small threshold range: 0.99 to 0.9999. In contrast, after applying the inflection point to determine the event time, we observed the predicted time accurately tracked the true time in Figure 2 (c), with most predictions happened ahead of the true AF event time. The average distance to from the predicted time is 1.720 months ahead of the true AF event time, while 1.014 months ahead for true AF censored time. CDSM allows the threshold-free prediction of the individual event time and early intervention on patients who might be prone to event occurrence.

5 Discussion

This study demonstrated that injecting the knowledge of survival analysis into the design of recurrent neural network can significantly improve the prediction of time-to-event outcomes. Our proposed outcome model, CDSM fitting the joint distribution of both failure and censored observations. The conventional machine learning algorithms for binary discrimination can maximize evaluation scores such as AUROC, but failed to provide meaningful survival curves and reliable predictions of event time. The major drawback of these algorithms, as identified by our empirical study, is that they do not take account of censoring and had significant drop in accuracy when being evaluated on the testing database.


This work was supported by National Health and Medical Research Council, project grant no. 1125414.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, et al. (2015) TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Cited by: §3.1.
  • R. Cox (1972) Regression models and life tables (with discussion). J. Roy. Statist. Soc. Ser. B 34, pp. 187–220. Cited by: §1.
  • E. Frank, R. M. Harrell, D. B. Califf, K. L. Pryor, R. A. Lee, and Rosati (1982) Evaluating the yield of medical tests. Journal of the American Medical Association 247 (18), pp. 2543–2546. Cited by: §3.1.
  • M. F. Gensheimer and B. Narasimhan (2019) A scalable discrete-time survival model for neural networks. PeerJ 7, pp. e6257–e6257. External Links: Document, Link Cited by: §1.
  • M. K. Golmakani and E. C. Polley (2020) Super Learner for Survival Data Prediction. The International Journal of Biostatistics 0 (0). External Links: Document, Link Cited by: §1.
  • E. Herrett, A. M. Gallagher, K. Bhaskaran, H. Forbes, R. Mathur, T. van Staa, and L. Smeeth (2015) Data Resource Profile: Clinical Practice Research Datalink (CPRD). International Journal of Epidemiology 44 (3), pp. 827–836. External Links: Document Cited by: item 3.
  • K. Imai and A. Strauss (2011) Estimation of Heterogeneous Treatment Effects from Randomized Experiments, with Application to the Optimal Planning of the Get-Out-the-Vote Campaign. Political Analysis 19 (1), pp. 1–19. External Links: Document, Link Cited by: §2.
  • A. Johnson, T. Pollard, and L. Shen (2016) MIMIC-III, a freely accessible critical care database. Sci Data 3, pp. 160035–160035. Cited by: item 1.
  • E. L. Kaplan and P. Meier (1958) Nonparametric estimation from incomplete observations. Journal of the American statistical association 53 (282), pp. 457–481. Cited by: §1.
  • J. L. Katzman, U. Shaham, A. Cloninger, J. Bates, T. Jiang, and Y. Kluger (2018) DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC medical research methodology 18 (1), pp. 24. Cited by: §1, §3.
  • M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal (2018)

    The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care

    Nature Medicine 24 (11), pp. 1716–1720. External Links: Document, Link Cited by: §3.
  • C. Lee, J. Yoon, and M. van der Schaar (2020)

    Dynamic-DeepHit: A Deep Learning Approach for Dynamic Survival Analysis With Competing Risks Based on Longitudinal Data

    IEEE Transactions on Biomedical Engineering 67 (1), pp. 122–133. External Links: Document, Link Cited by: §1.
  • C. Lee, W. R. Zame, J. Yoon, and M. van der Schaar (2018) DeepHit: A Deep Learning Approach to Survival Analysis With Competing Risks.. In AAAI, pp. 2314–2321. Cited by: item 4.
  • E. Martinsson (2016) Wtte-rnn: Weibull time to event recurrent neural network. Cited by: §1.
  • J. C. Recknor and A. J. Gross (1994) Fitting Survival Data to a Piecewise Linear Hazard Rate in the Presence of Covariates. Biometrical Journal. External Links: Document Cited by: §1.
  • M. Reyna, C. Josef, R. Jeter, S. Shashikumar, B. Moody, M. B. Westover, A. Sharma, S. Nemati, and G. Clifford (2019) Early Prediction of Sepsis from Clinical Data – the PhysioNet Computing in Cardiology Challenge. Vol. (version 1.0.0). External Links: Document, Link Cited by: §1, item 2, §3.
  • C. W. Seymour (2016) Assessment of clinical criteria for sepsis: For the third international consensus definitions for sepsis and septic shock (sepsis-3). J. Am. Med. Assoc 315, pp. 762–774. Cited by: §3.
  • J. Zhu and B. Gallego (2020) Targeted Estimation of Heterogeneous Treatment Effect in Observational Survival Analysis. Journal of Biomedical Informatics, pp. 103474. External Links: Document Cited by: §2.