A Bayesian Approach to Modelling Longitudinal Data in Electronic Health Records

12/19/2019 ∙ by Alexis Bellot, et al. ∙ 0

Analyzing electronic health records (EHR) poses significant challenges because often few samples are available describing a patient's health and, when available, their information content is highly diverse. The problem we consider is how to integrate sparsely sampled longitudinal data, missing measurements informative of the underlying health status and fixed demographic information to produce estimated survival distributions updated through a patient's follow up. We propose a nonparametric probabilistic model that generates survival trajectories from an ensemble of Bayesian trees that learns variable interactions over time without specifying beforehand the longitudinal process. We show performance improvements on Primary Biliary Cirrhosis patient data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clinical prognostic models derived from electronic medical records are an important support for many critical diagnostic and therapeutic decisions. The majority of these models, however, do not leverage the information contained in a patient’s history, such as tests, vital signs, and biomarkers. Longitudinal EHR data is increasingly relevant to better understand and predict the health state of patients Stanziano et al. (2010) but remains a challenge for analysis because of irregular, sporadic sampling and high missingness (often informatively) Ferrer et al. (2017). As an example of a typical trajectory found in EHRs we depict in Figure 1 two longitudinal variables describing a patient suffering from liver disease.

Figure 1: Illustration of the available patient’s record and subsequent model survival predictions for an exemplary patient with primary biliary cirrhosis. The left and middle panel suggest that the patient gradually transitions from being healthy to becoming sick as elevations in bilirubin measurements over time signal an advanced disease stage; a fact reflected in our model’s survival predictions (right panel). The times at which survival predictions are updated are given by the red dashed lines.

In this paper we consider the problem of analyzing survival of patients on the basis of scarce longitudinal data sampled irregularly over time. Modelling sparsely sampled data - arising when intervals between observations are often large - is becoming increasingly relevant for managing elderly patient populations due to the greater prevalence of chronic diseases that develop over a period of years or decades Licher et al. (2019); Wills et al. (2011)

. A substantial portion of machine learning research now investigates prognosis with time series data, typically focusing on patients in the hospital where information is densely collected. Predictions in this setting tend to target a binary event and require structured and regularly sampled data

Choi et al. (2016); Lipton (2017). Modelling survival data differs from the above as data is often recorded with censoring - a patient may drop out of a study but information up to that time is known. Moreover, the sparsity of recorded information will tend to lead to substantial uncertainty about model predictions which needs to be captured for reliable inference in practice Begoli et al. (2019). We discuss these issues and develop a Bayesian nonparametric model that aims to overcome these difficulties by (1) quantifying the uncertainty around model predictions in a principled manner at the individual level, while (2) avoiding to make assumptions on the data generating process and adapting model complexity to the structure of the data. Both contributions are particularly important for personalizing health care decisions as predictions may be uncertain due to lack of data, while we can expect the underlying heterogeneous physiological time series to vary wildly across patients.

2 Problem Formulation

Our goal is to analyze the occurrence of the event of interest in a medically-relevant time frame

. The survival probability given survival up to time

is given by,

Each individual in our population is described by an irregularly sampled time series, defined as a sequence of multidimensional observations (in ) with irregular intervals between their observation times . We denote by all information available for patient up to time . Our target for prediction is the time until the occurrence of the event of interest . For some patients this outcome may be censored (with censoring time denoted ), for example if they drop out of the study. We write for patients experiencing the event of interest and for censored patients. Let denote independent samples of the random tuple on which a data-driven model is to be fitted.

Note that it will also be useful to define missing data as those missing entries in the multidimensional feature vector

, . This is because when other information is observed at a time , missing risk factors may indicate a perception of reduced relevance in certain patients, as is the case with routine measurements such as the Body Mass Index Bhaskaran et al. (2013).

3 Model description

In this section we introduce our proposed modelling approach, termed Bayesian Nonparametric Dynamic Survival (BNDS) model.

We represent a patient trajectory explicitly as a discrete sequence of tuples consisting of the observation times and corresponding measurements . For each one of these sequences, we define event indicators , which represent the survival status of a patient (i.e. if the patient was alive at time ). The probability of an event is described by .

This interpretation as conditional probabilities is especially useful to define full survival distributions. Define a fine sequence of time windows in the future. The survival probability at a time in the future given survival up to time then is given by Singer and Willett (1993),


Prior model. The nonparametric nature of inference follows from first assigning a prior over a random basis of regression trees that captures the interaction of time and the values of the patient variables at that time, denoted by 111The prior on the trees itself, is composed of priors on the tree structure (depth of the tree and, splitting variables and values in each node) and leaf node outputs . We refer the reader to Chipman et al. (2010) and the Supplement for more details.. For a given prior , the generative model proceeds as follows: we sample regression trees, combine them to form an ensemble and transform their output to issue probabilities for each tuple , . In a nutshell,


denotes the cumulative distribution function of a standard Gaussian random variable. In the medical setting this set up is useful as prior beliefs on the importance of risk factors for a disease, such as diabetes, can be included in

to encourage trees to split on that variable or to regulate the depth of the trees based on the believed complexity of associations. Refer to the Supplement for more details.

Posterior distribution. Our observation model guides our probability model to a posterior distribution of parameters that agrees with the observed data. It is defined as follows,


We compute the posterior (intractable because of the large parameter space) of the set of tree structures and leaf node parameters given via repeatedly sampling from a tailored version of the Backfitting MCMC algorithm introduced in Chipman et al. (2010). In a given iteration, we proceed by subtracting from the observed ’s the fit of the current ensemble and draw the subsequent tree conditional on the resulting residual with a Gibbs sampler and a Metropolis step. Because we target binary outcomes, we use a data augmentation approach Albert and Chib (1993) which introduces Gaussian latent variables , , (one for each observed patient and observation time) simulated in an additional step in the MCMC algorithm. Let . is generated conditional on the observed data as follows,

Draws from this posterior are used to sample real-valued latent variables that link the probit model for binary outcomes with the tree regression model. This step can be inferred exactly with a Gibbs sampler as the posterior has a well-defined closed form Kapelner and Bleich (2013). For an individual having survived up to , the survival probability beyond time is estimated by extrapolating along the time dimension on this surface,


is all patient information up to time

. We quantify uncertainty around model predictions by considering the quantiles of the posterior distribution, providing credible intervals, which are especially important for heterogeneous data samples as flexible algorithms might overfit the training data and give unreliable predictions at test time.

Informative missing values. As we have noted, missing measurements may often reflect decisions by clinicians which implicitly provide information about the health status of the patient 222This setting corresponds to data missing not at random (MNAR), see Little and Rubin (2014) for more details.. To include this information, we augment the space of splitting rules in the stochastic tree construction to include missingness of a variable as a splitting criterion similarly to the approach of Twala et al. (2008); Kapelner and Bleich (2013). Such a split, if informative for survival (in the sense of increasing the likelihood of observed event times under the current model) will be encouraged by the MCMC sampling algorithm and result in meaningful partitions of the data based on unobserved measurements. For a possible splitting threshold in the range of a patient variable with missing values, three splitting rules are considered: (1) - versus , (2) - versus , and (3) - versus .

These are uniformly sampled as a potential splitting rule in the MCMC procedure. No imputation of training or testing data is required.

4 Experiments: Predictive performance comparisons

Data description. Primary Biliary Cirrhosis (PBC) is a slowly progressing disease of the liver that affects approximately of the population, the majority of which are women. Early identification is difficult because many biomarkers are involved in its development with poorly understood interactions Lindor et al. (2009). We considered data from the Mayo Clinic trial used previously in Therneau and Grambsch (2000). A total of patients were analyzed in this study of which experienced liver failure, the end point of interest. On average follow-up visits were recorded during an average time to event of years. Summary statistics are included in the Supplement.

Baseline prediction algorithms. Modelling longitudinal data to inform survival has been considered in two statistics approaches: Joint models Andrinopoulou et al. (2012); Rizopoulos (2011); Hickey et al. (2016) and Landmarking Van Houwelingen and Putter (2008); Van Houwelingen (2007); van Houwelingen and Putter (2011). Joint models (JM) specify separate longitudinal and survival models related by shared random effects, inducing correlation between the two and propagating the uncertainty from future longitudinal estimates into survival predictions. Landmarking, in contrast, uses the last longitudinal measurement carried forward as an estimate for the current longitudinal value and builds a (static) prediction model based on these values. Its implementation consists of two steps: the construction of a cross-sectional data set, and the development of a Cox model based on that landmark data set. We evaluate joint models with the implementation by Hickey et al. (2016)

. We used a multivariate normal linear mixed model for longitudinal variables and a Cox model to relate these and fixed variables to survival estimates. We implemented two variants of the landmarking approach: one using a traditional Cox survival model

Cox (1972) and one using Random Survival Forests (RSF) Ishwaran et al. (2008). We discuss other related work in the Supplement.

Performance evaluation. We use the time-dependent concordance index (-index) Gerds et al. (2013) to assess the discriminative ability of all models,


The -index corresponds to the probability that predicted time-to-event probabilities at a time of interest are ranked in accordance to the actual observed survival times (at time ), but evaluated only for those patients still alive at the follow up time at which predictions are made. It thus serves as a measure of a model’s discriminative power. The -index is bounded between and , indicating performance of random guesses and indicating perfect ordering of event times. In all experiments we implemented BNDS with trees and samples after a burn-in of samples, our default specification.

Results. We observe in Table 1 that BNDS significantly outperforms all other benchmarks. BNDS is the only method modelling longitudinal data agnostically and using the whole patient history. Both Cox-Landmark and JM require specification of variable interactions which strongly limits the structure they are able to recover from data. This limitation is further illustrated by the performance gain of RSF-Landmark over Cox-Landmark, arising because the former models variable interactions more flexibly. Notice also that the performance gap between JM and BNDS (both using the whole patient history) widens for predictions at later follow ups with respect to cross-sectional models, thereby highlighting the importance of considering a patient’s history as well as current measurements.

C-index on PBC data
Joint Model
BNDS (ours)
Table 1: Median

-index (higher better) and standard deviation using 5-fold cross-validation.

5 Conclusion

We have introduced a Bayesian nonparametric method that is able to use sparse, longitudinal data to give personalized survival predictions that are updated as new information is recorded. Our modelling approach has the advantage of providing uncertainty estimates and has the flexibility to model highly heterogeneous data without a priori modelling choices.


  • [1] J. H. Albert and S. Chib (1993) Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association 88 (422), pp. 669–679. Cited by: §3.
  • [2] E. Andrinopoulou, D. Rizopoulos, R. Jin, A. J. Bogers, E. Lesaffre, and J. J. Takkenberg (2012) An introduction to mixed models and joint modeling: analysis of valve function over time. The Annals of thoracic surgery 93 (6), pp. 1765–1772. Cited by: §4.
  • [3] E. Begoli, T. Bhattacharya, and D. Kusnezov (2019) The need for uncertainty quantification in machine-assisted medical decision making. Nature Machine Intelligence 1 (1), pp. 20. Cited by: §1.
  • [4] K. Bhaskaran, H. J. Forbes, I. Douglas, D. A. Leon, and L. Smeeth (2013) Representativeness and optimal use of body mass index (bmi) in the uk clinical practice research datalink (cprd). BMJ open 3 (9), pp. e003389. Cited by: §2.
  • [5] H. A. Chipman, E. I. George, R. E. McCulloch, et al. (2010) BART: bayesian additive regression trees. The Annals of Applied Statistics 4 (1), pp. 266–298. Cited by: §3, footnote 1.
  • [6] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun (2016)

    Doctor ai: predicting clinical events via recurrent neural networks

    In Machine Learning for Healthcare Conference, pp. 301–318. Cited by: §1.
  • [7] D. R. Cox (1972) Regression models and life tables (with discussion). Journal of the Royal Statistical Society. Series B 34, pp. 187–220. Cited by: §4.
  • [8] L. Ferrer, H. Putter, and C. Proust-Lima (2017) Individual dynamic predictions using landmarking and joint modelling: validation of estimators and robustness assessment. arXiv preprint arXiv:1707.03706. Cited by: §1.
  • [9] T. A. Gerds, M. W. Kattan, M. Schumacher, and C. Yu (2013) Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring. Statistics in Medicine 32 (13), pp. 2173–2184. Cited by: §4.
  • [10] G. L. Hickey, P. Philipson, A. Jorgensen, and R. Kolamunnage-Dona (2016) Joint modelling of time-to-event and multivariate longitudinal outcomes: recent developments and issues. BMC medical research methodology 16 (1), pp. 117. Cited by: §4.
  • [11] H. Ishwaran, U. B. Kogalur, E. H. Blackstone, and M. S. Lauer (2008) Random survival forests. The annals of applied statistics, pp. 841–860. Cited by: §4.
  • [12] A. Kapelner and J. Bleich (2013) Bartmachine: machine learning with bayesian additive regression trees. arXiv preprint arXiv:1312.2171. Cited by: §3, §3.
  • [13] S. Licher, A. Heshmatollah, K. D. van der Willik, B. H. C. Stricker, R. Ruiter, E. W. de Roos, L. Lahousse, P. J. Koudstaal, A. Hofman, L. Fani, et al. (2019) Lifetime risk and multimorbidity of non-communicable diseases and disease-free life expectancy in the general population: a population-based cohort study. PLoS medicine 16 (2), pp. e1002741. Cited by: §1.
  • [14] K. D. Lindor, M. E. Gershwin, R. Poupon, M. Kaplan, N. V. Bergasa, and E. J. Heathcote (2009) Primary biliary cirrhosis. Hepatology 50 (1), pp. 291–308. Cited by: §4.
  • [15] Z. C. Lipton (2017) The doctor just won’t accept that!. arXiv preprint arXiv:1711.08037. Cited by: §1.
  • [16] R. J. Little and D. B. Rubin (2014) Statistical analysis with missing data. Vol. 333, John Wiley & Sons. Cited by: footnote 2.
  • [17] D. Rizopoulos (2011) Dynamic predictions and prospective accuracy in joint models for longitudinal and time-to-event data. Biometrics 67 (3), pp. 819–829. Cited by: §4.
  • [18] J. D. Singer and J. B. Willett (1993) It’s about time: using discrete-time survival analysis to study duration and the timing of events. Journal of educational statistics 18 (2), pp. 155–195. Cited by: §3.
  • [19] D. C. Stanziano, M. Whitehurst, P. Graham, and B. A. Roos (2010)

    A review of selected longitudinal studies on aging: past findings and future directions

    Journal of the American Geriatrics Society 58, pp. S292–S297. Cited by: §1.
  • [20] T. M. Therneau and P. Grambsch (2000) Extending the cox model. Edited by P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg, pp. 51. Cited by: §4.
  • [21] B. Twala, M. Jones, and D. J. Hand (2008)

    Good methods for coping with missing data in decision trees

    Pattern Recognition Letters 29 (7), pp. 950–956. Cited by: §3.
  • [22] H. C. Van Houwelingen and H. Putter (2008) Dynamic predicting by landmarking as an alternative for multi-state modeling: an application to acute lymphoid leukemia data. Lifetime data analysis 14 (4), pp. 447. Cited by: §4.
  • [23] H. C. Van Houwelingen (2007) Dynamic prediction by landmarking in event history analysis. Scandinavian Journal of Statistics 34 (1), pp. 70–85. Cited by: §4.
  • [24] H. van Houwelingen and H. Putter (2011) Dynamic prediction in clinical survival analysis. CRC Press. Cited by: §4.
  • [25] A. K. Wills, D. A. Lawlor, F. E. Matthews, A. A. Sayer, E. Bakra, Y. Ben-Shlomo, M. Benzeval, E. Brunner, R. Cooper, M. Kivimaki, et al. (2011) Life course trajectories of systolic blood pressure using longitudinal data from eight uk cohorts. PLoS medicine 8 (6), pp. e1000440. Cited by: §1.