Retarded kernels for longitudinal survival analysis and dynamic prediction

10/21/2021 ∙ by Annabel L Davies, et al. ∙ The University of Manchester 0

Predicting patient survival probabilities based on observed covariates is an important assessment in clinical practice. These patient-specific covariates are often measured over multiple follow-up appointments. It is then of interest to predict survival based on the history of these longitudinal measurements, and to update predictions as more observations become available. The standard approaches to these so-called `dynamic prediction' assessments are joint models and landmark analysis. Joint models involve high-dimensional parametrisations, and their computational complexity often prohibits including multiple longitudinal covariates. Landmark analysis is simpler, but discards a proportion of the available data at each `landmark time'. In this work we propose a `retarded kernel' approach to dynamic prediction that sits somewhere in between the two standard methods in terms of complexity. By conditioning hazard rates directly on the covariate measurements over the observation time frame, we define a model that takes into account the full history of covariate measurements but is more practical and parsimonious than joint modelling. Time-dependent association kernels describe the impact of covariate changes at earlier times on the patient's hazard rate at later times. Under the constraints that our model (i) reduces to the standard Cox model for time-independent covariates, and (ii) contains the instantaneous Cox model as a special case, we derive two natural kernel parameterisations. Upon application to three clinical data sets, we find that the predictive accuracy of the retarded kernel approach is comparable to that of the two existing standard methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Survival analysis is a well established field of medical statistics that involves modelling the probability of survival until some specified irreversible event such as death or the onset of disease. Of particular clinical interest is the prediction of patient-specific survival based on a set of observed biomarkers or ‘covariates’.vanH:2007 Such predictions aid clinicians in making treatment and testing decisions, and provide personalised information for patients about their health.Rizopoulos:2017

Cox’s proportional hazards (PH) modelCox:1972 remains the most widely used model in survival analysis.Tian:2005 ; Lee:2019 In this context, survival is assumed to depend on a set of covariates, , , measured at some baseline time. The hazard, , is the probability per unit time of the event happening at time given that no event has occurred up to that time. In Cox’s PH model this hazard is defined as


where the (with ) are the so-called association parameters. The base hazard rate, , is the value of the hazard for covariate values . The name ‘proportional hazards’ refers to the fact that, due to the exponential form of the hazard function, the effect of each covariate on the hazard is multiplicative. In this work we will call the model in Equation (1) the ‘standard Cox model’.

Survival prediction in the standard Cox model is based on the survival function,


that describes the probability that an individual with hazard function experiences the event after time .

In reality, covariates are often measured repeatedly over time. This means that multiple observations of time-dependent covariates are made for any particular patient. A simple extension to the standard Cox model involves modelling the hazard rate as dependent on the instantaneous value of the covariates,Kleinbaum:2005 ; Kalbfleisch:2012 ; Mills:2012 that is


We refer to this as the ‘instantaneous Cox model’.

However, in practice, one does not have access to the full covariate trajectories . Instead observations are made at discrete follow-up times until some subject-specific final observation time. Since we do not have access to covariate measurements after this time, we cannot make predictions about future survival probabilities based on Equation (3). Of particular difficulty is the inclusion of so-called ‘endogenous’ covariates.Rizo:2012

Due to these difficulties, survival predictions are commonly evaluated by treating the baseline covariate measurements as fixed values in a standard Cox model.Rizopoulos:2017 By not including the follow-up observations, this standard practice discards a potentially considerable proportion of the available patient data.

Recently, there has been much interest in so-called ‘dynamic prediction’.vanH:2007 ; Rizopoulos:2011 ; vanH:2011 These methods aim to make survival predictions based on the longitudinal history of biomarker data, and update these predictions as more data becomes available. Such analysis is clinically valuable as it allows patients and clinicians to review disease progression over time and update the prognosis at each follow-up visit.Zhu:2018 Currently, there are two main approaches to dynamic prediction; joint modelling and landmarking.

Landmarking was an early approach to the problem,Anderson:1983 whereby a standard Cox model is fitted to patients in the original data set who are still at risk at the time point of interest, using their most recent covariate measurements.

More recently, joint modelling has become an established method.Tsiatis:1995 ; Ibrahim:2010 ; Wu:2012 ; Rizo:2012

Here one models the time-dependent covariate trajectory using a parameterised longitudinal model, and this complete trajectory is then inserted into an instantaneous Cox-type survival model. A joint likelihood of the longitudinal and survival sub-models is constructed, and the model parameters are estimated via maximum likelihood or Bayesian inference.

Both methods have limitations. In particular, joint models are demanding both conceptually and computationally. Correctly modelling the longitudinal trajectories can be difficult when patient measurements exhibit varied non-linear behaviourZhu:2018 and misspecification of this trajectory has been found to lead to bias.Arisido:2019 Furthermore, the number of model parameters increases rapidly with the inclusion of multiple longitudinal markers. This means that many software packages cannot handle more than one longitudinal covariate,Moreno-Betancur:2018 ; Gould:2015 ; Hickey:2016 and those that can quickly become computationally intensive.Mauff:2020 ; Li:2017 For these reasons, the landmarking model is often seen as the only practical option.Zhu:2018 However, the relative simplicity of the landmarking approach comes with its own drawbacks. By using only the ‘at risk’ data set to make predictions at a certain time (discarding patients who had an event before the landmark time), landmarking makes use of only a subset of the available data. In standard landmarking approaches, the history of the covariate values are not taken into account directly, and a new model must be fitted every time one wishes to update the predictions.

In this work we present a new approach to dynamic prediction that conceptually and in terms of computational complexity lies somewhere in between the joint modelling and landmarking methods. Rather than modelling the covariate trajectory at future times, as in the joint modelling approach, we model the probability of survival conditioned directly on the observed covariates measured from the baseline time up to a subject-specific final observation time. Unlike the landmark approach, a single model is fitted to all of the available data, using the full history of the covariate values. We do, however, maintain well-established and desirable features of the Cox model, so that our model contains the instantaneous Cox model as a special case, and reduces automatically to the standard Cox model for covariates that are observed to be fixed over time. Within these constraints, we define time-dependent parametric association kernels, , that describe the impact of changes of covariate at time on patient risk at some later time . The kernel can also depend on the final observation time for the patient. Building on ideas from weighted cumulative exposure models,Breslow:1983 ; Thomas:1988 these kernels allow us to assign smaller effects to covariates that were measured further in the past. We refer to our method as the ‘retarded kernel’ approach.

The remainder of this article is set out as follows. In Section 2 we introduce the motivating data sets. In Section 3 we then provide details of the dynamic prediction models. We begin by describing the longitudinal and time-to-event data, and briefly outline the standard methods: joint modelling (Section 3.2) and landmarking (Section 3.3). In Section 3.4, we introduce the retarded kernel approach. We start by defining the hazard rate conditioned on the observed data, and then develop two natural parameterisations for the association kernels that meet our requirements. We outline the maximum likelihood method for parameter estimation for these models, and show how the retarded kernel approach can be used to make dynamic predictions. Via application to the real data sets, in Section 4 we compare the performance of the retarded kernel approach to the standard methods using an established measure of predictive accuracy. Finally, we discuss and summarise our results in Section 5.

2 Motivating data sets

In our work we will assess the predictive capabilities of the different models for dynamic prediction using three clinical data sets, that contain both longitudinal covariate measurements and time-to-event data. All three data sets are publicly available in the JMbayes package,Rizo:2016 and were used in Rizopoulos (2012)Rizo:2012 to illustrate the joint modelling method.

2.1 Primary biliary cirrhosis

Figure 1:   The longitudinal profiles of the time-dependent covariates log serum bilirubin (), log serum albumin () and log prothrombin time () for the patients () in the PBC data set described in Section 2.1. For clarity, the trajectories of 6 individuals are highlighted. Time, , on the -axis is measured in years.

The first motivating data set is a from a study conducted by the Mayo Clinic from 1974 to 1984 on patients with primary biliary cirrhosis (PBC), a progressive chronic liver disease.Murtaugh:1994 We will refer to this as the PBC data. The study involved patients who were randomly assigned either a placebo (154 patients) or the D-penicillamine treatment (158 patients). Time-to-event data is available for the outcome of interest (death) or the censoring event (either the time at which the patient receives a liver transplant or the final follow-up time at which they were still alive). By the end of follow-up, 140 patients had died, 29 had received a transplant and 143 were still alive. As well as baseline covariate measurements such as age at baseline and gender, multiple longitudinal biomarker measurements were collected for each patient over an average number of 6.2 visits from study entry to some subject-specific final observation time (prior to their event time). While the original aim of the study was to investigate the effect of the drug D-penicillamine, no effect was found and the data has since been used to study the progression of the disease based on longitudinal biomarkers.Schoop:2008 With this in mind, we include age at baseline as our only fixed covariate, and focus on the longitudinal covariates log serum bilirubin, log serum albumin and log prothrombin time, which have previously been found to be indicators of patient survival.Schoop:2008 Serum bilirubin and serum albumin indicate concentrations of these substances in the blood, measured in mg/dl and g/dl respectively. Prothrombin time measures the time (in seconds) it takes for blood to clot in a sample. Time series of these three longitudinal biomarkers are plotted in Figure 1.

2.2 Aids

The second data set involves HIV-infected patients who had failed to respond, or were intolerant to, zidovudine (previously called ‘azidothymidine’) therapy (AZT).Abrams:1994 The aim of the study was to compare two antiretroviral drugs, didanosine (ddI) and zalcitabine (ddC). Patients were randomly assigned one of these drugs at baseline. Patients’ CD4 cell counts were recorded at baseline and follow-up measurements were planned at 2, 6, 12 and 18 months. CD4 cells are white blood cells that fight infections. A decrease in the number of CD4 cells over time indicates a worsening of the immune system and higher susceptibility to infection. Therefore, the number of CD4 cells in a blood sample is an important marker of immune strength and hence a covariate of interest in HIV-infected patients. By the end of the study 118 patients had died, and the time to event (death) or censoring was recorded for all patients. Final observation times ( months) were always less than their corresponding event times, such that there is a time gap between when a subject was last observed and when they experienced an event. Following Guo and Carlin (2004),Guo:2004 we included, in addition to the longitudinal CD4 counts and the patients’ drug group, also three other binary fixed covariates in our analysis: gender, PrevOI (previous opportunistic infection – AIDS diagnosis – at study entry), and Stratum (whether the patient experienced AZT failure or AZT intolerance). We will refer to this data as the AIDS data set. The longitudinal profiles of the CD4 count for all patients are plotted in Figure 2.

Figure 2:   The longitudinal profiles of CD4 count () in the patients () in the AIDS data set in Section 2.2. For clarity, the trajectories of 6 individuals are highlighted. Time, is measured in months.

2.3 Liver cirrhosis

The third data set is from a trial conducted between 1962 and 1974, involving patients with liver cirrhosis, a general term including all forms of chronic diffuse liver disease.Andersen:1993 We call this the Liver data set. At baseline, 251 patients were randomly assigned a placebo and 237 were assigned treatment with the drug prednisone. Follow-up appointments were scheduled at 3, 6 and 12 months and then yearly thereafter, though actual follow up times varied considerably. At these follow up appointments, multiple longitudinal biomarkers were measured. However, only the prothrombin index measurements are available from the JMbayes package. This is a measure of liver function based on a blood test of coagulation factors produced by the liver. For reproducibility, and following previous analyses of the Liver data set,Henderson:2002 ; Rizo:2012 we include the prothrombin index as our only time-dependent biomarker. The drug group is included as a fixed baseline covariate. By the end of the study, 150 prednisone-treated, and 142 placebo-treated patients had died. Their time-to-event data was recorded. Of the 488 subjects, 120 were observed until their event time while all others were observed until some subject-specific final observation time before their event time. Figure 3 shows the longitudinal prothrombin measurements for all patients.

Figure 3:   The longitudinal profiles of the prothrombin index () as measured in the patients ( in the Liver data set described in Section 2.3. For clarity, the trajectories of 6 individuals are highlighted. Time, , is measured in years.

3 Dynamic prediction models

3.1 Setup and notation

In this work, we consider longitudinal survival data of the form where is the observed event time of individual with denoting the true event time and denoting the censoring time. The event indicator is equal to 1 when the true event time is observed and 0 when it is censored. Throughout this article we use the indicator function , defined as if holds, and otherwise. denotes the set of time-dependent covariate observations of individual . Individual has measurements of longitudinal covariates from time up to some subject-specific final observation time . These measurements are taken at discrete (subject-specific) observation times, , , where and . We write for the ‘true’ (but non-accessible) continuous trajectories of the covariates over the interval for individual . We develop our theory based on the assumption that we have access to these trajectories . As we will see later, we estimate from the discrete observations .

We are interested in predicting survival probabilities for some new subject with longitudinal measurements . The quantity we wish to estimate is the probability that this subject survives until some future time , conditional on their survival to , and on their covariate observations up to . That is,


The quantity is referred to as a ‘dynamic predictor’ due to the fact that it can be updated as more measurements become available at later times.vanH:2011 ; Rizopoulos:2017

3.2 Joint Models

In joint modelling one specifies two model components: a longitudinal model for the trajectory of the time-dependent covariates, and a survival model which relates to the covariate trajectory via shared parameters. In the JMbayes package, joint models are fitted using Bayesian inference by specifying a joint likelihood distribution for the two model components and a set of prior distributions on the model parameters. Details of this package and the joint modelling framework we follow are described in Rizopoulos (2016)Rizo:2016 and Rizopoulos (2012).Rizo:2012 In this section we briefly outline the model.

3.2.1 Longitudinal modelling component.

Mixed-effects models are typically specified for the longitudinal covariate trajectories, where it is assumed that the observed value of the covariate at time deviates from the true (unobserved) value by an amount . The error terms

of all subjects are assumed to be statistically independent, and normally distributed with variance



Between-subject variability is modelled via estimation of subject-specific random effects , whereas effects that are shared between all subjects are modelled by the fixed effects

. The vectors

and denote the design vectors for these fixed and random effects respectively. For multivariate models, one can allow for association between the different longitudinal markers via their corresponding random effects. In particular, we assume that the complete vector of random effects follows a multivariate normal distribution with mean zero and variance-covariance matrix that describes the correlations between and variances of the random effects. For details we refer to References Rizopoulos:2017 ; Rizopoulos:2011 ; Rizo:2012 ; Cekic:2021 ; Mauff:2020 .

3.2.2 Survival modelling component.

The hazard at time is assumed to depend on the value of the longitudinal covariate at time without measurement error, that is


where denotes the history of the ‘true’ (unobserved) longitudinal covariates up to time . Note that in Equation (6) the hazard rate depends only on the instantaneous values of the covariates, but this can be generalised as briefly highlighted below. Unlike in the Cox model, the baseline hazard,

cannot be expressed analytically in terms of the other model parameters during the maximum likelihood procedure, but must instead be specified. Often this is done using a flexible parametric model, for example using penalised spline functions.

Rizo:2016 Dependence of the hazard function on time-independent covariates can be included through an additional term in the exponent in Equation (6), where is the association parameter for fixed covariate . Alternative extensions allow the hazard to depend on the slope of the covariate trajectory, or on its cumulative effect, by replacing the term with or with , respectively.Rizo:2012 ; Rizopoulos:2017

It has also been proposed to introduce a weight function to capture cumulative effects, writing , with kernels defined such that earlier covariate values have a smaller effect on the hazard than recent values.Rizo:2012 ; Mauff:2017 This idea is connected to the concept of ‘weighted cumulative exposure’ (WCE).Breslow:1983 ; Thomas:1988 WCE models were developed in etiological research to describe the complex cumulative effect of time-dependent ‘exposure’, e.g. to a drug, on health outcomes.Abrahamowicz:2012 In a survival context, these models rely on continuous knowledge of the exposure all the way up to the event time, and have hence been used almost exclusively for measuring the effects of external exposures such as treatments or environmental factors.Sylvestre:2009 Joint modelling allows the principles of WCE to be used for any longitudinal covariate. Through the prediction of future covariate trajectories, it is also possible for these ideas to be integrated into dynamic predictions.Mauff:2017 As we will explain later, we build on the principles of WCE to develop our retarded kernel approach.

3.2.3 Dynamic prediction.

Finally, in the joint modelling framework we use, the quantity is estimated using a Bayesian approach, with posterior parameter distribution and where is the vector of all the model parameters in the joint model. This leads to the estimator


The parameter average in Equation (7) can generally not be evaluated analytically, and is computed via Monte Carlo methods. Again, we refer to ReferencesRizopoulos:2017 ; Rizo:2012 ; Rizopoulos:2011 for details.

3.2.4 Limitations.

Joint models have undergone much development over recent years, with various extensions making the approach flexible in a range of different scenarios. However, the joint modelling approach requires the ability to correctly specify both the longitudinal and survival model. This can involve modelling assumptions which are not always easy to verify. Indeed, simulations have demonstrated that the joint modelling approach is biased under misspecification of the longitudinal model.Arisido:2019 In addition, as more longitudinal outcomes are included, the dimensionality of the random effects increases, and fitting the joint model becomes computationally intensive. Depending on the longitudinal model specified, it can be difficult to include more than three or four longitudinal covariates.Mauff:2020 ; Zhu:2018 This is amplified when cumulative or weighted cumulative effects are used in the survival model (as numerical integration of the longitudinal model is required). As a result, there are cases where joint models are not a viable option and, instead, one must rely on approaches such as landmarking.Zhu:2018

3.3 Landmarking

3.3.1 Description of the landmarking procedure.

The landmarking approach to dynamic prediction is based on the standard Cox model.Anderson:1983 ; vanH:2007 ; vanH:2011 Upon denoting with the set of individuals in the original data set who are still at risk at time , the landmarking model assumes that for a subject in the risk set the distribution of survival times, conditioned on the covariate measurements at that time, follows a standard Cox model.Zhu:2018 In general one does not have covariate measurements for all individuals at time . Instead, one uses for each individual the last observation of the covariates before time , and treats these as fixed covariates in a standard Cox model with as the baseline time. That is, for the so-called ‘landmark time’ one defines the hazard rate at times as


The baseline hazard function is unspecified, and is estimated as in standard Cox models, via partial likelihood arguments, or via functional maximisation of the data log-likelihood. Subsequently the association parameters are estimated. This procedure is carried out for each choice of the landmark time , and leads to the Breslow estimatorBreslow:1972 and the association parameters . The main difference compared to standard Cox models is the dependence of association parameters and the base hazard rate on the landmark time .

To estimate the quantity , the landmark time in Equation (8) is set equal to . Once this model is fitted, survival prediction to time is performed using the standard Cox survival probability,


3.3.2 Limitations.

Landmarking is computationally and conceptually much simpler than the joint modelling approach. For data sets with multiple longitudinal covariates, disparate non-linear covariate trajectories or categorical time-dependent covariates, landmarking is often the preferred approach.Zhu:2018 However, it also has limitations. For example, the model focuses only on the most recent value observed before time , and does not account for the earlier history of covariates. Furthermore, data from individuals who experience the event before time is not used for the parameter estimation at landmark time . Therefore, the landmark approach uses only a subset of the available data. In addition, a new Cox model has to be specified and fitted for each landmark time. Therefore, in order to update predictions after each time where subject is observed, one must refit the model using a new risk set. The longer subject is observed, the fewer individuals remain in the risk set and less data is available to do this.

3.4 Retarded kernel approach

We now introduce our retarded kernel approach to dynamic prediction. It aims to overcome some of the limitations of the standard joint modelling and landmarking methods. Unlike landmarking, the retarded kernel approach aims to incorporate the entire data set, including the full history of covariate values while, at the same time remaining conceptually and computationally simpler than joint models.

3.4.1 General setup.

The starting point for the retarded kernel approach is an expression for the hazard rate that resembles that of weighted cumulative exposure models,Breslow:1983 ; Thomas:1988


In this expression the are time-dependent covariates, which we assume to be known from time up to time . To keep the notation compact we have left out time-independent covariates, as these can always be included trivially. This model differs from the joint model approach to WCE in how we deal with covariates that are only observed up to some final observation time before the event time. When (i.e. when is a point in time prior to the last observation of covariates) the hazard rate in Equation (10) only depends on covariates up to time . For times covariates up to time enter into the hazard rate.

The kernel describes (potentially) retarded effects of covariates. More precisely, quantifies the effect of the value of covariate at time on the hazard rate at a later time , for a patient whose covariates are known up to time . The form of Equation (10) ensures causality, since only covariate values at times contribute to the hazard at time . We set for . In principle, the precise form of could be chosen from a wide range of functions. We reduce this freedom via the following requirements which must hold for all :

  1. Exponential decay of covariate impact. We assume that the impact of each covariate at time on the hazard rate at a later time decays exponentially with the time difference . How fast the effect of the covariate decays is governed by a covariate-specific impact time scale .

  2. Equivalence with standard Cox model for stationary covariates. Our second requirement is that expression (10) reduces to the standard Cox model in Equation (1) in the case of a constant covariate, i.e. when for all . This is achieved when there is a constant , which is independent of and , such that

  3. Equivalence with instantaneous Cox model for short impact time scales. Finally, for we require that expression (10) reduces to the instantaneous Cox model in Equation (3) in the limit , i.e. when the covariate impact on risk decays immediately. This is achieved, without violating (ii), if we have


From (i) it follows that our kernel must have the following form:


where the quantities and can depend on in general. Requirements (ii) and (iii) then translate into, respectively,


3.4.2 Two models within this family.

Finally, from the remaining family of models, i.e. those that satisfy Equations (14) and (15), we choose the two simplest members. These are defined by demanding that either for any (model A), or that for any (model B). Working out the details for these choices via Equations (14, 15) then leads to the following formulae:

Model A: (16)
Model B: (17)

Both models are built around the time-translation invariant factor and satisfy conditions (i), (ii), and (iii). So both reproduce the standard Cox model for time-independent covariates, as well as the instantaneous Cox model for longitudinal covariates with vanishing impact time scales, but they achieve this in distinct ways. We could have ensured a time-translation invariant kernel by choosing in Equation (13) expressions for and that are independent of . However, our models would then not reduce to the standard Cox model when covariates are constant. For we find that is independent of . This describes an anomalous response: the system ‘remembers’ early changes in covariates without decay. This could describe e.g. irreversible damage to the organism. In contrast, retains a decaying dependence on when , with . This could describe, for example, fluctuations in hormone levels that impact the hazard mostly in the short term, but also with persistent long-term effects.

Equations (16, 17) only hold for . In the data sets we study below there are some individuals whose longitudinal covariates are observed only once at the baseline time (i.e. their final observation time is ). Given that Equations (16) and (17) cannot be used for such individuals, we must specify their association parameters in some other way. Two possible options are a constant association, , or a decaying association, . Throughout the main paper we choose the former in the retarded kernel models. Results for the decaying association are presented in Section S5 of the Supplementary Material.

We note that we condition on knowledge of the covariates observed over a specific time interval in the model setup. As a consequence, the parameters in the retarded kernel models cannot necessarily be interpreted directly in terms of biophysical mechanisms. For example, encapsulates both the possible decay in the physical effect of covariate , and the decay that occurs from conditioning on knowledge of the covariate in the past. Parameter interpretations for the model therefore only make sense in a prediction context.

3.4.3 Maximum likelihood inference.

As in the standard Cox model, we use maximum likelihood inference to determine the most plausible values of the model parameters based on the observed data. For simplicity, in this section we will mostly omit the superscript RK from the hazard function. We write for the full set of parameters, i.e. the model parameters described in Section 3.4.2 and the base hazard rate , and assume initially that for each sample the covariates are known over the full time interval . The optimal parameters are those for which the data likelihood is maximised. For non-censored data this likelihood is given by


where is the probability density for individual experiencing an event at time given their covariate measurements. This probability density is expressed in terms of the parameterised hazard rate and the survival probability via


For right-censored data there are two contributions to the likelihood. Individuals for whom an event is observed at time contribute a density . Those that are censored at time contribute the survival probability . Using the primary event indicator , the likelihood for censored data is then


Upon defining , we can write the maximum likelihood parameter estimators as .

A full derivation of the maximum likelihood equations for models of the form in Equation (10) is provided in Section S1.1 of the Supplementary Material. Here we present only the results. The maximum likelihood estimator of the base hazard rate is the direct analogue of the Breslow estimator:Breslow:1972


recalling from Section 3.1 that if holds, and otherwise. The remaining parameters in Equations (16) and (17) are found by minimisation of


where we have disregarded terms that are independent of . As in all Cox-type models, the final minimisation of Equation (22) with respect to the remaining parameters (here, the associations and time-scales) must be performed numerically, for example using Powell’s method.Powell:1964

3.4.4 Dynamic Prediction.

Using the maximum likelihood estimates for the model parameters, we can use the retarded kernel models to estimate the quantity in Equation (4), representing the probability that a subject has not experienced an event by time , conditional on their survival to and on their covariate values up to that time. That is,


with as defined by Equation (10), with kernels of the form in Equations (16, 17) and with the ML estimators for the parameters in those kernels. Using the ML estimator in Equation (21) of the base hazard rate we can perform the integration in Equation (23) to find


where indicates the association kernel obtained from the ML estimators of the parameters . In the numerator we have used the fact that the prefactor ensures that .

3.4.5 Covariate interpolation.

So far, we have defined the retarded kernel models conditional on covariate trajectories over the entire interval . In reality, we do not have full knowledge of these trajectories. Instead for each subject we have a finite number of discrete measurements that coincide with follow up appointments, . In order to perform the integrals in Equations (22) and (24

) we must interpolate between these discrete observed values.

We choose a simple ‘nearest neighbour’ approach, that is we set where is the observation time closest to . The approximate covariate trajectory is then a step function that changes value half way between each pair of consecutive observation times. Figure 4 illustrates this idea. Using this method, the integrals in Equations (22) and (24) can be evaluated analytically (see Section S1.3 of the Supplementary Material). This reduces the computational effort required to perform the minimisation and the dynamic prediction. Other, smoother interpolation procedures such as Gaussian convolutionsPress:1992 ; VanDenBoomgaard:2001 are also possible and may improve estimations (at some computational cost). While interpolation makes assumptions about the values of the covariate within the observation interval , we do not make assumptions about the covariates after the final observation time .

Figure 4: An illustration of the interpolation method for covariates. For each subject , there are a discrete number of covariate observations. The observation times are labelled on the horizontal axis. The covariate measurement at each observation time is indicated by a cross. The solid line shows the interpolated covariate trajectory based on these discrete observations. The value of a covariate at time is taken to be equal to the observed value of the covariate at the observation time closest to . This yields a step function that changes value half way between each pair of consecutive observations.

4 Application to clinical data

4.1 Methods

4.1.1 Training and test data.

For each of the data sets we assess the predictive accuracy of the different dynamic prediction models by splitting the data in half into a training data set and a test data set. Each model is fitted to the training data and the resulting model is used to make survival predictions about individuals in the test data set. Predictive accuracy is assessed by comparing these predictions to the true event times in the test data (see Section 4.1.2). The procedure was repeated for 20 random splits of the data and the corresponding prediction error was averaged over these repetitions.

4.1.2 Measuring predictive accuracy.

Following Rizopoulos et al. (2017),Rizopoulos:2017 we quantify the predictive accuracy of the different models using the expected error of predicting future events. Dynamic prediction is concerned with predicting the survival of individuals to a given time , based on their survival to some earlier time , and covariate measurements for the individual up to this time. The expected prediction error for a given ‘prediction time’ and ‘base time’ is then defined asSchoop:2011


where is the true event status of subject at time , and is the model’s predicted survival probability for subject based on information about this subject (covariate measurements and survival status) up to the base time . The notation stands for an average over the distribution of covariates and event times.

denotes a loss function which defines how we measure the difference between survival status and predicted survival probability. Commonly choices are

and the squared loss .Rizopoulos:2017 ; Henderson:2002 ; Schoop:2011 We choose the latter. The definition of prediction error is such that if the survival status of all individuals is predicted with full accuracy (i.e. for all subjects who are alive at time and for subjects who are dead by time ). If the reverse is true ( for subjects who are dead at time and for subjects who are alive) then . We obtain if every individual has predicted survival probability .

Again following Rizopoulos et al. (2017),Rizopoulos:2017 in this paper we use the overall prediction error proposed by Henderson et al. (2002),Henderson:2002 that in addition takes into account censoring,


The sum extends over the subjects in the test data set who are still at risk at time . The first term of Equation (26) corresponds to individuals in the test data who are still alive after time . These have survival status , and therefore contribute a loss function based on the difference between their estimated survival probability and 1, i.e., . The second term refers to individuals who have experienced an event by time (i.e. ). Their survival status is 0 and therefore they contribute a loss function . The final term represents individuals who were censored before time (i.e. ) so we do not know their survival status at time . Here the estimated probability of survival based on information up to time is compared with the probability of survival given that we know subject survived up until their censoring time .

To compare the predictive accuracy of joint modelling, landmarking and the retarded kernel approach we insert into Equation (26) the respective estimators , and . This requires that we calculate the probability of a subject’s survival to time , based on survival and covariate observations until a general base time that need not be the individual’s final observation time . For the joint model and landmarking estimators we replace the final observation time with in Equations (7) and (9). For the retarded kernel estimator in Equation (24) we replace with since we know subject is alive until . However, we only have covariate observations up to the latest observation time that is . In line with our chosen interpolation procedure we only integrate the covariate trajectory up to this time. Specifically, for any general base time we have


where index labels the individual (in the test data) for whom we are making predictions, while the sums over and refer to individuals in the training data set used for inference. The integral limit labels the last observation time of individual before (or at) the base time .

The term in Equation (26) represents the probability of survival to given subject survived to their censoring time . To calculate this using the retarded kernel model we replace with in Equation (27). For joint modelling is obtained by replacing with and by replacing the condition with in Equation (7). Since this term is only calculated for censored individuals (), the condition means ‘the true event time of individual is greater than their censoring time’. Finally, for landmarking we use which is equivalent to replacing with in Equation (9) except in the integral limits where we replace with .

To perform the prediction error calculation for the retarded kernel models we use our own C++ code following Equations (26) and (27). For the joint model and landmarking model we use a version of the function prederrJM in the JMbayes package subject to minor modifications (see Section S4 of the Supplementary material for details).

4.1.3 Fixed base time.

First we compare the predictive accuracy of the three methods by specifying a fixed base time and varying the prediction time . Based on Figures 1 and 3, for the PBC and Liver data sets we choose a fixed base time of years. This value is chosen so that a large number of individuals are still alive after this time (and we can hence make predictions about them), but also so that these individuals have had their covariates measured multiple times before this time. We then vary the prediction time from the base time years in steps of years up to years for the PBC data, and up to years for the Liver data. For the AIDS data set we choose months as the base time, so that most individuals have been observed three times. We then vary the prediction time from this base time up to months in steps of months.

4.1.4 Fixed prediction window.

In our second test, we vary the base time , while keeping the prediction window fixed (i.e., the time difference between prediction and base time). Since we are varying the base time , we must then fit a new landmark model for each choice of (where the landmark time ). On the other hand, for the retarded kernel approach and the joint model we need only fit the model once, and can make the error assessments at each iteration using this single fitted model.

Based on previous analysis of the PBC and Liver dataRizo:2012 ; Henderson:2002 we choose three prediction windows: year, years, and years. Given the event time distributions, we do not make predictions for either data set beyond years. Therefore, for we vary the base time from years, for we vary it from years and for this is years. In all cases we increase the base time in increments of years.

Based on the event time distribution of the AIDS data, we choose prediction windows months, months and months. Here covariates are observed at 0, 2, 6, 12, and 18 months only. As a result, predictions will only be updated at these time steps, and we can only make a small number of distinct measurements of predictive accuracy. Due to the event times in the AIDS data set, we do not make predictions past 18 months. Therefore, for window we use base times months and for windows and we use months only.

4.2 PBC data set

We fit each model to the PBC training data set using time-dependent covariates and a single fixed covariate: denotes log serum bilirubin, denotes log serum albumin, is log prothrombin time, and the fixed covariate is the subject’s age at baseline.

The PBC data set contains event-time information for two events, death and liver transplant. The most appropriate way of analysing this data is to use a competing risks model. However, for simplicity we here treat the transplant event as a censoring event. Another simple way to analyse this data is to treat the two events as a single composite event. We provide the results of the latter analysis in Section S3 of the Supplementary Material. The two analyses are found to give similar results.

4.2.1 Models.

For the joint model we first fit a simple multivariate linear mixed model to each of the three time-dependent covariates,


where the random effects are assumed to follow a joint multivariate normal distribution with mean zero and variance-covariance matrix .

Figure 1 suggests that the covariate trajectories in the PBC data may be non-linear for some individuals. Hence, for extra flexibility we also fit a second joint model that includes natural cubic splines in both the fixed and random effects parts of the model. Following Rizopoulos (2016),Rizo:2016 the log serum bilirubin (

) is modelled using natural cubic B-splines with 2 degrees of freedom,


where denotes the B-spline basis matrix for a natural cubic spline of time.Rizo:2012 ; Perperoglou:2019 We write analogous equations for both the log albumin and the log prothrombin covariates. Again, the random effects of all three longitudinal covariates are assumed to follow a joint multivariate normal distribution.

For both the linear and spline longitudinal models, the hazard function of the survival sub-model in the joint modelling framework is


where we recall from Section 3.2 that denotes the history of the ‘true’ (unobserved) longitudinal covariates up to time for subject . For the landmark model the hazard is instead specified for a given landmark time, ,


where is again the last observed value of covariate for patient before time .

For the retarded kernel approach, we specify the hazard function as


The parameterisations of the time-dependent association parameters are given in Equations (16) and (17) for models A and B, respectively.

4.2.2 Results.

Figure 5 shows plots of the overall prediction error against the prediction time for a fixed base time of years averaged over the 20 random splits of the data. Results for the linear joint model, spline joint model, landmarking model and models A and B of the retarded kernel approach are plotted on the same graph. All five models have similarly accurate predictions up to years. For later prediction times, the standard approaches have a lower average prediction error than the retarded kernel models. The largest disparity in prediction error is observed at years between the spline joint model () and the retarded kernel models which both have .

Figure 5:  Overall prediction error as a function of prediction time (in years) for the PBC data with fixed base time years. Prediction error is calculated for values from 3 to 8 years, with 0.2 year increments. A squared loss function was used in Equation (26). The prediction error plotted at each time is an average over values of calculated for 20 random splits of the data into training and test data sets. The results from models A and B of the retarded kernel approach are plotted alongside the landmarking model and two joint models (one that uses a linear longitudinal model for the time-dependent covariates, and another that uses cubic splines).
Figure 6:   Overall prediction error versus base time (in years) for the PBC data, with prediction windows year, years and years. The prediction times are . The prediction error is calculated for ranging from 0 to 9,8 or 7 years for , and respectively, with 0.2 year increments. A squared loss function was used in Equation (26). The prediction error plotted at each time is an average over values of calculated for 20 random splits of the data into training and test data sets. Results from models A and B of the retarded kernel approach are plotted alongside the landmarking model and two joint models; one that uses a linear longitudinal model for the time-dependent covariates, and another that uses cubic splines.

Plots of the average prediction error against the base time are shown in Figure 6, for fixed prediction windows year, years, and years. Again results for the five different models are plotted in the same graphs. For the shortest prediction window , all models are similarly accurate for base times up to years, after which the landmarking model performs slightly worse than the others. For the larger prediction windows, models A and B of the retarded kernel approach show slightly larger prediction errors than the other models over the range years. At larger base times the joint models and retarded kernel models again exhibit similar prediction errors while landmarking has the largest error. The largest difference in performance occurs at years for the prediction window , between the spline joint model () and the landmarking model ().

The results of the above tests suggest that, for the PBC data, the retarded kernel approach performs as well as existing methods for prediction windows years but less well for larger windows. However, as the base time increases, the retarded kernel models behave similarly to the joint models while the landmarking model exhibits the highest prediction error. Care should be taken when interpreting these results, as we have not used a competing risks model in our analysis. This data set does, however, serve as an illustration that with only a modest drop in accuracy the retarded kernel model can serve as a simpler alternative to joint modelling when considering multiple longitudinal covariates. Unlike the landmarking approach, the retarded kernel model takes into account the full history of covariate observations which, along with the fact that landmarking discards more data as the base time increases, may explain why the landmarking model performs worst for later base times.

4.3 AIDS data set

In the AIDS data set we focus on a single longitudinal covariate, the CD4 count . We also include four fixed binary covariates: drug group ( for ddI and for ddC), gender ( for male and for female), PrevOI  (