Introduction
Joint models turned out to be a powerful approach to analyzing data where event times are measured alongside a longitudinal outcome and were first suggested by Wulfsohn and Tsiatis [29]. When dealing with such a very common data structure, one naive approach would be separate modeling, i.e. fitting some suitable longitudinal model and in addition a Cox regression for event times. The disadvantages of this approach are that separate modeling neither corrects for eventdependent dropout in longitudinal analysis, nor quantifies the relation between a timedependent covariate and the risk for an event in survival analysis which has been shown based on simulations by Guo and Carlin [7]. To overcome these issues, various modeling approaches have been proposed, e.g. the extended Cox model including timedependent covariates or two stage approaches, where longitudinal models are fit in order to carry out survival analysis. These approaches however happen to produce biased results like Rizopoulos [20] and Sweeting [22] showed for the extended Cox model, respectively the two stage approach. The solution is combining both the survival and longitudinal submodel within one single joint likelihood. An introduction for this joint modeling framework can be found in [20] including the JM package discussed in Rizopoulos [19]. Furthermore Tsiatis and Davidian [24] give a summary of joint model evolution up to 2004.
One main drawback in current joint modeling estimation approaches is that they are inappropriate for high dimensional data, especially when the number of covariates exceeds the number of observations, and lack clear strategies to address variable selection properties. Thus, Waldmann
et al. [27]proposed an estimation procedure based on gradient boosting techniques. Evolved from machine learning as an approach to classification problems initially proposed by Freund and Schapire
[6], gradient boosting can be seen as conventional gradient descent techniques transferred into function space. This connection between boosting and functional gradient descent was first discovered by Breiman [2, 3]. For a general summary of the statistical perspective on boosting, see Bühlmann and Hothorn [4], for theoretical convergence results using specific loss functions Bühlmann and Yu
[5].Based on the work by Waldmann et al. [27] it is for the first time possible to estimate and select high dimensional joint models with respect to minimizing the prediction error. However, their initial joint model boosting approach did not include baseline covariates exclusively for the survival submodel. Aim of this work is to extend the discussed algorithm by incorporating another boosting step focusing solely on the newly introduced survival predictor, as well as updating and improving the simulation method by a different approach to generate event times.
The remainder of this paper is structured as follows: Section 1 gives a detailed description of the considered joint model, while boosting in general and the extended boosting algorithm for joint models are discussed in Section 2. Sections 3 and 4 deal with applying the algorithm to different setups of simulated data as well as to the AIDS dataset included in the JM package (see Abrams et al. [1]). Finally the results and possible extensions are discussed.
1. The Joint Model
In the following we will give a detailed description of the two submodels in order to formulate the joint likelihood function.
Suppose we have individuals with longitudinal measurements per individual . The longitudinal part has the form
where denotes the th longitudinal outcome of the th individual. The longitudinal predictor includes linear functions of possibly timevarying covariates and an intercept, thus with . The shared predictor contains in addition to linear functions of timeconstant covariates also individual specific random effects and a fixed linear time effect. Hence we have with and . The error terms
are assumed to follow a normal distribution with
and .The survival model is given by the hazard function
with denoting linear functions of covariates fixed in time and a baseline hazard chosen to be constant. Furthermore the scaled shared predictor introduced in the longitudinal model reappears in the survival part. Thus , also called the association parameter, quantifies the relation between the two submodels.
Now we can formulate the loglikelihood. Let denote the event time of individual with censoring indicator
and introduce the vectors
and as the collection of all longitudinal and survival measurements. Set the parameter vector to be , so we can calculate the loglikelihood(1.1) 
since the error terms are normal distributed. Observe so can be seen as a mapping . We are interested in the maximum likelihood estimator
(1.2) 
for given longitudinal data and event times . This maximization is achieved via componentwise gradient boosting methods discussed in the following section.
2. Methods
In this section we give a brief overview on the concept of gradient boosting in general and then focus on developing an approach for the class of joint models with regards to the problem proposed in formula (1.2).
2.1. Componentwise Gradient Boosting
Though having its roots in machine learning as an approach to classification problems [6], from a mathematical point of view, gradient boosting can be seen as conventional steepest descent algorithms transformed into function space [2, 3, 14]. If we want to minimize a function one basic iterative method is to compute the negative gradient  evaluated at the current position. We then walk the by  indicated direction for a specific steplength to receive a new value and eventually converge into a local minimum of the cost function . So we have
(2.1) 
where the optimal is usually chosen via line search methods, see e.g. [28].
This idea remains the same for gradient boosting. We are now interested in finding an optimal predictor function that minimizes a prechosen loss function between the evaluation of the predictor at the specific covariates and the given data . Popular examples for this loss function are the euclidean distance or the negative likelihood. The task is now to find an optimal function which minimizes the functional . In addition the predictor is split into several baselearners
. Usually one baselearner models the effect of one single covariate; e.g. baselearners in a regular linear regression model would take the form
as the linear effect of the th covariate.For componentwise gradient boosting it is crucial to compute the functional derivative
in order to fit the single baselearners separately to the negative gradient . This can be seen as an approximation of the gradient in order to obtain a steepest descent update in analogy to formula (2.1). Now only the best performing baselearner is updated to receive the new predictor
The steplength is chosen as the optimal to minimize the loss function but shrunken by a constant factor in order to give every baselearner an equal chance to be picked and avoid overfitting. Since the optimal steplength for gradient boosting methods using the euclidean distance in fact equals 1, in the literature a famous choice for the shrunken steplength is . Gradient boosting techniques are also capable of handling smooth effects, see Schmid and Hothorn [21]. A wide class of models and baselearners is included in the R package mboost available on CRAN, a tutorial to mboost can be found in [13].
2.2. Boosting Joint Models
We now want to formulate the explicite algorithm for optimizing the likelihood (1.1) and give a detailed description. The algorithm in appendix is the proposed algorithm to carry out gradient boosting for joint modeling. The implementation of the algorithm as well as the simulation function discussed in the next section are provided with the new R addon package JMboostE which source code is hosted openly on
http://www.github.com/cgriesbach/JMboostE.
Since both submodels contain different, but not disjunct compositions of the predictors and , a seperate boosting step for every submodel would not work out. Thus the general concept is to create an outer loop, in which every predictor is boosted separately. In a fourth step, the parameters and are updated according to the current likelihood. This approach is an extension of boosting methods in multiple dimensions including nuisance parameters discussed in [15]. Since the exact sequence of substeps in the three main iterations is basically the same, we give a general explanation instead of carrying out every single step.
Initializing baselearners. Although the computation of the gradient depends on the exact definition of the baselearner functions
, there are no restrictions to their specific form. Yet it is necessary to account for the different degrees of freedom as e.g. nonlinear baselearners approximated via splines have a higher chance to be picked in the variable selection process otherwise as discussed in Hofner
et al. [12]. In addition note, that the random effects are formulated as a single baselearner appearing in the step for the longitudinal predictor and are updated all at once, if selected.Computing the gradient. Primary goal in every boosting iteration is computing the gradient as the functional derivative of the loss function with regard to the current predictor. This gradient indicates the direction which the base learners need to be fit to in order to obtain the optimal update. Since we have chosen the negative loglikelihood as our loss function , the gradient for the survival part takes the form
(2.2) 
where denotes the timeindependent part of the shared predictor. A detailed derivation of formula (2.2) can be found in the Appendix. For the shared predictor a straight forward computation leads to the desired gradients as well. An explicit formulation can be found in the algorithm, the Appendix of [27] shows the calculation.
Stopping iterations. Main tuning parameter for the algorithm are the stopping iterations , and . Since the single predictors are updated separately, the best working triple has to be found. This is done via cross validation where the coefficients are fit to a set of training data, while the evaluation of the different choices for the stopping iterations is performed on the remaining individuals. The training and test datasets are received by splitting the data randomly in different subsets.
Steplengths. Boosting three predictor functions not only leads to three different upper boundaries for the number of total iterations, but also to three different choices for the steplength in each updating scheme. For the steplengths used in the longitudinal and shared predictor we use the established steplength mentioned in the previous section. The steplength of the survival submodel is chosen to be since the algorithm needs more iterations to converge into the overall maximum of the survival likelihood. This choice is of course not compulsory as lowering the weakness of the learner technically reduces estimation accuracy. One has to find a reasonable choice for with respect to a feasible computational effort and minimal tradeoff between computation time and quality of the results.
Time and random effects. Although the time and random effects are part of the shared predictor they are estimated solely on longitudinal data alongside with the predictor . Since it has been shown in [9] that gradient boosting is not capable of handling time effects in survival analysis, this choice is substantial for the estimates’ reliability. They nevertheless have an important impact on the joint model through their scaled appearance in the hazard function.
3. Simulation
In order to evaluate the method proposed in Section 2, we wrote a simulation algorithm which simulates data in context of the scenario discussed in Section 2. The simulation evolved from the approach described in the Appendix of [27] but comes with some crucial changes in the part where eventtimes are generated. This leads to bias reduction, as the representation of the true distribution of event times is more accurate.
3.1. Data generation
The algorithm in Appendix B briefly describes the method to simulate data joint modeling is suitable for. We now want to give a detailed description to the single steps.
Time points. The setup simulates a survey over years where every individual attends the survey randomly once a year. The first longitudinal measurement time is set to avoid missing not at random censoring, which would occur by ignoring extremely quick occurring events completely.
Covariate matrices. We have and where the first column of is the unit vector and the last column of the time vector with respect to the design of and . Note that although containing data for multiple longitudinal time points, only inherits time independent covariates. For the survival covariates we have
Event times. The event times are generated via inversion sampling. Given the distribution function of the event times, we can generate event times with distribution function
by drawing uniformly distributed random numbers
and setting . The inverse has a closed form solution and can be obtained via a straight forward computation. We consider an event time as censored, if it exceeds the last longitudinal measurement time, since this indicates the end of the survey and the individual is not longer under observation.3.2. Results
Overall, we considered two different simulation setups. One low dimensional situation mimicking a more common dataset and one high dimensional to evaluate performance in cases where holds.
In both cases two different data sets were generated, training data with and test data with individuals. In the first step a model was fit to the training data, while the test data was used to evaluate this model’s performance. Fitting and evaluating the model was done via grid search. Based on a three dimensional grid containing possible triples of stopping iterations, a model was fit to the training data for each of those triples. The parameter estimates were then used to compute the overall likelihood function (1.1) for the test data, in order to evaluate the performance of every predefined triple of stopping iterations.
We used the grids and for the low and high dimensional setup each. In both cases the selected stopping iterations did not exceed the grid, which means they did not concentrate on the upper boundary for any predictor. Tables 1 and 2 summarize the estimates of the single coefficients as well as the parameters , and for both simulation runs S1 (low dimensional) and S2 (high dimensional), which are discussed in more detail in the following two subsections. The variable selection properties are depicted in Table 3 where TP/FP stands for true/false positive indicating the rate of correctly/incorrectly picked variables.
Truth  2  1  2 

S1 (sd)  2.006 (0.012)  0.996 (0.009)  (0.008) 
S2 (sd)  2.007 (0.017)  0.994 (0.010)  (0.010) 
Truth  2  1  
S1 (sd)  (0.090)  1.822 (0.105)  0.906 (0.087) 
S2 (sd)  (0.092)  1.492 (0.084)  0.693 (0.080) 
Truth  1  2  
S1 (sd)  1.000 (0.007)  (0.007)  1.980 (0.029) 
S2 (sd)  0.986 (0.016)  (0.020)  1.972 (0.045) 
Truth  0.5  0.1  0.1 

S1 (sd)  0.469 (0.038)  0.133 (0.025)  0.093 (0.003) 
S2 (sd)  0.412 (0.033)  0.217 (0.030)  0.090 (0.006) 
S1  1.00  0.012  1.00  0.598  1.00  0.990 

S2  1.00  0.013  1.00  0.035  1.00  0.022 
3.2.1. Simulation 1: Low dimensional setting
For the first simulation individuals with longitudinal measurements per individual were simulated. We chose the coefficients
where the first component of is the intercept and the last component of the time effect . The remaining parameters were set as
Furthermore we added 15 noninformative covariates, six in each predictor, leading to 24 covariates overall. The coefficient estimates of 100 simulation runs in total can be seen in Figure 1.
The estimates are very close to the real values, which also holds for the remaining parameters. The coefficients of the survival predictor experience shrinkage which leads to a tiny tradeoff between and . The mean stopping iterations for each predictor are
(averaged over all 100 simulation runs). However, in a low dimensional setup false positive rates tend to be higher, in particular noninformative covariates are selected at a rate of in the survival and in the shared predictor. Because of the design of gradient boosting algorithms, which focuses on predictive risk minimization rather than variable selection, these methods tend to have higher false positive rates in low dimensional settings. Since the random effects are boosted alongside the longitudinal predictor, the algorithm tends to pick these instead of noninformative longitudinal covariates, hence we have a false positives rate of in this case. True positives, on the other hand, are recognized 100% of the time.
3.2.2. Simulation 2: High dimensional setting
In the high dimensional setup we used the same parameters as S1 discussed in section 3.2.1 but now with 731 noninformative covariates per predictor overall leading to . Technically, up to longitudinal measurements are thinkable, but since the death of individuals leads to removal of the remaining values, the case is highly unlikely and never occurred in the simulations, hence the number of covariates exceeded the number of measurements in every single simulation run. The coefficient estimates are visualized in Figure 3. The coefficients in experience higher shrinkage than in the low dimensional case, which also increases the tradeoff between and , see Table 1. This is due to the fact, that the loss function of the survival part has a higher complexity than the quadratic loss appearing in the longitudinal part. The coefficients in the shared predictor on the other hand are not shrunken. The shared gradient is mainly driven by the quadratic loss, since the longitudinal measurements clearly outrun the survival data in numbers.
Variable selection performs way better in the high dimensional setup. While informative variables are picked of the time, false positives occur at rate in the longitudinal, in the survival and in the shared predictor. The averaged stopping iterations are
The low value of compared to the low dimensional setup explains the relatively high shrinkage in the survival predictor.
4. Example with AIDS data
We test our boosting algorithm with the AIDS dataset consisting of 467 patients with advanced human immunodeficiency virus infection during antiretroviral treatment in Abrams et al. [1]. Aim of the study was to compare two antiretroviral drugs indexed by ddC and ddI with respect to survival time. After study entry, patients had several followups after 2, 6, 12 and 18 months where amongst other things the CD4 cell count was recorded. By the end of the study, 188 (40.26%) of the patients had died. In addition we have the baseline covariates gender, prev (whether or not they had acquired immunodeficiency syndrome) and azt (whether some previous treatment failed or the patient was immune). More details regarding the study can be found in [1], the dataset itself is quiet popular and included in various R packages, e.g. JM. Since the CD4 cell count is a timevarying covariate, we formulate the following joint model
as we are interested in investigating the association between a timedependent marker and the instantaneous risk for the death. To ensure the model contains random effects, we start with a random intercept estimated via maximum likelihood and , . The grid search was done with equally picked steplengths on the grid defined by , and , since in the case convergence of the model would have been obtained. This leads to a total of 1000 estimated models where the best performing triple based on 10 fold cross validation is
The coefficient paths of the survival model with this specific triple of stopping iterations are depicted in Figure 4. The algorithm picks the variable prev right away and therefore sees an unsurprisingly high linkage between suffering from AIDS and an increased risk for death. While azt is left fairly untouched by the algorithm, also gender gets picked after a while. The stepwise optimized association parameter takes the value indicating a negative influence of the CD4 cell count on the hazard ratio, thus a decrease of CD4 cells leads to a higher risk for death, which corresponds to the facts as to retrace in [18].
5. Discussion and Outlook
In addition to the work by Waldmann et al. [27] where longitudinal analysis was improved by incorporating information gained through the observed eventtimes, the extended algorithm offers a new approach to survival analysis by combining the advantages of both joint modeling and gradient boosting. Especially for high dimensional data this turns out to be a powerful tool, since gradient boosting not only makes joint modeling of high dimensional data manageable at all, but in addition also offers good variable selection properties. In the application the algorithm builds a stepwise coefficient progression allowing the algorithm to stop early in order to prevent overfitting and minimize the predictive risk.
Still, in its current form the simplicity of the model is not directly applicable to a more general class of datasets. In further work we plan to not only extend the baseline hazard to be timevarying, but also include a wide range of smooth effects in the longitudinal as well as in the survival submodel. Particularly in the latter this is quite a challenge, since gradient boosting is not capable of estimating timevarying effects in survival analysis. Therefore we intend to move on to likelihoodbased boosting techniques [25], which address exactly these issues and already have proven to be practicable in very flexible survival models [11].
Another aspect of a possible extension would focus on including an allocation mechanism, where the algorithm allocates single covariates to each predictor instead of boosting three different predictor functions with predetermined covariates. This not only improves the model’s flexibility but also reduces the computational effort by a lot, since the grid search is simplified to just one overall . A similar concept to boosting in multiple dimensions has been carried out by Thomas et al [23]. The allocation approach for joint models is being investigated in [26] and can also be extended to the present model.
Furthermore the variable selection properties with focus on the rate of false positives in low dimensional setups leaves room for improvement. Since the algorithm is not specifically designed for variable selection, some other possible solutions might be helpful. Stability selection [17] being one of them already proved to be useful for boosting longitudinal and survival models, see [10, 16]. Probing on the other hand leads to better results regarding variable selection as investigated in [8]. The idea of probing is to add noninformative phantom variables to each predictor and stop the boosting procedure, as soon as a phantom variable gets selected.
In this article the inference scheme proposed in [27] has been extended to a more general class of joint models, which allow way more focus on survival analysis. Nevertheless the extensions of manifold classes of joint models via tools from statistical learning are still far beyond its potential limits and open to further investigation.
References
 [1] Donald I. Abrams, Anne I. Goldman, Cynthia Launer, Joyce A. Korvick, James D. Neaton, Lawrence R. Crane, Michael Grodesky, Steven Wakefield, Katherine Muth, Sandra Kornegay, David L. Cohn, Allen Harris, Roberta LuskinHawk, Norman Markowitz, James H. Sampson, Melanie Thompson, and Lawrence Deyton, A comparative trial of didanosine or zalcitabine after treatment with zidovudine in patients with human immunodeficiency virus infection., New England Journal of Medicine 330 (1994), 657–662.

[2]
Leo Breiman,
Arcing classifiers (with discussion)
, Ann. Statist. 26 (1998), 801–849.  [3] by same author, Prediction games and arcing algorithms, Neural Computation 11 (1999), 1493–1517.
 [4] Peter Bühlmann and Torsten Hothorn, Boosting algorithms: Regularization, prediction and model fitting, Statistical Science 22 (2007), no. 4, 477–505.
 [5] Peter Bühlmann and Bin Yu, Boosting with the l2 loss, Journal of the American Statistical Association 98 (2003), no. 462, 324–339.
 [6] Yoav Freund and Robert E. Schapire, Experiments with a new boosting algorithm, Proceedings of the Thirteenth International Conference on Machine Learning Theory, Morgan Kaufmann, San Francisco, 1996, pp. 148–156.
 [7] X. Guo and B. P. Carlin, Separate and joint modeling of longitudinal and event time data using standard computer packages, The American Statistician 58 (2004), 16–24.
 [8] Tobias Hepp, Janek Thomas, Andreas Mayr, and Bernd Bischl, Probing for sparse and fast variable selection with modelbased boosting, Computational and Mathematical Methods in Medicine 2017 (2017), 422–430.
 [9] Benjamin Hofner, Variable selection and model choice in survival models with timevarying effects, Diploma thesis, LudwigMaximiliansUniversität München, 2008.
 [10] Benjamin Hofner, Luigi Boccuto, and Markus Göker, Controlling false discoveries in highdimensional situations: boosting with stability selection., BMC Bioinformatics 16 (2015), no. 12.
 [11] Benjamin Hofner, Torsten Hothorn, and Thomas Kneib, Variable selection and model choice in structured survival models, Computational Statistics 28 (2013), 1079–1101.
 [12] Benjamin Hofner, Torsten Hothorn, Thomas Kneib, and Matthias Schmid, A framework for unbiased model selection based on boosting, Journal of Computational and Graphical Statistics 20 (2011), 956–971.
 [13] Benjamin Hofner, Andreas Mayr, Nikolay Robinzonov, and Matthias Schmid, Modelbased boosting in r: A handson tutorial using the r package mboost, Computational Statistics 29 (2014), no. 12, 3–35.
 [14] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean, Functional gradient techniques for combining hypotheses, Advances in Large Margin Classifiers (1999).
 [15] Andreas Mayr, Nora Fenske, Benjamin Hofner, Thomas Kneib, and Matthias Schmid, Generalized additive models for location, scale and shape for high dimensional dataa flexible approach based on boosting, Journal of the Royal Statistical Society: Series C (Applied Statistics) 61 (2012), no. 3, 403–427.
 [16] Andreas Mayr, Benjamin Hofner, and Matthias Schmid, Boosting the discriminatory power of sparse survival models viaoptimization of the concordance index and stability selection., BMC Bioinformatics 17 (2012).
 [17] Nicolai Meinshausen and Peter Bühlmann, Stability selection, Journal of the Royal StatisticalSociety 72 (2010), 417–473.
 [18] Online Mendelian Inheritance in Man, OMIM, CD4 antigen. MIM number: 186940, 2016.
 [19] Dimitris Rizopoulos, JM: An R package for the joint modelling of longitudinal and timetoevent data, Journal of Statistical Software 35 (2010), no. 9.
 [20] by same author, Joint models for longitudinal and timetoevent data: With applications in r, Chapman & Hall / CRC biostatistics series, vol. 6, CRC Press, Boca Raton, 2012.
 [21] Matthias Schmid and Torsten Hothorn, Boosting additive models using componentwise psplines, Computational Statistics and Data Analyses 53 (2008), no. 2, 298–311.
 [22] Michael J. Sweeting and Simon G. Thompson, Joint modelling of longitudinal and timetoevent data with application to predicting abdominal aortic aneurysm growth and rupture, Biometrical Journal 53 (2011), no. 5, 750–763.
 [23] Janek Thomas, Andreas Mayr, Bernd Bischl, Matthias Schmid, Adam Smith, and Benjamin Hofner, Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates, Statistics and Computing (2017), 1–15.
 [24] Anastasios A. Tsiatis and Marie Davidian, Joint modeling of longitudinal and timetoevent data: An overview, Statistica Sinica 14 (2004), no. 3, 809–834.
 [25] Gerhard Tutz and Harald Binder, Generalized additive modeling with implicit variable selection by likelihoodbased boosting, Biometrics 62 (2006), no. 4, 961–971.
 [26] Elisabeth Waldmann, Colin Griesbach, and Andreas Mayr, Variable selection and allocation in joint models for longitudinal and timetoevent data via boosting, in preparation, 2018.
 [27] Elisabeth Waldmann, David TaylorRobinson, Nadja Klein, Thomas Kneib, Tania Pressler, Matthias Schmid, and Andreas Mayr, Boosting joint models for longitudinal and timetoevent data, Biometrical journal (2017).
 [28] Philip Wolfe, Convergence conditions for ascent methods, SIAM Review 11 (1969), no. 2, 226–235.
 [29] Michael S. Wulfsohn and Anastasios A. Tsiatis, A joint model for survival and longitudinal data measured with error, Biometrics 53 (1997), no. 1, 330.
Appendix A Boosting algorithm

Initialize predictors , and define baselearners , and , specify for the random effects and for the time effect. Initialize association parameter , baseline hazard and model error . Choose stopping iterations , and with overall maximum and steplengths , and .

for to do

step1: Update longitudinal predictor
if set , , and skip this step
else
Compute as

Fit the negative gradient vector separately to every baselearner specified for the longitudinal predictor :

Select the component
that best fits and update .


step2: Update survival predictor
if set , and skip this step
else
Compute as

Fit the negative gradient vector separately to every baselearner specified for the survival predictor :

Select the component
that best fits and update .


step3: Update shared predictor
if set and skip this step
else
Compute as

Fit the negative gradient vector separately to every baselearner specified for the shared predictor :

Select the component
that best fits and update .


step4: Update and

if set and skip this update
else for the number of total longitudinal observations and covariates set.

if set and skip this update
else


end when .
Appendix B Simulation algorithm

Choose values for and .

Generate time points for longitudinal measurements the following way:

Sample and set for and .

For each shift observation times to the left, so we have .

Standardize time points to the unit interval by

Set as the collection of all longitudinal time points.


Generate covariate matrices using uniformly distributed random numbers and random effects following a normal distribution with mean 0.

Calculate the predictor vectors

Simulate longitudinal measurements by
where denotes the dimensional multivariate normal distribution and the corresponding unit matrix.

Draw event times by generating random numbers and setting
according to inversion sampling. Set to obtain censored data with censoring indicator and receive the observed survival outcome .

Delete all entries from corresponding to times for every individual .

Add columns with i.i.d. generated random numbers as noninformative covariates to and .
Appendix C Gradient of the survival predictor
Since computing the functional derivative with respect to the longitudinal predictor is straightforward and the computation of the shared gradient analogous to the explanation in the appendix of [27], we will sketch out the calculus of the functional derivative with respect to the survival predictor.
As mentioned in section 2, we set the negative loglikelihood as our loss function, thus . Since appears only in the survival component of the likelihood we get
as does not depend on . We get
with again denoting the timeconstant part of , so we have
for the final gradient.