1 Introduction
The terms joint models or joint modelling have been used in various contexts to describe modelling of a combination of different outcomes. This article deals with joint models for longitudinal and survival outcomes, in which the predictors for both are composed of individual as well as shared subpredictors (i.e. a part of the predictor which is used in both, the longitudinal and the survival part of the model) . The shared subpredictor is scaled by an association parameter which quantifies the relationship between the two parts of the model. This type of model was first suggested by Wulfsohn and Tsiatis (1997) in order to prevent the bias resulting from the independent estimation of the two entities, and this approach has been modified and extended subsequently in various ways. Simulation studies comparing results from separate and joint modelling analyses of survival and longitudinal outcomes were undertaken by Guo and Carlin (2004). The simplest formulation possible is the shared random effects model, where the shared subpredictor consists of random effects. Choosing a random intercept and slope model, the resulting joint model looks like this:
where the are the longitudinal measurements recorded per individual at time points with , where is the number of observations recorded for individual . The hazard function evaluated at time is based on the baseline hazard . The coefficients and are individualspecific random intercept and slope while and are the longitudinal and the survival subpredictor respectively. Both are functions of independent sets of covariates, possibly varying over time in the longitudinal subpredictor. The association parameter quantifies the relationship between the two parts of the model. This type of model has been used in many biomedical settings, see for example Gao (2004) or Liu et al. (2007). For a critical review on shared random effects models in multivariate joint modelling see Fieuws and Verbeke (2004). Many extensions have been suggested for model (LABEL:shared_random_effect) as well as a general approach with a universal shared subpredictor. Ha et al. (2003) used a generalized model for the longitudinal component, while Li et al. (2009) suggested an approach for robust modelling of the longitudinal part and included competing risks in the survival part. Chan (2016) recently included binary outcomes modeled by an autoregressive function, while the model proposed by Armero et al. (2016) accounts for heterogeneity between subjects, serial correlation, and measurement error. Interpretation of the separate parts of the model was simplified by the work of Efendi et al. (2013). A computationally less burdensome approach has recently been suggested by Barrett et al. (2015). For theoretical results on approximation and exact estimation, see Sweeting and Thompson (2011) and for an overview of the development up to 2004 see Tsiatis and Davidian (2004). Since then Rizopoulos (2012) has substantially contributed to the research field with the R addon package JM (for an introduction to the package see Rizopoulos, 2010).
One of the main limitations of classical estimation procedures for joint models in modern biomedical settings is that they are unfeasible for highdimensional data (with more explanatory variables than patients or even observations). But even in lowdimensional settings the lack of a clear variable selection strategy provides further challenges. In order to deal with these problems, we propose a new inferential scheme for joint models based on gradient boosting
(Bühlmann and Hothorn, 2007). Boosting algorithms emerged from the field of machine learning and were originally designed to enhance the performance of weak classifiers (
baselearners) with the aim to yield a perfect discrimination of binary outcomes (Freund and Schapire, 1996). This was done by iteratively applying them to reweighted data, giving higher weights to observations that were misclassified previously. This powerful concept was later extended for use in statistics in order to estimate additive statistical models using simple regression functions as baselearners (Friedman et al., 2000; Friedman, 2001). The main advantages of statistical boosting algorithms (Mayr et al., 2014a) are (i) their ability to carry out automated variable selection, (ii) the ability to deal with highdimensional data and (iii) that they result in statistical models with the same straightforward interpretation as common additive models estimated via classical approaches (Tutz and Binder, 2006; Bühlmann and Hothorn, 2007).The aim of this paper is the extension of statistical boosting to simultaneously estimate and select multiple additive predictors for joint models in potentially highdimensional data situations. Our algorithm is based on gradient boosting and cycles through the different subpredictors, iteratively carrying out boosting updates on the longitudinal and the shared subpredictor. The model variance, the association parameter and the baseline hazard are optimized simultaneously maximizing the log likelihood. To the best of our knowledge, this is the first statisticallearning algorithm to estimate joint models and the first approach to introduce automated variable selection for potentially highdimensional data in the framework of joint model.
We apply our new algorithm to data on repeated measurements of lung function in patients with cystic fibrosis in Denmark. Cystic fibrosis is the most common serious genetic disease in white populations. Most patients with cystic fibrosis die prematurely as a result of progressive respiratory disease (TaylorRobinson et al., 2012). Loss of lung function in cystic fibrosis is accelerated for patients who are infected by pulmonary infections such as Pseudomonas aeruginosa. However it may also be the case that more rapid lung function decline predisposes patients to infection (Qvist et al., 2015). We thus aim to model lung function decline jointly with the onset of infection with Pseudomonas aeruginosa and select the best covariates from the data set in order to better understand how lung function impacts on risk of infection. This example is suited for a joint modelling approach to provide a better understanding of how lung function influences risk of infection onset, whilst taking into account other covariates such as sex and age that influence both processes.
The remainder of the paper is structured as follows: In the second section we present a short introduction in joint modelling in general and describe how we approach the estimation with a boosting algorithm. In the next section we conduct a simulation study in order to evaluate the ability to estimate and select effects of various variables for both lowdimensional and highdimensional data. In the fourth section we apply our approach to cystic fibrosis data with a focus on variable selection. Finally we discuss further extensions of joint models made possible by boosting.
2 Methods
In this section we describe the model and the associated likelihood as used in the rest of the paper. There are two different dependent variables: a longitudinal outcome and the time of event for the endpoints of interest. The predictor for the longitudinal outcome divides into two parts:
where refers to the th individual, to the th observation and
is the model error, which is assumed to follow a normal distribution with zero mean and variance
. The two functions and , which will be referred to as the longitudinal and the shared subpredictor, are functions of two separate sets of covariates: are possibly time varying covariates included only in the longitudinal subpredictor; are covariates varying over individuals yet not over time and are included in the shared subpredictor. In the setup we are using throughout this paper, the shared subpredictor will also include a random intercept and slope, denoted by and respectively. The shared predictor reappears in the survival part of the model:where the baseline hazard is chosen to be constant () in this paper. The subpredictor with subscript ’ls’ refers to both the longitudinal and survival part of the model, whereas it is assumed that the covariates in only have an impact on the longitudinal structure. The relation between both parts of the model is quantified by the association parameter . Consequently, the two predictor equations can be summarized in the joint likelihood:
(2) 
where is the observed time of event for individual and where the distribution for the longitudinal part is the Gaussian error distribution. The parameter is the censoring indicator for the th individual, taking the value in the case of censoring and in the case of an event. The likelihood for the survival part is:
2.1 Componentwise gradient boosting
In the following section we briefly highlight the general concept of boosting before we will adapt it to the class of joint models. The basic idea of statistical boosting algorithms is to find an additive predictor for a statistical model that optimizes the expected loss regarding a specific loss function . The loss function describes the type of regression setting and denotes the discrepancy between realizations and the model . The most typical example for a loss function is the loss for classical regression of the expected value.
For a given set of observation , the algorithm searches for the best solution to minimize the empirical loss (often referred to as empirical risk) for this sample:
In case of the classical loss, the empirical risk simply refers to the mean squared error. While there exist different approaches for statistical boosting (Mayr et al., 2014a), we will focus in this work on componentwise gradient boosting (Bühlmann and Hothorn, 2007). The main concept is to iterative apply baselearners , which are typically simple regression type functions that use only one component of the predictor space (i.e., one covariate ). The baselearners are fitted onebyone (componentwise), not on the observations but on the negative gradient vector of the loss
at the th iteration step. In case of the loss, this means that the baselearners in iteration are actually fitted on the residuals from iteration .
In classical boosting algorithms from the field of machine learning, baselearners are simple classifiers for binary outcomes. In case of statistical boosting, where the aim is to estimate additive predictors, the baselearners itself are regression models: The baselearner represents the potential partial effect of variable on the outcome. Examples are simple linear models () or smooth nonparametric terms estimated via splines.
In every boosting iteration, only the bestperforming baselearner is selected to be updated in the additive predictor (see Algorithm 1). Typically, one baselearner is used for each variable. The specification of the baselearner defines the type of effect the variable is assumed to have in the predictor. For linear effects one could for example use simple linear models (Bühlmann, 2006) or Psplines for nonlinear effects (Schmid and Hothorn, 2008).
The stopping iteration is the main tuning parameter as it controls variable selection and shrinkage of effect estimates. If the algorithm is stopped before convergence, baselearners that have never been selected for the update are effectively excluded from the final model. Higher numbers of hence lead to larger, more complex models while smaller numbers lead to sparser models with less complexity. In practice, is often selected via crossvalidation or resampling methods, by selecting the value that leads to the smallest empirical risk on test data (Hofner et al., 2014).
For theoretical insights on the general concept of boosting algorithms, we refer to the work of Zhang and Yu (2005) who studied the numerical convergence and consistency with different loss functions. For the loss, Bühlmann (2006) proved the consistency of gradient boosting with simple linear models as baselearners in the context of highdimensional data (cf., Hepp et al., 2016).
2.2 Boosting for multiple dimensions
The general concept of componentwise gradient boosting was later extended to numerous regression settings (for a recent overview, see Mayr et al., 2014b). Some of these extensions focused on loss functions that can be optimized with respect to multiple dimensions simultaneously (Schmid et al., 2010; Mayr et al., 2012). This can refer either to regression settings where multiple distribution parameters should be modelled, e.g., like in distributional regression (Rigby and Stasinopoulos, 2005), or settings where in addition to the main model some nuisance parameter (e.g., a scale parameter for negative binomial regression) should be optimized simultaneously.
Boosting the latter model can be achieved by first carrying out the classical gradientfitting and updating steps for the additive predictor (see Algorithm 1) and second by updating the nuisance parameter , both in each iteration step. Updating the nuisance parameter is done by optimizing the loss function with respect to , keeping the current additive predictor fixed:
(3) 
A typical example for a regression setting where various parameters should be modeled simultaneously by additive predictors are generalized additive models for location, scale and shape (GAMLSS, Rigby and Stasinopoulos, 2005)
. The idea is to model all parameters of a conditional distribution by their own additive predictor and own associated link function. This involves not only the location (e.g., the mean), but also scale and shape parameters (e.g., variance, skewness).
In case of boosting GAMLSS, the algorithms needs to estimate statistical models simultaneously. This is achieved by incorporating an outer loop that circles through the different distribution parameters, always carrying out one boosting iteration and updating them one by one (see Algorithm 2 for details). As a result, the algorithm can provide intrinsic shrinkage and variable selection for models simultaneously.
2.3 Boosting Joint Models
The algorithm we suggest for estimating the subpredictors for joint modelling is based on the boosting algorithm for multiple dimensions as outlined in the previous part of this section, but it differs in a range of aspects from Algorithm 2. The main reason for those differences is that the additive predictors for the two dependent variables (the longitudinal outcome and the timetoevent) are neither entirely different nor completely identical. While the longitudinal outcome is modelled by the set of the two subpredictors and , the survival outcome in our model is solely based on the shared subpredictor and the corresponding association parameter . It is hence not possible to construct the algorithm circling between the two outcomes and as for the components in the preceding section. Translating the joint likelihood to one empirical loss function poses difficulties, since the losses of the different observations differ highly in their dimensionality. We therefore suggest an updating scheme at predictor stage (i.e. and ) rather than at the level of the dependent variables. More specifically, we define an outer loop that cycles in every boosting iteration through the following three steps:

(step1) update in a boosting iteration

(step2) update in a boosting iteration

(step3) update , and by maximizing the likelihood.
We omit the arguments of the subpredictors to ensure readability in the following sections. The longitudinal subpredictor will hence be denoted by the indexed values: for the longitudinal and for the shared subpredictor. The derivation of the parts necessary for the three steps will be described in the following, with exact calculations in Appendix A. In the tradition of the shared random effects models we will consider a setup including random effects in the shared subpredictor and furthermore allowing for a fixed effect linear in time. The structure of is hence
where can include various different non time dependent covariate functions.
(step1) Boosting the exclusively longitudinal predictor:
The optimization problem of the longitudinal predictor is straightforward, since it is basically the same as for a quadratic loss function. The gradient vector at iteration hence consists of the differences (residuals) from the iteration before:
The vector has the same length as the vector , including all longitudinal measurements before the event/censoring. Note that there is a slight difference to the gradient vector for classical Gaussian regression: the variance has to be included to ensure that the weighting between the two parts of the model is correct.
(step2) Boosting the joint predictor:
The optimization problem for the joint predictor is more complex. To account for the different scales of the different outcomes, we construct a loss function in such a way that there is an entry for each individual rather than for each observation. This leads to entries in the vector of loss functions that consist of two parts for each individual . The entries of the loss vector hence are:
with . The resulting entries of the gradient vector at iteration , after the update of , are:
where and are the coefficients for fixed timeeffect and random slope in the shared subpredictor and is the part of the subpredictor not depending on time. For the complete derivation of this result see Appendix A.
(step3) Estimating the association parameter, the baseline hazard and the variance:
In order to estimate , and , we maximize the likelihood (2) simultaneously in every iteration.
This third step is carried out after every boosting iteration for the joint predictor (step 2), even if boosting the longitudinal predictor was already stopped and the corresponding steps were hence skipped.
The complete proposed algorithm for boosting JM is presented as Algorithm 3 and its implementation is provided with the new R addon package JMboost which source code is hosted openly on
http://www.github.com/mayrandy/JMboost.
Model tuning:
Tuning of the algorithm is similar to the classical boosting algorithms for multidimensional loss functions: In general, both the steplength and the different stopping iterations have an influence on the variable selection properties, convergence speed and the final complexity of the additive predictors. However, in practice, there exists a quasilinear relation between the steplength and the needed number of boosting iterations (Schmid and Hothorn, 2008). As a result, it is often recommended to use a fixed small value of for the steplength and optimize the stopping iteration instead (Mayr et al., 2012; Hofner et al., 2014).
In case of boosting JM, where two additive predictors are fitted and potentially two boosting updates are carried out in each iteration of the algorithms, it is hard to justify why both predictors should be optimal after the same number of boosting iterations (i.e. ). This special case was referred to as onedimensional early stopping in contrast to the more flexible multidimensional early stopping, where instead of one single value a vector of stopping iterations has to be optimized via a grid (Mayr et al., 2012). This computationally burdensome issue is necessary to allow for different levels of complexity for the additive predictors. The range of the grid (minimum and maximum ) has to be specified adhoc, however it might be necessary to adapt it based on the results (adaptive grid search).
For a more detailed discussion on how to select this grid in practice, we refer to Hofner et al. (2016). Finally, the combination of stopping iterations from the grid is selected, which yields the smallest empirical risk on testdata (e.g., via crossvalidation or resampling procedures).
3 Simulations
To evaluate the performance of the new boosting algorithm, a generic simulation setup was created to mimic a typical joint modelling situation, and will be described in the following. The technical details for the complete simulation algorithm are explained in depth in Appendix B, the underlying RCode to reproduce the results is included in the Supporting Information.
3.1 Simulation Setup
The purpose of the simulation study is threefold:

evaluate estimation accuracy,

test variable selection properties, and

determine the appropriate stopping iterations and .
The evaluation of estimation accuracy has to be done for both, the coefficients of the subpredictors as well as the association parameter, which plays a crucial role in the interpretation of the model. Three different setups were chosen to give insight in all three questions from different angles. We will first describe the basis for all three models and then point out the differences in the simulation concepts. All three models are based on the following subpredictors:
The matrices and are the collections of the standardised covariates, and the corresponding linear effects with sub vectors of different lengths, is the impact of time , the random intercept and the random slope. In all three setups individuals were generated. For each individual, we drew five time points which were reduced to less than five in cases with events before the last observation. If there was no incident for any of the simulated individuals the number of observations would thus be . However the simulations are constructed in a way that this case never occurs. The first of the three setups (S1) was constructed to mimic the data situation of the application in Section 4 more closely and to thus better demonstrate the ability of the algorithm to perform variable selection in a lower dimension. S1 had four non informative covariates in each predictor. In the second simulation setup (S2) we used non informative covariates, for each subpredictor. In this case the number of covariates exceeds the number of individuals. In the third of the three setups (S3) we chose the number of non informative covariates to be over both subpredictors, i.e. each, the theoretical maximum number of observations is hence exceeded by the number of covariates. Please note that we first simulated a genuine informative part of the model, and afterwards drew the non informative covariates for each setup individually. The three setups are hence based on the same data and the informative covariates of the three models are the same. In all three setups the association parameter was chosen to be .
3.2 Results
We ran models of each setup; results are summarized in Table 1 and will be described in detail in the following.
Setup  TP  FP  TP  FP  

S1  
S2  
S3  
(i) Evaluation of estimation accuracy
As can be seen in Table 1
estimation for the coefficients of the subpredictors were very close to the true values. Standard deviations were small, except for the estimation of the time effect, where the variation between the simulation runs is slightly higher. A graphical illustration is provided in Figure
1, which displays boxplots of the resulting coefficients for S3 (the results for the other setups only differs slightly). The slight tendency of underestimation can be attributed to the shrinkage behavior of boosting algorithms in general. This underestimation can be observed in all three models in both parts. The estimation of the crucial association parameter was also very close to the true value in all setups (see the last column of Table 1 and Figure 2). The slight tendency to overestimate can be contributed to the compensatory behaviour of the association parameter. This is due to the fact that is estimated via optimisation and hence not subject so shrinkage but adapting directly to the data.(ii) Variable selection properties
All informative variables were selected in % of the runs in all three setups. In S1, more noninformative variables () were selected in the longitudinal predictor than in the shared predictor (). Boosting tends to select a higher percentage of noninformative variables in a small setup, which explains the high selection rate in the longitudinal component. The fact that the shared part is less prone to this greedy behaviour of the algorithm can be contributed to the option to chose the random effect over one of the noninformative variables. In the highdimensional setups noninformative effects are selected in very few of the runs in both setups. The longitudinal subpredictor does even better than the shared subpredictor in this case. There are slight differences, which can be attributed to the smaller number of runs it required (see paragraph below). Overall the selection property works very well for both parts of the model, especially in highdimensional scenarios.
(iii) Stopping iterations
The two (possibly different) stopping iterations and were selected by evaluating the models run on adaptively adjusted grids on an evaluation data set with individuals. In all setups the grid ran through an equally spaced sequence from to in both directions – for and . When analysing these results we noticed that the steps between the vlues on the grid were to wide for setup S2 and S3. The grid was consequently adapted such that was chosen from an equally spaced sequence from to , while the grid for ran from to . The optimal stopping iterations were chosen such that the joint log likelihood was maximal on the patients left out of the fitting process (predictive risk). For an overview over the resulting stopping iterations see Table 2.
S1  S2  S3  

4 Cystic Fibrosis data
Cystic fibrosis is the most common lifelimiting inherited disease among Caucasian populations, with most patients dying prematurely from respiratory failure. Children with cystic fibrosis in the UK are usually diagnosed in the first year of life, and subsequently require intensive support from family and healthcare services (TaylorRobinson et al., 2013). Though cystic fibrosis affects other organs such as the pancreas and liver, it is the lung function that is of most concern, with gradual decline in function for most patients leading to the necessity of lung transplantation for many. Lung function decline in cystic fibrosis is accelerated if patients become infected with a range of infections (Qvist et al., 2015). However, the direction of causation is unclear. It may also be the case that accelerated lung function decline predisposes patients to risk of lung infection. There is thus utility in analyzing lung function decline, and time to lung infection in a joint model in order to gain more insight in the structure of the process.
We analyzed part of a dataset from the Danish cystic fibrosis registry. To achieve comparability between patients we chose to use one observation per patient per year. Lung function, measured in forced expiratory volume in one second (FEV1), is the longitudinal outcome of our joint model. The event in the survival part of the model is the onset of the pulmonary infection Pseudomonas aeruginosa (PA). After selecting the patients that have at least two observations before the infection, the data set contained a total of observations of patients of which were infected with PA in the course of the study. The mean waiting time until infection was , hence the mean age of infection was years, since patients are included at the age of five into the study. The covariates for the longitudinal predictor were height and weight of the patient as well as two binary indicators: one states if the patient had one of three different additional lung infections and a second indicates if the patient had diabetes. The covariates possibly having an impact on the shared part of the model were time (i.e. age of the individuals), pancreatic insufficiency, sex, and age at which cystic fibrosis was diagnosed. Covariates were standardized to have mean zero and standard deviation one, except for age, which was normed on an interval from zero to one.
We then ran our boosting algorithm on the data set in order to simultaneously estimate and select the most influential variables for the two subpredictors while optimizing the association parameter . As recommended, we used a fixed step length of 0.1 for both predictors and optimized instead. The stopping iterations ( and ) were chosen based on tenfold cross validation via a twodimensional grid, sequencing in equidistant steps of from to . The resulting coefficient paths for both subpredictors are displayed in Figure 3.
The selected variables chosen to be informative for the joint model were height, weight, (additional) infection, age, pancreatic insufficiency and diabetes. Age of diagnosis and sex did not show to have an impact and were not selected. Being diabetic results to have a negative impact on the lung function, the same holds for having an additional lung infection or pancreatic insufficiency. The negative impact of age on the lung function was to be expected since lung function typically declines for patients getting older. The resulting negative impact of height and the positive impact of weight indicate, that following our joint model, underweight has a negative influence on the lung function.
The association parameter is negative with a value of . The risk of being infected is hence implicitly – i.e. via the shared predictor – associated negatively with the value of the lung function. Lower lung function thus leads to a higher risk of being infected with pseudomonas emphasizing the need to model both outcomes simultaneously in the framework of joint models.
5 Discussion and Outlook
Joint modelling is a powerful tool for analysing longitudinal and survival data, as documented by Rizopoulos (2012)
. It is increasingly applied in biomedical research, yet it was unfeasible for highdimensional data and there have been no suggestions for automated variable selection until now. Statistical boosting on the other hand is a modern machinelearning method for variable selection and shrinkage in statistical models for potentially highdimensional data. We present a possibility to merge these two state of the art statistical fields. Our suggested method leads to an automated selection of informative variables and can deal with highdimensional data and, probably even more importantly, opens up a whole new range of possibilities in the field of joint modelling.
Boosting algorithms utilizing the underlying baselearners are very flexible when it comes to incorporating different types of effects in the final models. Due to the modular nature of boosting (Bühlmann et al., 2014), almost any baselearner can be included in any algorithm – although the regression setting defined by the underlying loss function might be extremely different. In our context, the proposed boosting algorithm for joint models could facilitate the construction of joint models including not only linear effects and splines, but for example, also spatial and regional effects.
A second area in which the suggested model calls for extension is variable selection itself. While informative variables are selected reliably in most cases, also non informative variables are included in too many iterations. With this being a structural problem of boosting algorithms (Bühlmann and Yu, 2007) in lowdimensional settings, the solution could be stability selection (Meinshausen and Bühlmann, 2010; Shah and Samworth, 2013), as shown to be helpful in other cases (Hofner et al., 2015; Mayr et al., 2016). The difference in the proportion of falsely selected variables between longitudinal and shared subpredictor however, could be a joint modelling inherent problem and should be subject of future analysis. Further research is also warranted on theoretical insights, as it remains unclear if the existing findings on consistency of boosting algorithms (Zhang and Yu, 2005; Bühlmann, 2006) hold also for the adapted version for boosting joint models.
We plan to extend the model by incorporating a predictor , i.e. including covariates which only have an impact on the survival time and are independent of the longitudinal structure. Once this is incorporated, we can make even better use of the features of boosting and implement variable selection and allocation between the predictors. The latter is especially useful if no prior knowledge on the structure of association exists, because covariates can be suggested for all three subpredictors and the algorithm automatically assigns the informative covariates to the appropriate subpredictor.
In conclusion we propose a new statistical inference scheme for joint models that also provides a starting point for a much wider framework of boosting joint models, covering a great range of potential models and types of predictor effects.
Acknowledgements
The work on this article was supported by the German Research Foundation (DFG), grant SCHM 2966/12, grant KN 922/42 and the Interdisciplinary Center for Clinical Research (IZKF) of the FriedrichAlexanderUniversity ErlangenNürnberg (Projects J49 and J61). We furthermore want to thank the Editor and the reviewers, especially the one who saved us from falling into the trap of oversimplifying the gradient in our first draft of the algorithm.
Appendix A Gradient of the shared predictor
In the following we will sketch out the calculus for the gradient of the likelihood of the shared subpredictor, indices for individuals/obsevations (i.e. will be omitted. Note that the derivation of the first part in equation (2.3) with respect to is the same as in the purely longitudinal predictor . The additional part in the likelihood of the shared subpredictor is simply the one of a pure survival model
The formulation of the log likelihood is thus:
This again splits into two part () and (). The derivation of the first part () is straight forward and reduces to . The part of the likelihood less straight forward to differentiate is the part including the integral ()
Note that is a function of time such that standard derivation cannot be used. To avoid confusion with the upper integral bound and the integration variable above, we suppress the argument time in the following. Applying the rules for functional derivatives, we obtain
In the case of interest in this paper, where only the second part of the predictor is timedependent, i.e., , the functional derivative of the function has the following form:
The whole derivation of the likelihood for the shared subpredictor hence is
Appendix B Simulation

Choose a number of patients and a number of observations per patient

Simulate the time points:

draw days in

generate the sequence of length years per patient, by taking the above simulated values as days of the year

to make calculations easier, divide everything by 365

example: time points for patient given that : means the patient was seen in the first year at day , in year two at day and so on


Generate the longitudinal predictor

generate fixed effects by choosing appropriate values for the covariate matrix

the covariate matrix can also include the time vector

and deciding for values for the parameter vector

calculate


Generate the shared predictor outcomes based on the time points (just as in a usual longitudinal setup):

generate random intercept and random slope
by drawing form a standard Gaussian distribution

generate fixed effects by choosing appropriate values for the covariate matrix (note that the values are not allowed to be time varying)

and deciding for values for the parameter vector

calculate


draw values of from

Simulate event times (based on the above simulated times and random effects):

choose a baseline hazard (for simplicity reasons chosen to be constant) and an association parameter

calculate the probability of an event happening up to each time point with the formula

draw uniformly distributed variables

if consider an event having happened before

define time of event as proportional to the difference of and between and

for every individual only the first is being considered

define the censoring vector of length with if for any and otherwise (this leads to an average censoring rate of ).


for every individual delete all with from the list of observations
The probabilities for the events are taken from the connection between the hazard rate and the survival function.
References
 Armero et al. (2016) Armero, C., A. Forte, H. Perpinan, M. J. Sanahuja, and S. Agusti (2016). Bayesian joint modeling for assessing the progression of chronic kidney disease in children. Statistical Methods in Medical Research online first, 1–17.
 Barrett et al. (2015) Barrett, J., P. Diggle, R. Henderson, and D. TaylorRobinson (2015). Joint modelling of repeated measurements and timetoevent outcomes: flexible model specification and exact likelihood inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77(1), 131–148.
 Bühlmann (2006) Bühlmann, P. (2006). Boosting for highdimensional linear models. The Annals of Statistics 34(2), 559–583.
 Bühlmann et al. (2014) Bühlmann, P., J. Gertheiss, S. Hieke, T. Kneib, S. Ma, M. Schumacher, G. Tutz, C. Wang, Z. Wang, and A. Ziegler (2014). Discussion of ’The evolution of boosting algorithms’ and ’Extending statistical boosting’. Methods of Information in Medicine 53(6), 436–445.
 Bühlmann and Hothorn (2007) Bühlmann, P. and T. Hothorn (2007). Boosting algorithms: Regularization, prediction and model fitting (with discussion). Statistical Science 22, 477–522.
 Bühlmann and Yu (2007) Bühlmann, P. and B. Yu (2007). Sparse boosting. Journal of Machine Learning Research 7, 1001–1024.
 Chan (2016) Chan, J. S. K. (2016). Bayesian informative dropout model for longitudinal binary data with random effects using conditional and joint modeling approaches. Biometrical Journal 58(3), 549–569.
 Efendi et al. (2013) Efendi, A., G. Molenberghs, E. Njagi, and P. Dendale (2013). A joint model for longitudinal continuous and timetoevent outcomes with direct marginal interpretation. Biometrical Journal 55(4), 572–588.
 Fieuws and Verbeke (2004) Fieuws, S. and G. Verbeke (2004). Joint modelling of multivariate longitudinal profiles: pitfalls of the randomeffects approach. Statistics in Medicine 23, 3093 –3104.
 Freund and Schapire (1996) Freund, Y. and R. Schapire (1996). Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning Theory, San Francisco, CA, pp. 148–156. San Francisco: Morgan Kaufmann Publishers Inc.
 Friedman (2001) Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29, 1189–1232.

Friedman
et al. (2000)
Friedman, J. H., T. Hastie, and R. Tibshirani (2000).
Additive logistic regression: A statistical view of boosting (with discussion).
The Annals of Statistics 28, 337–407.  Gao (2004) Gao, S. (2004). A shared random effect parameter approach for longitudinal dementia data with nonignorable missing data. Statistics in Medicine 23, 211–219.
 Guo and Carlin (2004) Guo, X. and B. P. Carlin (2004). Separate and joint modeling of longitudinal and event time data using standard computer packages. The American Statistician 58, 16–24.
 Ha et al. (2003) Ha, I. D., T. Park, and Y. Lee (2003). Joint modelling of repeated measures and survival time data. Biometrical Journal 45(6), 647–658.
 Hepp et al. (2016) Hepp, T., M. Schmid, O. Gefeller, E. Waldmann, and A. Mayr (2016). Approaches to regularized regression–a comparison between gradient boosting and the lasso. Methods of Information in Medicine 55(5), 422–430.
 Hofner et al. (2015) Hofner, B., L. Boccuto, and B. Göker (2015). Controlling false discoveries in highdimensional situations: Boosting with stability selection. BMC Bioinformatics 16(144).
 Hofner et al. (2014) Hofner, B., A. Mayr, N. Robinzonov, and M. Schmid (2014). Modelbased boosting in R: a handson tutorial using the R package mboost. Computational Statistics 29(12), 3–35.
 Hofner et al. (2016) Hofner, B., A. Mayr, and M. Schmid (2016). gamboostLSS: An R package for model building and variable selection in the GAMLSS framework. Journal of Statistical Software 74(1), 1–31.
 Li et al. (2009) Li, N., R. M. Elashoff, and G. Li (2009). Robust joint modeling of longitudinal measurements and competing risks failure time data. Biometrical Journal 51(1), 19–30.
 Liu et al. (2007) Liu, L., R. A. Wolfe, and J. D. Kalbfleisch (2007). A shared random effects model for censored medical costs and mortality. Statistics in Medicine 26, 139–155.
 Mayr et al. (2014a) Mayr, A., H. Binder, O. Gefeller, and M. Schmid (2014a). The evolution of boosting algorithms  from machine learning to statistical modelling. Methods of Information in Medicine 53(6), 419–427.
 Mayr et al. (2014b) Mayr, A., H. Binder, O. Gefeller, and M. Schmid (2014b). Extending statistical boosting  an overview of recent methodological developments. Methods of Information in Medicine 53(6), 428–435.
 Mayr et al. (2012) Mayr, A., N. Fenske, B. Hofner, T. Kneib, and M. Schmid (2012). Generalized additive models for location, scale and shape for highdimensional data – a flexible aproach based on boosting. Journal of the Royal Statistical Society: Series C (Applied Statistics) 61(3), 403–427.
 Mayr et al. (2012) Mayr, A., B. Hofner, and M. Schmid (2012). The importance of knowing when to stop. Methods of Information in Medicine 51(2), 178–186.
 Mayr et al. (2016) Mayr, A., B. Hofner, and M. Schmid (2016). Boosting the discriminatory power of sparse survival models via optimization of the concordance index and stability selection. BMC Bioinformatics 17(288). doi: 10.1186/s1285901611498.
 Meinshausen and Bühlmann (2010) Meinshausen, N. and P. Bühlmann (2010). Stability selection (with discussion). Journal of the Royal Statistical Society. Series B 72, 417–473.
 Qvist et al. (2015) Qvist, T., D. TaylorRobinson, E. Waldmann, H. Olesen, C. Rø nne Hansen, M. I. Mathiesen, N. Hø iby, T. L. Katzenstein, R. L. Smyth, D. P., and P. T (2015). Infection and lung function decline; comparing the harmful effects of cystic fibrosis pathogen. in press.
 Rigby and Stasinopoulos (2005) Rigby, R. A. and D. M. Stasinopoulos (2005). Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics) 54(3), 507–554.
 Rizopoulos (2010) Rizopoulos, D. (2010). JM: An R package for the joint modelling of longitudinal and timetoevent data. Journal of Statistical Software 35(9), 1–33.
 Rizopoulos (2012) Rizopoulos, D. (2012). Joint Models for Longitudinal and TimetoEvent Data: With Applications in R. Chapman & Hall/CRC Biostatistics Series. Taylor & Francis.
 Schmid and Hothorn (2008) Schmid, M. and T. Hothorn (2008). Boosting additive models using componentwise Psplines. Computational Statistics & Data Analysis 53, 298–311.
 Schmid et al. (2010) Schmid, M., S. Potapov, A. Pfahlberg, and T. Hothorn (2010). Estimation and regularization techniques for regression models with multidimensional prediction functions. Statistics and Computing 20, 139–150.
 Shah and Samworth (2013) Shah, R. D. and R. J. Samworth (2013). Variable selection with error control: Another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75(1), 55–80.
 Sweeting and Thompson (2011) Sweeting, M. J. and S. G. Thompson (2011). Joint modelling of longitudinal and timetoevent data with application to predicting abdominal aortic aneurysm growth and rupture. Biometrical Journal 53(5), 750–763.

TaylorRobinson et al. (2013)
TaylorRobinson, D., R. L. Smyth, P. Diggle, and M. Whitehead (2013).
The effect of social deprivation on clinical outcomes and the use of treatments in the uk cystic fibrosis population: A longitudinal study.
Lancet Respir Med 1, 121–128.  TaylorRobinson et al. (2012) TaylorRobinson, D., M. Whitehead, F. Diderichsen, H. V. Olesen, T. Pressler, R. L. Smyth, and P. Diggle (2012). Understanding the natural progression in with cystic fibrosis: a longitudinal study. Thorax 67, 860–866.
 Tsiatis and Davidian (2004) Tsiatis, A. and M. Davidian (2004). Joint modelling of longitudinal and timetoevent data: an overview. Statistica Sinica 14, 809–834.
 Tutz and Binder (2006) Tutz, G. and H. Binder (2006). Generalized additive modeling with implicit variable selection by likelihoodbased boosting. Biometrics 62, 961–971.
 Wulfsohn and Tsiatis (1997) Wulfsohn, M. and A. Tsiatis (1997). A joint model for survival and longitudinal data measured with error. Biometrics 53, 330–339.
 Zhang and Yu (2005) Zhang, T. and B. Yu (2005). Boosting with early stopping: convergence and consistency. Annals of Statistics, 1538–1579.
Comments
There are no comments yet.