1 Introduction
Clinical studies often collect three types of data on each patient: the time to the event of interest (possibly censored), the longitudinal measurements on a continuous response (for example, some sort of biomarker viewed as clinically important), and an additional set of covariates (possibly timevarying) about the patient. The clinical studies then focus on analyzing the relationship between the timetoevent and the longitudinal responses, using the additional covariates. A common approach is to jointly model the timetoevent by a survival model, while modeling the longitudinal responses using a linear mixedeffects model, with both the survival and the linear mixedeffects models potentially making use of the additional covariates.
The first and most common approach for the joint modeling problem is the shared random effects model (SREM) (Wulfsohn and Tsiatis, 1997; Henderson et al., 2000; Tsiatis and Davidian, 2004; Rizopoulos, 2010). The name “shared random effects” comes from the modeling assumption that a set of random effects accounts for the association between longitudinal outcomes and timetoevent. In particular, the longitudinal outcomes are modeled by linear mixedeffects models (Laird and Ware, 1982)
, with random effects for each patient. Meanwhile, these random effects affect the hazards of the event through some parametric survival model, for example a proportional hazards model. SREM is limited in the use of covariates in the survival model, as it only allows timeinvariant baseline covariates, such as a treatment indicator. To fit a SREM, the parameters of the survival model and the linear mixedeffects model are estimated together via maximum likelihood. SREM proposes a straightforward way of jointly modeling the two quantities, but it comes with high computational cost.
A different line of research focuses on the joint latent class model (JLCM) (Lin et al., 2002; ProustLima and Taylor, 2009; ProustLima et al., 2009, 2017)
. JLCM assumes that the population of patients consists of multiple latent classes. A patient’s timetoevent and longitudinal responses are independent conditioning on his or her latent class membership. In addition, JLCM assumes the latent classes are homogeneous and thus patients within a latent class follow the same survival and linear mixedeffects model. Finally, the latent class membership is modeled by a multinomial logistic regression model.
What makes JLCM interesting and different from SREM is the idea of latent class membership, which (among other things) can be used to describe disease progression in clinical studies (Lin et al., 2002; Garre et al., 2008; ProustLima et al., 2009). It is wellknown that many diseases have different stages; examples include dementia, AIDS, cancer, and chronic obstructive pulmonary disease (COPD) (Dicker et al., 2006). From a clinical point of view, it is important to identify those stages, since treatment could change with membership in different stages (Hajiro et al., 2000). Currently the clinical definitions of stages of a disease consist of using diagnostic findings (such as biomarkers) to produce clusters of patients. However, it is possible that by jointly studying biomarker trajectories and survival experiences, one can find datadependent latent classes that uncover new, meaningful stages.
JLCM, like most frequentist parametric approaches, estimates the parameters for all modeling components via maximizing the loglikelihood function, which is a very complex process. Although the computational cost for JLCM is less than that of SREM, it is still prohibitively slow for large scale data. In addition, JLCM cannot use timevarying covariates in its latent class membership and survival models, which could greatly reduce its prediction performance.
A nonparametric approach that addresses some of these problems would be desirable, and treebased approaches (Breiman et al., 1984) are natural candidates. It is very efficient to fit a tree, and the terminal nodes of a tree naturally represent a partition of the population. A treebased approach for joint latent class modeling also addresses the timeinvariant limitation of JLCM, since timevarying covariates can be used as the splitting variables to construct the tree. Furthermore, once a tree is constructed, it is up to the user to decide which type of survival models and which covariates to use within each terminal node, providing additional flexibility to the analyst.
In this work, we propose the joint latent class tree (JLCT) method. JLCT, like JLCM, is based on the key assumption that conditioning on latent class membership, timetoevent and longitudinal responses are independent. JLCT therefore looks for a treebased partitioning such that within each estimated latent class defined by a terminal node, the timetoevent and longitudinal responses display a lack of association. Once the tree is constructed, we assign each observation to a latent class (i.e. terminal node), and independently fit survival and linear mixedeffects models, using the class membership information.
The rest of the paper is organized as follows. In Section 2 we introduce the setup of the joint modeling problem, and present our joint latent class tree (JLCT) method. In Sections 3 and 4 we use simulations to compare JLCT with JLCM in terms of prediction performance and running time. Finally, in Section 5 we apply JLCT to a real dataset, and demonstrate that JLCT admits competitive (or superior) prediction performance, while being orders of magnitude faster than the JLCM approach.
2 Constructing a tree to uncover conditional independence
2.1 Joint modeling setup
Assume there are subjects in the sample. For each subject , we observe repeated measurements of a longitudinal outcome at times
. We denote the vector of longitudinal outcomes by
. In addition, for each subject we observe a vector of covariates at each measurement time , . These covariates can be either timeinvariant or timevarying. Each subject is also associated with a timetoevent tuple , where is the time of the event, and is the censor indicator with if subject is censored at , and otherwise.We assume there exist latent classes, and let denote the latent class membership of subject at time . We assume the latent class membership is determined by a subset of covariates, denoted by .
The joint latent class modeling problem makes the key assumption that a patient’s timetoevent and longitudinal outcomes are independent conditioning on his or her latent class membership . Without controlling the latent class membership , timetoevent and longitudinal outcomes may appear to be correlated because each is related to the latent class, but given the two are independent of each other, and therefore the longitudinal outcomes have no prognostic value for timetoevent given the latent class. The modeling of and are therefore separated conditioning on .
We assume the longitudinal outcomes come from a linear mixedeffects model:
(1) 
Here we assume the longitudinal outcomes depend on two subsets of , with the subset of covariates associated with a fixed effect vector , and the subset of covariates associated with a latent classspecific random effect vector . We assume the random effect vector is independent across latent classes . In addition, there is a subjectspecific random intercept that is independent across subjects and independent of the random effects . Finally, the errors
are assumed to be independent and normally distributed with mean
and variance
, and independent of all the random effects as well.Conditioning on latent class membership, we assume the timetoevent tuple depends on a subset of covariates observed at time , through the extended Cox model for timevarying covariates (Cox, 1972):
(2) 
where is the vector of classspecific slope coefficients. Equation (2) shows a general form of the survival model given latent classes . In practice one can restrict either or , or both, to be identical across latent classes.
Tree construction requires fitting the extended Cox model using the longitudinal outcome variable as a predictor, and perhaps other timevarying covariates. In order to implement that, we convert the original data into lefttruncated rightcensored (LTRC) data (Andersen and Gill, 1982; Fu and Simonoff, 2017). For each subject and measurement time , there is a “pseudoobservation” with , and a timetoevent triplet , where is the lefttruncated time, is the rightcensored time, and is the censor indicator.
We have introduced four subsets of covariates so far: for the latent classes, for the fixed effects, for the random effects, and for the timetoevent. Each of the four subsets can contain timevarying covariates, and the four subsets can be either identical, or share common covariates, or share no covariates at all.
2.2 Joint Latent Class Tree (JLCT) methodology
In Appendix A, we give a brief introduction to JLCM, and discuss at length its strengths and weaknesses. In this section we formally introduce JLCT, the joint latent class tree approach. Like JLCM, JLCT also assumes conditional independence of the longitudinal outcomes and the timetoevent within each latent class. Under this assumption, JLCT looks for a treebased partitioning such that within each estimated class defined by a terminal node the timetoevent and longitudinal outcomes display a lack of association.
Treebased methods are powerful modeling tools in statistics and data mining (Breiman et al., 1984; Hastie et al., 2001), especially because they are fast to construct, able to uncover nonlinear relationships between covariates, and intuitive and easy to explain. We consider binary trees, where each node is recursively split into two children nodes based on a splitting criterion. Often the splitting criterion ensures that the two children nodes are more “homogeneous” than their parent node. The measure of “homogeneity” varies by the method, for instance in a classification tree, the measure could be misclassification error, Gini index, or crossentropy, while in a regression tree it could be residual sum of squares. The tree stops splitting when the node is “pure,” or when some stopping criteria are met.
The measure of “homogeneity” in JLCT is quite different from those commonly used in regression and classification trees. Our measure is based on the conditional independence between the timetoevent and the longitudinal outcomes: the more apparently independent the two variables are conditioning on the node, the more “homogeneous” the node is. To be more concrete, the splitting criterion repeatedly uses the test statistic for the hypothesis test
(3) 
under the extended Cox model
In the model above, the coefficient is associated with the longitudinal outcomes , and the vector of coefficients is associated with the set of covariates . Thus, corresponds to the longitudinal outcome having no relationship with the timetoevent in the node given the other covariates .
We have two obvious choices for the hypothesis test (3): the loglikelihood ratio test (LRT) or the Wald test. In this work we use the loglikelihood ratio test in all experiments; simulations indicate that the Wald test gives similar results. We will denote the test statistic of the hypothesis test as TS. The smaller the value of TS is, the less related longitudinal outcomes are to the timetoevent data given the covariates and current node. JLCT seeks to partition observations using covariates in , such that TS is small within each leaf node, but stops partitioning when TS is less than a specified stopping parameter . More formally, the tree splitting procedure works as follows:

[label*=0.]

At the current node, compute the test statistic TS.

If TS, stop splitting. Otherwise proceed.

For every split variable , and possible split point ,

[label*=0.]

Define two children nodes

Ignore this split if either node violates the restrictions specified in the control parameters. (Details are given near the end of this section.) Otherwise proceed.

At each child node, determine the test statistics TS, TS respectively.

Compute the score .


Scan through all pairs of to find , and split the current node on variable at split point .
JLCT recursively splits nodes according to the above procedure, until none of the terminal nodes can be further split.
The stopping criterion, , is based on the following property of the test statistic: under the null model in (3), the distribution of TS is approximately a
distribution. Therefore, we do not reject the null hypothesis
when TS is smaller than a threshold associated to a desired significance level. We set by default, which is the 5% tail of the distribution.Standard control parameters, such as the minimal number of observations in any terminal node, the minimal improvement in the “purity” by a split, etc., also apply to the JLCT method. In addition, we introduce two control parameters that are specific to JLCT:

Minimum number of events in any terminal node. This parameter ensures that the survival model in each terminal node is fit with meaningful survival data with enough occurrences of events (that is, uncensored observations). By default, this parameter is set to the number of covariates used for the Cox PH model.

Upper bound on the variance of the estimated coefficients in all survival models at tree nodes. This parameter ensures that the fitted coefficients of the survival models are stable. By default this parameter is set to .
These two new control parameters enforce reliability of the fitted survival models at all nodes, which further produces reliable test statistics that can accurately reflect the relationships between the timetoevent and the longitudinal outcomes.
3 Simulation results: timeinvariant covariates
In this section, we use simulations to study the behavior of JLCT, and compare JLCT with JLCM. The data are generated such that the timetoevent is correlated with the longitudinal outcomes, but the two are independent conditioning on the latent classes. Thus, the key assumption of conditional independence holds on the simulated data.
We begin with a relatively simple situation where all of the covariates are timeinvariant, and thus the timetoevent and latent class memberships are determined by timeinvariant covariates. This scenario agrees with the setup of JLCM. In the next section (Section 4) we consider the general scenario where all covariates are timevarying.
The simulation setup varies in the structure of latent classes and in the distribution of timetoevent. We consider four structures of latent classes (as functions of covariates and ): tree partition, linear separation, nonlinear separation, and asymmetric tree partition, which are shown in Figure 1
. Class membership for each subject is determined based on an underlying probability vector, with the concentration level
being defined as the probability of falling into the most probable class; see Appendix B.2 for details. We consider three distributions for baseline hazards of timetoevent data: exponential, Weibull with decreasing hazards with time (WeibullD), and Weibull with increasing hazards with time (WeibullI). We give a full description of the simulation setup with timeinvariant covariates in Appendix B.3.1 Methods and measures
We compare four methods on the simulated datasets.

A baseline model that treats the entire dataset as one latent class. This is equivalent to JLCT with stopping threshold . After the tree is constructed, we fit a Cox PH model.

A JLCT with the stopping threshold . The tree is pruned to have no more than six terminal nodes, which coincides with the maximal number of latent classes we allow for JLCM. After the tree is constructed, we fit a Cox proportional hazards (PH) model with the same baseline hazard function but different Cox PH slopes across terminal nodes.

A JLCT that is the same as the second JLCT, except that both baseline hazard functions and Cox PH slopes differ across terminal nodes.

A JLCM, with classspecific slopes and classspecific baseline hazard functions in the survival model. The optimal number of latent classes is chosen from using BIC.
All JLCT methods fit the linear mixedeffects model (1) to the longitudinal outcomes after the tree is constructed, taking the terminal nodes to be the latent classes. Given the fitted JLCT and JLCM, we predict longitudinal outcomes and timetoevent (in terms of survival curves) on both insample and outofsample data. The details of the prediction procedure are given in Appendix B.5.
We measure performance of the four methods by prediction accuracy (both insample and outofsample), and by the difference between fitted and true parameters. To compute the outofsample measures, at each simulation run we generate a new random sample of subjects, using the same data generating process for the insample data. We consider the following measures:

The integrated squared error (ISE) measures the accuracy of the predicted survival curve. The ISE is defined as follows for a set of subjects :
where and are the predicted and the true survival probability for subject at time , respectively. We compute ISE on insample subjects (ISE) and on outofsample subjects (ISE).

MSE measures the accuracy of longitudinal prediction, which is defined for a set of subjects with longitudinal outcomes:
where we denote by and the predicted and the true th longitudinal outcome of subject , respectively. We compute MSE on insample subjects (MSE) and on outofsample subjects (MSE).

MSE measures the difference between the estimated and the true Cox PH slope coefficients, on a set of subjects with observations:
where and are the estimated and the true Cox PH slopes for latent class , and where and are the predicted and the true latent class membership for the th observation of subject , respectively.
It is worth emphasizing that JLCM uses extra information, such as the longitudinal outcomes and timetoevent, to predict latent class membership for insample subjects. The quality of latent class membership prediction directly affects that of timetoevent and longitudinal outcomes predictions, and thus JLCM is advantaged compared to JLCT for insample performance. A comparison between the two is only fair on outofsample subjects, since the longitudinal outcomes and timetoevent are no longer available to JLCM at prediction time, and therefore the two methods use the same amount of information for prediction. In view of this, we focus on comparing the outofsample measures in this section, and present the insample prediction results in Appendix D.
3.2 Results
Figures 1(a) to 1(c) show the boxplots of ISE, MSE, and MSE respectively, on scale for , light censoring, and WeibullI distributions, for 16 combinations of latent class structure and concentration level: Tree, Linear, Nonlinear, Asymmetric. The experiments are repeated 100 times for JLCT, and 20 times for JLCM under each setting. (Since JLCM is prohibitively timeconsuming, we only perform 20 runs per setting.) We only consider large concentration levels , where the majority class of a particular subject is dominating the remaining classes in probability that the subject will actually be a member of that class. The results for other baseline hazard distributions and performance measures are given in Appendix D.
Figure 1(a) shows that the accuracies of survival predictions of JLCT (Methods 2 & 3) and JLCM (Method 4) are comparable, with JLCT being slightly worse, under the settings where latent classes are generated from probabilistic models (). In particular, when the latent class structure comes from a linear separation (Linear) and therefore the simulated data matches exactly the model of JLCM, JLCT still performs comparably to JLCM (e.g. , “Linear”). The gap between JLCT and JCLM decreases as the concentration level increases from to . In the case where and thus latent classes are generated by a deterministic partitioning, JLCT is more effective than JLCM in all settings except nonlinear partitioning. In particular, when the latent class model follows the deterministic partitioning by a tree (,“Tree”) and thus matches the model of JLCT, JLCT outperforms JLCM by a significant margin. Furthermore, JLCM tends to have higher variance in ISE values, due to occasionally unconverged JLCM runs.
A similar pattern appears for MSE in Figure 1(b), with JLCT and JLCM comparable for most settings, and JLCT significantly outperforming JLCM when the simulated data come from the underlying model of JLCT. Regarding estimation accuracy of Cox PH slopes (Figure 1(c)
), the same pattern shows up here as well, although there exist more outliers in all methods.
Methods 2 and 3 give very similar results. Indeed, the constructed JLCT trees of the two methods coincide, and only the survival models that are fitted afterwards differ. Method 2 assumes a single baseline hazard function shared across terminal nodes, which agrees with the true data generating scheme. Nevertheless, Methods 2 and 3 have almost identical outofsample performances (ISE and MSE), which suggests that introducing additional parameters for nodespecific baseline hazard functions does not hurt the prediction performance for the simulated data. In practice, it is often unclear whether the latent classes share a single baseline hazard function. Our results suggest that in such a case, we can assume a separate baseline hazard function for each terminal node, without worrying about overfitting the data. In later simulations and real application, we only consider the Cox PH model with separate baseline hazard functions across terminal nodes.
Table 1 shows the average running time of JLCT (over 100 runs) and JLCM (over 20 runs) under the same settings as the plots. The running time of JLCT includes constructing the tree and fitting the two survival models of Methods 2 and 3. The running time of JLCM includes fitting with all numbers of latent classes . Clearly, JLCT is orders of magnitude faster than JLCM across all settings: JLCT completes one run within a minute, while JLCM typically takes 25 to 45 minutes. The running time for other censoring levels and baseline hazard distributions are similar. All of the simulations (including those in Section 4) are performed on high performance computing nodes with 3.0GHz CPU and 62 GB of memory.
The simulation results demonstrate that when all covariates are timeinvariant, which satisfies the assumptions of JLCM, JLCT is comparable to JLCM in prediction performance under most simulation settings. Furthermore, as the latent class membership model approaches a deterministic partitioning, JLCT can significantly outperform JLCM in certain settings. More importantly, JLCT takes much less time to fit, which enables fast model selection for even better performance. In the next section, we study the scenario where all covariates are timevarying, and thus JLCM does not apply directly, and demonstrate that JLCT gains additional advantage over JLCM by fully exploiting the timevarying nature of the data.
Structure  JLCT  JLCM  

Tree  0.5  34.37 (2.45)  1853.02 (301.94) 
0.7  28.27 (2.22)  2151.44 (426.60)  
0.85  19.16 (1.92)  1921.36 (381.52)  
1  9.66 (0.46)  1691.30 (261.07)  
Linear  0.5  46.48 (3.91)  1674.99 (136.63) 
0.7  38.86 (3.43)  2170.75 (369.54)  
0.85  28.62 (2.80)  1826.12 (510.56)  
1  30.97 (3.79)  1794.20 (369.24)  
Nonlinear  0.5  36.74 (3.09)  1565.98 (368.03) 
0.7  28.68 (2.41)  1627.64 (260.65)  
0.85  31.02 (2.75)  1744.02 (251.78)  
1  19.50 (2.60)  3302.90 (358.73)  
Asymmetric  0.5  35.26 (2.61)  1555.53 (382.61) 
0.7  31.46 (2.64)  1680.21 (309.16)  
0.85  22.99 (2.90)  1781.39 (196.74)  
1  9.62 (1.26)  2757.40 (627.39) 
The average running time in seconds (standard deviation in parentheses) on a dataset with timeinvariant covariates,
, light censoring, and WeibullI distribution.4 Simulation results: timevarying covariates
In this section, we compare JLCT with JCLM under the general scenario where all covariates are timevarying. JLCT can work with timevarying covariates in a straightforward way, but JLCM can only use a subset of available timevarying data, which limits its prediction performance.
The setup greatly resembles that of the timeinvariant scenario in Section 3, except that all covariates are now timevarying. As a result, subject might change from one latent class to another at time , which directly affects the generation of longitudinal outcomes and timetoevent. In Appendix C we give a full description of the setup with timevarying covariates.
4.1 Methods and measures
Given the simulated data, JLCT and JLCM use the same choice of covariates in each modeling component (, , , ) as in the timeinvariant scenario. We can directly fit JLCT to the simulated data using the timevarying covariates. On the other hand, JLCM does not support timevarying covariates in either the latent class membership model or the timetoevent model. If a timevarying covariate is used in either model, by default JLCM will take the first encountered value of each subject and use it as if the covariate is timeinvariant; that is, it takes the value as a baseline value and uses that. We denote by and the first encountered values of the timevarying covariates and . Thus we can still fit JLCM to the simulated data, but JLCM automatically replaces with .
Since JLCM only uses the first encountered values per subject for covariates , it is not clear whether differences in performance of JLCM and JLCT are due to the additional information contained in later values of timevarying covariates, or because of the difference in the methodology itself. For the purpose of decomposing the difference, we also examine a version of JLCT that is restricted to using and . To summarize, we fit the following five methods to the simulated data:

A baseline model that treats the entire dataset as one latent class. This is equivalent to JLCT with stopping threshold .

A JLCT model, using converted timeinvariant covariates and to model latent class and timetoevent, while using to model longitudinal outcomes.

A JLCT model, using converted timeinvariant covariates to model latent class, while using , to model timetoevent and longitudinal outcomes.

A JLCT model, using original timevarying covariates for all of the modeling components. This is the default JLCT.

JLCM, using converted timeinvariant covariates and to model latent class and timetoevent, meanwhile using to model longitudinal outcomes. This is the default JLCM, and it uses the same amount of information as the second JLCT (Method 2). In JLCM, we allow classspecific coefficients for in the survival model, as well as classspecific Weibull baseline risk functions. The optimal number of latent classes is chosen from using BIC.
All JLCT models use stopping threshold and are pruned to have no more than 6 terminal nodes. Once a tree is constructed, we fit an extended Cox model with different hazard functions and different slopes across terminal nodes. We use the same prediction procedure and performance measures as in the timeinvariant scenario to compare these five methods.
4.2 Results
Figures 2(a) to 2(c) show the boxplots of ISE, MSE, and MSE respectively, on scale for , light censoring, and WeibullI distributions, for 16 combinations of latent class structure and concentration level: Tree, Linear, Nonlinear, Asymmetric . The experiments are repeated 100 times for JLCT, and 20 times for JLCM under each setting. The results for other baseline hazards distributions and performance measures are given in Appendix E.
Figure 2(a) demonstrates a clear decomposition of the differences in ISE between JLCT and JLCM. The first JLCT method with no split serves as a benchmark. The second JLCT method uses the converted “timeinvariant”(baseline) data, and it has slightly larger ISE values than JLCM (Method 5), which uses the same converted data. If we allow JLCT to use the original timevarying covariates in the survival model (Method 3), then the ISE becomes smaller than that of JLCM. This demonstrates the potential of timevarying covariates in making accurate survival predictions. When we further allow JLCT to use the original timevarying covariates in the class membership model as well (Method 4), the ISE decreases even more and becomes significantly better than that of JLCM. That is, if each subject is allowed to switch between latent classes throughout the time of study, and the estimated membership is also allowed to switch, we can achieve additional improvement in survival predictions when such switching actually occurs in the population.
The ISE of the default JLCT (Method 4) improves when the latent class membership becomes less noisy, i.e. increases, and JLCT is much more favorable than JLCM when the latent classes are generated by a nearly deterministic partitioning (). In particular, when the partitioning is deterministic () and the underlying structure is a tree (Tree, and Asymmetric Tree), JLCT is expected to perform well since data are generated according to its underlying model, and does indeed outperform JLCM by a much larger margin in deterministic tree setups.
Similar patterns appear in MSE (Figure 2(b)), and MSE (Figure 2(c)), where the default JLCT (Method 4) outperforms all other methods when the concentration level is reasonably large (). The other two measures have less variability relative to overall levels than does ISE.
We report average running times of the default JLCT (Method 4) and JLCM (Method 5) in Table 2. The running times of all three JLCT methods (Methods 2 to 4) are comparable, thus we only keep the default JLCT (Method 4) in our comparison. Table 2 shows again that JLCT is orders of magnitude faster than JLCM. Compared to the timeinvariant scenario (Table 1), the change in JLCT’s running time is negligible, while JLCM takes much longer to run. This observation is surprising, since internally JLCM still fits a timeinvariant dataset. The longer running time of JLCM suggests that the data generated by timevarying covariates contain more complex signals, which results in longer running times in fitting the (incorrect) JLCM model.
Structure  JLCT  JLCM  

Tree  0.5  42.66 (2.43)  1928.86 (506.13) 
0.7  33.92 (4.33)  1863.05 (481.95)  
0.85  20.62 (3.04)  2037.55 (564.91)  
1  7.96 (0.58)  2189.12 (468.41)  
Linear  0.5  54.18 (4.75)  2109.98 (313.72) 
0.7  42.78 (3.24)  2114.97 (479.38)  
0.85  35.23 (4.21)  2140.96 (269.62)  
1  27.97 (3.83)  2366.95 (718.97)  
Nonlinear  0.5  40.54 (4.86)  2044.40 (201.28) 
0.7  33.25 (3.09)  2142.03 (177.44)  
0.85  30.70 (6.02)  2150.83 (249.17)  
1  18.87 (2.72)  2175.97 (455.68)  
Asymmetric  0.5  40.60 (5.31)  2154.53 (240.17) 
0.7  32.34 (2.47)  2219.78 (207.58)  
0.85  25.59 (3.06)  2233.15 (219.98)  
1  7.00 (1.05)  2165.21 (43.30) 
5 Application
In this section, we apply JLCT to a real dataset, the PAQUID dataset, which was also examined in ProustLima et al. (2017), and compare its performance to that of JLCM.
We use the PAQUID (Personnes Agees Quid) dataset from the R package lcmm. The provided PAQUID dataset consists of 2250 records of 500 subjects from the PAQUID study (Letenneur et al., 1994). The PAQUID dataset collects five timevarying values (MMSE, IST, BVRT, HIER, CESD) along with age at visit (age). The event in the dataset is the dementia diagnosis (dem), and the timetoevent is the age at dementia diagnosis or last visit, agedem. The PAQUID dataset also collects three timeinvariant covariates: education (CEP), gender (male), and age at the entry of the study (age_init). As suggested by ProustLima et al. (2017), we normalize the highly asymmetric covariate MMSE and only consider its normalized version normMMSE; we also construct a new covariate . Our goal for this dataset is to jointly model the trajectories of normMMSE (longitudinal outcomes) and the risk of dementia (timetoevent), using the remaining covariates.
We consider two JLCM and three JLCT models:

(JLCM) We adopt the timeinvariant JLCM model in ProustLima et al. (2017): The trajectories of normMMSE depend on fixed effects age65, , CEP, male, and random effects age65, . The risk of dementia depends on CEP, male, with classspecific Weibull baseline hazards function. The class membership is modeled by CEP, male.

(JCLM) We extend JLCM to using additional covariates: the survival model uses covariates CEP, male, age_init, , , , ; the class membership model uses covariates CEP, male, , , , , . The rest of the model remains the same as in the timeinvariant JLCM. Note that JLCM automatically uses the first encountered value of any timevarying covariate .

(JLCT) The first JLCT model uses the same sets of covariates as JLCM.

(JCLT) The second JLCT model uses the same sets of covariates as JLCM. In particular, JLCT also uses for any timevarying covariate .

(JLCT) The last JLCT model adopts the same sets of covariates as JLCM, but using the original values of any timevarying covariate: CEP, male, age_init, BVRT, IST, HIER, CESD; and CEP, male, age65, BVRT, IST, HIER, CESD. The rest of the model remains the same as in the timeinvariant JLCM. JLCT is our main model with no comparable competitors.
For the two JLCM models, the number of latent classes is chosen from 2 to 6 according to the BIC selection criterion. For the three JLCT models, we set the stopping threshold to and prune the trees to have no more than 6 terminal nodes. The survival models of all five methods assume classspecific baseline hazards functions and the same slope coefficients across classes, which is adopted by ProustLima et al. (2017).
We use the root mean squared error (RMSE) to measure the accuracy of the predicted longitudinal outcomes. To evaluate the accuracy of the timetoevent predictions, we take the commonly used measure, the Brier score and its integrated version, IBS (Graf et al., 1999). The Brier score (BS) at a fixed time is defined as
where is the predicted probability of survival at time conditioning on subject ’s predictor vector , and is the timetoevent of subject . The Integrated Brier score (IBS) is therefore defined as
We compute the accuracy measure on outofsample subjects using 10fold crossvalidation as follows. We first randomly divide the dataset into 10 folds of equal size, where we take care such that observations of a single subject belong to the same fold. Next, we hold out one fold of data and run the model on the remaining nine folds. The performance of the model is then evaluated on the held out data. The procedure is repeated 10 times, where each of the 10 folds is used for outofsample evaluation.
We report the average prediction measure and running time over the 10 folds in Table 3. When using only timeinvariant covariates, JLCT performs similarly to its counterpart JLCM in prediction accuracy (IBS and RMSE). By adding four “timeinvariant” covariates (which are converted from timevarying ones) to the class membership and survival models, the performance of JLCT remains similar, but the performance of JLCM becomes much worse, mainly because JLCM failed to converge when optimizing the loglikelihood function. When using the original timevarying covariates in the class membership and survival models, however, JLCT improves its timetoevent prediction accuracy and outperforms all other methods by a significant margin on that measure. When we look at the running time, JLCT is much faster than JLCM: fitting using JLCT took no more than 2 minutes even for the most complex model (JLCT), while fitting using JLCM took from 40 to 60 minutes. The experiments are performed on a desktop with 2.26GHz CPU and 32GB of memory.
The results in Table 3 demonstrate two key advantages of the treebased approach JLCT over the parametric JLCM: JLCT is capable of providing significantly better prediction performance with the use of timevarying covariates in all of its modeling components, and it can be orders of magnitude faster to fit JLCT than to fit JLCM.
JLCM  JLCM  JLCT  JLCT  JLCT  

IBS  0.1731  0.4467  0.1611  0.169  0.0966 
RMSE  14.7588  18.3544  14.5503  14.2912  14.5014 
Time (secs)  2448.7026  4107.6193  1.6998  40.9126  87.9072 
Figures 3(a), 3(b), and 3(c) give the JLCT tree structure from JLCT, JLCT and JLCT respectively, which are fit using the entire PAQUID dataset. The numbers in each box display the test statistics TS, and the proportion of observations contained in the current node. We make the following observations:

When fitting with only timeinvariant covariates CEP and male, JLCT first splits into CEP and CEP, then splits on gender within the node of . Two of the three terminal nodes have final TS greater than the stopping criterion, 3.84, which indicates potential association between longitudinal and survival data within these two nodes. However, since CEP and male take binary values , JLCT cannot split further. Thus, it seems unlikely that using only timeinvariant covariates provides adequate fit for these data.

JLCT uses more splitting covariates, CEP, male, , , , , , with some of the covariates converted from timevarying ones. JLCT makes multiple splits on , and then makes a final split on . With additional covariates to split on, JLCT ends up with five terminal nodes, each having a test statistic less than 3.84, and thus JLCT has uncovered a good partitioning in the sense that the terminal nodes lack evidence of association between longitudinal and survival outcomes. However, the prediction performance of JLCT shows no significant improvement over JLCT based on crossvalidation, suggesting that the JLCT models have reached a limit with only timeinvariant covariates to use.

JLCT uses the original timevarying values of the covariates in JLCT. The tree splits into three nodes based only on age (splitting at ages 82 and 90), suggesting that people transition into different dementia statuses as they get older, which are reflected in both cognitive test score (normMMSE) and time until a dementia diagnosis. Furthermore, all three nodes obtain a final test statistic less than 3.84, which indicates a good partitioning of the population in terms of preserving conditional independence within groups. Finally, with timevarying latent class memberships and timevarying covariates in the survival models, JLCT achieves significant improvement in prediction performance of time to dementia diagnosis.
6 Conclusion
In this paper, we have proposed a treebased approach (JLCT) to jointly model longitudinal outcomes and timetoevent with latent classes. Simulations and real application on the PAQUID dataset show that when covariates are timeinvariant, JLCT performs comparably to the common parametric joint latent class modeling approach JLCM. When the covariates become timevarying, JLCT makes full use of the timevarying information and can demonstrate significant advantage over JLCM, which can only use a subset of the available timevarying data. In addition, JLCT is orders of magnitude faster than JLCM under both scenarios.
There are several interesting extensions of the JLCT method that could be explored. The PAQUID data set discussed in ProustLima et al. (2017) and in Section 5 is actually one exhibiting competing risks, since there is a risk of death before dementia is exhibited, but this was ignored in the analysis. It would be of interest to generalize JLCT to account for this. Often survival values are only known to within an interval of time (intervalcensoring), and JLCT could be adapted to that situation as well. In addition, the analysis here only allows for one longitudinal outcome, but sometimes several biomarkers are available for a patient, and it would be useful to generalize JLCT to allow for that.
An R package, jlctree, that implements JLCT is available at CRAN. Appendix F provides code illustrating its use.
References
 Andersen and Gill [1982] P. K. Andersen and R. D. Gill. Cox’s regression model for counting processes: A large sample study. The Annals of Statistics, 10(4):1100–1120, 1982.
 Bates et al. [2015] D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting linear mixedeffects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015.
 Breiman et al. [1984] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA, USA, 1984.
 Cox [1972] D. R. Cox. Regression models and lifetables. Journal of the Royal Statistical Society. Series B, 34(2):187–220, 1972.
 Crowley and Hu [1977] J. Crowley and M. Hu. Covariance analysis of heart transplant survival data. Journal of the American Statistical Association, 72(357):27–36, 1977.
 Dekker et al. [2008] F. W. Dekker, R. De Mutsert, P. C. Van Dijk, C. Zoccali, and K. J. Jager. Survival analysis: Timedependent effects and timevarying risk factors. Kidney International, 74(8):994–997, 2008.
 Dicker et al. [2006] R. Dicker, F. Coronado, D. Koo, and R. G. Parrish. Principles of epidemiology in public health practice. US Department of Health and Human Services, 2006.
 Fu and Simonoff [2017] W. Fu and J. S. Simonoff. Survival trees for lefttruncated and rightcensored data, with application to timevarying covariate data. Biostatistics, 18(2):352–369, 2017.
 Garre et al. [2008] F. G. Garre, A. H. Zwinderman, R. B. Geskus, and Y. W. Sijpkens. A joint latent class changepoint model to improve the prediction of time to graft failure. Journal of the Royal Statistical Society: Series A, 171(1):299–308, 2008.
 Graf et al. [1999] E. Graf, C. Schmoor, W. Sauerbrei, and M. Schumacher. Assessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine, 18(1718):2529–2545, 1999.
 Hajiro et al. [2000] T. Hajiro, K. Nishimura, M. Tsukino, A. Ikeda, and T. Oga. Stages of disease severity and factors that affect the health status of patients with chronic obstructive pulmonary disease. Respiratory Medicine, 94(9):841–846, 2000.
 Hastie et al. [2001] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
 Henderson et al. [2000] R. Henderson, P. Diggle, and A. Dobson. Joint modelling of longitudinal measurements and event time data. Biostatistics, 1(4):465–480, 2000.
 Jongerden et al. [2015] I. P. Jongerden, B. Speelberg, C. L. Satizábal, A. G. Buiting, M. A. Leversteinvan Hall, J. Kesecioglu, and M. J. Bonten. The role of systemic antibiotics in acquiring respiratory tract colonization with gramnegative bacteria in intensive care patients: A nested cohort study. Critical Care Medicine, 43(4):774–780, 2015.
 Kovesdy et al. [2007] C. P. Kovesdy, J. E. Anderson, and K. KalantarZadeh. Paradoxical association between body mass index and mortality in men with CKD not yet on dialysis. American Journal of Kidney Diseases, 49(5):581–591, 2007.
 Laird and Ware [1982] N. M. Laird and J. H. Ware. Randomeffects models for longitudinal data. Biometrics, 38(4):963–974, 1982.
 Letenneur et al. [1994] L. Letenneur, D. Commenges, J.F. Dartigues, and P. BarbergerGateau. Incidence of dementia and Alzheimer’s disease in elderly community residents of southwestern France. International Journal of Epidemiology, 23(6):1256–1261, 1994.
 Lin et al. [2002] H. Lin, B. W. Turnbull, C. E. McCulloch, and E. H. Slate. Latent class models for joint analysis of longitudinal biomarker and event process data: Application to longitudinal prostatespecific antigen readings and prostate cancer. Journal of the American Statistical Association, 97(457):53–65, 2002.
 MunozPrice et al. [2016] L. S. MunozPrice, J. F. Frencken, S. Tarima, and M. Bonten. Handling timedependent variables: Antibiotics and antibiotic resistance. Clinical Infectious Diseases, 62(12):1558–1563, 2016.
 ProustLima and Taylor [2009] C. ProustLima and J. M. Taylor. Development and validation of a dynamic prognostic tool for prostate cancer recurrence using repeated measures of posttreatment PSA: A joint modeling approach. Biostatistics, 10(3):535–549, 2009.
 ProustLima et al. [2009] C. ProustLima, P. Joly, J.F. Dartigues, and H. JacqminGadda. Joint modelling of multivariate longitudinal outcomes and a timetoevent: A nonlinear latent class approach. Computational Statistics & Data Analysis, 53(4):1142–1154, 2009.
 ProustLima et al. [2014] C. ProustLima, M. Séne, J. M. Taylor, and H. JacqminGadda. Joint latent class models for longitudinal and timetoevent data: A review. Statistical Methods in Medical Research, 23(1):74–90, 2014.
 ProustLima et al. [2017] C. ProustLima, V. Philipps, and B. Liquet. Estimation of extended mixed models using latent classes and latent processes: The R package lcmm. Journal of Statistical Software, 78(2):1–56, 2017.
 ProustLima et al. [2018] C. ProustLima, V. Philipps, A. Diakite, and B. Liquet. lcmm: Extended Mixed Models Using Latent Classes and Latent Processes, 2018. URL https://cran.rproject.org/package=lcmm. R package version: 1.7.9.
 Rizopoulos [2010] D. Rizopoulos. JM: An R package for the joint modelling of longitudinal and timetoevent data. Journal of Statistical Software, 35(9):1–33, 2010.
 Therneau [2015] T. M. Therneau. A Package for Survival Analysis in S, 2015. URL https://CRAN.Rproject.org/package=survival. version 2.38.
 Therneau and Grambsch [2000] T. M. Therneau and P. M. Grambsch. Modeling Survival Data: Extending the Cox Model. Springer, New York, 2000. ISBN 0387987843.
 Tsiatis and Davidian [2004] A. A. Tsiatis and M. Davidian. Joint modeling of longitudinal and timetoevent data: An overview. Statistica Sinica, 14(3):809–834, 2004.
 Tsiatis et al. [1995] A. A. Tsiatis, V. Degruttola, and M. S. Wulfsohn. Modeling the relationship of survival to longitudinal data measured with error. Applications to survival and CD4 counts in patients with AIDS. Journal of the American Statistical Association, 90(429):27–37, 1995.
 Wulfsohn and Tsiatis [1997] M. S. Wulfsohn and A. A. Tsiatis. A joint model for survival and longitudinal data measured with error. Biometrics, 53(1):330–339, 1997.
Appendix A Joint Latent Class Models (JLCM)
In this section, we give a brief introduction to JLCM, and discuss its strengths and weaknesses. More details about JLCM can be found in ProustLima et al. [2014].
In JLCM, the latent class membership for subject is determined by the set of covariates , ( must be timeinvariant in JLCM, so we drop the time indicator ), through the following probabilistic model:
where are classspecific intercept and slope parameters for class .
The longitudinal outcomes in JLCM are assumed to follow a slightly different linear mixedeffects model than in (1):
where is the fixed effect vector for class , and is the random effect vector for subject and class . The random effect vector is independent across latent classes and subjects, and normally distributed with mean and variancecovariance matrix . The errors are assumed to be independent and normally distributed with mean and variance , and independent of all of the random effects as well. Let denote the likelihood of longitudinal outcomes given that subject belongs to latent class .
The timetoevent is considered to follow the proportional hazards model with timeinvariant covariates :
(4) 
where parameterizes the classspecific baseline hazards , and is associated with the set of covariates (we drop the time indicator from since it must be timeinvariant in JLCM). Let denote the survival probability at time if subject belongs to latent class . Note that the extended Cox model (2) of JLCT extends the proportional hazards model (4) to allow for timevarying covariates.
Let , , , , , , , , : , be the entire vector of parameters of JLCM. These parameters are estimated together via maximizing the loglikelihood function
The loglikelihood function above uses the assumption that conditioning on the latent class membership (), longitudinal outcomes () and timetoevent () are independent.
As mentioned in the introduction, the concept of latent class membership is of particular interest in clinical studies. JLCM is designed to give parametric descriptions of subjects’ tendency of belonging to each latent class, and therefore JLCM is a suitable model when the true latent class is indeed a random outcome with unknown probabilities for each class. The multinomial logistic regression that JLCM uses is a flexible tool to model these unknown probabilities.
Despite the usefulness of latent classes, JLCM has several weaknesses. First of all, the running time of JLCM does not scale well due to its complicated likelihood function. Simulation results show that the running time of JLCM increases exponentially fast as a function of the number of observations, the number of covariates, and the number of assumed latent classes. Secondly, the modeling of timetoevent in JLCM is restricted to the use of timeinvariant covariates. However, timevarying covariates are helpful in modeling the timetoevent, especially when treatment or important covariates change during the study, for instance the patient receives a heart transplant [Crowley and Hu, 1977], or the longitudinal CD4 counts change during the study of AIDS [Tsiatis et al., 1995]. Research shows that using timevarying covariates can uncover shortterm associations between timetoevent and covariates [Dekker et al., 2008, Kovesdy et al., 2007], and ignoring the timevarying nature of the covariates will lead to timedependent bias [Jongerden et al., 2015, MunozPrice et al., 2016]. The other restriction of JLCM is that the latent class membership model only uses timeinvariant covariates, which implies that the latent class membership of a subject is assumed to be fixed throughout the time of study. However, the stage of a disease of a patient is very likely to change during the course of clinical study, for instance the disease would move from its early stages to its peak, and then move to its resolution. When the goal of joint modeling is to uncover meaningful clustering of the population that leads to definitions of disease stages, it is important to allow timevarying covariates in the latent class membership model, so that the model reflects this real world situation.
Appendix B Simulation setup: timeinvariant covariates
In this section we give details of the data generating scheme in the simulation study of timeinvariant covariates only (Sections 3).
At each simulation run, for each subject we randomly generate five independent, timeinvariant covariates . We assume there are four latent classes , which are determined by covariates . Once the latent classes are determined for each subject , the timetoevent and the longitudinal outcomes are conditionally independent given the latent classes, and therefore generated separately. In particular, the survival outcomes (timetoevent) depend on , and the longitudinal outcomes depend on the latent classes. We give more details below.
b.1 Covariates
At each simulation run, for each subject we draw , , , , uniformly from , , , , and respectively.
b.2 Latent classes
We determine class membership based on a multinomial logistic model of , with increasing level of concentration on one class. Our latent class membership generation model matches the setup of JLCM, and it approaches the setup of JLCT as the concentration level approaches .
For subject , we compute the value of a “score” function , where denotes the parameters associated with latent class , and denote the first two covariates of subject . The latent class membership for subject , , is generated according to two key values: the “majority” class , and the “concentration” level .
Define the “majority” class as the latent class with largest score for sample ,
We consider four types of score functions , which correspond to four underlying structures of latent classes: tree partition, linear separation, nonlinear separation, and asymmetric tree partition. The structure is reflected by the dependency of on , which is shown in Figure 1.
The latent class membership of subject is drawn according to the probabilities
where the parameter is chosen such that the probability of “major” class is approximately equal to a prespecified concentration level . That is . In particular, when , , and therefore the latent class membership is randomly determined and independent of . On the other hand, when , , and therefore the latent class membership corresponds to a deterministic partitioning based on , which is consistent with the assumptions underlying a tree partitioning.
The choices of parameters are given below.

Tree. Consider the following coefficients :
Define the score function
and let denote the latent class with largest score for sample . See Figure 0(a) for the dependency of on . The latent classes are drawn according to the probabilities
where the parameter is chosen such that , and is some prespecified level . In particular, when , ; when , .

Linear. Consider the following coefficients drawn randomly from the unit sphere,
Define the score function
and let . See Figure 0(b) for the dependency of on . The latent classes are drawn according to the following probabilities
with again chosen to control the value of .

Nonlinear. We can skip the step of defining the score function and the value, but directly work with and . For each observation, its “most likely” latent class is determined by whether belongs to the circles centered at and with radius :
See Figure 0(c) for visualization of . The latent classes are drawn according to the following probabilities:
where .

Asymmetric. We can skip the step of defining the score function and the value, but directly work with and . For each observation, its “most likely” latent class is determined by the following asymmetric tree:
See Figure 0(d). The latent classes are drawn according to the following probabilities:
where .
b.3 Timetoevent
The survival time (timetoevent) of subject follows the proportional hazards model
where the slope coefficients depend on latent class for subject .
We use three different distributions for baseline hazards : exponential, Weibull with decreasing hazards with time (WeibullD), and Weibull with increasing hazards with time (WeibullI). We select parameters and slopes such that the mean values of survival time across latent classes remain similar across different distributions. The distributions and corresponding parameters for generating timetoevent data are listed below.

Exponential with , and slopes are

Weibull distribution with shape parameter , which corresponds to decreasing hazards with time (WeibullD). The scale parameter is , and slopes are

Weibull distribution with shape parameter , which corresponds to increasing hazards with time (WeibullI). The scale parameter is , and slopes are
Left truncation times are generated independently from uniform
. Right censoring times are generated independently from an exponential distribution, with parameters chosen to reflect light censoring (approximately 20% observations are censored), and heavy censoring (approximately 50% observations are censored).
b.4 Longitudinal outcomes
The longitudinal outcome comes from the following linear mixedeffects model: for subject at time , let denote the latent class membership, and
where , and are classspecific random intercepts. We assume each subject is measured at its entry (left truncation) time, together with multiple intermediate measurement times , etc. The number of intermediate measurements for subject is generated independently from , thus each subject has at least 2 measurements, and has 3 measurements on average. The intermediate measurement time is then sampled independently and uniformly between subject ’s lefttruncated and rightcensored time, . Finally, the data are converted to the LTRC format.
Observe that covariates and determine the latent classes, and thus affect the timetoevent and longitudinal outcomes . Therefore, timetoevent is correlated with if and are unknown. On the other hand, conditioning on one of the four latent classes , timetoevent and longitudinal outcomes are independent: the former follows a classspecific proportional hazards model and only depends on , while the latter is a constant plus random noise. Therefore, the simulated data satisfy the conditional independence assumption made by both JLCM and JLCT.
b.5 Methods and predictions
Given the simulated data, JLCT and JLCM use almost the same subsets of covariates in their modeling components: to model timetoevent, and for fixed and random effects to model longitudinal outcomes. JLCT uses to model latent class membership. Since treebased approaches automatically use interactions between covariates to partition the population, for fair comparison we allow JLCM to include an additional interaction term in modeling latent classes:
Comments
There are no comments yet.