Clinical studies often collect three types of data on each patient: the time to the event of interest (possibly censored), the longitudinal measurements on a continuous response (for example, some sort of biomarker viewed as clinically important), and an additional set of covariates (possibly time-varying) about the patient. The clinical studies then focus on analyzing the relationship between the time-to-event and the longitudinal responses, using the additional covariates. A common approach is to jointly model the time-to-event by a survival model, while modeling the longitudinal responses using a linear mixed-effects model, with both the survival and the linear mixed-effects models potentially making use of the additional covariates.
The first and most common approach for the joint modeling problem is the shared random effects model (SREM) (Wulfsohn and Tsiatis, 1997; Henderson et al., 2000; Tsiatis and Davidian, 2004; Rizopoulos, 2010). The name “shared random effects” comes from the modeling assumption that a set of random effects accounts for the association between longitudinal outcomes and time-to-event. In particular, the longitudinal outcomes are modeled by linear mixed-effects models (Laird and Ware, 1982)
, with random effects for each patient. Meanwhile, these random effects affect the hazards of the event through some parametric survival model, for example a proportional hazards model. SREM is limited in the use of covariates in the survival model, as it only allows time-invariant baseline covariates, such as a treatment indicator. To fit a SREM, the parameters of the survival model and the linear mixed-effects model are estimated together via maximum likelihood. SREM proposes a straightforward way of jointly modeling the two quantities, but it comes with high computational cost.
. JLCM assumes that the population of patients consists of multiple latent classes. A patient’s time-to-event and longitudinal responses are independent conditioning on his or her latent class membership. In addition, JLCM assumes the latent classes are homogeneous and thus patients within a latent class follow the same survival and linear mixed-effects model. Finally, the latent class membership is modeled by a multinomial logistic regression model.
What makes JLCM interesting and different from SREM is the idea of latent class membership, which (among other things) can be used to describe disease progression in clinical studies (Lin et al., 2002; Garre et al., 2008; Proust-Lima et al., 2009). It is well-known that many diseases have different stages; examples include dementia, AIDS, cancer, and chronic obstructive pulmonary disease (COPD) (Dicker et al., 2006). From a clinical point of view, it is important to identify those stages, since treatment could change with membership in different stages (Hajiro et al., 2000). Currently the clinical definitions of stages of a disease consist of using diagnostic findings (such as biomarkers) to produce clusters of patients. However, it is possible that by jointly studying biomarker trajectories and survival experiences, one can find data-dependent latent classes that uncover new, meaningful stages.
JLCM, like most frequentist parametric approaches, estimates the parameters for all modeling components via maximizing the log-likelihood function, which is a very complex process. Although the computational cost for JLCM is less than that of SREM, it is still prohibitively slow for large scale data. In addition, JLCM cannot use time-varying covariates in its latent class membership and survival models, which could greatly reduce its prediction performance.
A nonparametric approach that addresses some of these problems would be desirable, and tree-based approaches (Breiman et al., 1984) are natural candidates. It is very efficient to fit a tree, and the terminal nodes of a tree naturally represent a partition of the population. A tree-based approach for joint latent class modeling also addresses the time-invariant limitation of JLCM, since time-varying covariates can be used as the splitting variables to construct the tree. Furthermore, once a tree is constructed, it is up to the user to decide which type of survival models and which covariates to use within each terminal node, providing additional flexibility to the analyst.
In this work, we propose the joint latent class tree (JLCT) method. JLCT, like JLCM, is based on the key assumption that conditioning on latent class membership, time-to-event and longitudinal responses are independent. JLCT therefore looks for a tree-based partitioning such that within each estimated latent class defined by a terminal node, the time-to-event and longitudinal responses display a lack of association. Once the tree is constructed, we assign each observation to a latent class (i.e. terminal node), and independently fit survival and linear mixed-effects models, using the class membership information.
The rest of the paper is organized as follows. In Section 2 we introduce the setup of the joint modeling problem, and present our joint latent class tree (JLCT) method. In Sections 3 and 4 we use simulations to compare JLCT with JLCM in terms of prediction performance and running time. Finally, in Section 5 we apply JLCT to a real dataset, and demonstrate that JLCT admits competitive (or superior) prediction performance, while being orders of magnitude faster than the JLCM approach.
2 Constructing a tree to uncover conditional independence
2.1 Joint modeling setup
Assume there are subjects in the sample. For each subject , we observe repeated measurements of a longitudinal outcome at times
. We denote the vector of longitudinal outcomes by. In addition, for each subject we observe a vector of covariates at each measurement time , . These covariates can be either time-invariant or time-varying. Each subject is also associated with a time-to-event tuple , where is the time of the event, and is the censor indicator with if subject is censored at , and otherwise.
We assume there exist latent classes, and let denote the latent class membership of subject at time . We assume the latent class membership is determined by a subset of covariates, denoted by .
The joint latent class modeling problem makes the key assumption that a patient’s time-to-event and longitudinal outcomes are independent conditioning on his or her latent class membership . Without controlling the latent class membership , time-to-event and longitudinal outcomes may appear to be correlated because each is related to the latent class, but given the two are independent of each other, and therefore the longitudinal outcomes have no prognostic value for time-to-event given the latent class. The modeling of and are therefore separated conditioning on .
We assume the longitudinal outcomes come from a linear mixed-effects model:
Here we assume the longitudinal outcomes depend on two subsets of , with the subset of covariates associated with a fixed effect vector , and the subset of covariates associated with a latent class-specific random effect vector . We assume the random effect vector is independent across latent classes . In addition, there is a subject-specific random intercept that is independent across subjects and independent of the random effects . Finally, the errors
are assumed to be independent and normally distributed with mean
and variance, and independent of all the random effects as well.
Conditioning on latent class membership, we assume the time-to-event tuple depends on a subset of covariates observed at time , through the extended Cox model for time-varying covariates (Cox, 1972):
where is the vector of class-specific slope coefficients. Equation (2) shows a general form of the survival model given latent classes . In practice one can restrict either or , or both, to be identical across latent classes.
Tree construction requires fitting the extended Cox model using the longitudinal outcome variable as a predictor, and perhaps other time-varying covariates. In order to implement that, we convert the original data into left-truncated right-censored (LTRC) data (Andersen and Gill, 1982; Fu and Simonoff, 2017). For each subject and measurement time , there is a “pseudo-observation” with , and a time-to-event triplet , where is the left-truncated time, is the right-censored time, and is the censor indicator.
We have introduced four subsets of covariates so far: for the latent classes, for the fixed effects, for the random effects, and for the time-to-event. Each of the four subsets can contain time-varying covariates, and the four subsets can be either identical, or share common covariates, or share no covariates at all.
2.2 Joint Latent Class Tree (JLCT) methodology
In Appendix A, we give a brief introduction to JLCM, and discuss at length its strengths and weaknesses. In this section we formally introduce JLCT, the joint latent class tree approach. Like JLCM, JLCT also assumes conditional independence of the longitudinal outcomes and the time-to-event within each latent class. Under this assumption, JLCT looks for a tree-based partitioning such that within each estimated class defined by a terminal node the time-to-event and longitudinal outcomes display a lack of association.
Tree-based methods are powerful modeling tools in statistics and data mining (Breiman et al., 1984; Hastie et al., 2001), especially because they are fast to construct, able to uncover nonlinear relationships between covariates, and intuitive and easy to explain. We consider binary trees, where each node is recursively split into two children nodes based on a splitting criterion. Often the splitting criterion ensures that the two children nodes are more “homogeneous” than their parent node. The measure of “homogeneity” varies by the method, for instance in a classification tree, the measure could be misclassification error, Gini index, or cross-entropy, while in a regression tree it could be residual sum of squares. The tree stops splitting when the node is “pure,” or when some stopping criteria are met.
The measure of “homogeneity” in JLCT is quite different from those commonly used in regression and classification trees. Our measure is based on the conditional independence between the time-to-event and the longitudinal outcomes: the more apparently independent the two variables are conditioning on the node, the more “homogeneous” the node is. To be more concrete, the splitting criterion repeatedly uses the test statistic for the hypothesis test
under the extended Cox model
In the model above, the coefficient is associated with the longitudinal outcomes , and the vector of coefficients is associated with the set of covariates . Thus, corresponds to the longitudinal outcome having no relationship with the time-to-event in the node given the other covariates .
We have two obvious choices for the hypothesis test (3): the log-likelihood ratio test (LRT) or the Wald test. In this work we use the log-likelihood ratio test in all experiments; simulations indicate that the Wald test gives similar results. We will denote the test statistic of the hypothesis test as TS. The smaller the value of TS is, the less related longitudinal outcomes are to the time-to-event data given the covariates and current node. JLCT seeks to partition observations using covariates in , such that TS is small within each leaf node, but stops partitioning when TS is less than a specified stopping parameter . More formally, the tree splitting procedure works as follows:
At the current node, compute the test statistic TS.
If TS, stop splitting. Otherwise proceed.
For every split variable , and possible split point ,
Define two children nodes
Ignore this split if either node violates the restrictions specified in the control parameters. (Details are given near the end of this section.) Otherwise proceed.
At each child node, determine the test statistics TS, TS respectively.
Compute the score .
Scan through all pairs of to find , and split the current node on variable at split point .
JLCT recursively splits nodes according to the above procedure, until none of the terminal nodes can be further split.
The stopping criterion, , is based on the following property of the test statistic: under the null model in (3), the distribution of TS is approximately a
distribution. Therefore, we do not reject the null hypothesiswhen TS is smaller than a threshold associated to a desired significance level. We set by default, which is the 5% tail of the distribution.
Standard control parameters, such as the minimal number of observations in any terminal node, the minimal improvement in the “purity” by a split, etc., also apply to the JLCT method. In addition, we introduce two control parameters that are specific to JLCT:
Minimum number of events in any terminal node. This parameter ensures that the survival model in each terminal node is fit with meaningful survival data with enough occurrences of events (that is, uncensored observations). By default, this parameter is set to the number of covariates used for the Cox PH model.
Upper bound on the variance of the estimated coefficients in all survival models at tree nodes. This parameter ensures that the fitted coefficients of the survival models are stable. By default this parameter is set to .
These two new control parameters enforce reliability of the fitted survival models at all nodes, which further produces reliable test statistics that can accurately reflect the relationships between the time-to-event and the longitudinal outcomes.
3 Simulation results: time-invariant covariates
In this section, we use simulations to study the behavior of JLCT, and compare JLCT with JLCM. The data are generated such that the time-to-event is correlated with the longitudinal outcomes, but the two are independent conditioning on the latent classes. Thus, the key assumption of conditional independence holds on the simulated data.
We begin with a relatively simple situation where all of the covariates are time-invariant, and thus the time-to-event and latent class memberships are determined by time-invariant covariates. This scenario agrees with the setup of JLCM. In the next section (Section 4) we consider the general scenario where all covariates are time-varying.
The simulation setup varies in the structure of latent classes and in the distribution of time-to-event. We consider four structures of latent classes (as functions of covariates and ): tree partition, linear separation, non-linear separation, and asymmetric tree partition, which are shown in Figure 1
. Class membership for each subject is determined based on an underlying probability vector, with the concentration levelbeing defined as the probability of falling into the most probable class; see Appendix B.2 for details. We consider three distributions for baseline hazards of time-to-event data: exponential, Weibull with decreasing hazards with time (Weibull-D), and Weibull with increasing hazards with time (Weibull-I). We give a full description of the simulation setup with time-invariant covariates in Appendix B.
3.1 Methods and measures
We compare four methods on the simulated datasets.
A baseline model that treats the entire dataset as one latent class. This is equivalent to JLCT with stopping threshold . After the tree is constructed, we fit a Cox PH model.
A JLCT with the stopping threshold . The tree is pruned to have no more than six terminal nodes, which coincides with the maximal number of latent classes we allow for JLCM. After the tree is constructed, we fit a Cox proportional hazards (PH) model with the same baseline hazard function but different Cox PH slopes across terminal nodes.
A JLCT that is the same as the second JLCT, except that both baseline hazard functions and Cox PH slopes differ across terminal nodes.
A JLCM, with class-specific slopes and class-specific baseline hazard functions in the survival model. The optimal number of latent classes is chosen from using BIC.
All JLCT methods fit the linear mixed-effects model (1) to the longitudinal outcomes after the tree is constructed, taking the terminal nodes to be the latent classes. Given the fitted JLCT and JLCM, we predict longitudinal outcomes and time-to-event (in terms of survival curves) on both in-sample and out-of-sample data. The details of the prediction procedure are given in Appendix B.5.
We measure performance of the four methods by prediction accuracy (both in-sample and out-of-sample), and by the difference between fitted and true parameters. To compute the out-of-sample measures, at each simulation run we generate a new random sample of subjects, using the same data generating process for the in-sample data. We consider the following measures:
The integrated squared error (ISE) measures the accuracy of the predicted survival curve. The ISE is defined as follows for a set of subjects :
where and are the predicted and the true survival probability for subject at time , respectively. We compute ISE on in-sample subjects (ISE) and on out-of-sample subjects (ISE).
MSE measures the accuracy of longitudinal prediction, which is defined for a set of subjects with longitudinal outcomes:
where we denote by and the predicted and the true -th longitudinal outcome of subject , respectively. We compute MSE on in-sample subjects (MSE) and on out-of-sample subjects (MSE).
MSE measures the difference between the estimated and the true Cox PH slope coefficients, on a set of subjects with observations:
where and are the estimated and the true Cox PH slopes for latent class , and where and are the predicted and the true latent class membership for the -th observation of subject , respectively.
It is worth emphasizing that JLCM uses extra information, such as the longitudinal outcomes and time-to-event, to predict latent class membership for in-sample subjects. The quality of latent class membership prediction directly affects that of time-to-event and longitudinal outcomes predictions, and thus JLCM is advantaged compared to JLCT for in-sample performance. A comparison between the two is only fair on out-of-sample subjects, since the longitudinal outcomes and time-to-event are no longer available to JLCM at prediction time, and therefore the two methods use the same amount of information for prediction. In view of this, we focus on comparing the out-of-sample measures in this section, and present the in-sample prediction results in Appendix D.
Figures 1(a) to 1(c) show the boxplots of ISE, MSE, and MSE respectively, on scale for , light censoring, and Weibull-I distributions, for 16 combinations of latent class structure and concentration level: Tree, Linear, Nonlinear, Asymmetric. The experiments are repeated 100 times for JLCT, and 20 times for JLCM under each setting. (Since JLCM is prohibitively time-consuming, we only perform 20 runs per setting.) We only consider large concentration levels , where the majority class of a particular subject is dominating the remaining classes in probability that the subject will actually be a member of that class. The results for other baseline hazard distributions and performance measures are given in Appendix D.
Figure 1(a) shows that the accuracies of survival predictions of JLCT (Methods 2 & 3) and JLCM (Method 4) are comparable, with JLCT being slightly worse, under the settings where latent classes are generated from probabilistic models (). In particular, when the latent class structure comes from a linear separation (Linear) and therefore the simulated data matches exactly the model of JLCM, JLCT still performs comparably to JLCM (e.g. , “Linear”). The gap between JLCT and JCLM decreases as the concentration level increases from to . In the case where and thus latent classes are generated by a deterministic partitioning, JLCT is more effective than JLCM in all settings except nonlinear partitioning. In particular, when the latent class model follows the deterministic partitioning by a tree (,“Tree”) and thus matches the model of JLCT, JLCT outperforms JLCM by a significant margin. Furthermore, JLCM tends to have higher variance in ISE values, due to occasionally unconverged JLCM runs.
A similar pattern appears for MSE in Figure 1(b), with JLCT and JLCM comparable for most settings, and JLCT significantly outperforming JLCM when the simulated data come from the underlying model of JLCT. Regarding estimation accuracy of Cox PH slopes (Figure 1(c)
), the same pattern shows up here as well, although there exist more outliers in all methods.
Methods 2 and 3 give very similar results. Indeed, the constructed JLCT trees of the two methods coincide, and only the survival models that are fitted afterwards differ. Method 2 assumes a single baseline hazard function shared across terminal nodes, which agrees with the true data generating scheme. Nevertheless, Methods 2 and 3 have almost identical out-of-sample performances (ISE and MSE), which suggests that introducing additional parameters for node-specific baseline hazard functions does not hurt the prediction performance for the simulated data. In practice, it is often unclear whether the latent classes share a single baseline hazard function. Our results suggest that in such a case, we can assume a separate baseline hazard function for each terminal node, without worrying about over-fitting the data. In later simulations and real application, we only consider the Cox PH model with separate baseline hazard functions across terminal nodes.
Table 1 shows the average running time of JLCT (over 100 runs) and JLCM (over 20 runs) under the same settings as the plots. The running time of JLCT includes constructing the tree and fitting the two survival models of Methods 2 and 3. The running time of JLCM includes fitting with all numbers of latent classes . Clearly, JLCT is orders of magnitude faster than JLCM across all settings: JLCT completes one run within a minute, while JLCM typically takes 25 to 45 minutes. The running time for other censoring levels and baseline hazard distributions are similar. All of the simulations (including those in Section 4) are performed on high performance computing nodes with 3.0GHz CPU and 62 GB of memory.
The simulation results demonstrate that when all covariates are time-invariant, which satisfies the assumptions of JLCM, JLCT is comparable to JLCM in prediction performance under most simulation settings. Furthermore, as the latent class membership model approaches a deterministic partitioning, JLCT can significantly outperform JLCM in certain settings. More importantly, JLCT takes much less time to fit, which enables fast model selection for even better performance. In the next section, we study the scenario where all covariates are time-varying, and thus JLCM does not apply directly, and demonstrate that JLCT gains additional advantage over JLCM by fully exploiting the time-varying nature of the data.
|Tree||0.5||34.37 (2.45)||1853.02 (301.94)|
|0.7||28.27 (2.22)||2151.44 (426.60)|
|0.85||19.16 (1.92)||1921.36 (381.52)|
|1||9.66 (0.46)||1691.30 (261.07)|
|Linear||0.5||46.48 (3.91)||1674.99 (136.63)|
|0.7||38.86 (3.43)||2170.75 (369.54)|
|0.85||28.62 (2.80)||1826.12 (510.56)|
|1||30.97 (3.79)||1794.20 (369.24)|
|Nonlinear||0.5||36.74 (3.09)||1565.98 (368.03)|
|0.7||28.68 (2.41)||1627.64 (260.65)|
|0.85||31.02 (2.75)||1744.02 (251.78)|
|1||19.50 (2.60)||3302.90 (358.73)|
|Asymmetric||0.5||35.26 (2.61)||1555.53 (382.61)|
|0.7||31.46 (2.64)||1680.21 (309.16)|
|0.85||22.99 (2.90)||1781.39 (196.74)|
|1||9.62 (1.26)||2757.40 (627.39)|
The average running time in seconds (standard deviation in parentheses) on a dataset with time-invariant covariates,, light censoring, and Weibull-I distribution.
4 Simulation results: time-varying covariates
In this section, we compare JLCT with JCLM under the general scenario where all covariates are time-varying. JLCT can work with time-varying covariates in a straightforward way, but JLCM can only use a subset of available time-varying data, which limits its prediction performance.
The setup greatly resembles that of the time-invariant scenario in Section 3, except that all covariates are now time-varying. As a result, subject might change from one latent class to another at time , which directly affects the generation of longitudinal outcomes and time-to-event. In Appendix C we give a full description of the setup with time-varying covariates.
4.1 Methods and measures
Given the simulated data, JLCT and JLCM use the same choice of covariates in each modeling component (, , , ) as in the time-invariant scenario. We can directly fit JLCT to the simulated data using the time-varying covariates. On the other hand, JLCM does not support time-varying covariates in either the latent class membership model or the time-to-event model. If a time-varying covariate is used in either model, by default JLCM will take the first encountered value of each subject and use it as if the covariate is time-invariant; that is, it takes the value as a baseline value and uses that. We denote by and the first encountered values of the time-varying covariates and . Thus we can still fit JLCM to the simulated data, but JLCM automatically replaces with .
Since JLCM only uses the first encountered values per subject for covariates , it is not clear whether differences in performance of JLCM and JLCT are due to the additional information contained in later values of time-varying covariates, or because of the difference in the methodology itself. For the purpose of decomposing the difference, we also examine a version of JLCT that is restricted to using and . To summarize, we fit the following five methods to the simulated data:
A baseline model that treats the entire dataset as one latent class. This is equivalent to JLCT with stopping threshold .
A JLCT model, using converted time-invariant covariates and to model latent class and time-to-event, while using to model longitudinal outcomes.
A JLCT model, using converted time-invariant covariates to model latent class, while using , to model time-to-event and longitudinal outcomes.
A JLCT model, using original time-varying covariates for all of the modeling components. This is the default JLCT.
JLCM, using converted time-invariant covariates and to model latent class and time-to-event, meanwhile using to model longitudinal outcomes. This is the default JLCM, and it uses the same amount of information as the second JLCT (Method 2). In JLCM, we allow class-specific coefficients for in the survival model, as well as class-specific Weibull baseline risk functions. The optimal number of latent classes is chosen from using BIC.
All JLCT models use stopping threshold and are pruned to have no more than 6 terminal nodes. Once a tree is constructed, we fit an extended Cox model with different hazard functions and different slopes across terminal nodes. We use the same prediction procedure and performance measures as in the time-invariant scenario to compare these five methods.
Figures 2(a) to 2(c) show the boxplots of ISE, MSE, and MSE respectively, on scale for , light censoring, and Weibull-I distributions, for 16 combinations of latent class structure and concentration level: Tree, Linear, Nonlinear, Asymmetric . The experiments are repeated 100 times for JLCT, and 20 times for JLCM under each setting. The results for other baseline hazards distributions and performance measures are given in Appendix E.
Figure 2(a) demonstrates a clear decomposition of the differences in ISE between JLCT and JLCM. The first JLCT method with no split serves as a benchmark. The second JLCT method uses the converted “time-invariant”(baseline) data, and it has slightly larger ISE values than JLCM (Method 5), which uses the same converted data. If we allow JLCT to use the original time-varying covariates in the survival model (Method 3), then the ISE becomes smaller than that of JLCM. This demonstrates the potential of time-varying covariates in making accurate survival predictions. When we further allow JLCT to use the original time-varying covariates in the class membership model as well (Method 4), the ISE decreases even more and becomes significantly better than that of JLCM. That is, if each subject is allowed to switch between latent classes throughout the time of study, and the estimated membership is also allowed to switch, we can achieve additional improvement in survival predictions when such switching actually occurs in the population.
The ISE of the default JLCT (Method 4) improves when the latent class membership becomes less noisy, i.e. increases, and JLCT is much more favorable than JLCM when the latent classes are generated by a nearly deterministic partitioning (). In particular, when the partitioning is deterministic () and the underlying structure is a tree (Tree, and Asymmetric Tree), JLCT is expected to perform well since data are generated according to its underlying model, and does indeed outperform JLCM by a much larger margin in deterministic tree setups.
Similar patterns appear in MSE (Figure 2(b)), and MSE (Figure 2(c)), where the default JLCT (Method 4) outperforms all other methods when the concentration level is reasonably large (). The other two measures have less variability relative to overall levels than does ISE.
We report average running times of the default JLCT (Method 4) and JLCM (Method 5) in Table 2. The running times of all three JLCT methods (Methods 2 to 4) are comparable, thus we only keep the default JLCT (Method 4) in our comparison. Table 2 shows again that JLCT is orders of magnitude faster than JLCM. Compared to the time-invariant scenario (Table 1), the change in JLCT’s running time is negligible, while JLCM takes much longer to run. This observation is surprising, since internally JLCM still fits a time-invariant dataset. The longer running time of JLCM suggests that the data generated by time-varying covariates contain more complex signals, which results in longer running times in fitting the (incorrect) JLCM model.
|Tree||0.5||42.66 (2.43)||1928.86 (506.13)|
|0.7||33.92 (4.33)||1863.05 (481.95)|
|0.85||20.62 (3.04)||2037.55 (564.91)|
|1||7.96 (0.58)||2189.12 (468.41)|
|Linear||0.5||54.18 (4.75)||2109.98 (313.72)|
|0.7||42.78 (3.24)||2114.97 (479.38)|
|0.85||35.23 (4.21)||2140.96 (269.62)|
|1||27.97 (3.83)||2366.95 (718.97)|
|Nonlinear||0.5||40.54 (4.86)||2044.40 (201.28)|
|0.7||33.25 (3.09)||2142.03 (177.44)|
|0.85||30.70 (6.02)||2150.83 (249.17)|
|1||18.87 (2.72)||2175.97 (455.68)|
|Asymmetric||0.5||40.60 (5.31)||2154.53 (240.17)|
|0.7||32.34 (2.47)||2219.78 (207.58)|
|0.85||25.59 (3.06)||2233.15 (219.98)|
|1||7.00 (1.05)||2165.21 (43.30)|
In this section, we apply JLCT to a real dataset, the PAQUID dataset, which was also examined in Proust-Lima et al. (2017), and compare its performance to that of JLCM.
We use the PAQUID (Personnes Agees Quid) dataset from the R package lcmm. The provided PAQUID dataset consists of 2250 records of 500 subjects from the PAQUID study (Letenneur et al., 1994). The PAQUID dataset collects five time-varying values (MMSE, IST, BVRT, HIER, CESD) along with age at visit (age). The event in the dataset is the dementia diagnosis (dem), and the time-to-event is the age at dementia diagnosis or last visit, agedem. The PAQUID dataset also collects three time-invariant covariates: education (CEP), gender (male), and age at the entry of the study (age_init). As suggested by Proust-Lima et al. (2017), we normalize the highly asymmetric covariate MMSE and only consider its normalized version normMMSE; we also construct a new covariate . Our goal for this dataset is to jointly model the trajectories of normMMSE (longitudinal outcomes) and the risk of dementia (time-to-event), using the remaining covariates.
We consider two JLCM and three JLCT models:
(JLCM) We adopt the time-invariant JLCM model in Proust-Lima et al. (2017): The trajectories of normMMSE depend on fixed effects age65, , CEP, male, and random effects age65, . The risk of dementia depends on CEP, male, with class-specific Weibull baseline hazards function. The class membership is modeled by CEP, male.
(JCLM) We extend JLCM to using additional covariates: the survival model uses covariates CEP, male, age_init, , , , ; the class membership model uses covariates CEP, male, , , , , . The rest of the model remains the same as in the time-invariant JLCM. Note that JLCM automatically uses the first encountered value of any time-varying covariate .
(JLCT) The first JLCT model uses the same sets of covariates as JLCM.
(JCLT) The second JLCT model uses the same sets of covariates as JLCM. In particular, JLCT also uses for any time-varying covariate .
(JLCT) The last JLCT model adopts the same sets of covariates as JLCM, but using the original values of any time-varying covariate: CEP, male, age_init, BVRT, IST, HIER, CESD; and CEP, male, age65, BVRT, IST, HIER, CESD. The rest of the model remains the same as in the time-invariant JLCM. JLCT is our main model with no comparable competitors.
For the two JLCM models, the number of latent classes is chosen from 2 to 6 according to the BIC selection criterion. For the three JLCT models, we set the stopping threshold to and prune the trees to have no more than 6 terminal nodes. The survival models of all five methods assume class-specific baseline hazards functions and the same slope coefficients across classes, which is adopted by Proust-Lima et al. (2017).
We use the root mean squared error (RMSE) to measure the accuracy of the predicted longitudinal outcomes. To evaluate the accuracy of the time-to-event predictions, we take the commonly used measure, the Brier score and its integrated version, IBS (Graf et al., 1999). The Brier score (BS) at a fixed time is defined as
where is the predicted probability of survival at time conditioning on subject ’s predictor vector , and is the time-to-event of subject . The Integrated Brier score (IBS) is therefore defined as
We compute the accuracy measure on out-of-sample subjects using 10-fold cross-validation as follows. We first randomly divide the dataset into 10 folds of equal size, where we take care such that observations of a single subject belong to the same fold. Next, we hold out one fold of data and run the model on the remaining nine folds. The performance of the model is then evaluated on the held out data. The procedure is repeated 10 times, where each of the 10 folds is used for out-of-sample evaluation.
We report the average prediction measure and running time over the 10 folds in Table 3. When using only time-invariant covariates, JLCT performs similarly to its counterpart JLCM in prediction accuracy (IBS and RMSE). By adding four “time-invariant” covariates (which are converted from time-varying ones) to the class membership and survival models, the performance of JLCT remains similar, but the performance of JLCM becomes much worse, mainly because JLCM failed to converge when optimizing the log-likelihood function. When using the original time-varying covariates in the class membership and survival models, however, JLCT improves its time-to-event prediction accuracy and outperforms all other methods by a significant margin on that measure. When we look at the running time, JLCT is much faster than JLCM: fitting using JLCT took no more than 2 minutes even for the most complex model (JLCT), while fitting using JLCM took from 40 to 60 minutes. The experiments are performed on a desktop with 2.26GHz CPU and 32GB of memory.
The results in Table 3 demonstrate two key advantages of the tree-based approach JLCT over the parametric JLCM: JLCT is capable of providing significantly better prediction performance with the use of time-varying covariates in all of its modeling components, and it can be orders of magnitude faster to fit JLCT than to fit JLCM.
Figures 3(a), 3(b), and 3(c) give the JLCT tree structure from JLCT, JLCT and JLCT respectively, which are fit using the entire PAQUID dataset. The numbers in each box display the test statistics TS, and the proportion of observations contained in the current node. We make the following observations:
When fitting with only time-invariant covariates CEP and male, JLCT first splits into CEP and CEP, then splits on gender within the node of . Two of the three terminal nodes have final TS greater than the stopping criterion, 3.84, which indicates potential association between longitudinal and survival data within these two nodes. However, since CEP and male take binary values , JLCT cannot split further. Thus, it seems unlikely that using only time-invariant covariates provides adequate fit for these data.
JLCT uses more splitting covariates, CEP, male, , , , , , with some of the covariates converted from time-varying ones. JLCT makes multiple splits on , and then makes a final split on . With additional covariates to split on, JLCT ends up with five terminal nodes, each having a test statistic less than 3.84, and thus JLCT has uncovered a good partitioning in the sense that the terminal nodes lack evidence of association between longitudinal and survival outcomes. However, the prediction performance of JLCT shows no significant improvement over JLCT based on cross-validation, suggesting that the JLCT models have reached a limit with only time-invariant covariates to use.
JLCT uses the original time-varying values of the covariates in JLCT. The tree splits into three nodes based only on age (splitting at ages 82 and 90), suggesting that people transition into different dementia statuses as they get older, which are reflected in both cognitive test score (normMMSE) and time until a dementia diagnosis. Furthermore, all three nodes obtain a final test statistic less than 3.84, which indicates a good partitioning of the population in terms of preserving conditional independence within groups. Finally, with time-varying latent class memberships and time-varying covariates in the survival models, JLCT achieves significant improvement in prediction performance of time to dementia diagnosis.
In this paper, we have proposed a tree-based approach (JLCT) to jointly model longitudinal outcomes and time-to-event with latent classes. Simulations and real application on the PAQUID dataset show that when covariates are time-invariant, JLCT performs comparably to the common parametric joint latent class modeling approach JLCM. When the covariates become time-varying, JLCT makes full use of the time-varying information and can demonstrate significant advantage over JLCM, which can only use a subset of the available time-varying data. In addition, JLCT is orders of magnitude faster than JLCM under both scenarios.
There are several interesting extensions of the JLCT method that could be explored. The PAQUID data set discussed in Proust-Lima et al. (2017) and in Section 5 is actually one exhibiting competing risks, since there is a risk of death before dementia is exhibited, but this was ignored in the analysis. It would be of interest to generalize JLCT to account for this. Often survival values are only known to within an interval of time (interval-censoring), and JLCT could be adapted to that situation as well. In addition, the analysis here only allows for one longitudinal outcome, but sometimes several biomarkers are available for a patient, and it would be useful to generalize JLCT to allow for that.
An R package, jlctree, that implements JLCT is available at CRAN. Appendix F provides code illustrating its use.
- Andersen and Gill  P. K. Andersen and R. D. Gill. Cox’s regression model for counting processes: A large sample study. The Annals of Statistics, 10(4):1100–1120, 1982.
- Bates et al.  D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015.
- Breiman et al.  L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA, USA, 1984.
- Cox  D. R. Cox. Regression models and life-tables. Journal of the Royal Statistical Society. Series B, 34(2):187–220, 1972.
- Crowley and Hu  J. Crowley and M. Hu. Covariance analysis of heart transplant survival data. Journal of the American Statistical Association, 72(357):27–36, 1977.
- Dekker et al.  F. W. Dekker, R. De Mutsert, P. C. Van Dijk, C. Zoccali, and K. J. Jager. Survival analysis: Time-dependent effects and time-varying risk factors. Kidney International, 74(8):994–997, 2008.
- Dicker et al.  R. Dicker, F. Coronado, D. Koo, and R. G. Parrish. Principles of epidemiology in public health practice. US Department of Health and Human Services, 2006.
- Fu and Simonoff  W. Fu and J. S. Simonoff. Survival trees for left-truncated and right-censored data, with application to time-varying covariate data. Biostatistics, 18(2):352–369, 2017.
- Garre et al.  F. G. Garre, A. H. Zwinderman, R. B. Geskus, and Y. W. Sijpkens. A joint latent class changepoint model to improve the prediction of time to graft failure. Journal of the Royal Statistical Society: Series A, 171(1):299–308, 2008.
- Graf et al.  E. Graf, C. Schmoor, W. Sauerbrei, and M. Schumacher. Assessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine, 18(17-18):2529–2545, 1999.
- Hajiro et al.  T. Hajiro, K. Nishimura, M. Tsukino, A. Ikeda, and T. Oga. Stages of disease severity and factors that affect the health status of patients with chronic obstructive pulmonary disease. Respiratory Medicine, 94(9):841–846, 2000.
- Hastie et al.  T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
- Henderson et al.  R. Henderson, P. Diggle, and A. Dobson. Joint modelling of longitudinal measurements and event time data. Biostatistics, 1(4):465–480, 2000.
- Jongerden et al.  I. P. Jongerden, B. Speelberg, C. L. Satizábal, A. G. Buiting, M. A. Leverstein-van Hall, J. Kesecioglu, and M. J. Bonten. The role of systemic antibiotics in acquiring respiratory tract colonization with gram-negative bacteria in intensive care patients: A nested cohort study. Critical Care Medicine, 43(4):774–780, 2015.
- Kovesdy et al.  C. P. Kovesdy, J. E. Anderson, and K. Kalantar-Zadeh. Paradoxical association between body mass index and mortality in men with CKD not yet on dialysis. American Journal of Kidney Diseases, 49(5):581–591, 2007.
- Laird and Ware  N. M. Laird and J. H. Ware. Random-effects models for longitudinal data. Biometrics, 38(4):963–974, 1982.
- Letenneur et al.  L. Letenneur, D. Commenges, J.-F. Dartigues, and P. Barberger-Gateau. Incidence of dementia and Alzheimer’s disease in elderly community residents of south-western France. International Journal of Epidemiology, 23(6):1256–1261, 1994.
- Lin et al.  H. Lin, B. W. Turnbull, C. E. McCulloch, and E. H. Slate. Latent class models for joint analysis of longitudinal biomarker and event process data: Application to longitudinal prostate-specific antigen readings and prostate cancer. Journal of the American Statistical Association, 97(457):53–65, 2002.
- Munoz-Price et al.  L. S. Munoz-Price, J. F. Frencken, S. Tarima, and M. Bonten. Handling time-dependent variables: Antibiotics and antibiotic resistance. Clinical Infectious Diseases, 62(12):1558–1563, 2016.
- Proust-Lima and Taylor  C. Proust-Lima and J. M. Taylor. Development and validation of a dynamic prognostic tool for prostate cancer recurrence using repeated measures of posttreatment PSA: A joint modeling approach. Biostatistics, 10(3):535–549, 2009.
- Proust-Lima et al.  C. Proust-Lima, P. Joly, J.-F. Dartigues, and H. Jacqmin-Gadda. Joint modelling of multivariate longitudinal outcomes and a time-to-event: A nonlinear latent class approach. Computational Statistics & Data Analysis, 53(4):1142–1154, 2009.
- Proust-Lima et al.  C. Proust-Lima, M. Séne, J. M. Taylor, and H. Jacqmin-Gadda. Joint latent class models for longitudinal and time-to-event data: A review. Statistical Methods in Medical Research, 23(1):74–90, 2014.
- Proust-Lima et al.  C. Proust-Lima, V. Philipps, and B. Liquet. Estimation of extended mixed models using latent classes and latent processes: The R package lcmm. Journal of Statistical Software, 78(2):1–56, 2017.
- Proust-Lima et al.  C. Proust-Lima, V. Philipps, A. Diakite, and B. Liquet. lcmm: Extended Mixed Models Using Latent Classes and Latent Processes, 2018. URL https://cran.r-project.org/package=lcmm. R package version: 1.7.9.
- Rizopoulos  D. Rizopoulos. JM: An R package for the joint modelling of longitudinal and time-to-event data. Journal of Statistical Software, 35(9):1–33, 2010.
- Therneau  T. M. Therneau. A Package for Survival Analysis in S, 2015. URL https://CRAN.R-project.org/package=survival. version 2.38.
- Therneau and Grambsch  T. M. Therneau and P. M. Grambsch. Modeling Survival Data: Extending the Cox Model. Springer, New York, 2000. ISBN 0-387-98784-3.
- Tsiatis and Davidian  A. A. Tsiatis and M. Davidian. Joint modeling of longitudinal and time-to-event data: An overview. Statistica Sinica, 14(3):809–834, 2004.
- Tsiatis et al.  A. A. Tsiatis, V. Degruttola, and M. S. Wulfsohn. Modeling the relationship of survival to longitudinal data measured with error. Applications to survival and CD4 counts in patients with AIDS. Journal of the American Statistical Association, 90(429):27–37, 1995.
- Wulfsohn and Tsiatis  M. S. Wulfsohn and A. A. Tsiatis. A joint model for survival and longitudinal data measured with error. Biometrics, 53(1):330–339, 1997.
Appendix A Joint Latent Class Models (JLCM)
In this section, we give a brief introduction to JLCM, and discuss its strengths and weaknesses. More details about JLCM can be found in Proust-Lima et al. .
In JLCM, the latent class membership for subject is determined by the set of covariates , ( must be time-invariant in JLCM, so we drop the time indicator ), through the following probabilistic model:
where are class-specific intercept and slope parameters for class .
The longitudinal outcomes in JLCM are assumed to follow a slightly different linear mixed-effects model than in (1):
where is the fixed effect vector for class , and is the random effect vector for subject and class . The random effect vector is independent across latent classes and subjects, and normally distributed with mean and variance-covariance matrix . The errors are assumed to be independent and normally distributed with mean and variance , and independent of all of the random effects as well. Let denote the likelihood of longitudinal outcomes given that subject belongs to latent class .
The time-to-event is considered to follow the proportional hazards model with time-invariant covariates :
where parameterizes the class-specific baseline hazards , and is associated with the set of covariates (we drop the time indicator from since it must be time-invariant in JLCM). Let denote the survival probability at time if subject belongs to latent class . Note that the extended Cox model (2) of JLCT extends the proportional hazards model (4) to allow for time-varying covariates.
Let , , , , , , , , : , be the entire vector of parameters of JLCM. These parameters are estimated together via maximizing the log-likelihood function
The log-likelihood function above uses the assumption that conditioning on the latent class membership (), longitudinal outcomes () and time-to-event () are independent.
As mentioned in the introduction, the concept of latent class membership is of particular interest in clinical studies. JLCM is designed to give parametric descriptions of subjects’ tendency of belonging to each latent class, and therefore JLCM is a suitable model when the true latent class is indeed a random outcome with unknown probabilities for each class. The multinomial logistic regression that JLCM uses is a flexible tool to model these unknown probabilities.
Despite the usefulness of latent classes, JLCM has several weaknesses. First of all, the running time of JLCM does not scale well due to its complicated likelihood function. Simulation results show that the running time of JLCM increases exponentially fast as a function of the number of observations, the number of covariates, and the number of assumed latent classes. Secondly, the modeling of time-to-event in JLCM is restricted to the use of time-invariant covariates. However, time-varying covariates are helpful in modeling the time-to-event, especially when treatment or important covariates change during the study, for instance the patient receives a heart transplant [Crowley and Hu, 1977], or the longitudinal CD4 counts change during the study of AIDS [Tsiatis et al., 1995]. Research shows that using time-varying covariates can uncover short-term associations between time-to-event and covariates [Dekker et al., 2008, Kovesdy et al., 2007], and ignoring the time-varying nature of the covariates will lead to time-dependent bias [Jongerden et al., 2015, Munoz-Price et al., 2016]. The other restriction of JLCM is that the latent class membership model only uses time-invariant covariates, which implies that the latent class membership of a subject is assumed to be fixed throughout the time of study. However, the stage of a disease of a patient is very likely to change during the course of clinical study, for instance the disease would move from its early stages to its peak, and then move to its resolution. When the goal of joint modeling is to uncover meaningful clustering of the population that leads to definitions of disease stages, it is important to allow time-varying covariates in the latent class membership model, so that the model reflects this real world situation.
Appendix B Simulation setup: time-invariant covariates
In this section we give details of the data generating scheme in the simulation study of time-invariant covariates only (Sections 3).
At each simulation run, for each subject we randomly generate five independent, time-invariant covariates . We assume there are four latent classes , which are determined by covariates . Once the latent classes are determined for each subject , the time-to-event and the longitudinal outcomes are conditionally independent given the latent classes, and therefore generated separately. In particular, the survival outcomes (time-to-event) depend on , and the longitudinal outcomes depend on the latent classes. We give more details below.
At each simulation run, for each subject we draw , , , , uniformly from , , , , and respectively.
b.2 Latent classes
We determine class membership based on a multinomial logistic model of , with increasing level of concentration on one class. Our latent class membership generation model matches the setup of JLCM, and it approaches the setup of JLCT as the concentration level approaches .
For subject , we compute the value of a “score” function , where denotes the parameters associated with latent class , and denote the first two covariates of subject . The latent class membership for subject , , is generated according to two key values: the “majority” class , and the “concentration” level .
Define the “majority” class as the latent class with largest score for sample ,
We consider four types of score functions , which correspond to four underlying structures of latent classes: tree partition, linear separation, non-linear separation, and asymmetric tree partition. The structure is reflected by the dependency of on , which is shown in Figure 1.
The latent class membership of subject is drawn according to the probabilities
where the parameter is chosen such that the probability of “major” class is approximately equal to a pre-specified concentration level . That is . In particular, when , , and therefore the latent class membership is randomly determined and independent of . On the other hand, when , , and therefore the latent class membership corresponds to a deterministic partitioning based on , which is consistent with the assumptions underlying a tree partitioning.
The choices of parameters are given below.
Tree. Consider the following coefficients :
Define the score function
and let denote the latent class with largest score for sample . See Figure 0(a) for the dependency of on . The latent classes are drawn according to the probabilities
where the parameter is chosen such that , and is some pre-specified level . In particular, when , ; when , .
Linear. Consider the following coefficients drawn randomly from the unit sphere,
Define the score function
and let . See Figure 0(b) for the dependency of on . The latent classes are drawn according to the following probabilities
with again chosen to control the value of .
Nonlinear. We can skip the step of defining the score function and the value, but directly work with and . For each observation, its “most likely” latent class is determined by whether belongs to the circles centered at and with radius :
See Figure 0(c) for visualization of . The latent classes are drawn according to the following probabilities:
Asymmetric. We can skip the step of defining the score function and the value, but directly work with and . For each observation, its “most likely” latent class is determined by the following asymmetric tree:
See Figure 0(d). The latent classes are drawn according to the following probabilities:
The survival time (time-to-event) of subject follows the proportional hazards model
where the slope coefficients depend on latent class for subject .
We use three different distributions for baseline hazards : exponential, Weibull with decreasing hazards with time (Weibull-D), and Weibull with increasing hazards with time (Weibull-I). We select parameters and slopes such that the mean values of survival time across latent classes remain similar across different distributions. The distributions and corresponding parameters for generating time-to-event data are listed below.
Exponential with , and slopes are
Weibull distribution with shape parameter , which corresponds to decreasing hazards with time (Weibull-D). The scale parameter is , and slopes are
Weibull distribution with shape parameter , which corresponds to increasing hazards with time (Weibull-I). The scale parameter is , and slopes are
Left truncation times are generated independently from uniform
. Right censoring times are generated independently from an exponential distribution, with parameters chosen to reflect light censoring (approximately 20% observations are censored), and heavy censoring (approximately 50% observations are censored).
b.4 Longitudinal outcomes
The longitudinal outcome comes from the following linear mixed-effects model: for subject at time , let denote the latent class membership, and
where , and are class-specific random intercepts. We assume each subject is measured at its entry (left truncation) time, together with multiple intermediate measurement times , etc. The number of intermediate measurements for subject is generated independently from , thus each subject has at least 2 measurements, and has 3 measurements on average. The intermediate measurement time is then sampled independently and uniformly between subject ’s left-truncated and right-censored time, . Finally, the data are converted to the LTRC format.
Observe that covariates and determine the latent classes, and thus affect the time-to-event and longitudinal outcomes . Therefore, time-to-event is correlated with if and are unknown. On the other hand, conditioning on one of the four latent classes , time-to-event and longitudinal outcomes are independent: the former follows a class-specific proportional hazards model and only depends on , while the latter is a constant plus random noise. Therefore, the simulated data satisfy the conditional independence assumption made by both JLCM and JLCT.
b.5 Methods and predictions
Given the simulated data, JLCT and JLCM use almost the same subsets of covariates in their modeling components: to model time-to-event, and for fixed and random effects to model longitudinal outcomes. JLCT uses to model latent class membership. Since tree-based approaches automatically use interactions between covariates to partition the population, for fair comparison we allow JLCM to include an additional interaction term in modeling latent classes: