Random forests (RF henceforth), introduced by Breiman (2001)
, is a non-parametric statistical learning method from the ensemble decisions trees approaches which is one of the state-of-the-art machine learning method for prediction and classification(Fernández-Delgado et al., 2014). It shows a good behavior for applications in high-dimensional settings where the number of predictors is larger than the number of observations (e.g., Cutler et al., 2007; Chen and Ishwaran, 2012).
Theoretical background of RF has also been recently improved (Scornet et al., 2015; Mentch and Hooker, 2016; Wager, 2014; Biau and Scornet, 2016). Scornet et al. (2015) proved a consistency result for RF in the context of additive regression models. Mentch and Hooker (2016) and Wager (2014)
study asymptotic normality of RF predictions and hence proposed confidence intervals for those predictions. We refer toBiau and Scornet (2016) for further reading on that matter.
However, some limitations may happen when is much larger than (Statnikov et al., 2008), i.e. . In this situation, parameters, specifically the number of variables randomly picked at each node of a tree, must be carefully tuned to control the randomness of the method (Genuer et al., 2008). Some recent improvements have been suggested to deal specifically with high-dimensional data (Zhu et al., 2015; Linero, 2018). Zhu et al. (2015)
used ideas from reinforcement learning with the tree-based model framework to focus on relevant variables.Linero (2018) developed Bayesian regression trees (Chipman et al., 1998) where sparsity was implemented by using appropriate priors.
Besides methodological developments to deal with the issue of high dimension, the situation may be improved by additional observations available in each individual from repeated measurements leading to longitudinal data. There is already a large bulk of work on RF for survival (i.e. censored) data Hothorn et al. (2005); Ishwaran et al. (2008, 2010); Steingrimsson et al. (2018). Hothorn et al. (2005)2008, 2010) introduced random survival forests by using a split criterion adapted to censored data, and then derive variable selection strategies for high-dimensional survival data. More recently, Steingrimsson et al. (2018) proposed a new survival trees and forests based on the censoring unbiased transformations theory.
Less has been done to adapt random forest approaches to repeated measurement settings. The analysis of repeated measurements requires to take into account the specific correlation structure as done with mixed effects models (Laird and Ware, 1982; Verbeke and Molenberghs, 2009). The first approaches dealing with longitudinal and clustered data involved tree-based methods (Segal, 1992; Hajjem et al., 2011; Sela and Simonoff, 2012).
Then, Hajjem et al. (2014)
proposed a method based on RF. The main idea was to iterate between the fixed part and the random part to estimate the parameters through an Expectation Maximization (EM) algorithm. All these approaches can be viewed as semi-parametric mixed effects model where the fixed effects part is modeled through tree-based methods. Of note, other approaches based on smoothing splines have been proposed for semi-parametric mixed effects model such as inJacqmin-Gadda et al. (2002); Zhang et al. (1998); Wang (1998).
The main contributions of this paper are three-fold: first we extend existing methods by using a stochastic semi-parametric mixed effects model (Section 2). Secondly, we introduce a new method based on RF to handle high-dimensional longitudinal data and derive theoretical guarantees for predictions of our model in an asymptotic framework (Section 3). Third, we compare existing methods with ours in an extensive simulation study (Section 5) and analyze a therapeutic vaccine trial in HIV-infected patients in Section 6.
All existing and proposed methods have been implemented together in an R package called longituRF111Available at https://github.com/Lcapitaine/longituRF.
2 The semi-parametric stochastic mixed effects model
Let us consider longitudinal data with individuals, the th individual having observations over time. Suppose (for all and ), the response of the th individual at time , satisfies
where is the vector of covariates, is the unknown mean behavior function, is a vector of random effects associated with the vector of covariates , is a stochastic process used to model serial correlation and denotes a measurement error. The , are independent, as well as the and the for all . We also assume that and are mutually independent.
We suppose that the
are normally distributed aswhere is a positive definite matrix; is a centered Gaussian process with covariance function depending on a parameter ; and the are normally distributed as .
We consider in model (1) that the evolution of the response variable for theth individual over time varies around a mean behavior function that is given by . These variations specify the individual trajectories around and are given by the random effects and the stochastic process for the th individual.
We suppose here that the number of covariates for the mean behavior is much larger than the total number of observations, this leads us in a high dimensional context. Zhang et al. (1998) considered a semi-parametric stochastic model close to model (1) in small dimension with a dimensional function of the time, hence (1) can be seen as a generalization of their model. Hajjem et al. (2014) considered a model similar to (1) but without the stochastic processes , and developed a method based on the EM algorithm (McCulloch, 1997) to estimate the mean behavior function with any regression method (splines, RF, kernel-based methods, etc…). Note that if the function is assumed linear then model (1) reduces to the linear stochastic model of Diggle and Hutchinson (1989).
We write model (1) in the vectorized form as follows, for all :
We suppose that the covariance matrix of the process verifies with a positive definite matrix only depending on observations time for all , . We also consider the case where depends on an additional parameter in section 3.2.
In the same spirit of Hajjem et al. (2014), we use a variant of the EM algorithm to estimate parameters. The main principle of the method is given in the Algorithm 1, while a detailed version can be found in the supplementary materials.
, estimate in the standard regression mode:
where . Then for given , and
for given , and , update , and
3.1 Mean behavior function estimation
At step 1 of Algorithm 1, we consider variance parameters known and given by estimations of the previous iteration. The mean behavior functioncan be estimated with any regression method. When is estimated with CART tree, Hajjem et al. (2011) refer to MERT (Mixed Effects Random Trees). CART (Breiman et al., 1984) consists in partitioning in a recursive way the explanatory variable space to obtain the best partition for prediction. At each step of the partitioning, the space is cut into two sub-parts. Hence, the obtained partition can naturally be associated to a binary tree which is called CART tree. Furthermore, we stress that each split is optimized among all explanatory variables and that the CART algorithm works with two steps: the maximal tree building following by the pruning step, in order to give the best predictor in terms of prediction error.
Similarly Hajjem et al. (2014) refer to MERF (Mixed Effects Random Forest) when is estimated with RF. RF are an aggregation of multiple randomized CART trees, where the aggregation consists in tacking the mean of individual trees predictions. Each tree is a maximal tree built using a random perturbation: first, it is built on a bootstrap sample of the learning set, and secondly, at each step of the partitioning, the best split is optimized among a randomly drawn subset of explanatory variables. The size of the subset of variables, often called mtry, is the most important parameter of the method. RF naturally estimate the prediction error with the Out-Of-Bag (OOB) error as the following: to predict the response of one particular observation of the learning set, only trees built on bootstrap samples not containing this observation are aggregated. Furthermore, OOB samples (made of observations not selected in bootstrap samples) are also used to compute a variable importance (VI) score. For a fixed variable, the VI score of this variable is defined as the mean increase of the error of a tree on its associated OOB sample after a random permutation of this variable values.
For the considered model (1) we denote by SMERF (Stochastic Mixed Effects Random Forests) the generalization of MERF that we propose.
When the mean behavior function is estimated with a CART tree , Sela and Simonoff (2012) proposed to use the partition associated with to fit a linear mixed effects model. Let the indicator matrix defined as where is the th leaf of the tree . Considering the following model
the estimation of the values of the associated leafs of at step 1 are given by
with for all .
With this method, the values associated with the leafs of are computed by taking in account intra-individual covariance matrix instead of taking the simple mean of values in the leaf. The fitted tree is called REEMtree.
Following this work, we propose a novel method, named REEMforest, which aggregate a collection of REEMtree. Let randomized trees , the indicator matrix associated with the th random tree and the fitted leafs of estimated with the stochastic linear mixed-effects model . The REEMforest estimator is given by the mean of the fitted REEMtree:
Finally, when the considered model is (1) we denote by SREEMforest the method with the additional estimation of the stochastic process.
Once has been computed, the predictions for the random effects and the stochastic processes for known parameters are obtained by taking their conditional expectations given the data , the best linear unbiased predictors BLUP are:
3.2 Variance components estimation
At step 2 of Algorithm 1, the estimation of the variance parameters are obtained by taking the conditional expectation of their maximum likelihood estimators given the data . Thanks to the conditional independence between the individuals we can write, for fixed the likelihood function associated to model (2) as follows:
the density function on the vector . Moreover, since , by using the independence of , and we can easily write the likelihood function as:
Using that , the maximum likelihood estimators of and are:
Because , and are unknown these estimators are not computable, this is why we take the expectation given the data . The conditional expectations of the estimators and are given in Wu and Zhang (2006):
The conditional expectation of the maximum likelihood estimator given the data is
Estimators of variance parameters , and at step 2 are given by , and .
Gaussian processes such as Ornstein-Uhlenbeck process and fractional Brownian motion have a variance-covariance function which depends on two parameters and . This covariance function can be written as with depending on . There is no analytic maximum likelihood estimator of . However, for a fixed value of , the estimation procedure described in this section holds. Thus for an ensemble of possible values of parameter, the estimator of is
where is the log-likelihood function.
4 Asymptotic analysis
4.1 Convergence of the expected outcome estimation
Given a fitted MERF forest , the associated random effects and the stochastic processes obtained with the algorithm 1, the fitted response variable for the th individual is
Considering the variance components known, the expectations of random effects and are related to the bias of by
This leads to
Under some regularity conditions on the function , thanks to Scornet et al. (2015) we conclude that .
This means that when the number of individuals is large enough, the mean fit tends to the true mean behavior .
4.2 Convergence of out-of-sample outcome predictions
The prediction of the response variable for the th individual at time is
with and the fixed and random effects explanatory variables for the th individual at time and
With this definition, if it is easy to check that
Similarly, for or the expectation of is a linear function of
. According to Scornet et al. (2015), for a new observation for the individual at time ,
5 Simulation study
5.1 Simulation model
5.1.1 Explanatory variables
In this section we present the approach used to simulate the data matrix of the explanatory variables . Our choices are motivated by the characteristics of the data coming from our application, which are transcriptomics data in a phase 1/2 vaccine trial (see Section 6 for more details).
Since we are in a high dimensional context, we assume that a large majority of variables are not associated to the response variable (also known as a sparsity
assumption). In our study, those variables are simulated as i.i.d. random draws from a multivariate Gaussian distribution, where
denotes the identity matrix of size(recall that denotes the total number of observations).
Moreover, since we deal with longitudinal data in the context of gene expression, we assume that some explanatory variables vary over time and that some explanatory variables are clustered into groups (which correspond to genes involved in the same biological pathway). Hejblum et al. (2015) highlighted ten examples of groups of genes with different temporal behaviors in the DALIA trial, and we mimic some of these trends by setting the following six behaviors over time in our simulations:
The explanatory variables with a temporal behavior are then simulated as follows:
where is the group of the th covariate ; corresponds to a random translation at the group level and is an additional time-dependent variability.
In the following, we investigate two situations with different values of the total number of variables, , as well as different sizes of each group of variables with temporal behavior.
5.1.2 Outcome variable
The two following models, which are special cases of model 2, are used to simulate the outcome variable . For all :
where with , is a Brownian motion with volatility and with . In these models, the random effects is associated with an exogenous variable , where for .
The mean behavior function depends on the dimension of the simulated data:
In the low-dimensional case (with ):
In the high-dimensional case (with and with at least 20 variables in the first two groups of explanatory variables):
where and represent two sets of 20 genes randomly picked from the group and respectively.
The mean behavior function is actually quite the same in the two situations. The difference lies in the fact that in the high-dimension case, 40 variables are related to the response variable, against 2 in the low-dimension case. It is indeed reasonable, in high-dimension, to assume that several genes coming from the same group are linked to the mean behavior function .
5.2 Squared bias and prediction error
The different methods are compared in terms of squared bias (associated to each estimated quantity) and prediction performance, computed among repetitions of the simulation.
Squared biases are defined as follows:
a fixed grid of times (different from the times of measurements),
the fitted random forest returned by the algorithm after convergence, associated with the -th repetition.
the estimation of on the -th repetition and similarly for and .
To evaluate prediction performance, data associated to one simulation are randomly split times into a learning set and a test set. Each test set is obtained by randomly picking two measurements of each individual. With denoting the index of the -th individual measurements in the -th test set, we define the prediction error as:
where is the predicted response variable, defined in (3), of the -th measure of the -th individual, given by the fitted random forest returned by the algorithm after convergence.
The number of individuals , is fixed to (the same as in the DALIA trial) all along the simulation study, and the number of measurements for the
th individual is randomly chosen (with uniform distribution) betweenand for every , leading to an unbalanced design. We recall that the total number of observations is denoted by .
5.3.1 A low-dimensional case
We start by considering a low-dimensional example where . Hence, we have 6 explanatory variables in the dataset and all of them have a temporal behavior (respectively given by Equation (4
)). This framework allows to compare different tree-based methods as well as a linear mixed model for longitudinal data in a standard framework. First, we simulate one dataset under model (5) using the mean behavior function defined in Equation (7) and study the behavior of the MERF method on that dataset.
Figure 1 shows that the convergence of the EM algorithm for the MERF method is quite affected by the randomness aspect of the random forests. Indeed, the MERF method converge to the maximum likelihood only for high values of the mtry parameter (the number of variables randomly drawn before optimizing the split of a node of a tree composing the RF). It is expected that in an iterative algorithm such as the EM algorithm, an update with too much randomness compromises the convergence of the algorithm. Hence, we advocate for large values of mtry in those EM-based RF methods, even if the trees in the RF are then more similar with each other.
We now simulate 100 datasets (again under model (5) with mean behavior function (7)) and study squared biases on estimations of quantities of interest (, , and when appropriate), given by the four tree-based methods (MERT and REEMtree for trees, MERF and REEMforest for forests) and also compare with the Linear Mixed Effects Model (LMEM) method. In addition, we simulate 100 additional datasets under the stochastic model (6) (still with defined by Equation (7)) and compare stochastic counterparts of the five methods described above, respectively denoted by: SMERT, SREEMtree, SMERF, SREEMforest and SLMEM.
As shown in Table 1, LMEM, MERT and REEMtree methods are close to each other in terms of bias on while MERF and REEMforest which use random forests, provide a much better mean behavior estimation. Moreover, the squared bias on for REEMforest is about 25% lower than MERF whereas the squared bias on of REEMtree is only 4% lower than the one obtained with MERT. Hence, in this framework, taking into account the intra-individual covariance structure to evaluate the tree leafs values generates a much more important decrease of the squared bias on when random forests are used compared with CART. Furthermore, the squared bias obtained on the random effects covariance matrix and the residual variance parameter are lower for all four tree-based methods compared to LMEM, with forests estimating much better than trees. Finally REEMforest gives slightly lower bias than MERF.
For stochastic methods, tree methods (SMERT and SREEMtree) led to a higher squared bias of and lower squared bias of the estimated parameters compared to SLMEM. However, forests methods (SMERF and SREEMforests) still perform better than trees and SLMEM with squared biases almost all much lower.
Finally, we compare the different methods on their prediction capacity by computing prediction errors on 100 simulated datasets, either under model (5) or (6). For each dataset, a test set if formed by randomly picking, for each individual , two observations among its observations. This gives a test set containing observations and a learning set with observations. Breiman’s RF is also included in this study in addition to the five methods already mentioned to illustrate the gain of tacking into account the intra-individual correlation.
As expected, left part of Figure 2 shows that Breiman’s RF performed the worse in terms of prediction error, because it assumes that all observations are independent, and that LMEM reached a poor prediction ability, because is not linear. MERT and REEMtree gave intermediate performance, whereas MERF and REEMforest reached the lowest prediction errors, with similar performances. Overall, those comments remains valid for the stochastic case, illustrated on the right part of Figure 2.
As a conclusion, we demonstrate the benefits of RF approaches for longitudinal data analysis in a low-dimensional case, especially in terms of prediction error. REEMforest exhibited a slight advantage compared to MERF in terms of validity of the estimation of the mean behavior function and of the other parameters , and .
5.4 A high-dimensional case
For the high-dimensional context, we kept but set . We also set the size of each of the six groups containing explanatory variables with temporal behaviors (given by Equation 4) to 266, leading to a total of 1596 variables that changed over time among the 8000 variables in the dataset.
First of all, according to Figure 1 for the low-dimensional case and some preliminary experiments, we fix the mtry parameter of RF to in all RF runs. This ensures convergence of EM-based methods and avoid a too heavy computational burden compared with a systematic optimization of mtry on several values for each RF.
We simulated 50 datatsets under model (5) (and 50 other dataset under model (6)) with the mean behavior function given by Equation (8), and computed squared biases on estimations given by the four tree-based methods: (S)MERT, (S)REEMtree, (S)MERF and (S)REEMforest. We did not compare anymore with LMEM which is not adapted to the high-dimensional setting.
For the non-stochastic scheme, the squared bias on and all parameters obtained with REEMforest were slightly lower than the one obtained with the existing MERF method. For the stochastic model (18), the two forest-based methods gave similar bias on all parameters. As in the low dimensional context, forests led systematically to lower biases on all estimations compared to trees, especially for the estimation of . However in this high-dimensional setting, (S)MERF and (S)REEMforest performed approximately the same.
We estimated prediction errors of the various methods, by randomly selecting test sets in each of the simulated datasets (in the same manner as in the low-dimensional case in the previous section). As illustrated in Figure 3, forests reached very low prediction error estimations compared to trees. This last result was expected because it is well-known that RF performs better than trees for high-dimensional data. It can also be seen that Breiman’s RF (which assume independence between all observations in the data) are competitive compared with trees, especially compared with (S)MERT. Hence, in that case, the gain of using RF instead of trees roughly compensates the fact that Breiman’s RF do not take into account the longitudinal feature of the data.
Finally, variable importance (VI) scores computed with the RF returned after convergence of the REEMforest method are plotted in decreasing order of VI in Figure 4 (only the 65 most important variables are plotted for the sake of clarity). This graph shows that the most important variables belong to one of the first three groups of explanatory variables. This result is satisfactory because the mean behavior function (defined by Equation 8) depends on variables that belongs to the first two groups only and the third group is very close to the second one in terms of dynamics (see Equations 4).
6 Application to the DALIA vaccine trial
DALIA is a therapeutic phase 1/2 vaccine trial including 19 HIV-infected patients who received an HIV vaccine before stopping their antiretroviral treatment (HAART). For a full description of the DALIA vaccine trial we refer to Lévy et al. (2014).
At each harvest time, 32979 gene transcripts were measured as well as the plasma HIV RNA concentration (which was log-transformed) for every patient. In this application, we were interested in finding the gene transcripts associated to the HIV viral load dynamics after antiretroviral treatment interruption. The analysis was performed on the 17 patients with available data at the time of treatment interruption.
Figure 5 illustrates the dynamics of the viral replication after antiretroviral treatment interruption with a large between-individuals variability.
A random intercept and a Gaussian process were included in the model:
and we will refer to methods only using a random intercept as non-stochastic methods (MERF and REEMforest in the following).
Prediction errors were evaluated with 25 training/test sets random splits. As in the simulation study, a test set was obtained by randomly drawing two observations for each individual. We chose the stochastic process (between an Ornstein-Uhlenbeck’s process and a fractional Brownian motion) that minimized the estimated prediction error. Hence, the fractional Brownian motion with Hurst exponent which is the standard Brownian motion was selected. Finally, the mtry parameter was fixed to in all experiments of this section, according to the likelihood profile (Figure 6).
As illustrated in Figure 7, Breiman’s RF were comparable in terms of prediction error with MERF and REEMforest which only included a random intercept. However, SMERF and SREEMforest outperformed simple RF and trees, with a slight advantage to SREEMforest. This confirms, in this real dataset, that taking precisely into account the longitudinal aspect of the data in RF leads to a significant drop of the prediction error.
6.1 Variable selection using random forests
Once the algorithm (e.g., SREEMforest) has converged, a variable selection process is applied to select the genes the most associated with the viral load dynamics. More precisely, the last estimations of and (which are outputs of the algorithm) are subtracted from the output variable , for all (as in step 1 of Algorithm 1) to come back to a classical regression framework (i.e., with independent observations). Hence, the Variable Selection Using Random Forests method from Genuer et al. (2010) can be apply by using the VSURF package (Genuer et al., 2015).
This method is a fully automatic variable selection procedure based on RF and designed to deal with high dimensional data in a regression framework as well as in supervised classification. It works in three steps: i) first, the variables are sorted in decreasing order of RF variable importance (VI), then a data-driven threshold is computed to eliminate variables with low VI; ii) variables left are then introduced (one by one according to the previous order) in nested RF models and the one minimizing the OOB error is selected; iii) a refined ascending sequence of RF models (obtained in a stepwise way) is then built and finally the last model of this sequence is returned.
6.2 Stability of the selected variables set
We illustrate the stability of the selected variables set by introducing a stability score and studying the behavior of this score against the RF parameter mtry.
Let and be the decreasing ordered variables respectively to the variable importance obtained with two runs of the SREEMforest method. Due to the randomness aspect of the RF, SREEMforest is random and the sequences and may be different. Hence, we introduce a stability score which measures the difference between two ordered sequence and :
This score measures the proportion of variables ranked in a same neighborhood ( handles the size of the neighborhood). To stabilize the results, the score was computed with 30 pairs of sequences and and the mean of the obtained stability scores was provided.
The computation of these stability scores was restricted to the 50 most important variables given by different runs of SREEMforest applied to the DALIA vaccine trial dataset. In Figure 8, we note that, except for mtry set to 1000, we obtained a stability score around 0.5 for a neighborhood size of 4. This means that for two lists of the 50 most important variables obtained with SREEMforest, approximately 50% of them were at the same rank ( ranks). For a neighborhood size larger than 8, the score can exceed 75%. In conclusion, for a wide range of mtry values, variable ranking results were quite consistent.
6.3 Biological results
The 21 variables selected by VSURF (after convergence of SREEMforest) were mainly transcripts (OAS, LY6E, HERC5, IFI/IFIT, EPSTI1, MX1, RSAD2, EIF2AK2, XAF1) associated to the interferon- pathway. For instance, they all belongs to the Chaussabel’s modules M1.2 and M3.4 annotated "Interferon" (Chaussabel and Baldwin, 2014). Interferon pathway is highly correlated to the viral replication as demonstrated previously (Bosinger et al., 2009). Only, two transcripts were not associated to the interferon pathway (EPSTI1 and SAMD9L). The commitment of the interferon pathway reflects the immune response to viral infection. The relevance of these results is another argument for the validation of the proposed approach.
In this article, we introduced a new RF-based methodology suited for the analysis of high-dimensional longitudinal data. We also generalized existing methods so they can now be used in the stochastic semi-parametric mixed-effects model. The simulation study revealed the superiority of our approach and of the generalizations we introduced, and the method has been applied to a complex dataset coming from an HIV vaccine trial, illustrating the effectiveness and interest of our approach in such high-dimensional longitudinal context.
One important aspect of the results of our study is the choice of the mtry parameter. Our advice is to choose a large value for mtry–roughly between and –, not smaller than , and this for two reasons: first, as we are in an (very) high-dimensional context, the number of variables selected at each node of trees must not be too small–preventing to choose only non-informative variables too often–; and second, since the different approaches are based on an iterative EM-algorithm, the inner RF model must be stabilized–preventing the EM-algorithm to diverge–. Taking a too small value for mtry could lead to the non-convergence of the method and hence to very sub-optimal results, as illustrated by Figure 6.
Another key step in these approaches is the choice of the model, and more particularly the choice of the random effects. Driven by our application, we only use a random intercept (in addition to the stochastic process) regarding the number of individuals and the number of time-points we had in the vaccine trial. However, in a context with more individuals and/or less time points, it could be interesting to add random effects on the different time points. This should make the model more flexible and hence increase the capacity of the method to well estimate the inter-individual variability.
There are several issues that require further research. First, the theoretical analysis of such complex methods (non-parametric estimates plugged into an EM algorithm) seems rather difficult and remains, to the extend of our knowledge, an open issue. Finally, for the methodological part, another way of adapting RF to longitudinal data could be to change the split criterion of a node in the tree building, as it has been done e.g., for survival RF.
, for given , and estimate on the model ; build matrices for every tree composing
for given , and , update , and
- Biau and Scornet (2016) Biau, G. and Scornet, E. (2016). A random forest guided tour. Test, 25(2):197–227.
- Bosinger et al. (2009) Bosinger, S. E., Li, Q., Gordon, S. N., Klatt, N. R., Duan, L., Xu, L., Francella, N., Sidahmed, A., Smith, A. J., Cramer, E. M., et al. (2009). Global genomic analysis reveals rapid control of a robust innate response in siv-infected sooty mangabeys. The Journal of clinical investigation, 119(12):3556–3572.
- Breiman (2001) Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
- Breiman et al. (1984) Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and regression trees. Chapman & Hall.