Longitudinal or clustered data, where observations within a unit (cluster) are more correlated than observations from other units (clusters), are very common in areas such as social science and medical research. Further, the data may contain a large number of correlated features relative to the number of observations (high dimensional data). The goal of this paper is to extend tree-based algorithms to high dimensional longitudinal data with correlated features and to develop a relatively interpretable data mining technique for feature selection and prediction.
Tree based algorithms began to gain momentum with the appearance of the CART (classification and regression trees) algorithm (Breiman et al., 1984) [breiman1984classification]
. They are widely used in statistical machine learning due to their interpretability, relatively high computational efficiency, and their nonparametric and nonlinear nature. Briefly, a binary decision tree-based algorithm recursively partitions the parameter space into relatively pure nodes using a splitting criterion such as entropy or Gini impurity by searching every possible variable, and each possible split point, until it meets a pre-specified stopping criteria, and builds a piece-wise model on each subset of the data. The algorithm is greedy by nature and does not take into account correlation or longitudinal structure.
Segal [segal1992tree] made the first attempt to deal with longitudinal data by using regression trees, and proposing a new split function depending on the covariance structure of multiple responses. However, this method cannot deal with time-varying covariates (only the responses, and not the covariates, vary with time in his setting) and all the observations within a unit end up in one terminal node. Mixed-effects longitudinal trees (MELT)(Cho et al., 2014) [eo2014tree] fully explore the shape of the data with respect to time by fitting low degree polynomials and splits on the coefficients. The objective of MELT is to identify different shapes of time among units. However, MELT only deals with time-invariant covariates, and is not optimized for prediction.
Sela and Simonoff (2012) [sela2012re]
proposed the RE-EM tree, which uses a random effects model to deal with longitudinal structure, where the fixed effect is modelled as a standard regression tree CART. The random effects and fixed effects are estimated alternatively, which is similar to the EM algorithm. Later, a new version of RE-EM tree was proposed by Simonoff and Fu (2015)[fu2015unbiased] where the implementation of the fixed effect was replaced by the conditional inference trees of Hothorn et al. (2006) [hothorn2006unbiased] to reduce bias. RE-EM tree can deal with time-varying covariates, and observations within a unit can end up in different terminal nodes.
The generalized linear mixed-effects model trees (GLMM tree) algorithm (M. Fokkema et al., 2017) [fokkema2018detecting] adopts a more general approach than the RE-EM tree. The GLMM tree also uses a random effects approach, but with the fixed effect modelled as a piece-wise generalized linear model, that is, as a regression model tree with a generalized linear model, instead of a constant, at each leaf. The fixed effects and random effects are estimated in an alternative way, with one estimated after the other until convergence. The GLMM tree provides more flexibility in the model of fixed effect and can be used to detect treatment effects (see 4.1). The GLMM tree approach will be discussed in more detail in 2.2.
It is known that Random Forest variable selection is biased when there is correlation among the features. Fuzzy Forests (Conn, Ramirez et al., 2015) [conn2015fuzzy] was developed to address correlation within the predictors in the setting where the number of parameters is much greater than the number of observations (). The first step in Fuzzy Forests is to explicitly cluster features using weighted correlation networks [zhang2005general] (reviewed in 2.1). Then a feature screening step is conducted within each cluster using Recursive Feature Elimination Random Forests (RFE-RFs) [diaz2006gene]. Finally a feature selection step is done within the features selected from the screening step, allowing clusters to interact with each other. The screening step and the selecting step enables Fuzzy Forests to select features in a relatively unbiased way in the presence of highly correlated features. The Fuzzy Forests methodology has been used in a number of applied research articles, for example [conn_ramirez2016, kim-alvarez-ramirez, ramirez-abrajano-alvarez2019].
This paper proposes the Fuzzy Random Effect Estimation tree (FREEtree), which takes advantage of the powerful feature selection approach of Fuzzy Forests, as well as the flexible framework of the GLMM tree, to deal with the longitudinal structure of data.
The remainder of the article is organized as follows: Section 2 reviews the building blocks of FREEtree before section 3 explains the FREEtree algorithms in detail. Section 4 provides simulation results of FREEtree on two simulated data sets, one with a time-treatment interaction and one without. Section 7 discusses future research development of FREEtree and the last section concludes the paper.
2 A review of WGCNA and GLMM tree
Weighted correlation networks (WGCNA) have been used in many applications to examine the network structure of covariates [langerfelder2008wgcna, pei2017wgcna, Gudenas2015wgcna]
. This is an unsupervised learning method. In order to construct the network, WGCNA does the following: (1) choose a similarity function for featureand , denoted by . A common choice is where Corr is the Pearson correlation. Then compute the similarity matrix . (2) Transform the similarity matrix X by the adjacency matrix where which results in a soft-thresholding network. The is chosen according to the scale-free criterion [zhang2005general].(3) Convert the adjacency matrix A to the topological overlap matrix (TOM) W through Eq.(1) where and
. (4) Use a hierarchical clustering tree algorithm to find clusters using TOM. The reason that hierarchical clustering algorithm uses TOM instead of the adjacency matrixis that using TOM may lead to more distinct modules [zhang2005general].
Weighted correlation network analysis (WGCNA)[zhang2005general] can be used for clustering covariates where covariates within each module are highly correlated and features (or g, we will often use the term genes in this paper to remain consistent with the genetics literature) from different modules are approximately uncorrelated. Covariates that are not assigned to any clusters are placed in the grey module. That is, each grey covariate in the grey module is roughly uncorrelated to any other covariates and can be viewed as a cluster on its own. Thus, note that in the context of machine learning, we can view each feature as a gene and therefore WGCNA can identify modules of highly correlated features.
2.2 GLMM tree
The rational behind the Generalized Linear Mixed-Effects Model tree (GLMM tree) [fokkema2018detecting] is that a global generalized linear mixed-effect model may not fit the data well. However, if additional splitting variables are available, we can fit the data with piece-wise models by partitioning the data with these splitting variables.
For example, suppose that in our dataset the th observation of cluster consists of covariates and response . Cluster may stand for the ith patient and , the time of the measurement. Then a global Generalized Linear Mixed-Effects model (GLMM) is given by
where is the link function and
is a vector of fixed-effect regression coefficients (as opposed to the power function described in WGCNA). For a mixed-effect model with only a random intercept,is just constant (1) and is the random intercept associated with cluster . When random slopes are involved, is the design vector which is a subset of and is the random vector with each component corresponding to the random deviation of the slope from the fixed-effect. For simplicity, we assume that the link function is the identity function and the mixed-effects model with only random intercept is adopted. That is, we are using a linear mixed-effect model with only a random intercept from now on, as the following:
In many cases, the Linear Mixed-Effect model (LMM) in Eq(3) may not fit the data well because the assumption that the underlying fixed-effect model is a linear function is too restrictive. It often makes more sense to approximate the fixed-effect structure with a piece-wise linear model instead of a global linear model. GLMM tree uses a model-based recursive partitioning (MOB) algorithm [zeileis2008model]
that partitions the dataset using splitting variables and find better-fitting local LMM models. MOB iterates the following: fit a parametric model (such as LMM) to the dataset and then adopt parameter stability tests on each of splitting variables by computing a p-value for every splitting variable. If the smallest p-value is below the significant level, the dataset is split into two subsets using the splitting variable value with the smallest p-value, with the split point for that variable chosen to minimize the instability. Therefore only significant splitting variables will be used for splitting at the node of a GLMM tree. More details of the parameter stability test are described by Zeileis [zeileis2008model]. The resulting GLMM tree has the following form,
where is the index of terminal node that th observation of cluster belongs to. Note that the fixed-effect is now a piece-wise linear function of covariates and the random intercept is global in the sense that it only depends on the cluster, instead of the terminal node. The GLMM tree is trained by iteratively estimating the fixed-effect (a linear mixed-effects tree) assuming random effects are known and estimating random effects by assuming fixed effects are known until convergence.
The R package, glmertree [fokkema2018detecting] implements the GLMM tree. In the following section, we use LMM tree for simplicity. That is, we assume the link function is the identity function. The function lmertree() is used in this package.
3 The FREEtree estimation method
The goal of FREEtree is feature selection and then using selected features to make predictions. The advantage of having fewer features is parsimony and increased interpretability. At the heart of the algorithm lies a binary decision tree splitting strategy that is easily interpretable. While CART and many other methods are usually biased towards selecting correlated features while ignoring independent ones in feature selection, FREEtree reduces this bias by clustering features by their correlation pattern and screening features within each cluster, while allowing for features to interact. The resulting features are used to fit a LMM tree, which includes a linear regression model at the end of each leaf that also considers a random effect at the patient level. The predictive power mostly comes from LMM tree, which fits the data with a piecewise linear function of covariates plus a random effect, instead of a piece-wise constant function like CART and RE-EM tree. However, in order to regress on covariates, feature selection is necessary because linear regression requires that the sample size be sufficiently larger than the number of parameters for identifiability. FREEtree integrates feature selection and prediction in a natural way and is particularly useful whenis larger than .
The training dataset consists of patients , who are measured at time . To simplify the notation, we assume balanced data here, though this is not required for the FREEtree algorithm. Each patient has three types of features:
var_select X: Features of length that will be chosen from.
fixed_regress R: Features that will be used for regression in every tree. In longitudinal settings, this could be time or higher order of time.
fixed_split S: Features that will be considered as splitting variables in every tree.
The value of features of patient at time is denoted by , and respectively. Note that var_select, fixed_regress and fixed_split can be empty. The selection of which type each features belongs in is left up to the user. The goal of FREEtree is to select important features from var_select and use the selected features as well as fixed_regress and fixed_split to give the final prediction.
3.2 The FREEtree algorithm
The FREEtree algorithm consists of a feature selection step and a prediction step. First assume that fixed_regress is not empty. The case where it is empty will be discussed in section 3.4.
The feature selection step has three steps: clustering, screening, and selection. During the clustering step, features in var_select are clustered by WGCNA into modules, which includes a grey module and non-grey modules whose number is not known a priori. The grey module includes all covariates that have low connectivity and can be viewed as roughly independent. Features within the same non-grey module are highly correlated/connected with each other and have lower correlation or connectivity with the features from other modules. Let there be modules selected by WGCNA. Denote the modules of var_select by and let so that . Without loss of generality, denote the last module as the grey module.
For the screening step, features are selected within each module as follows: For module (), use fixed_regress as regression variables and use as well as consider fixed_split for splitting variables to fit a LMM tree. The selected features from module are the set of features used in the LMM tree that are not included in fixed_split. The result of the screening step is a set of screened features .
The final selection step allows the selected features from each modules to interact with each other. FREEtree uses all of the screened features from the screening step and treats fixed_split as splitting variables, then uses fixed_regress as regression variables to fit a LMM tree. The final selected features from var_select are the features used by this LMM tree that are not included in fixed_split, denoted by .
Finally, at the prediction step, a LMM tree is fitted using fixed_split and as splitting variables and using fixed_regress and as regression variables. The prediction is provided by this final LMM tree. Note that the final selected features from var_select are used both as splitting and regression variables, which fits the data in a more flexible way than just regressing on fixed_regress.
3.3 Another strategy for feature selection
The screening and selection steps help reduce bias in feature selection by eliminating features in correlated modules and thus protecting independent features from being ignored by LMM tree. However, if the number of non-grey modules is large and there are many correlated features after screening step, the independent features are still in the danger of being ignored at the selection step. In order to help protect independent features, another strategy of feature selection is proposed, which is particularly helpful if the number of correlated feature is large compared with independent features. Users can set Fuzzy=False to use this strategy. If Fuzzy=True, the strategy in section 3.2 will be adopted.
At the screening step, features within each non-grey modules are screened into . That is, use () and fixed_split as splitting variables and use fixed_regress as regression variables to fit a LMM tree and choose features used by the tree and not contained in fixed_split. Note that for now we don’t screen within the grey module . Then we select features from within the set of screened features from the non-grey groups by using all of the screened features and fixed_split as splitting variables, and fixed_regress as regression variables to fit a LMM tree. The selection step allows the non-grey modules to interact with each other producing with for . Then we fit a LMM tree using fixed_split and features in the grey module as splitting variables, and regress on fixed_regress as well as . The set of selected features from the grey module are the ones used in this LMM tree that are not included in fixed_split, which is denoted by . The final result of feature selection is , denoted by . A final LMM tree for prediction is fitted using fixed_split and as splitting variables and using fixed_regress and as regression variables.
3.4 Use principal components in the absence of regressors
Suppose that we do not have a natural choice for fixed_regress and set it to empty. One obvious way to do feature selection and prediction is to use RE-EM tree [sela2012re] with an averaged value at each leaf instead of a linear regression model. The disadvantage is that the assumption of the underlying true model being a RE-EM tree, a piece-wise constant function plus random intercept, can be too restrictive.
It is more flexible to fit the underlying model with a piece-wise linear function in addition to a random intercept. Therefore another method that is proposed, which could have more power in feature selection and prediction, is to use the dominant principal components (PC) of the non-grey modules as intermediate regressors. The idea here is that in linear regression, using the dominant principle components as regressors has a comparable power in terms of prediction as using all the covariates as regressors, although interpretability is lost. However, FREEtree, even if it uses PCs, is still interpretable because PCs are used only in the step of feature selection and the selected features are determined by the non-terminal nodes of the tree, instead of PCs or any other regressors. The first PCs of non-grey modules are used for simplicity, though more dominant features can be used. Note that we do not use PCs of grey module since features within grey module are roughly independent and thus it is likely that there may be no dominant PCs.
For the screening step, features from non-grey modules (l=1,2,..,m-1) are selected by fitting a LMM tree using the first PC of as regression variables and use and fixed_split as splitting variables. If Fuzzy=True, for the grey module , a RE-EM tree is fitted using fixed_split and the features used in the node of RE-EM tree are selected. Denote the screened features by . For the selection step, final features are obtained by selecting from the screened features. That is, fit a RE-EM tree using and select those appeared in the nodes of the RE-EM tree. In the prediction step, a LMM tree is fitted using and fixed_split as splitting variables and as regression variables.
If Fuzzy=False, final non-grey features are obtained by selecting from screened features from non-grey modules. That is, use all the as splitting variables to fit a RE-EM tree and select features used in the node of RE-EM tree and not contained in fixed_split. Then the selected grey-features are obtained by fitting a LMM tree using the grey module and fixed_split as splitting variables and as regression variables. The final set of selected features is . The prediction is given by a LMM tree using and fixed_split as splitting variables and using as regression variables.
4.1 Design of simulations
We provide simulations to examine the utility of FREEtree in terms of feature selection, prediction and estimation of the underlying model structure. In all simulations, the training dataset has subjects (we will allow to vary) and each subject has features to be selected along with fixed_split and fixed_regress. The features, , are grouped into 4 modules , , as well as . Each feature
is generated from a multivariate normal distribution with mean 0 and variance 1. The features from different modules are uncorrelated and features within the first three modules are correlated with correlation 0.8, while features within the last module are uncorrelated. Therefore, the first three modules are called non-grey modules and the final module is the grey module, according to the conventions in WGCNA.
The first simulation includes a time by treatment interaction where different treatments corresponds to different patterns of response with respect to time. For simplicity, we assume two treatments here, and . The true model for subject at time is given by
where is the indicator function, is the error is drawn from normal distribution and is given by
Here, only 6 variables out of 300 are important. The other variables are noise. We use as fixed_split and use and () as fixed_regress. Here var_select is with important features being and . Since we have a natural choice for fixed_regress, in and , we adopt the method described in section 3.2 and section 3.3.
In a second simulation, we consider a mixed effects model given by
where and are the same as in the first simulation and is the random intercept corresponding to subject which is drawn from normal distribution with mean 0 and variance 3. Random intercepts of different subjects are independent. Since now we do not have a natural choice for fixed_regress, we adopt the method described in section 3.4. That is, during the screening step, we regress on the first principal components of non-grey modules to select features from non-grey modules.
In both simulations, a validation set of 100 subjects is used for tuning parameters and a test set of 100 subjects is used for measuring root mean squared error on future observations. The prediction does not include random intercepts because they cannot be estimated from unknown patients. The performances of Random Forests and Fuzzy Forests in the following sections are measured by running the simulation 50 times using different random seeds.
4.2 Predictive performance
In this section, we first consider the dataset with the time-treatment interaction detailed in the previous section. We compare the predictive performance of FREEtree, Random Forests, Fuzzy Forests and LMM tree. For Random Forests and Fuzzy Forests, var_select , fixed_regress and and fixed_split are used as covariates. , and are manually put into the ”grey” module in Fuzzy Forests because the time variables are uncorrelated with in the generating process and treatment is categorical which WGCNA cannot deal with directly. For LMM tree, treatment and are specified as splitting variables and , are used as the regression variables. Note that unlike FREEtree, we can not use all as regression variables because linear regression requires that the sample size be greater than the number of parameters in the linear regression model.
Fig.1 shows the results on this dataset. FREEtree outperforms other methods when the sample size is relatively large. When the sample size is relatively small, FREEtree does not have an strong advantage since it has a linear regression model at each leaf, and thus there are many more parameters to estimate, necessitating a larger sample size.
Fig.2 gives the results of the performance on the simulated dataset with only random intercepts, a special case of longitudinal structure. The RMSE of Random Forests, Fuzzy Forests, RE-EM tree and FREEtree are given. Only are used in these algorithms. This analysis shows that FREEtree has better predictive performance than other algorithms and performs better when the sample size is larger. Note, that unlike the case in the previous simulation, FREEtree does well even when is relatively small because the dataset structure here is much simpler.
4.3 Feature selection performance
In this section, we compare the performance of feature selection from FREEtree and Fuzzy Forests, which is designed for feature selection. For Fuzzy Forests, we computed the proportion of times each feature was selected as important over 50 simulation runs on the same training set with different seeds and/or tuning parameters. In each run, the top 12 features are selected in the first simulation with time-treatment interaction dataset and top 10 features are chosen in the second simulation using dataset with only random intercepts. For FREEtree, the final chosen features are presented.
In the first simulation, shown in Fig.4, where the true features are
, , and , Fuzzy Forests successfully identified
with probability 1 but missedand completely (selected 0 times) regardless of the overall sample size. Fuzzy Forests identifies with probability 1 when . As for FREEtree, since treatment, time and time2 are explicitly specified to use as splitting and regression variable respectively, we only need to examine the final selected features from var_select . Fig.3 gives results for this simulation and it shows that in this dataset FREEtree can recover the true important features when .
In the second simulation where the true generating process only includes random intercepts, the feature selection performance of Fuzzy Forests and FREEtree were also studied. Fig.6 shows the results of Fuzzy Forests, which recovers all the important features correctly. Fig.5 shows that FREEtree can also recover all the important features for all of the sample sizes tested.
4.4 Estimation of the underlying pattern
The advantage of FREEtree is not only in its in higher prediction accuracy, but also in how it fits the underlying structure due to the models at its leaves. Recall that in the first simulation, the dataset has a time-treatment interaction. That is, the treatment-time components will first drop then increase for treatment1 and will first increase and then drop for treatment2. In this section we will examine whether FREEtree can recover the true time pattern for different treatments. The underlying true pattern should have the following form:
FREEtree was able to successfully detect the time-treatment interaction in this simulation. Table 1 shows that FREEtree gives a reasonable estimation of the time pattern function. However, note that patterns like this cannot be directly observed using tree-based methods such as RE-EM tree because the leaves in RE-EM tree correspond to an averaged value instead of a model.
We illustrate a real data application of FREEtree in a wide longitudinal dataset of World Bank, IMF and Penn World Table country level economic and developmental indicators. Using the adoption of inflation targeting by a nation’s central bank as a treatment variable, we wish to predict the percentage change in a country’s consumer price index (CPI) as a measure of the inflation rate. Merging together 15 different data sources111IMF World Economic Outlook (October 2019), IMF Financial Development Index Database, Penn World Table version 9.1, and the following World Bank databases: World Development Indicators, Education Statistics, Doing Business, Health Nutrition and Population Statistics, Gender Statistics, Global Financial Development, Health Equity and Financial Protection Indicators, Worldwide Governance Indicators, Worldwide Bureaucracy Indicators, Statistical Capacity Indicators, Global Jobs Indicators and Environment, Social and Governance Data.
, we obtain a final data set of 120 countries with 393 features observed for a 12 year period between 2005 and 2016 inclusively. The data series mostly comprise of population ratios, per capita metrics, year-over-year rates of change, proportions of national accounts, and scaled indicators, before being normalized to have mean zero and unit standard deviations.
Country level indicators are often highly correlated across time, with many series being very related to or subsets of others. Although tree-based techniques like Random Forests and Fuzzy Forests can process large numbers of series through feature selection, they do not have the capabilities to model mixed effects or give a single interpretable tree. glmertree can manage such effects while directly incorporating treatment variables into the analysis using GLM at each final node, it cannot handle the number of features in the dataset given the dimensionality problems inherit in linear regression. We compare results obtained by FREEtree to Random Forests and Fuzzy Forests in an example that takes advantage of individual country-level effects, as well as the central bank price targeting policy country_id was declared to be the subgroup cluster, while fixed regress
included a linear and quadratic temporal term. Inflation targeting adoption, a binary variable, was declared forconsider split, the rest of the features were included for the screening and selecting process. The formula takes the form:
where include all the other features to be screened and selected by FREEtree’s algorithm.
The resulting tree has three nodes from two split variables (investment price index and GDP volatility) and 9 explanatory variables, including GNP per capital, fuel and GDP volatility (Figure 7). The mixed effect paint a picture of volatile frontier economies in various states of high inflation or deflation, while industrialized nations tend to be closer to the mean (Figure 8).
FREEtree differentiates itself from other model trees in its ability to accept a very large number of features, addressing dimensionality issues when . Although this is a feature it shares in common with Random Forest and Fuzzy Forest, it distinguishes itself from these ensemble methods by being able to produce a single tree the user can readily interpret and understand while also providing superior predictions.
The production of a single decision tree also lets the user specify persistent features that will make it into the regression nodes, inherited from LMM tree. In addition, the user can specify subgroup cluster indicators and which features are guaranteed to make it past the screening process as regressors in the linear model. This flexibility caters well to researchers seeking to understand the impact of their variables of interest among a high number of other features, allowing them to effectively customize the output tree while taking advantage of WGCNA-based feature selection.
7 Discussion and Future Research
At the feature clustering step, FREEtree uses Pearson correlation as the similarity function, which may not be optimal when the measurements of each feature of any patients are time series. That is, for patient and feature , is a time series. In order to cluster features in this case, we have to cluster time series. According to our simulations, where Auto-Regressive and Compound-Symmetric structure were imposed on each feature , WGCNA still works when the correlation between features are relatively large. However, when the correlation is relatively low, WGCNA may not find strong associations and assign all the features to the grey group. One way to get around this is that when doing WGCNA analysis, instead of using correlation of features when building the similarity matrix, we use time series distance measures such as Dynamic time warping (DTW) and average them with respect to each patient and finally transform it into a similarity measure. In this case, the adapted WGCNA can detect module distinctions even if the correlation between features is relatively low. However, it is most be pointed out that computing time series distance measure such as DTW requires a lot of computational resources and in applications where is really large, replacing correlation with a time series distance measure may not be practical computationally.
In this paper we have presented Fuzzy Random Effect Estimation tree (FREEtree) algorithm that can provide a relatively unbiased way to do feature selection in the presence of correlation between features. Also, it deals with longitudinal data by using a random effect model tree, where the fixed effect is modelled as a piece-wise linear model, which has greater fitting and predicting power than RE-EM tree. It is expected that FREEtree can be widely used in application where the data has longitudinal structure as well as many correlated features.