Longitudinal Multi-Level Factorization Machine (AAAI'20)
Selecting important variables and learning predictive models from high-dimensional longitudinal data is challenging due to the need to account for complex data correlation and expensive computation. In this work, we propose an extension of factorization machines, LMLFM, to deal with such longitudinal data. LMLFM is efficient, sparse, provably convergent and explainable. Specifically, LMLFM is the first multi-level model that can simultaneously select fixed effects and random effects while accounting for complex correlations in the data and non-linear interactions among variables. Experimental results with both simulated and real-world longitudinal data show that LMLFM outperforms the state-of-the-art longitudinal methods in terms of prediction accuracy with significantly lower false positive, using substantially less computational resources.READ FULL TEXT VIEW PDF
We consider the problem of learning predictive models from longitudinal ...
We consider the problem of learning predictive models from longitudinal ...
This paper proposes FREEtree, a tree-based method for high dimensional
High-order parametric models that include terms for feature interactions...
In crowdsourced preference aggregation, it is often assumed that all the...
For complex nonlinear supervised learning models, assessing the relevanc...
Opioid Use Disorder (OUD) is a public health crisis costing the US billi...
Longitudinal Multi-Level Factorization Machine (AAAI'20)
Longitudinal data, also called panel data, consists of repeated observations from a set of individuals over time. Such data are common in many areas, including health sciences, social sciences and economics. Longitudinal data often exhibit longitudinal correlation (LC) (correlations among observations for each individual over time), cluster correlation (CC) (correlations among individuals that have similar characteristics), or multi-level correlation where LC and CC both exist [finch2016multilevel]. An example is presented in fig. 1. To predict Mike’s health status when he is 38, we should take into account both Mike’s history of physical examinations, as well as the results of other individuals with similar age. Due to such correlations, longitudinal data is typically not independent and identically distributed. Hence, analyses that do not account for correlations can lead to misleading statistical inferences [gibbons1997random].
To account for data correlations, research on longitudinal data analysis [gibbons1997random, lozano2012multi, groll2014variable] often relies on mixed effects models that include fixed effects and random effects, where the fixed effects capture the regression parameters that are shared by all individuals, whereas random effects capture those parameters that vary across individuals. In practice, the design of mixed effects models generally relies on expert input to decide which variables are subject to random effects as opposed to fixed effects. This typically involves trial and error. However, existing mixed effects models are exceptionally expensive [gibbons1997random, lozano2012multi, groll2014variable], the computational cost usually scales to , where is the number of variables that are subject to random effects. This limits the models to be applied only on the low-dimensional data. With the advent of big data, variants of dimensionality reduction methods such as LASSO for longitudinal data [schelldorfer2011estimation, lozano2012multi, schelldorfer2014glmmlasso, groll2014variable, xu2015longitudinal, ratkovic2017sparse, marino2017covariate, lu2017multilevel, goplerud2018sparse, finch2018modeling] have received much attention. But most such methods are limited to selecting only fixed effects. There are limited literature on jointly selecting fixed and random effects, for example, penalized likelihood methods [ibrahim2011fixed, bondell2010joint, hui2017joint] and Bayesian models [chen2003random, yang2019bayesian]. However, such methods are still computationally intensive and generally limited to linear models.
Contributions. Against this background, this paper is aimed at addressing the urgent need for effective models that can handle high-dimensional longitudinal data where the number of variables is very large compared to the the size of the population, the interactions among variables can be nonlinear, the fixed and random effects are a priori unspecified, and the data exhibit complex correlation (LC, CC, or both). Specifically, we introduce longitudinal multi-level factorization machines (LMLFM), a novel, efficient, provably convergent extension of Factorization Machine (FM) [freudenthaler2011bayesian, rendle2012factorization]
for predictive modeling of high-dimensional longitudinal data. LMLFM inherits the advantages of FM, such as the ability to reliably estimate the model parameters from high-dimensional data, and model non-linear interactions. Further, LMLFM can automatically select fixed and random effects in the presence of multi-level correlation (if any), and reduce the need for tuning hyperparameters in FM through a novel hierarchical Bayesian model. Specifically, LMLFM adopts two layers of Laplace prior, where the first layer produces sparse latent representation and the second layer identifies fixed effects and random effects. We solve the LMLFM using the iterated conditional modes (ICM) algorithm[besag1986statistical] which allows for efficient optimization with strong convergence guarantees. Experimental results with simulated data show that LMLFM can readily handle longitudinal data with over 5000 variables whereas the existing mixed effects models fail when the number of variables exceeds 100. Experiments with two real-world data sets show that LMLFM compares favorably with the state-of-the-art baselines in terms of predictive accuracy. The proposed model is sparse and interpretable. It can select the relevant variables, which are consistent with the findings in domain-specific literature [bromberger2011mood, dolan2008we].
Approaches to modeling longitudinal data include Generalized Estimating Equations (GEE) [liang1986longitudinal] and Generalized Mixed Effects Models (GMEM) [mcculloch1997maximum, fitzmaurice2012applied]. GEE are marginal models which only estimate the average outcome (or fixed effects) over the population [liang1986longitudinal]. In contrast, GMEM are conditional models that provide the expectation of the conditional distribution of the outcome given the random effects.
With the advent of big data, there is a growing interest in the problem of variable selection in longitudinal data [schelldorfer2011estimation, schelldorfer2014glmmlasso, groll2014variable]. Most of them focus only on selecting fixed effects under the assumption that the type of correlation is correctly specified and the random and fixed effects are correctly identified. Because of the computational complexity, they are generally limited to data with only a handful of variables [Chen2018LatentSM]. In contrast to fixed effects selection, the problem of simultaneous selection of both fixed effects and random effects is more challenging. Most of the existing methods rely on adding SCAD, LASSO or group LASSO penalty to the GMEM objective function [bondell2010joint, ibrahim2011fixed, muller2013model, hui2017hierarchical, hui2017joint]. Bayesian methods e.g., [chen2003random, yang2019bayesian]
, offer a conceptually attractive alternative to penalized likelihood methods for variable selection. However, such methods are currently applicable only to 2-level data which exhibit only LC or only CC but not both. Most assume a linear mixed model, and hence cannot accommodate non-linear interactions among variables. Because of their reliance on matrix decomposition and matrix inversion in parameter estimation, their computational complexity is, making them generally unsuitable for high-dimensional longitudinal data.
There have been a few attempts at applying factorization models to longitudinal data, e.g., [zhou2014micro] that applies non-negative matrix factorization (NNMF) to construct patient phenotypes from longitudinal electronic health records data; [stamile2016longitudinal, stamile2017multiparametric]
that uses NNMF to identify the outliers from longitudinal health data; and[Kidzinski2018LongitudinalDA]
that utilizes singular value decomposition to predict longitudinal progression of diseases. However, these models focus on prediction without accounting for different types of correlations or random effects. In contrast to these models, LMLFM is able to exploit and identify the complex correlation structure (if any) and jointly select both fixed and random effects.
Scalars are denoted by lowercase letters and vectors by bold lowercase letters. All vectors are column vectors.refers to the length of and is the norm of . Matrices are denoted by uppercase letters, e.g., and a set of objects by a bold uppercase letter, e.g., . The calligraphy letter and is reserved to denote information related to the individuals and the observations respectively. For example, refers to the submatrix of associated to the individuals. We reserve letter to denote an arbitrary individual and observation respectively. We denote as a diagonal matrix where the diagonal components are defined by . Conversely, the vector of diagonal components of a square matrix is denoted by . As an observation is taken on a particular discrete time point, the word observation and time point will be used interchangeably throughout this paper.
Factorization Machines (FM). Given a real valued design feature vector corresponding to an individual and observation , factorization machines (FM) [rendle2012factorization] model all nested interactions up to order . For example, the prediction of FM of order is given by:
where is a squared matrix with zeros on the diagonal. The off-diagonal component is parameterized as a dot product of two low dimensional embeddings . FM can be readily solved by coordinate descent [rendle2012factorization]. The time and space complexity is and respectively. and is the total number of observations across all individuals and the size of training data respectively.
Linear Mixed Model (LMM). We now introduce LMM since it motivates the design of our model. Let denote the scalar outcome of individual measured at observation . Let , be variables associated with fixed effects (denoted by ) and random effects (denoted as ) respectively. LMM assumes that the outcome is predicted by111The error term is incorporated to the random effects.:
The random effects matrix captures the time-invariant patterns for each individual. For all , , where is the covariance matrix. The random effects serve two purposes: i) regularizing the effects (similar to the norm); and ii) inducing correlation between the longitudinal observations, i.e., .
The structures of multi-level models are depicted in fig. 2. Standard regression models assume that given the variables and regression parameters, the outcomes are i.i.d. and hence suffer from biased estimates of parameters in the presence of LC or CC. 2-level models account for the LC or CC by either directly or indirectly specifying a correlation matrix that models the corresponding correlations. Mixed effects models introduce individual (or observation) specific random effects as proxies for the relevant information (see fig. 2 for illustration). A natural approach to extend this design is to incorporate both individual factors and observation factors ,222It is worth noting that individual/observation factors are distinct from individual/observation random effects, in this work, we use the former to estimate the later. as proxies for the individual and observation specific information. The pairwise interactions between such factors in the latent space (see eq. (1)) are shown by arrows in fig. 2. However, in its current form, the model does not provide a way to relate latent factors to the random effects or to explicitly accommodate for any type of correlation. Given the observed design matrix and outcomes , our goals are to: i) predict the unknown outcomes , ii) jointly select both fixed and random effects and iii) uncover the type of correlation in the data.
The prediction layer of LMLFM (see fig. 3) is motivated by both LMM and FM. Since we follow the Bayesian framework, all variables are first assumed random. Under such assumption, the LMM prediction in (2) is reduced to with . Further, to accommodate for multi-level correlation, we decompose the random effects as the summation of two parts of latent factors , where is the individual factors and observation factors respectively. Taking the interaction between individual and observation factors into account, we introduce the following prediction function:
Recall that in FM, the design feature vector
contains individual one-hot encoding, observation one-hot encoding and observed features, with latent factors associated with each individual, each observation and each feature dimension. Note that FM (eq. (1)) embeds each dimension of the design feature vector into a latent vector. In contrast, LMLFM (eq. (3)) only embeds the individuals and observations into latent vectors, and only considers the interactions among individuals, observations, and features, thereby simplifying the model. Instead of setting the latent dimension empirically as in the original FM, we initially let the latent dimension be as large as the feature dimension , and then apply variable selection to identify the relevant subset of latent factors (see below). This makes the proposed model almost as interpretable as a simple linear model while making use of factorization to incorporate nonlinear interactions.
The hierarchical Bayesian structure of LMLFM is shown in fig. 3
. We decompose the task of joint selection for fixed and random effects into two sub-tasks: i) identifying fixed and random effects by shrinking the variance of some variables to 0; and ii) selecting the relevant latent factors by shrinking some factors to 0. We handle the first sub-task by imposing a Laplace prior onand , respectively (see Laplace layer 2 in fig. 3). For the second sub-task, we enforce sparsity in by imposing Laplace prior on (see Laplace layer 1 in fig. 3
). We denote the model parameters and hyperpriors of LMLFM byand respectively, yielding the generative model:
and Poisson distribution for count data[xu2015longitudinal]
) In this paper, for the ease of exposition, we choose the Gaussian distribution.
We adopt the iterated conditional modes (ICM) algorithm [besag1986statistical] to estimate the parameters of LMLFM. Basically, ICM updates blocks of parameters with the modes of their conditional posterior while keeping the remaining parameters fixed. Our choice of priors permits the modes of the conditional posterior density to be derived analytically, yielding substantial convergence speedup. Specifically, we consider the Maximum A Posteriori (MAP) formulation . Due to space constraints, we include only the update equations for here. Detailed derivations can be found in the supplementary material. To minimize notational clutter, we omit the superscript .
Update of : The posterior of
is a Gamma distribution, the mode of which is computed by.
Update of : For each model parameter , the prediction is a linear combination of two functions and that are independent of the value of :
where is the matrix of latent factors constructed by the observations associated to . We have the following results:
is the ReLU function;, with denoting the set of integers ranging from 1 to excluding and is the -th column of .
Update of : The problem of finding the optimal can be converted to find the weighted median of the vector with the weights . Efficient algorithm can be found in [gurwitz1990weighted], which solves the problem in linear time.
Update of : The optimal is updated by .
It is worth noting that the computational complexity of LMLFM for one complete iteration is , strictly linear in the size of the training data. Our approach is more efficient than FM [rendle2012factorization], of which the computational complexity is ( is the latent dimension). The space complexity of LMLFM is the same with FM, which is .
We now present the way to compute random effects and fixed effects given the latent factors. Temporal Individual-Specific Random Effects (TISE). As shown in (3
), we can rewrite the prediction function of LMLFM as a form similar to the ordinary linear regression wherewith coefficients and error . Therefore, we let be the estimator of TISE.
Averaged Individual-Specific Random Effects (AISE). AISE is computed by integrating out the observation effects, such that:
Solving is non-trivial. We compute the approximation using the estimated observation factors, where .
Temporal Population Averaged Random Effects (TPAE). Similar to solving AISE, TPAE is solved by integrating out the individual factors, which is computed by .
Fixed Effects. We claim that a variable has fixed effect if and only if . The fixed effect for variable is computed by .
Details of correlation estimation can be found in the supplementary material.
We establish two important properties in LMLFM: Ascent property and Convergence (See supplementary material for detailed proofs). Let denote the set of all positive integers. We denote as the full conditional posterior of .
Ascent property. holds for all iteration .
The key idea behind the proof of Proposition 1 is that the joint posterior density of LMLFM is non-decreasing with the update of each component of .
Convergence. If is bounded above, there exists an iteration , such that holds for .
The proof of Proposition 2 is obtained by contradiction, i.e., the negation of the proposition implies that , thus it leads to a contradiction.
We report results of experiments with simulated data to answer the following questions. (RQ1) Can LMLFM handle high-dimensional data? (RQ2) Can LMLFM accurately select relevant variables? (RQ3) How does LMLFM perform in the presence of LC, CC and both?
Simulated data. We construct simulated longitudinal data sets with individuals and observations per individual, constituting 1600 data samples.333Though the simulation sample size seems small, this is still a general setting, see e.g., [groll2014variable, hui2017joint]. We consider several choices of from . Three types of correlation are considered, i.e., LC, CC and multi-level correlation. (Further details concerning the generation of synthetic data, can be found in the supplementary material).
Comparisons. We compare LMLFM with the following baseline methods: i) State-of-the-art multi-level linear mixed model (M-LMM) [bates2014fitting]; ii) State-of-the-art 2-level models: LMMLASSO, a linear mixed model with adaptive LASSO penalty on the fixed effects [schelldorfer2011estimation]; GLMMLASSO, a generalized linear mixed model with standard LASSO penalty on the fixed effects [groll2014variable] and rPQL [hui2017joint], a joint selection mixed model with adaptive LASSO penalty on the fixed effects and group LASSO penalty on the random effects; iii) Factorization-based multilevel Lasso (MLLASSO,444Despite the fact that MLLASSO is named with the term Multi-level it only works for 2-level cases and does not provide intuitive way to relate the latent factors to random effects.) which factorizes the fixed effects by the product of global effects and individual effects, with both regularized by norm [lozano2012multi]
; iv) the standard LASSO regression (LASSO)[tibshirani1996regression]. Implementation details are outlined in the supplementary material. We report performance statistics obtained from 100 independent runs. Hyperparameters are selected using cross validation. Evaluation scores are computed on the held-out data set. We report execution failure if an algorithm fails to converge within 48 hours or generates an execution error.
|R (%)||f.p.||f.n.||R (%)||f.p.||f.n.|
Evaluation Measures. We evaluate the performance of all methods on both prediction accuracy and parameter estimation. For prediction accuracy, we use the r-squared () score. For assessing the ability to deal with high-dimensional features, we use false positive (f.p.), the number of variables that are incorrectly selected and false negative (f.n.) and the number of variables that are falsely discarded.
Results. Only a part of the results on the synthetic data is summarized in table 1. RQ1: The performance of the state-of-the-art MEM are highly sensitive to the number of random effects, i.e., M-LMM, LMMLASSO and GLMMLASSO fail when exceeds 100 due to execution error; Only LMLFM, LASSO and MLLASSO ran to completion. RQ2: Selecting random effects is more challenging compared to selecting fixed effects. Among all models, only LMLFM and rPQL are designed to select random effects. To achieve sparsity on variable , LMLFM and rPQL need to shrink the vector to 0 while others need only to shrink a scalar to 0. Our results indicate that albeit LMLFM is running on a harder task, it still achieves the best performance in terms of f.p.. Though our results suggest that LASSO has attractive R when . However, it is noted that LASSO lacks the ability to deal with random effects, leading to an incredibly high f.p. in the presence of LC, CC and both. . RQ3: Our results (omitted due to space constraints) indicate that 2-level models work poorly when applying CC model on data with LC or vice versa. In contrast, multi-level models (M-LMM and LMLFM) can better fit the data that exhibit LC, CC, or both without trial-and-error. Furthermore, LMLFM consistently outperforms M-LMM in terms of accuracy, variable selection ability and computational efficiency. We conclude that LMLFM is the only model that can effectively model high-dimensional longitudinal data, and select the relevant variables regardless of whether they are associated with random effects or fixed effects, in the presence of LC, CC, or both.
|# of Individuals||3,300||4,510|
|# of Observation||1-22||1-30|
|# of Variables||137||1,553|
|# of Samples||28,405||59,599|
We compare LMLFM with the state-of-the-art baselines on two real-world longtudinal data sets: i) Study of Women’s Health Across the Nation (SWAN) 555https://www.swanstudy.org/ [sutton2005sex] (on predicting depression) and ii) General social survey (GSS)666http://gss.norc.org/ [smith2017general] (on predicting general happiness). Brief description of the two data sets is listed in table 2
. We use the same settings of hyperparamters for LMLFM as in the experiments with simulated data. We exclude rPQL since it fails on all real life experiments due to memory issue. In addition to the aforementioned baselines, we also included some popular 1-level models: Random Forest (RF), FM and Penalized GEE (PGEE)[inan2017pgee].
We aim to answer the following questions. (RQ4) Can LMLFM correctly select the relevant variables in the real-world setting? (RQ5) How does LMLFM perform as compared with the state-of-the-art baselines? To answer RQ4, for each data set, we choose 5 positive, 5 negative variables as identified in the existing literature (see below for specifics) and add 5 additional variables that are believed to be relatively uninformative as the ground truth. To answer RQ5, we use the entire set of variables.
In the first real-world experiment (SWAN data), we consider the task of predicting the CESD score [dugan2015association], which is used for screening for depression. The variables of interest include aspects of physical and mental health, and demographic factors such as race and income. The outcomes we aim to predict are defined by ( is individual and is the age of the individual) since CESD is found to be highly indicative of depression. Existing research [dugan2015association, prairie2015symptoms] suggests that hispanic ethnicity, depressed or fluctuating mood and low household income are highly positively correlated with depression, whereas Caucasian/white ethnicity, stable mood and high income are negatively correlated with depression. The variables used to answer RQ4 and the experimental results are summarized in fig. 4(a) and Table 3. We note that LMLFM outperforms all other methods in R score and correctly recovers the relevant variables. Performance of the factorization baselines (FM, MLLASSO) is dissatisfactory. This is because of the lack of intuitive way to relate estimated latent factors to the corresponding effects. Note that the variables related to depressed mood are generally selected by our baselines, which is consistent with the findings in [prairie2015symptoms].777Variable selection results on the entire data are outlined in the supplementary material. FM renders menopausal status as strong factors to depression, a finding supported by existing literature [bromberger2011mood, prairie2015symptoms]. However, we argue that depressed and fluctuated mood are more likely to be direct causes to depressive symptoms because menopausal status usually causes abnormal hormone level, which could further affect the mood of the patients. Results of LMLFM show that and 98.3% of the components in are exactly zero, which further implies that the random effects related to the individuals are less predictive. Nonetheless, the analysis on reveals a different story: 59 out of 136 are non-zero and the sparsity rate on is 56.4%. Therefore, we conclude that the depressive symptoms for multiple individuals with the same age (CC) as well as for single individual across time (LC) are both correlated with CC dominating LC.
In the second real-world experiment (GSS data), we consider predicting the self-reported happiness. We define by individual reports happy at year and as the opposite. Existing research [dolan2008we, oishi2011income] indicates that, being married, good physical and psychological health, satisfactory with financial situation, having strong religious beliefs and being trusted are positively correlated with happiness, whereas their absence as well as unemployment are negatively correlated. The variables used to answer RQ4 and the results of our experiments are presented in fig. 4(b) and table 3. Though we see that PGEE and LASSO has the lowest f.p., their R is relatively low. They tend to shrink the negative effects to zero, yet perform poorly even on the training set, showing strong evidents of underfitting. LMLFM is competitive with the best performing competitors in recovering the relevant variables, while significantly outperforming them in terms of predictive performance. Compared to SWAN, variable selection in GSS is more challenging not only due to the prohibitively high variable dimension but also due to the intricate colinearily in the feature space and the lack of consistent understanding about what truly affects happiness. The low R of FM and MLLASSO indicate that they significantly underfit the data. Though RF has a high R score, variables selected by RF are less reasonable and hard to explain compared to that of the other baselines (see supplementary material for details). In contrast, LASSO achieves lower R, but the selected variables are more consistent with the results given by LMLFM. For LMLFM, all selected variables are agreed with the findings in [dolan2008we]. By examining the random effects and correlation estimated by LMLFM, we find that all , thus no LC is detected; the sparsity rate on is 84.9%. Therefore, we conclude that GSS data is purely CC. Such conclusion not surprising because of the huge gap between consecutive observations (the survey is taken once per one to three years) within which many unobserved factors could potentially affect the subjective happiness. The fixed effects analysis points that 126 out of 1199 effects are non-zero and among which, only 6 features have absolute effects greater than 0.1, thus vast majority of variables are uninformative. As a final remark, LMLFM outperforms all the baselines and is the only multi-level MEM that can handle high-dimensional data.
We have introduced LMLFM, for predictive modeling from longitudinal data when the number of variables is large compared to the population size, the fixed and random effects are a priori unspecified, the interactions among variables are nonlinear, and the data exhibit complex correlation (LC, CC, or both). LMLFM, a natural generalization of FM to longitudinal data setting, adopts a novel hierarchical Bayesian model with two layers of Laplace prior, where the first layer produces sparse latent representation and the second layer identifies fixed effects and random effects. We train LMLFM using iterated conditional modes algorithm which offers both computational efficiency and strong convergence guarantees. Compared to the state-of-the-art alternatives, LMLFM yields more compact and fully interpretable, rapidly trainable - and hence scalable, models with only one tuning hyperparameters. Our experiments with simulated data with thousands of variables, and two widely studied real-world longitudinal datasets have shown that LMLFM outperforms the 1-level baselines and state-of-the-art 2-level and multi-level longitudinal models in terms of prediction as well as parameter estimation, using substantially less computational resources.