Longitudinal Multi-Level Factorization Machines (AAAI'20)
We consider the problem of learning predictive models from longitudinal data, consisting of irregularly repeated, sparse observations from a set of individuals over time. Such data often exhibit longitudinal correlation (LC) (correlations among observations for each individual over time), cluster correlation (CC) (correlations among individuals that have similar characteristics), or both. These correlations are often accounted for using mixed effects models that include fixed effects and random effects, where the fixed effects capture the regression parameters that are shared by all individuals, whereas random effects capture those parameters that vary across individuals. However, the current state-of-the-art methods are unable to select the most predictive fixed effects and random effects from a large number of variables, while accounting for complex correlation structure in the data and non-linear interactions among the variables. We propose Longitudinal Multi-Level Factorization Machine (LMLFM), to the best of our knowledge, the first model to address these challenges in learning predictive models from longitudinal data. We establish the convergence properties, and analyze the computational complexity, of LMLFM. We present results of experiments with both simulated and real-world longitudinal data which show that LMLFM outperforms the state-of-the-art methods in terms of predictive accuracy, variable selection ability, and scalability to data with large number of variables. The code and supplemental material is available at <https://github.com/junjieliang672/LMLFM>.READ FULL TEXT VIEW PDF
Selecting important variables and learning predictive models from
We consider the problem of learning predictive models from longitudinal ...
Linear mixed models are widely used for analyzing hierarchically structu...
Longitudinal (panel) data provide the opportunity to examine temporal
In crowdsourced preference aggregation, it is often assumed that all the...
We present a semi-parametric generative model for predicting anatomy of ...
Glaucoma, a leading cause of blindness, is characterized by optic nerve
Longitudinal Multi-Level Factorization Machines (AAAI'20)
Longitudinal data consist of repeated observations from a set of individuals over time. Such data are common in many areas, including health sciences, social sciences and economics. Consider for example, the scenario shown in Fig. 1. To predict an individual ’s health status at the age of 38, we should take into account both ’s history of physical examinations, as well as those of other individuals who are similar to in age and other characteristics. Clearly, such longitudinal data often exhibit longitudinal correlation (LC) (correlations among observations for each individual over time), cluster correlation (CC) (correlations among individuals that have similar characteristics), or both (multi-level correlation) [finch2016multilevel], and hence are not independent and identically distributed. Analysis that does not account for such correlations can lead to misleading statistical inferences [gibbons1997random]. To account for data correlations, state-of-the-art longitudinal data analysis [gibbons1997random, lozano2012multi, groll2014variable] often relies on mixed effects models that include fixed effects and random effects, where the fixed effects capture the regression parameters that are shared by all individuals, whereas random effects capture those parameters that vary across individuals. In practice, the design of mixed effects models relies on expert input to decide which variables are subject to random effects as opposed to fixed effects, or a process of trial and error. However, existing mixed effects models are very computationally intensive, with the computational cost scaling with , where is the number of variables that are subject to random effects which limits their applicability to relatively low-dimensional data. While with the advent of big data, variants of dimensionality reduction methods such as LASSO have been explored in the longitudinal setting [schelldorfer2011estimation, lozano2012multi, groll2014variable, xu2015longitudinal, ratkovic2017sparse, marino2017covariate, lu2017multilevel, finch2018modeling], most such methods are limited to selecting only fixed effects. There is limited work on jointly selecting fixed and random effects, e.g., penalized likelihood methods [ibrahim2011fixed, bondell2010joint, hui2017joint] and Bayesian models [chen2003random, yang2019bayesian]. However, their applicability is limited by their high computational cost and reliance on linear models.
Contributions. This paper aims to address the urgent need for effective models that can handle high-dimensional longitudinal data where the number of variables is very large compared to the size of the population, the interactions among variables can be nonlinear, the fixed and random effects are a priori unspecified, and the data exhibit correlation structure (LC, CC, or both). Specifically, we introduce Longitudinal Multi-Level Factorization Machine (LMLFM), a novel, efficient, provably convergent extension of Factorization Machines (FM) [rendle2012factorization]
for predictive modeling of high-dimensional longitudinal data. LMLFM inherit the advantages of FM, e.g., the ability to reliably estimate the model parameters from high-dimensional data and model non-linear interactions. Further, LMLFM can automatically select fixed and random effects even in the presence of multi-level correlation, and greatly reduce the need for hyper-parameter tuning using a novel hierarchical Bayesian formulation. Specifically, LMLFM adopts two layers of Laplace prior, one for sparsifying the latent representation and one for identifying fixed effects and random effects. We solve the LMLFM using the iterated conditional modes (ICM) algorithm[besag1986statistical] which offers efficient optimization with strong convergence guarantees. Experimental results with simulated data show that LMLFM can readily handle longitudinal data with over 5000 variables whereas the existing mixed effects models fail when the number of variables exceeds 100. Experiments with two real-world data sets show that LMLFM (i) compares favorably with the state-of-the-art baselines in terms of predictive accuracy; (ii) yields sparse and easy-to-interpret predictive models; and (iii) effectively selects the relevant variables, which are consistent with the published findings [bromberger2011mood, dolan2008we].
Popular longitudinal data analysis methods include Generalized Estimating Equations (GEE) [liang1986longitudinal] and Generalized Mixed Effects Models (GMEM) [fitzmaurice2012applied]. GEE are marginal models which only estimate the average outcome (or fixed effects) over the population [liang1986longitudinal]. In contrast, GMEM are conditional models that provide the expectation of the conditional distribution of the outcome given the random effects.
There is much interest in the problem of variable selection in longitudinal data [schelldorfer2011estimation, groll2014variable]. Existing techniques focus on selecting only the fixed effects, under the assumption that the type of correlation is correctly specified and the random and fixed effects are correctly identified. Their high computational cost limits their applicability to data with small numbers of variables [chen2018latent]. There is limited work on the more challenging problem of selecting both fixed effects and random effects. Existing methods typically rely on adding a sparsity inducing penalty, e.g., LASSO or its variants, to the GMEM objective function [bondell2010joint, ibrahim2011fixed, muller2013model, hui2017hierarchical, hui2017joint]. While Bayesian methods, e.g., [chen2003random, yang2019bayesian]
, offer a conceptually attractive alternative to penalized likelihood methods for variable selection, they are currently applicable only to 2-level data which exhibit only LC or only CC but not both. Furthermore, most assume a linear mixed model, and hence cannot accommodate non-linear interactions among variables. Because they rely on matrix decomposition and matrix inversion for parameter estimation, their computational complexity is, making them unsuitable for high-dimensional longitudinal data.
While there have been a few attempts at applying factorization techniques [zhou2014micro, stamile2017multiparametric, Kidzinski2018LongitudinalDA], and deep representation learning techniques [xu2019adaptive, xu2019spatio], their primary focus is to improve the predictive accuracy. These techniques do not explicitly account for the complex correlation structure in the data or distinguish between random effects and fixed effects. In contrast, LMLFM efficiently accounts for complex correlation structure in the data and selects the most predictive fixed and random effects.
Scalars are denoted by lowercase letters and vectors by bold lowercase letters. All vectors are column vectors.refers to the length of and is the norm of . Matrices are denoted by uppercase letters, e.g., and a set of objects by a bold uppercase letter, e.g., . The calligraphic letters and denote information related to the individuals and the observations respectively. For example, refers to the sub-matrix of associated to the individuals. We use the letters to denote an arbitrary individual and observation respectively. We use to denote the vector of diagonal components of a square matrix . Because observations occur at discrete time points, we use observation and time point interchangeably.
Factorization Machines (FM). Given a real-valued design feature vector corresponding to an individual and observation , factorization machines (FM) [rendle2012factorization] model all nested interactions up to order . For example, the prediction of FM of order is given by:
where is a squared matrix with zeros on the diagonal. The off-diagonal component is parameterized as a dot product of two low dimensional embeddings . FM can be readily solved by coordinate descent [rendle2012factorization]. The time and space complexity are and respectively where and denote the total number of observations across all individuals, and the data size (i.e., ), respectively.
Linear Mixed Model (LMM). We introduce LMM to motivate the design of LMLFM. Let denote the scalar outcome of individual measured at observation . Let , be variables associated with fixed effects (denoted by ) and random effects (denoted as ) respectively. LMM assumes that the outcome is predicted by111We omit the error term since it can be readily incorporated into the random effects.:
The random effects matrix captures the time-invariant patterns for each individual. For all , , where is the covariance matrix. The random effects serve two purposes: (i) regularizing the effects (similar to the norm); and (ii) inducing correlation between the longitudinal observations, i.e., .
The structures of multi-level models are shown in Fig. 2. Standard regression models assume that given the variables and regression parameters, the outcomes are i.i.d. and hence yield biased estimates of parameters in the presence of LC or CC. 2-level models account for the LC or CC by either directly or indirectly specifying a correlation matrix that models the corresponding correlations. Mixed effects models introduce individual (or observation) specific random effects as proxies for the relevant information (see Fig. 2). A natural approach to extend this design is to incorporate both individual factors and observation factors , as proxies for the individual and observation specific information.222It is worth noting that individual/observation factors are distinct from individual/observation random effects, in this work, we use the former to estimate the latter. The pairwise interactions between such factors in the latent space (see Eq. (1)) are shown by arrows in Fig. 2. However, in its current form, the model does not provide a way to relate latent factors to the random effects or to explicitly accommodate complex correlation structures. Given the observed design matrix and outcomes , our goals are to: (i) predict the unknown outcomes , (ii) jointly select both fixed and random effects and (iii) recover the correlation structure from the data.
The prediction layer of LMLFM (see Fig. 3) is inspired by both LMM and FM. With a Bayesian framework, all variables are first assumed random. Hence, the LMM prediction in Eq. (2) reduces to with . Further, to accommodate multi-level correlation, we decompose the random effects as the summation of two subsets of latent factors , where denote the individual factors and observation factors respectively. Considering the interaction between individual and observation factors, we introduce the following prediction function:
Recall that in FM, the design feature vector
includes individual one-hot encoding, observation one-hot encoding and observed features. The original design of FM as in Eq. (1) embeds each component of the design feature vector into a latent vector. In contrast, LMLFM (Eq. (3)) embeds only the individuals and observations into latent vectors, and considers only the interactions among individuals, observations, and the design feature vector, thus simplifying the model. Instead of setting the latent dimension empirically as in the case of a FM, we initially let the latent dimension to be as large as the feature dimension , and then apply variable selection to identify the relevant subset of latent factors (see below). This makes the proposed model almost as interpretable as a simple linear model while making use of factorization to accommodate nonlinear interactions among variables.
The hierarchical Bayesian structure of LMLFM is shown in Fig. 3
. We decompose the task of joint selection of fixed and random effects into two sub-tasks: (i) identifying fixed and random effects by shrinking the variance of some variables towards 0; and (ii) selecting the relevant latent factors by shrinking some components towards 0. We handle the first sub-task by imposing a Laplace prior onand , respectively (see Laplace layer 2 in Fig. 3). For the second sub-task, we enforce sparsity in by imposing Laplace prior on (see Laplace layer 1 in Fig. 3). We denote the model parameters and hyper-priors of LMLFM by and respectively, yielding the following generative model:
can be application-dependent. For the ease of exposition, we assume that the outcome variable follows a Gaussian distribution.
We adopt the iterated conditional modes (ICM) algorithm [besag1986statistical] to estimate the parameters of LMLFM. ICM updates blocks of parameters with the modes of their conditional posterior while keeping the remaining parameters fixed. Our choice of priors permits the analytical closed-form derivation of the modes of the conditional posterior density, yielding substantial speedup. Specifically, we consider the Maximum A Posteriori (MAP) formulation:
Due to space constraints, we include only the update equations for here, omitting the superscript to minimize notational clutter.
Update of . The posterior of
is a Gamma distribution, whose mode is given by.
Update of . For each model parameter , the prediction is a linear combination of two functions and that are independent of the value of :
with and . Here is the matrix of latent factors constructed by the observations associated with . Hence, we have:
is the ReLU function;, with denoting the set of integers ranging from 1 to excluding and is the -th column of . Clearly, sparsity is achieved if and .
Update of . The problem of finding the optimal reduces to finding the weighted median of the vector with the weights , yielding a linear-time algorithm [gurwitz1990weighted].
Update of . The optimal is updated by .
We note that the computational complexity of LMLFM for one complete iteration is , which is linear in the size of the training data. Our approach is more efficient than FM [rendle2012factorization], whose computational complexity is ( is the latent dimension). The space complexity of LMLFM is the same as that of FM, i.e., .
We proceed to describe how to estimate random effects, fixed effects and outcomes for seen and unseen individual/observation from the model.
Temporal Individual-Specific Random Effects (TISE). As shown in Eq. (3
), we can rewrite the prediction function of LMLFM in a form resembling ordinary linear regression wherewith coefficients and error . Hence, we let be the estimator of TISE.
Averaged Individual-Specific Random Effects (AISE). AISE is computed by integrating out the observation effects:
Solving is non-trivial. Hence, we approximate it using the estimated observation factors, where .
Temporal Population Averaged Random Effects (TPAE). Similar to solving AISE, TPAE is solved by integrating out the individual factors: .
Fixed Effects. We say that a variable has fixed effect if and only if . The fixed effect for variable is computed by .
Outcomes. The outcome for seen individual and observation is computed using Eq. (3). In the case of unseen individuals, we replace the posterior of the latent factors with their prior. Thus, for an unseen individual , the outcome is given by:
We establish two important properties in LMLFM: Ascent property and Convergence.333See supplemental material for detailed proofs. Let denotes the set of all positive integers and denotes the value of at the -th iteration. We denote as the full conditional posterior of .
Ascent property. holds for all iterations .
The proof of Proposition 1 follows from the observation that the joint posterior density of LMLFM is non-decreasing with the update of each component of .
Convergence. If is bounded above, there exists an iteration , such that , holds for .
We prove Proposition 2 by contradiction, i.e., the negation of the proposition implies that , yielding a contradiction.
We report results of experiments with simulated data to answer the following questions: (RC1) Can LMLFM handle high-dimensional data? (RC2) Can LMLFM accurately select the relevant variables? (RC3) How does LMLFM perform in the presence of LC and CC?
Simulated data. Following [hui2017joint], we construct simulated longitudinal data sets with individuals and observations per individual. We consider several choices of from . We consider three types of correlation, i.e., pure LC, pure CC and both (See supplemental material for details.).
Methods compared. We compare LMLFM with several baseline methods: (i) State-of-the-art multi-level linear mixed model (M-LMM) [bates2014fitting]; (ii) State-of-the-art 2-level models: LMMLASSO, a linear mixed model with adaptive LASSO penalty on the fixed effects [schelldorfer2011estimation]; GLMMLASSO, a generalized linear mixed model with standard LASSO penalty on the fixed effects [groll2014variable] and rPQL [hui2017joint], a joint selection mixed model with adaptive LASSO penalty on the fixed effects and group LASSO penalty on the random effects; and (iii) Factorization-based multi-level Lasso (MLLASSO),444Despite its name, MLLASSO works only as a 2-level model and does not provide a simple way to associate the latent factors with random effects. which factorizes the fixed effects as a product of global effects and individual effects, both regularized by norm [lozano2012multi]
; and (iv) the standard LASSO regression (LASSO)[tibshirani1996regression]. We report performance statistics obtained from 100 independent runs. Hyper-parameters are selected using cross validation on the training data. Evaluation scores are computed on the held-out data set. We report execution failure if an algorithm fails to converge within 48 hours or generates an execution error.
|R (%)||f.p.||f.n.||R (%)||f.p.||f.n.|
Evaluation Measures. We evaluate the performance of all methods in terms of both predictive accuracy and the ability to select random and fixed effects. We measure the predictive accuracy using the r-squared () score. To assess a method’s ability to select the relevant variables, we consider a variable to be selected if the corresponding coefficient is non-zero. We use false positive (f.p.), the number of variables that are incorrectly selected, and false negative (f.n.), the number of variables that are incorrectly discarded, to assess the models.
Results. A subset of our results summarized in Table 1 answer RC1-RC3. RC1: the performance of the state-of-the-art mixed effects models are highly sensitive to the number of random effects, i.e., M-LMM, LMMLASSO and GLMMLASSO fail when exceeds 100 due to execution error; Only LMLFM, LASSO and MLLASSO ran to completion. RC2: Selecting random effects is more challenging compared to selecting fixed effects. Among all models, only LMLFM and rPQL are designed to select random effects. To enforce model sparsity, say on variable , LMLFM and rPQL shrink the vector to 0 whereas other methods shrink a scalar to 0. Our results show that LMLFM achieves the best terms of f.p. score. Although LASSO has attractive R when , LASSO is unable to select variables with random effects, leading to very high f.p. in the presence of LC, CC and both. RC3: Our results (omitted due to space constraints) show that 2-level models work poorly when a CC model is used on data with LC or vice versa. In contrast, multi-level models (M-LMM and LMLFM) achieve better fit on the data that exhibit LC, CC, or both. LMLFM consistently outperforms M-LMM in terms of accuracy, variable selection ability and computational efficiency. We conclude that LMLFM is the only method among those compared in this study, that can effectively model high-dimensional longitudinal data, and select the relevant variables, regardless of whether they are associated with random effects or fixed effects, in the presence of LC, CC, or both.
We compare LMLFM with the state-of-the-art baselines on two real-world longitudinal data sets: (i) Study of Women’s Health Across the Nation (SWAN) 555https://www.swanstudy.org/ [sutton2005sex] (on predicting depression); and (ii) General social survey (GSS)666http://gss.norc.org/ [smith2017general]
(on predicting general happiness). We chose these two data sets because they have attracted much interest in the field of social sciences. We use the same settings of hyper-parameters for LMLFM as in our experiments with simulated data. We exclude rPQL because it fails on all of the experiments due to memory issue. In addition to the aforementioned baselines, we include some popular 1-level models in our comparison: Random Forest (RF), FM and Penalized GEE (PGEE)[inan2017pgee].
We seek answers to the following question: (RC4) How does LMLFM compare with the state-of-the-art baselines with respect to its ability to correctly identify the fixed and random effects and predictive accuracy? To answer RC4, for each data set, we choose as ”ground truth”, 5 ”positive”, 5 ”negative” variables identified in the existing literature (see below for specifics) and add 5 additional variables that are believed to be relatively uninformative.
Evaluation on SWAN Data. In the case of SWAN data, we consider the task of predicting the CESD score [dugan2015association], which is used for screening for depression. The variables of interest include aspects of physical and mental health, and demographic factors, such as race and income. The data set includes 3,300 individuals, with 1-22 observations per individual, and 137 variables. The outcome we aim to predict is defined by ( is individual and is the age of the individual) since CESD has been observed to be highly indicative of depression. Existing research [dugan2015association, prairie2015symptoms] suggests that hispanic ethnicity, depressed or fluctuating mood and low household income are highly positively correlated with depression, whereas Caucasian/white ethnicity, stable mood and high income are negatively correlated with depression. The variables used to answer RC4 and the experimental results are summarized in Fig. 4(a) and Table 2 respectively. We note that LMLFM outperforms all other methods in R score and correctly recovers the relevant variables. Performance of the factorization baselines (FM, MLLASSO) is unsatisfactory. This is because of the lack of intuitive way to relate estimated latent factors to the corresponding effects. Note that the variables related to depressed mood are generally selected by our baselines (not shown), which is consistent with the findings in [prairie2015symptoms]. FM renders menopausal status as strong factors to depression, a finding supported by existing literature [bromberger2011mood, prairie2015symptoms]. However, we argue that depressed and fluctuated mood is more likely to be direct causes to depressive symptoms because menopausal status usually causes abnormal hormone level, which could further affect the mood of the patients. Results of LMLFM show that and the sparsity rate of (i.e., the number of zero components in divided by ) is 98.3%, which further implies that the random effects related to the individuals are less predictive. Nonetheless, the analysis on reveals a different story: 59 out of 136 are non-zero and the sparsity rate of is 56.4%. This suggests that the depressive symptoms for multiple individuals with similar age are correlated (CC) as are the depressive symptoms for a single individual across time (LC), with CC dominating LC.
Evaluation on GSS Data. In our experiments with the GSS data, we consider the problem of predicting the self-reported happiness. We define by individual reports happy at year and as the opposite. The GSS data consists of 4,510 individuals, 1-30 observations per individual and 1,553 variables. Existing research [dolan2008we, oishi2011income] indicates that, being married, good physical and psychological health, satisfactory with financial situation, having strong religious beliefs and being trusted are positively correlated with happiness, whereas the absence of these characteristics and unemployment are negatively correlated with happiness. The variables used to answer RC4 and the results of our experiments are shown in Fig. 4(b) and Table 2 respectively. Though we see that PGEE and LASSO have the lowest f.p., their R is relatively low. They tend to shrink the negative effects to zero, and perform poorly even on the training set, which strongly suggests that they underfit the data. LMLFM is competitive with the best performing methods in recovering the relevant variables, while significantly outperforming them in predictive performance. We note that variable selection with the GSS data is far more challenging than with the SWAN data because of the substantially larger number of variables and collinearity of the variables. The low R of FM and MLLASSO indicate that they significantly underfit the data. Though RF has a high R score, variables selected by RF are harder to explain compared to that of the other baselines (not shown). In contrast, LASSO achieves lower R, but selects variables that are consistent with those selected by LMLFM. We further find that all of the variables selected by LMLFM are consistent with the findings reported in [dolan2008we]. We find that , thus ruling out LC. This is perhaps explained by the huge gap between consecutive observations (the survey is taken once per one to three years) within which many unobserved factors could potentially affect subjective happiness. We find that the sparsity rate of is 84.9%. Our analysis of the fixed effects shows that 126 out of 1199 effects are non-zero and among which, only 6 features have absolute effects greater than 0.1, thus vast majority of variables are uninformative.
Summary of Experimental Results. We conclude that LMLFM outperforms all the baselines and is the only multi-level mixed effects model that can reliably select variables associated with fixed as well as random effects from high-dimensional longitudinal data.
We have introduced LMLFM, for predictive modeling from longitudinal data when the number of variables is large compared to the population size, the fixed and random effects are a priori unspecified, the interactions among variables are nonlinear, and the data exhibit complex correlation (LC, CC, or both). LMLFM, a natural generalization of FM to longitudinal data setting, adopts a novel hierarchical Bayesian model with two layers of Laplace prior, where the first layer induces a sparse latent representation and the second layer identifies fixed effects and random effects. We train LMLFM using iterated conditional modes algorithm which offers both computational efficiency and strong convergence guarantee. Compared to the state-of-the-art alternatives, LMLFM yields more compact, easy-to-interpret, rapidly trainable, and hence scalable, models with minimal need for hyper-parameter tuning. Our experiments with simulated data with thousands of variables, and two widely studied real-world longitudinal data sets have shown that LMLFM outperforms the 1-level baselines and state-of-the-art 2-level and multi-level longitudinal models in terms of predictive accuracy, variable selection ability, and scalability to data with large number of variables.
This work was funded in part by the NIH NCATS through the grant UL1 TR002014 and by the NSF through the grants 1518732, 1640834, and 1636795, the Edward Frymoyer Endowed Professorship at Pennsylvania State and the Sudha Murty Distinguished Visiting Chair in Neurocomputing and Data Science funded by the Pratiksha Trust at the Indian Institute of Science (both held by Vasant Honavar). The content is solely the responsibility of the authors and does not necessarily represent the official views of the sponsors.