1. introduction
Longitudinal studies, which involve repeated observations of a chosen set of variables, taken at not necessarily regularly spaced time points, for the same set of individuals, over a period of time, are quite ubiquitous in many fields including health sciences, learning sciences, social and economic sciences. Longitudinal studies are effective in identifying the timevarying as well as the timeinvariant risk factors associated to a particular outcome of interest, e.g., market crash, disease onset (hedeker2006longitudinal). Longitudinal data typically exhibit longitudinal correlation (LC), i.e., correlations among the repeated observations of a given individual over time; and cluster correlation (CC), i.e., correlations among observations across individuals, e.g., due to the characteristics that they share among themselves e.g., age, demographics factors; or both, i.e., multilevel correlation (MC). In practice, the specific multilevel correlation structure in the data can be quite complex, and unknown a priori. In predictive modeling from longitudinal data, failure to account for such multilevel correlations can lead to misleading statistical inferences (gibbons1997random; liang2019lmlfm). As we shall see below, it can be nontrivial to choose a suitable correlation structure that reflects the correlations present in the data. The relationships between the covariates and outcomes of interest can be highly complex and nonlinear. Furthermore, modern LDA applications often call for methods that scale gracefully with increasing number of variables, the number of individuals, and the number of longitudinal observations per individual.
1.1. Related Work
Conventional LDA Methods Predictive modeling for longitudinal data has been extensively studied for decades (hedeker2006longitudinal; verbeke2014analysis)
. Conventional longitudinal data analysis methods can be grouped into to two broad categories: (i) marginal models and (ii) conditional models. Marginal models rely on assumptions about the marginal association among the observed outcomes. A typical example of marginal models is the generalized estimating equations (GEE)
(liang1986longitudinal), where a working correlation matrix of a specified form (independent, exchangeable, autoregressive, unstructured, etc.) is specified to model the marginal association among the observed outcomes. The parameters of marginal models are often shared among by all individuals in the population, yielding populationaveraged effects or fixed effects. Conditional models on the other hand avoid the direct specification of the full correlation matrix by distinguishing between fixed effects of the sort described above and random effects, i.e., parameters that differ across individuals, so as to solve the parameters for each individual conditioned on the parameters of all other individuals. A popular example of conditional models is the generalized linear mixed effects models (GLMM) (mcculloch1997maximum). Hence, the correlations of random effects across individuals specifies the correlations between their observed outcomes e.g., health status. While both marginal and conditional models continue to be of interest (fitzmaurice2012applied; wang2014generalized; xiong2019mixed; liang2019lmlfm), many of the challenges, especially the choice of correlation structure, and the selection of the model parameters to be associated with random as opposed to fixed effects and the scalability issues remain to be addressed.Nonparametric LDA Methods.Nonparametric models, e.g., Gaussian processes (quintana2016bayesian; cheng2019additive)
offer an attractive approach to predictive modeling from longitudinal data for several reasons: (i) GP make fewer assumptions about the underlying data distribution compared to their parametric counterparts; (ii) Being nonparametric, GP circumvent the need to choose a particular parametric form of the nonlinear predictive model. (iii) GP leverages a parametertized kernel function that uses the proximity between observations as a proxy for the similarity of observed outcomes, while also providing smoothing and interpolation needed for coping with data that are sampled at irregularly spaced time points; (iv) GP can be made modular and interpretable by using a kernel function that is composed of multiple simpler kernel functions, typically for capturing the correlation structure within a subset of covariates that share similar covariate structure. Such models can also flexibly account for both longitudinal and cluster correlations in the data. For example, Quintana
et al. (quintana2016bayesian) define a Dirichlet Process prior on the kernel parameters, which enables a clustering structure across individuals; and employ a tridiagonal autoregressive kernel to approximate the longitudinal correlation,thereby substantially speeding up the computation of the inverse of the correlation matrix during model inference. Cheng et al. (cheng2019additive) utilize an additive kernel for Gaussian data and employ a stepwise search strategy to select the kernel components and covariates that optimize the predictive accuracy of the model. Timonen et al. (timonen2019interpretable) consider a heterogeneous kernel to model individualspecific (random) effects in the case of nonGaussian data.Limitations of GPbased LDA Methods Despite these advantages, existing GP based approaches to LDA suffer from several shortcomings that limit their applicability in realworld settings: (i) The choice of an appropriate kernel often involves a tedious, often expensive and unreliable trial and error (rasmussen2003gaussian) or heuristics guided (cheng2019additive) identification of a kernel or a combination of several kernels from a pool of candidate kernels. (ii) Existing GP based approaches to predictive modeling from longitudinal data do not scale up to longitudinal data sets with thousands of covariates and millions of individuals that are not uncommon in many modern applications.
1.2. Overview of Contributions
A key challenge in predictive modeling of longitudinal data has to do with modeling the complex correlation structure in the data. We posit that the observed correlation structure is induced by the interactions between timeinvariant, individualspecific effects, and time varying population effects. Hence, we can divide the task of predictive modeling from longitudinal data into three subtasks: (i) Given an observed data set, how do we estimate the timevarying and timeinvariant effects? (ii) Given the learned effects, how do we estimate the correlation structure present in the data? (iii) Given the correlation structure, how do we predict as yet unobserved, e.g., future outcomes?
We present LDKGPR, an effective and scalable deep kernel Gaussian process for regression that overcomes the key limitations of existing GP based approaches to predictive modeling from longitudinal data. LDKGPR inherits the attractive features of GP (summarized above) while overcoming their key limitations. Specifically, LDKGPR eliminates the need for trial and error or ad hoc heuristics in choosing a kernel function using a deep kernel learning technique (wilson2016deep) which combines the advantages of modern deep neural networks (DNN) with the nonparametric flexibility of kernel methods, to automate the discovery of the rich correlation structure from the data. LDKGPR adopts a multilevel model that is conceptually similar to GLMM, to model the timeinvariant individualspecific random effects and the timevarying fixed effects. We introduce a variant of the stochastic variational method (wilson2016stochastic) for efficient and scalable learning of LDKGPR in a regression setting.^{1}^{1}1Though our work focuses primarily on continuous outcome, it is fairly easy to extend our work to other outcome types such as binary and count outcomes. See Section §3.3 for details. We report results of extensive experiments using both simulated and realworld benchmark longitudinal data sets that demonstrate the superior performance of LDKGPR over the popular GLMM, GEE baselines as well as several stateoftheart nonlinear LDA methods (liang2019lmlfm; timonen2019interpretable).
2. Preliminaries
2.1. Longitudinal Data
We denote a longitudinal data set by , where is an covariate matrix and
is the vector of measured outcomes. We denote a row in
by , with indexing the individual and the time for the observation respectively. Because the observations for each individual are irregularly sampled over time, we have for each individual , a submatrix with dimension , where is the number of observations available for the individual . If we denote by be the number of individuals in , the covariate matrix is given by . Accordingly, the outcomes are given by .2.2. Gaussian Processes
A Gaussian process (GP) is a stochastic process, i.e., a collection of random variables indexed by time and/or space where every finite collection of the random variables has a multivariate normal distribution
(williams2006gaussian). The distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables, and hence a distribution over functions with a continuous domain, e.g. time or space. A kernel (or covariance function) describes the covariance of the Gaussian process random variables. Together with the mean function the kernel completely defines a Gaussian process. More precisely, if a function
has a GP prior where is the mean function and is a (positive semidefinite) kernel function parameterized by , then any finite collection of components of (denoted as f) has a multivariate Gaussian distribution
(1) 
where is the mean vector, and is the covariance matrix. In GP regression model, function is treated as an unobserved signal that is linked to the outcomes through a likelihood function, which is typically Gaussian, such that . Given as the covariate matrix for the test data points, the predictive distribution is computed as
(2) 
where is the covariance matrix between and , which can be computed using the same kernel function .
The radial basis function (RBF) is a popular choice as a kernel function
(williams2006gaussian). For a given pair of data points , the RBF kernel is given by:(3) 
with kernel parameters .
2.3. Additive Gaussian Processes
Additive GP extends GP by decomposing the unobserved signal into a summation of independent signal components, i.e., , where are the the coefficients associated with the individual components. In practice, each signal component is computed on a subset of the observed covariate vector . For example, in (cheng2019additive; timonen2019interpretable), the signal component is a function of only one or two covariates. Since every signal component has a GP prior, the joint signal is guaranteed to be a GP, such that Eq. (1) is rewritten as
(4) 
Additive GP offers several additional advantages over the original GP: (i) It allows the choice of different kernel functions for each signal component; (ii) Each signal component typically encodes information from a small set of homogeneous covariates, yielding a more interpretable model; (iii) More importantly, it permits the modeling timevarying and timeinvariant effects using different kernel functions, which is especially attractive in modeling longitudinal data.
While it is possible to learn the kernel parameters of both GP and Additive GP using either MLE framework (williams2006gaussian) or Bayesian approaches (cheng2019additive; timonen2019interpretable), both require an iterative algorithm that requires inverting an by matrix at each iteration, yielding time complexity and space complexity per iteration in the training phase. The prohibitive computational complexity of the model limits the applicability of this approach in realworld settings with large numbers of observations. Moreover, existing approaches for predictive modeling of longitudinal data suffer from lack of clear guidance for choosing an optimal set of kernels that are best suited to analyses of the data at hand. Hence, current methods for choosing kernels (cheng2019additive; timonen2019interpretable) tend to be heuristic in nature, e.g., based on the data types of the covariates and is fixed throughout the subsequent analysis, which leads to suboptimal correlation structure. Such practice fails to capitalize the expressive power of GP, thus leading to a suboptimal correlation structure.
3. Longitudinal Deep Kernel Gaussian Process Regression
We proceed to describe LDKGPR, an additive GP with a learnable deep kernel for regression with longitudinal data before introducing an efficient algorithm for predictive modeling from longitudinal data.^{2}^{2}2Note that although our model is designed for regression problem with Gaussian outcomes, it is straightforward to extend it for binary or count outcomes by replacing the Gaussian likelihood with Logistic or Poisson/Negative Binomial likelihood.
3.1. Modeling the Correlation Structure
Recall that longitudinal data exhibit complex correlations arising from the interaction between timevarying effects and timeinvariant effects. Hence, we decompose the signal function into two components, i.e., which models the timevarying effects and , which models the timeinvariant effects. The resulting probabilistic model is given by:
(5)  
(6)  
(7)  
(8) 
We denote the kernel parameters (to be learned from data) for timevarying effects and timeinvariant effects respectively by and (see below for details). The mean functions can be estimated from data, or if known, fixed to their known values. In this study, following (williams2006gaussian; wilson2016deep; wilson2016stochastic; cheng2019additive; timonen2019interpretable), we set . Assuming that and are conditionally independent given , we can express the distribution of as follows:
(9) 
where .
3.1.1. Timevarying Kernel
Timevarying kernel is designed to capture the longitudinal correlation in the data. The structure of our timevarying kernel is shown in Figure 1(a). Let be a nonlinear mapping function given by a deep architecture parameterized by . Given the covariate vectors of a pair of data points , where index the individuals and index the observations, the timevarying kernel is given by:
(10) 
Note that RBF kernel is based on Euclidean distance, which is not a useful measure of distance in the high dimensional input space (aggarwal2001surprising). Hence, we use a deep neural network (goodfellow2016deep), specifically, a nonlinear encoder to map the input space to a lowdimensional latent space and then apply the RBF kernel to the latent space.
3.1.2. Timeinvariant Kernel
We encode the individualspecific characteristics that are invariant with respect to time, using the timeinvariant kernel , whose structure is shown in Figure 1(b). Let be function that identifies the individuals, and be an embedding function that maps each individual to a vector in the latent space. Then for any pair of data points with arbitrary observation indices, the timeinvariant kernel is given by:
(11) 
3.2. Learning LDKGPR from data
We draw on the large body of existing work on efficient approximation techniques to ensure the scalability of LDKGPR to largescale problems. Specifically, we draw on (wilson2016stochastic), which combines the inducing points with variational inference. The idea of inducing points is to reduce the effective number of input data points in going down from to (), where is the number of inducing points, thus greatly simplifying the computation of the GP posterior (see Section §3.3 for details). Let be the collection of inducing points, and u their corresponding signal function. Instead of identifying the inducing points in the input space, we construct inducing points that lie in the latent space, such that . In addition, we define to distinguish the inducing points from the actual data points. We can now express the joint signal distribution as follows:
(12) 
Therefore, the conditional signal distribution is
(13) 
Let be the model parameters, we seek to maximize the log of marginal likelihood . By assuming a variational posterior over the joint signals , we have:
(14) 
Eq (3.2) provides the variational evidence lower bound (ELBO). We define the proposal posterior . We follow (wilson2016stochastic), to iteratively update the variational parameters and using stochastic gradient ascent, where a noisy approximation of the gradient of the ELBO is computed on minibatches of the training data. Specifically, with samples of u and a minibatch of data points , the ELBO is estimated as
(15) 
Note that since the KL term in Eq. (15) is constructed on two multivariate Gaussian distributions, it can be computed in closed form without the need for Monte Carlo estimation. Specifically, we have:
(16) 
To estimate the Monte Carlo likelihood, we need to first obtain u from and then sample f from . Both these tasks can be completed using Cholesky decomposition and reparameterization. For example, to sample u, we first let , then sample u is computed using with . The sampling cost is . Using the local kernel interpolation trick on f and Kronecker decomposition on (wilson2016stochastic), one can further reduce the sampling cost to , where . To perform gradient ascent, we could compute the derivatives w.r.t. the model parameters
using the chain rule and matrix derivatives
(petersen2008matrix). We omit the detailed derivation of the parameter updates using back propagation since they can be realized using modern toolboxes that support automatic differentiation, e.g., PyTorch
(NEURIPS2019_9015).3.3. Prediction
A common approximation assumption associated with the inducing points idea is that the signals between training data and test data are conditionally independent given u (quinonero2005unifying). Hence,
(17) 
This is particularly useful during the test phase. Given the covariate matrix for the test data, the prediction distribution is given by:
(18) 
In the case of Gaussian likelihood, the closed form solution is given by:
(19) 
In general, the expectation in Eq. (3.3) may not have a closed form solution. In such cases, we can employ Monte Carlo sampling to estimate the mean and variance of the prediction distribution.
4. Experiments
We proceed to describe the results of experiments on simulated as well as realworld benchmark data to compare LDKGPR with several stateoftheart longitudinal models. The experiments are designed to answer the research questions regarding three aspects of LDKGPR: prediction accuracy, scalability and interpretability: (RQ1) How does the performance of LDKGPR compare with the stateoftheart methods on standard longitudinal regression tasks? (RQ2) How does the scalability of LDKGPR compare with that of the stateoftheart longitudinal regression models? (RQ3) Can LDKGPR reliably recover the rich correlation structure from the data? (RQ4) How do the different components of LDKGPR contribute to the overall performance of LDKGPR?
4.1. Data Sets
We used one simulated data set and three realworld longitudinal data sets in our experiments:
Simulated data. We construct simulated longitudinal data sets that exhibit i.e., longitudinal correlation (LC) and multilevel correlation (MC) as follows: The outcome is generated using where
is a nonlinear transformation based on the observed covariate matrix
and the residual . To simulate longitudinal correlation, we simply set to a block diagonal matrix. For each individual, we use a firstorder autoregressive correlation structure () with decaying factor fixed at . To simulate a data set that exhibits multilevel correlation, we first split the individuals into clusters. We then define the cluster correlation matrix by setting the correlation associated to data points in the same cluster to . Finally, we compute the multilevel correlation by summing up the longitudinal correlation and cluster correlation. Following (cheng2019additive; timonen2019interpretable), we simulate individuals, observations, and covariates for each individual. To simulate correlation among the covariates, we first generate base features independently from uniform distribution, then is computed using an encoder network with architecture . The network architecture for is . We vary from .Study of Women’s Health Across the Nation (SWAN) (sutton2005sex). SWAN is a multisite longitudinal study designed to examine the health of women during the midlife years. We consider the task of predicting the CESD score, which is used for screening for depression. Similar to (liang2019lmlfm), we define the adjusted CESD score by , thus indicates depression. The variables of interest include aspects of physical and mental health, and demographic factors such as race and income. The resulting data set has 3,300 individuals, 137 variables and 28,405 records.
General Social Survey (GSS) (smith2017general). The GSS data set includes data on contemporary American society collected with the goal of understanding and explaining trends and constants in attitudes, behaviors, and attributes. It records responses to survey questions about demography, behavior, and attitudes that elicit information about how Americans think and feel about such issues as national spending priorities, crime and punishment, intergroup relations, and confidence in institutions, since 1972. In our experiment, we consider the task of predicting the selfreported general happiness of 4,510 individuals using 1,553 features and 59,599 records. The task is identical to that considered in (liang2019lmlfm), where indicates happy and indicates the opposite.
The Alzheimer’s Disease Prediction Of Longitudinal Evolution (TADPOLE) (marinescu2018tadpole). TADPOLE includes data from individuals who have provided in the Alzheimer’s disease neuroimaging initiative (ADNI) studies ADNI 1 and ADNI 2 who have agreed to participate in ADNI 3. The TADPOLE challenge aims at predicting the symptoms related to AD in the short to medium term (15 years) of a group of people who are considered to be at risk of AD. In our experiment, we focus on predicting the ADASCog13 score using the demographic features and MRI measures (Hippocampus, Fusiform, WholeBrain, Entorhinal and MidTemp). The resulting data set has 1,681 individuals, 24 variables and 8,771 records.
4.2. Experimental Setup
We proceed to describe our experimental setup designed to answer the research questions RQ1RQ4.
To answer RQ1, we use both simulated data and realworld data. To evaluate the regression performance, we compute the mean and standard deviation of
between the actual (ground truth) and predicted outcomes of each method on each data set across 10 independent runs. All methods are trained on the training set, hyperparameters are tuned on the validation set and their performance is evaluated on the test set. We use 50%, 20%, 30% of data for training, validation and testing respectively.To answer RQ2, we take data from a subset consisting of individuals with the largest number of observations from each realworld data. We record the run time per iteration of each method on both the individual subset and full data set. Because not all baseline methods implement GPU acceleration, we compare the run times of all the methods without GPU acceleration. We report execution failure if a method fails to converge within 48 hours or generates an execution error (liang2019lmlfm).
To answer RQ3, we rely mainly on the simulated data since the actual correlation structures underlying the real world data sets are not known. We evaluate the performance of each method by visualizing the learned correlation matrix. Additionally, we provide a case study of realworld data by comparing and interpreting the correlation matrix recovered by the different methods.
To answer RQ4, we compare the performance of LDKGPR with LRBFGPR, a variant that replaces the learned deep kernel with a simple RBF kernel; and LDKGPR, a variant of LDKGPR without the timeinvariant effects. Though we could achieve a variant of LDKGPR without the timevarying effects, but in that case we ignore the observed covaraites by taking as input only the individual indexes, thus giving the same predictions for the same individual regardless of time. This design is unrealistic in LDA applications. Therefore, we do not compare LDKGPR to this variant.
All experiments are conducted on a desktop machine with Intel Core i77700K CPU, 16GB RAM and GTX 1060 6GB graphics card.
4.3. Baseline Methods
We compare LDKGPR with the conventional as well as the stateoftheart methods for predictive modeling from longitudinal data:

GLMM (bates2014fitting), a conventional multilevel mixedeffect model for longitudinal data. GLMM accounts for data correlations by specifying variables associated with random effects and fixed effects. Because in the general setting, we do not know the exact set of variables that are subject to fixed effects as opposed to random effects, we assume that all variables are subject to random effects conditioned on the individual.

GEE (inan2017pgee), a conventional longitudinal model that is based on generalized estimation equation. GEE accommodates data correlations by factorizing the covariance matrix into a product of a diagonal variance matrix and a working parameterized correlation matrix. In our experiment, we use the firstorder autoregressive correlation as it delivers the best performance among all choices.

LMLFM (liang2019lmlfm), a stateoftheart multilevel mixedeffect model based on factorization machines. In contrast to GLMM that requires expert input to specify the variables associated with random effects and fixed effects, LMLFM automatically distinguishes random effects from fixed effects and selects a subset of predictive variables by optimizing a suitably specified objective function.

LGPR (timonen2019interpretable), a stateoftheart Bayesian longitudinal model based on addictive GP. LGPR first fits each GP component using a chosen kernel. The final signal function is achieved by a weighted sum of all GP components. In our experiments, zerosum kernel is used for categorical covariates and heterogeneous kernel is used for the continuous covariates. Unlike other baselines, LGPR is solved using MCMC, which generally requires a large number of iterations to reach to the stationary distribution. In our experiments, the number of MCMC sampling iterations of LGPR is set to 2000.
4.4. Implementation Details
We implement LDKGPR using PyTorch. We formulate using a deep neural network (DNN) consisting of multiple fully connected layers. Specifically, the structure of is
with leaky ReLU activation
(xu2015empirical) between layers. The latent dimension is chosen from the range based on crossvalidation performance on the validation set. Note that the implementation is flexible enough to allow more advanced DNN structure such as CNN and RNN. The embedding function is a by parameter matrix. We set . We use a diagonal matrix for the variational posterior, such that . All the model parameters, i.e., , are updated iteratively using Adam optimizer. The learning rate for and the remaining parameters are chosen from and respectively, based on crossvalidation performance on the validation set. The training and testing batch sizes are set to. The maximum training epoch of LDKGPR is set to 300 for all data sets.
We use the implementations of GLMM, GEE and LGPR available in the lmer4, PGEE and lgpr packages, respectively from CRAN.^{3}^{3}3https://cran.rproject.org/. We use the LMLFM implementation from https://github.com/junjieliang672/LMLFM. The implementation of LDKGPR and the data used in our experiments will be made publicly available upon acceptance of the manuscript.
5. Results
We proceed to describe the results of our experiments designed to answer the research questions RQ1RQ4.
5.1. LDKGPR vs. the stateoftheart
Method  
LC  MC ()  MC ()  MC ()  MC ()  
LDKGPR  74.43.5  90.515.5  97.42.1  94.92.4  99.40.4 
LGPR  37.119.1  123.6162.0  26.343.2  9.114.8  0.15.9 
LMLFM  54.715.1  138.3121.9  48.3123.6  22.649.0  36.241.1 
GLMM  5.327.9  656.3719.8  801.4507.4  684.1491.3  528.7313.5 
GEE  59.024.5  636.1606.0  703.6465.8  665.6554.3  516.5457.5 
RQ1: How does the performance of LDKGPR compare with the baselines on standard longitudinal regression tasks?
The results of our experiments that evaluate the various models based on their estimated regression accuracy are reported in Table 1 and Table 2 for simulated and realworld data sets respectively.
In the case of simulated data, we find that GEE and GLMM fail in the presence of multilevel correlations (MC) in the data with the mean being negative. The results can be explained by the fact that GEE is designed only to handle pure LC, thus fails to account for CC. While GLMM is capable of handling MC (and hence CC), it requires practitioners to specify the cluster structure responsible for CC prior to model fitting. However, in experiments, cluster structure is assumed unknown. When cluster structure is unspecified not surprisingly, the performance of GLMM is degraded. Moreover, we find that although LMLFM outperforms GLMM and GEE in the presence of MC, its is still quite low. This is because LMLFM accounts for only a special case of MC, namely, for CC among individual observed at the same time points, and not all of the CC presented in the data.
We note that the performance of both LMLFM and GEE tends vary across data sets whereas GLMM fails on data sets that include data from a large population or measurements over a large number of variables as well as data that exhibit MC but the cluster structure is unknown.
We find that LGPR has relatively poor performance on both simulated and realworld data. This might due to the fact that LGPR treats the variables independently by building kernel component on each variables before taking their weighted sum. Though it is possible to incorporate higher order interactions between variables into LGPR, doing so requires results in large numbers of interaction parameters to be estimated which presents challenges, especially when working with data from small populations.
On the simulated data, we see that LDKGPR consistently and significantly outperforms the baselines by a large margin. On the realworld data sets, we find that LDKGPR outperforms the longitudinal baselines in a vast majority of the cases (except when the number of data samples is small).
While we see that that all methods benefit from larger training data sets, if they can process the data. LDKGPR, since it relies on learning a deep kernel, can be expected to perform better when it is trained on large amounts of data.
We believe that two factors contribute to the overall superior performance of LDKGPR. First, because LDKGPR uses a learned deep kernel, it is capable of accommodating any nonlinear relationships that might exist between the covariates and the outcomes. Second, LDKGPR does not rely on specific assumptions on how the data correlation is structured. It flexibly learns the correlation from the data.
5.2. Scalability of LDKGPR
RQ2: How does the scalability of LDKGPR compare with that of the stateoftheart longitudinal regression models? The CPU run times and failure to complete execution on the realworld data sets are reported in Table 2. We see that LGPR, GLMM and GEE are exceptionally sensitive to the number of variables. Indeed, their computational complexity increases proportional to where is the number of variables. In contrast, LDKGPR and LMLFM scale gracefully with increase in the number of individuals as well as the number of variables measured per individual in the data set.
Data sets  (%)  Runtime/iteration (sec.)  
LDKGPR  LGPR  LMLFM  GLMM  GEE  LDKGPR  LGPR  LMLFM  GLMM  GEE  
TADPOLE  595  50  24  44.05.6  261.19.0  8.75.1  50.85.5  11.44.8  0.03  6.39  0.01  0.01  0.13 
SWAN  550  50  137  46.84.9  16.612.7  38.64.2  40.17.7  46.48.0  0.03  26.1  0.02  0.06  0.59 
GSS  1,500  50  1,553  19.13.7  N/A  15.31.4  N/A  4.63.5  0.12  N/A  0.30  N/A  30.1 
TADPOLE 
8,771  1,681  24  64.91.4  N/A  10.40.6  61.91.9  17.60.7  1.48  N/A  0.25  0.03  4.66 
SWAN  28,405  3,300  137  52.50.4  N/A  48.62.0  N/A  N/A  4.48  N/A  1.74  N/A  N/A 
GSS  59,599  4,510  1,553  56.90.1  N/A  54.82.2  N/A  N/A  5.31  N/A  24.35  N/A  N/A 
5.3. Recovery of Correlation Structure
RQ3: Can LDKGPR reliably recover the rich correlation structure from the data?
5.3.1. Correlation Structure from Simulated Data
The outcome correlations estimated by all methods on the simulated data are shown in Figure 2. It is easy to see that LMLFM, GLMM and GEE are incapable of recovering MC. Though LGPR can, model data which exhibit MC, the resulting correlation is overly simple. In the presence of MC with , we see that only one known cluster is correctly recovered. We conjecture that the inferior performance of LGPR is due to the limited expressive power of the kernel functions it used. We note that LGKDPR is able to recover most of the correlation structure present in the data. We further note that LDKGPR, despite being the best performer among the methods compared in this study, it tends to underestimate the number of clusters because the full data correlation is approximated by a lowrank matrix (see Eq. (3.3)) resulting in information loss.
5.3.2. Correlation Structure in RealWorld Data
We show how the correlation structure discovered by LDKGPR can offer useful insights into longitudinal data gathered in realworld settings. Because of space constraints, we only analyze the correlation structure uncovered by LDKGPR from the SWAN data.
Data sets  
LDKGPR  LDKGPR  LRBFGPR  
TADPOLE  24  64.91.4  13.21.1  55.52.4 
SWAN  137  52.50.4  29.03.2  5.41.6 
GSS  1,553  56.90.1  56.20.1  140.4 
Figure 3(a) displays the timeinvariant cluster correlation after fitting the model to the full SWAN data. We find that the individuals can be roughly grouped into two clusters of which the first shows a clear cluster structure. The distributions of the adjusted CESD score of the two clusters are presented in Figure 3(b). We find that individuals assigned to cluster #1 tend to have lower risk of depression whereas those assigned to cluster #2 tend to have higher risk of depression. Closer examination of individuals assigned to the clusters reveals potentially useful suggestions for further analysis. In our case study, we can identify at least two individuals stand out for further indepth investigation (See Figure 4). Individual #1: Judging from the longitudinal correlation as shown in Figure 4(a), we detect a clear transition from nondepression to depression starting around age . Comparison of the covariates between age and reveals potential reasons for the transition e.g., having family and financial issues at that were not present before age 52. In addition, while we have two observations at the age of
, the first observation tends to be uncorrelated with the others, and hence more likely to be an outlier. Another interesting finding is that, although our model detects a potential transition from nondepression to depression, the predictions are consistently below the threshold, making the individual nondepressive overall. This could further imply that the depression symptom associated with individual #1 is on mild and could be temporary. Perhaps mental health services could help individual #1 to successfully get through what appears to be a temporary depression, likely triggered by family and financial issues.
Individual #2: In the case of individual #2 shown in Figure 4(b) we find a transition from nondepression to depression around age . The transition is perhaps explained by fact that individual #2 is going through a change of menopausal status between age of and . It is worth noting that a sudden rise in CESD score is observed at the age of (the dissident point), which, surprisingly, is consistently ignored by our model. To understand why, we search for clues by comparing the covariates between age of and . Results turn out that no clear evidence are found to support the sudden depression. Therefore, we conjecture that the observed adjusted CESD score at is likely unreliable and a more careful examination might have been warranted. In summary, the correlation structure revealed by LDKGPR offers useful insights into longitudinal data.5.4. Deconstructing LDKGPR
RQ4: How do the different components of LDKGPR contribute to the overall performance of LDKGPR?
Regression accuracy comparison carried out on complete realworld data sets is shown in Table 3. LDKGPR vs. LDKGPR: We see dramatic drop on regression performance when timeinvariant effects are not included. The average across three data sets drops from to . The results imply that the timeindependent component of LC and CC modeled by the timeinvariant effects are essential for accurate modeling of longitudinal data. The decomposition of the the task of correlation estimation into the time variant and timeinvariant components simplifies the task. The correlation estimated through timeinvariant component is analogous to estimating the mean correlation and that estimated using the timevarying effects contributes to the residual correlation. The two phase process helps to reduce the variance of the estimated correlations. The timeinvariant components are estimated using latent factors, which is helpful when there are unobserved covariates in the data. LDKGPR vs. LRBFGPR: We see that LDKGPR consistently outperforms LRBFGPR. The performance gap between LDKGPR and LRBFGPR increases with the number of covariates. A possible explanation of this result has to do with the meaninglessness of Euclidean distance (and hence RBF kernel) as a measure of similarity between data points in a high dimensional space (aggarwal2001surprising); and the ability of the learned deep kernel to map the data to a lowdimensional latent space where distances are meaningful.
6. Summary
We have presented LDKGPR, a novel longitudinal deep kernel Gaussian process regression model that overcomes the key limitations of existing approaches to predictive modeling from longitudinal data. Specifically, LDKGPR eliminates the need for trial and error or ad hoc heuristics in choosing a kernel function using a deep kernel learning which combines the advantages of modern deep neural networks (DNN) with the nonparametric flexibility of kernel methods, to automate the discovery of the rich correlation structure from longitudinal data. LDKGPR adopts a multilevel model to account for the timeinvariant individualspecific random effects and the timevarying fixed effects. We have shown how LDKGPR can be efficiently trained using a variant of the stochastic variational method. We report results of extensive experiments using both simulated and realworld benchmark longitudinal data sets that demonstrate the superior performance of LDKGPR over the stateoftheart LDA methods, not only in terms of accuracy of regression, but also scalability of the model. A case study with a realworld data set shows the how the approach can be used to obtain useful insights from longitudinal data.
Comments
There are no comments yet.