Longitudinal Deep Kernel Gaussian Process Regression

05/24/2020 ∙ by Junjie Liang, et al. ∙ Penn State University 26

We consider the problem of learning predictive models from longitudinal data, consisting of irregularly repeated, sparse observations from a set of individuals over time. Effective approaches to this problem have to account for the complex multi-level correlation structure in the data. Gaussian process models offer an attractive framework for longitudinal data analysis (LDA) because they require fewer assumptions about the underlying data distribution compared to parametric models and model complex correlation structure using a composition of kernels. However, such methods have two key shortcomings: (i) The kernel often relies on ad hoc heuristics or a tedious process of trial and error; and (ii) The methods do not scale with increasing number of individuals, observations per individual, or the number of covariates. We present L-DKGPR, an effective and scalable longitudinal deep kernel Gaussian process regression model that overcomes the key limitations of existing GP based approaches to predictive modeling from longitudinal data. Specifically, L-DKGPR eliminates the need for trial and error or ad hoc heuristics in choosing a kernel function using a deep kernel learning technique which combines the advantages of modern deep neural networks (DNN) with the non-parametric flexibility of kernel methods, to automate the discovery of the rich correlation structure from the data. L-DKGPR adopts a multilevel model to account for the time-invariant individual-specific random effects and the time-varying fixed effects. We show how L-DKGPR can be efficiently trained using a variant of the stochastic variational method. We report the results of extensive experiments using both simulated and real-world benchmark longitudinal data sets that demonstrate the superior performance of L-DKGPR over the state-of-the-art methods.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. introduction

Longitudinal studies, which involve repeated observations of a chosen set of variables, taken at not necessarily regularly spaced time points, for the same set of individuals, over a period of time, are quite ubiquitous in many fields including health sciences, learning sciences, social and economic sciences. Longitudinal studies are effective in identifying the time-varying as well as the time-invariant risk factors associated to a particular outcome of interest, e.g., market crash, disease onset (hedeker2006longitudinal). Longitudinal data typically exhibit longitudinal correlation (LC), i.e., correlations among the repeated observations of a given individual over time; and cluster correlation (CC), i.e., correlations among observations across individuals, e.g., due to the characteristics that they share among themselves e.g., age, demographics factors; or both, i.e., multilevel correlation (MC). In practice, the specific multilevel correlation structure in the data can be quite complex, and unknown a priori. In predictive modeling from longitudinal data, failure to account for such multi-level correlations can lead to misleading statistical inferences (gibbons1997random; liang2019lmlfm). As we shall see below, it can be non-trivial to choose a suitable correlation structure that reflects the correlations present in the data. The relationships between the covariates and outcomes of interest can be highly complex and non-linear. Furthermore, modern LDA applications often call for methods that scale gracefully with increasing number of variables, the number of individuals, and the number of longitudinal observations per individual.

1.1. Related Work

Conventional LDA Methods Predictive modeling for longitudinal data has been extensively studied for decades (hedeker2006longitudinal; verbeke2014analysis)

. Conventional longitudinal data analysis methods can be grouped into to two broad categories: (i) marginal models and (ii) conditional models. Marginal models rely on assumptions about the marginal association among the observed outcomes. A typical example of marginal models is the generalized estimating equations (GEE)

(liang1986longitudinal), where a working correlation matrix of a specified form (independent, exchangeable, auto-regressive, unstructured, etc.) is specified to model the marginal association among the observed outcomes. The parameters of marginal models are often shared among by all individuals in the population, yielding population-averaged effects or fixed effects. Conditional models on the other hand avoid the direct specification of the full correlation matrix by distinguishing between fixed effects of the sort described above and random effects, i.e., parameters that differ across individuals, so as to solve the parameters for each individual conditioned on the parameters of all other individuals. A popular example of conditional models is the generalized linear mixed effects models (GLMM) (mcculloch1997maximum). Hence, the correlations of random effects across individuals specifies the correlations between their observed outcomes e.g., health status. While both marginal and conditional models continue to be of interest (fitzmaurice2012applied; wang2014generalized; xiong2019mixed; liang2019lmlfm), many of the challenges, especially the choice of correlation structure, and the selection of the model parameters to be associated with random as opposed to fixed effects and the scalability issues remain to be addressed.

Non-parametric LDA Methods.Non-parametric models, e.g., Gaussian processes (quintana2016bayesian; cheng2019additive)

offer an attractive approach to predictive modeling from longitudinal data for several reasons: (i) GP make fewer assumptions about the underlying data distribution compared to their parametric counterparts; (ii) Being non-parametric, GP circumvent the need to choose a particular parametric form of the nonlinear predictive model. (iii) GP leverages a parametertized kernel function that uses the proximity between observations as a proxy for the similarity of observed outcomes, while also providing smoothing and interpolation needed for coping with data that are sampled at irregularly spaced time points; (iv) GP can be made modular and interpretable by using a kernel function that is composed of multiple simpler kernel functions, typically for capturing the correlation structure within a subset of covariates that share similar covariate structure. Such models can also flexibly account for both longitudinal and cluster correlations in the data. For example, Quintana

et al. (quintana2016bayesian) define a Dirichlet Process prior on the kernel parameters, which enables a clustering structure across individuals; and employ a tri-diagonal auto-regressive kernel to approximate the longitudinal correlation,thereby substantially speeding up the computation of the inverse of the correlation matrix during model inference. Cheng et al. (cheng2019additive) utilize an additive kernel for Gaussian data and employ a step-wise search strategy to select the kernel components and covariates that optimize the predictive accuracy of the model. Timonen et al. (timonen2019interpretable) consider a heterogeneous kernel to model individual-specific (random) effects in the case of non-Gaussian data.

Limitations of GP-based LDA Methods Despite these advantages, existing GP based approaches to LDA suffer from several shortcomings that limit their applicability in real-world settings: (i) The choice of an appropriate kernel often involves a tedious, often expensive and unreliable trial and error (rasmussen2003gaussian) or heuristics guided (cheng2019additive) identification of a kernel or a combination of several kernels from a pool of candidate kernels. (ii) Existing GP based approaches to predictive modeling from longitudinal data do not scale up to longitudinal data sets with thousands of covariates and millions of individuals that are not uncommon in many modern applications.

1.2. Overview of Contributions

A key challenge in predictive modeling of longitudinal data has to do with modeling the complex correlation structure in the data. We posit that the observed correlation structure is induced by the interactions between time-invariant, individual-specific effects, and time varying population effects. Hence, we can divide the task of predictive modeling from longitudinal data into three sub-tasks: (i) Given an observed data set, how do we estimate the time-varying and time-invariant effects? (ii) Given the learned effects, how do we estimate the correlation structure present in the data? (iii) Given the correlation structure, how do we predict as yet unobserved, e.g., future outcomes?

We present L-DKGPR, an effective and scalable deep kernel Gaussian process for regression that overcomes the key limitations of existing GP based approaches to predictive modeling from longitudinal data. L-DKGPR inherits the attractive features of GP (summarized above) while overcoming their key limitations. Specifically, L-DKGPR eliminates the need for trial and error or ad hoc heuristics in choosing a kernel function using a deep kernel learning technique (wilson2016deep) which combines the advantages of modern deep neural networks (DNN) with the non-parametric flexibility of kernel methods, to automate the discovery of the rich correlation structure from the data. L-DKGPR adopts a multilevel model that is conceptually similar to GLMM, to model the time-invariant individual-specific random effects and the time-varying fixed effects. We introduce a variant of the stochastic variational method (wilson2016stochastic) for efficient and scalable learning of L-DKGPR in a regression setting.111Though our work focuses primarily on continuous outcome, it is fairly easy to extend our work to other outcome types such as binary and count outcomes. See Section §3.3 for details. We report results of extensive experiments using both simulated and real-world benchmark longitudinal data sets that demonstrate the superior performance of L-DKGPR over the popular GLMM, GEE baselines as well as several state-of-the-art non-linear LDA methods (liang2019lmlfm; timonen2019interpretable).

2. Preliminaries

2.1. Longitudinal Data

We denote a longitudinal data set by , where is an covariate matrix and

is the vector of measured outcomes. We denote a row in

by , with indexing the individual and the time for the observation respectively. Because the observations for each individual are irregularly sampled over time, we have for each individual , a submatrix with dimension , where is the number of observations available for the individual . If we denote by be the number of individuals in , the covariate matrix is given by . Accordingly, the outcomes are given by .

2.2. Gaussian Processes

A Gaussian process (GP) is a stochastic process, i.e., a collection of random variables indexed by time and/or space where every finite collection of the random variables has a multivariate normal distribution


. The distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables, and hence a distribution over functions with a continuous domain, e.g. time or space. A kernel (or covariance function) describes the covariance of the Gaussian process random variables. Together with the mean function the kernel completely defines a Gaussian process. More precisely, if a function

has a GP prior where is the mean function and is a (positive semi-definite) kernel function parameterized by , then any finite collection of components of (denoted as f

) has a multivariate Gaussian distribution


where is the mean vector, and is the covariance matrix. In GP regression model, function is treated as an unobserved signal that is linked to the outcomes through a likelihood function, which is typically Gaussian, such that . Given as the covariate matrix for the test data points, the predictive distribution is computed as


where is the covariance matrix between and , which can be computed using the same kernel function .

The radial basis function (RBF) is a popular choice as a kernel function

(williams2006gaussian). For a given pair of data points , the RBF kernel is given by:


with kernel parameters .

2.3. Additive Gaussian Processes

Additive GP extends GP by decomposing the unobserved signal into a summation of independent signal components, i.e., , where are the the coefficients associated with the individual components. In practice, each signal component is computed on a subset of the observed covariate vector . For example, in (cheng2019additive; timonen2019interpretable), the signal component is a function of only one or two covariates. Since every signal component has a GP prior, the joint signal is guaranteed to be a GP, such that Eq. (1) is rewritten as


Additive GP offers several additional advantages over the original GP: (i) It allows the choice of different kernel functions for each signal component; (ii) Each signal component typically encodes information from a small set of homogeneous covariates, yielding a more interpretable model; (iii) More importantly, it permits the modeling time-varying and time-invariant effects using different kernel functions, which is especially attractive in modeling longitudinal data.

While it is possible to learn the kernel parameters of both GP and Additive GP using either MLE framework (williams2006gaussian) or Bayesian approaches (cheng2019additive; timonen2019interpretable), both require an iterative algorithm that requires inverting an by matrix at each iteration, yielding time complexity and space complexity per iteration in the training phase. The prohibitive computational complexity of the model limits the applicability of this approach in real-world settings with large numbers of observations. Moreover, existing approaches for predictive modeling of longitudinal data suffer from lack of clear guidance for choosing an optimal set of kernels that are best suited to analyses of the data at hand. Hence, current methods for choosing kernels (cheng2019additive; timonen2019interpretable) tend to be heuristic in nature, e.g., based on the data types of the covariates and is fixed throughout the subsequent analysis, which leads to sub-optimal correlation structure. Such practice fails to capitalize the expressive power of GP, thus leading to a sub-optimal correlation structure.

3. Longitudinal Deep Kernel Gaussian Process Regression

We proceed to describe L-DKGPR, an additive GP with a learnable deep kernel for regression with longitudinal data before introducing an efficient algorithm for predictive modeling from longitudinal data.222Note that although our model is designed for regression problem with Gaussian outcomes, it is straightforward to extend it for binary or count outcomes by replacing the Gaussian likelihood with Logistic or Poisson/Negative Binomial likelihood.

3.1. Modeling the Correlation Structure

Recall that longitudinal data exhibit complex correlations arising from the interaction between time-varying effects and time-invariant effects. Hence, we decompose the signal function into two components, i.e.,  which models the time-varying effects and , which models the time-invariant effects. The resulting probabilistic model is given by:


We denote the kernel parameters (to be learned from data) for time-varying effects and time-invariant effects respectively by and (see below for details). The mean functions can be estimated from data, or if known, fixed to their known values. In this study, following (williams2006gaussian; wilson2016deep; wilson2016stochastic; cheng2019additive; timonen2019interpretable), we set . Assuming that and are conditionally independent given , we can express the distribution of as follows:


where .

(a) Time-varying kernel
(b) Time-invariant kernel
Figure 1. Structure of the time-varying and time-invariant kernels

3.1.1. Time-varying Kernel

Time-varying kernel is designed to capture the longitudinal correlation in the data. The structure of our time-varying kernel is shown in Figure 1(a). Let be a non-linear mapping function given by a deep architecture parameterized by . Given the covariate vectors of a pair of data points , where index the individuals and index the observations, the time-varying kernel is given by:


Note that RBF kernel is based on Euclidean distance, which is not a useful measure of distance in the high dimensional input space (aggarwal2001surprising). Hence, we use a deep neural network (goodfellow2016deep), specifically, a nonlinear encoder to map the input space to a low-dimensional latent space and then apply the RBF kernel to the latent space.

3.1.2. Time-invariant Kernel

We encode the individual-specific characteristics that are invariant with respect to time, using the time-invariant kernel , whose structure is shown in Figure 1(b). Let be function that identifies the individuals, and be an embedding function that maps each individual to a vector in the latent space. Then for any pair of data points with arbitrary observation indices, the time-invariant kernel is given by:


3.2. Learning L-DKGPR from data

We draw on the large body of existing work on efficient approximation techniques to ensure the scalability of L-DKGPR to large-scale problems. Specifically, we draw on (wilson2016stochastic), which combines the inducing points with variational inference. The idea of inducing points is to reduce the effective number of input data points in going down from to (), where is the number of inducing points, thus greatly simplifying the computation of the GP posterior (see Section §3.3 for details). Let be the collection of inducing points, and u their corresponding signal function. Instead of identifying the inducing points in the input space, we construct inducing points that lie in the latent space, such that . In addition, we define to distinguish the inducing points from the actual data points. We can now express the joint signal distribution as follows:


Therefore, the conditional signal distribution is


Let be the model parameters, we seek to maximize the log of marginal likelihood . By assuming a variational posterior over the joint signals , we have:


Eq (3.2) provides the variational evidence lower bound (ELBO). We define the proposal posterior . We follow (wilson2016stochastic), to iteratively update the variational parameters and using stochastic gradient ascent, where a noisy approximation of the gradient of the ELBO is computed on minibatches of the training data. Specifically, with samples of u and a minibatch of data points , the ELBO is estimated as


Note that since the KL term in Eq. (15) is constructed on two multivariate Gaussian distributions, it can be computed in closed form without the need for Monte Carlo estimation. Specifically, we have:


To estimate the Monte Carlo likelihood, we need to first obtain u from and then sample f from . Both these tasks can be completed using Cholesky decomposition and reparameterization. For example, to sample u, we first let , then sample u is computed using with . The sampling cost is . Using the local kernel interpolation trick on f and Kronecker decomposition on (wilson2016stochastic), one can further reduce the sampling cost to , where . To perform gradient ascent, we could compute the derivatives w.r.t. the model parameters

using the chain rule and matrix derivatives


. We omit the detailed derivation of the parameter updates using back propagation since they can be realized using modern toolboxes that support automatic differentiation, e.g., PyTorch


3.3. Prediction

A common approximation assumption associated with the inducing points idea is that the signals between training data and test data are conditionally independent given u (quinonero2005unifying). Hence,


This is particularly useful during the test phase. Given the covariate matrix for the test data, the prediction distribution is given by:


In the case of Gaussian likelihood, the closed form solution is given by:


In general, the expectation in Eq. (3.3) may not have a closed form solution. In such cases, we can employ Monte Carlo sampling to estimate the mean and variance of the prediction distribution.

4. Experiments

We proceed to describe the results of experiments on simulated as well as real-world benchmark data to compare L-DKGPR with several state-of-the-art longitudinal models. The experiments are designed to answer the research questions regarding three aspects of L-DKGPR: prediction accuracy, scalability and interpretability: (RQ1) How does the performance of L-DKGPR compare with the state-of-the-art methods on standard longitudinal regression tasks? (RQ2) How does the scalability of L-DKGPR compare with that of the state-of-the-art longitudinal regression models? (RQ3) Can L-DKGPR reliably recover the rich correlation structure from the data? (RQ4) How do the different components of L-DKGPR contribute to the overall performance of L-DKGPR?

4.1. Data Sets

We used one simulated data set and three real-world longitudinal data sets in our experiments:

Simulated data. We construct simulated longitudinal data sets that exhibit i.e., longitudinal correlation (LC) and multilevel correlation (MC) as follows: The outcome is generated using where

is a non-linear transformation based on the observed covariate matrix

and the residual . To simulate longitudinal correlation, we simply set to a block diagonal matrix. For each individual, we use a first-order auto-regressive correlation structure () with decaying factor fixed at . To simulate a data set that exhibits multilevel correlation, we first split the individuals into clusters. We then define the cluster correlation matrix by setting the correlation associated to data points in the same cluster to . Finally, we compute the multilevel correlation by summing up the longitudinal correlation and cluster correlation. Following (cheng2019additive; timonen2019interpretable), we simulate individuals, observations, and covariates for each individual. To simulate correlation among the covariates, we first generate base features independently from uniform distribution, then is computed using an encoder network with architecture . The network architecture for is . We vary from .

Study of Women’s Health Across the Nation (SWAN) (sutton2005sex). SWAN is a multi-site longitudinal study designed to examine the health of women during the midlife years. We consider the task of predicting the CESD score, which is used for screening for depression. Similar to (liang2019lmlfm), we define the adjusted CESD score by , thus indicates depression. The variables of interest include aspects of physical and mental health, and demographic factors such as race and income. The resulting data set has 3,300 individuals, 137 variables and 28,405 records.

General Social Survey (GSS) (smith2017general). The GSS data set includes data on contemporary American society collected with the goal of understanding and explaining trends and constants in attitudes, behaviors, and attributes. It records responses to survey questions about demography, behavior, and attitudes that elicit information about how Americans think and feel about such issues as national spending priorities, crime and punishment, intergroup relations, and confidence in institutions, since 1972. In our experiment, we consider the task of predicting the self-reported general happiness of 4,510 individuals using 1,553 features and 59,599 records. The task is identical to that considered in (liang2019lmlfm), where indicates happy and indicates the opposite.

The Alzheimer’s Disease Prediction Of Longitudinal Evolution (TADPOLE) (marinescu2018tadpole). TADPOLE includes data from individuals who have provided in the Alzheimer’s disease neuroimaging initiative (ADNI) studies ADNI 1 and ADNI 2 who have agreed to participate in ADNI 3. The TADPOLE challenge aims at predicting the symptoms related to AD in the short to medium term (1-5 years) of a group of people who are considered to be at risk of AD. In our experiment, we focus on predicting the ADAS-Cog13 score using the demographic features and MRI measures (Hippocampus, Fusiform, WholeBrain, Entorhinal and MidTemp). The resulting data set has 1,681 individuals, 24 variables and 8,771 records.

4.2. Experimental Setup

We proceed to describe our experimental setup designed to answer the research questions RQ1-RQ4.

To answer RQ1, we use both simulated data and real-world data. To evaluate the regression performance, we compute the mean and standard deviation of

between the actual (ground truth) and predicted outcomes of each method on each data set across 10 independent runs. All methods are trained on the training set, hyper-parameters are tuned on the validation set and their performance is evaluated on the test set. We use 50%, 20%, 30% of data for training, validation and testing respectively.

To answer RQ2, we take data from a subset consisting of individuals with the largest number of observations from each real-world data. We record the run time per iteration of each method on both the -individual subset and full data set. Because not all baseline methods implement GPU acceleration, we compare the run times of all the methods without GPU acceleration. We report execution failure if a method fails to converge within 48 hours or generates an execution error (liang2019lmlfm).

To answer RQ3, we rely mainly on the simulated data since the actual correlation structures underlying the real world data sets are not known. We evaluate the performance of each method by visualizing the learned correlation matrix. Additionally, we provide a case study of real-world data by comparing and interpreting the correlation matrix recovered by the different methods.

To answer RQ4, we compare the performance of L-DKGPR with L-RBF-GPR, a variant that replaces the learned deep kernel with a simple RBF kernel; and L-DKGPR-, a variant of L-DKGPR without the time-invariant effects. Though we could achieve a variant of L-DKGPR without the time-varying effects, but in that case we ignore the observed covaraites by taking as input only the individual indexes, thus giving the same predictions for the same individual regardless of time. This design is unrealistic in LDA applications. Therefore, we do not compare L-DKGPR to this variant.

All experiments are conducted on a desktop machine with Intel Core i7-7700K CPU, 16GB RAM and GTX 1060 6GB graphics card.

4.3. Baseline Methods

We compare L-DKGPR with the conventional as well as the state-of-the-art methods for predictive modeling from longitudinal data:

  • GLMM (bates2014fitting), a conventional multilevel mixed-effect model for longitudinal data. GLMM accounts for data correlations by specifying variables associated with random effects and fixed effects. Because in the general setting, we do not know the exact set of variables that are subject to fixed effects as opposed to random effects, we assume that all variables are subject to random effects conditioned on the individual.

  • GEE (inan2017pgee), a conventional longitudinal model that is based on generalized estimation equation. GEE accommodates data correlations by factorizing the covariance matrix into a product of a diagonal variance matrix and a working parameterized correlation matrix. In our experiment, we use the first-order auto-regressive correlation as it delivers the best performance among all choices.

  • LMLFM (liang2019lmlfm), a state-of-the-art multilevel mixed-effect model based on factorization machines. In contrast to GLMM that requires expert input to specify the variables associated with random effects and fixed effects, LMLFM automatically distinguishes random effects from fixed effects and selects a subset of predictive variables by optimizing a suitably specified objective function.

  • LGPR (timonen2019interpretable), a state-of-the-art Bayesian longitudinal model based on addictive GP. LGPR first fits each GP component using a chosen kernel. The final signal function is achieved by a weighted sum of all GP components. In our experiments, zero-sum kernel is used for categorical covariates and heterogeneous kernel is used for the continuous covariates. Unlike other baselines, LGPR is solved using MCMC, which generally requires a large number of iterations to reach to the stationary distribution. In our experiments, the number of MCMC sampling iterations of LGPR is set to 2000.

4.4. Implementation Details

We implement L-DKGPR using PyTorch. We formulate using a deep neural network (DNN) consisting of multiple fully connected layers. Specifically, the structure of is

with leaky ReLU activation

(xu2015empirical) between layers. The latent dimension is chosen from the range based on cross-validation performance on the validation set. Note that the implementation is flexible enough to allow more advanced DNN structure such as CNN and RNN. The embedding function is a -by- parameter matrix. We set . We use a diagonal matrix for the variational posterior, such that . All the model parameters, i.e., , are updated iteratively using Adam optimizer. The learning rate for and the remaining parameters are chosen from and respectively, based on cross-validation performance on the validation set. The training and testing batch sizes are set to

. The maximum training epoch of L-DKGPR is set to 300 for all data sets.

We use the implementations of GLMM, GEE and LGPR available in the lmer4, PGEE and lgpr packages, respectively from CRAN.333https://cran.r-project.org/. We use the LMLFM implementation from https://github.com/junjieliang672/LMLFM. The implementation of L-DKGPR and the data used in our experiments will be made publicly available upon acceptance of the manuscript.

5. Results

We proceed to describe the results of our experiments designed to answer the research questions RQ1-RQ4.

Figure 2. Outcome correlation estimated by all methods on simulated data.

5.1. L-DKGPR vs. the state-of-the-art

LC MC () MC () MC () MC ()
L-DKGPR 74.43.5 90.515.5 97.42.1 94.92.4 99.40.4
LGPR -37.119.1 -123.6162.0 -26.343.2 -9.114.8 -0.15.9
LMLFM 54.715.1 -138.3121.9 -48.3123.6 22.649.0 36.241.1
GLMM 5.327.9 -656.3719.8 -801.4507.4 -684.1491.3 -528.7313.5
GEE 59.024.5 -636.1606.0 -703.6465.8 -665.6554.3 -516.5457.5
Table 1. Regression accuracy comparison on simulated data with different correlation structures

RQ1: How does the performance of L-DKGPR compare with the baselines on standard longitudinal regression tasks?

The results of our experiments that evaluate the various models based on their estimated regression accuracy are reported in Table 1 and Table 2 for simulated and real-world data sets respectively.

In the case of simulated data, we find that GEE and GLMM fail in the presence of multi-level correlations (MC) in the data with the mean being negative. The results can be explained by the fact that GEE is designed only to handle pure LC, thus fails to account for CC. While GLMM is capable of handling MC (and hence CC), it requires practitioners to specify the cluster structure responsible for CC prior to model fitting. However, in experiments, cluster structure is assumed unknown. When cluster structure is unspecified not surprisingly, the performance of GLMM is degraded. Moreover, we find that although LMLFM outperforms GLMM and GEE in the presence of MC, its is still quite low. This is because LMLFM accounts for only a special case of MC, namely, for CC among individual observed at the same time points, and not all of the CC presented in the data.

We note that the performance of both LMLFM and GEE tends vary across data sets whereas GLMM fails on data sets that include data from a large population or measurements over a large number of variables as well as data that exhibit MC but the cluster structure is unknown.

We find that LGPR has relatively poor performance on both simulated and real-world data. This might due to the fact that LGPR treats the variables independently by building kernel component on each variables before taking their weighted sum. Though it is possible to incorporate higher order interactions between variables into LGPR, doing so requires results in large numbers of interaction parameters to be estimated which presents challenges, especially when working with data from small populations.

On the simulated data, we see that L-DKGPR consistently and significantly outperforms the baselines by a large margin. On the real-world data sets, we find that L-DKGPR outperforms the longitudinal baselines in a vast majority of the cases (except when the number of data samples is small).

While we see that that all methods benefit from larger training data sets, if they can process the data. L-DKGPR, since it relies on learning a deep kernel, can be expected to perform better when it is trained on large amounts of data.

We believe that two factors contribute to the overall superior performance of L-DKGPR. First, because L-DKGPR uses a learned deep kernel, it is capable of accommodating any nonlinear relationships that might exist between the covariates and the outcomes. Second, L-DKGPR does not rely on specific assumptions on how the data correlation is structured. It flexibly learns the correlation from the data.

5.2. Scalability of L-DKGPR

RQ2: How does the scalability of L-DKGPR compare with that of the state-of-the-art longitudinal regression models? The CPU run times and failure to complete execution on the real-world data sets are reported in Table 2. We see that LGPR, GLMM and GEE are exceptionally sensitive to the number of variables. Indeed, their computational complexity increases proportional to where is the number of variables. In contrast, L-DKGPR and LMLFM scale gracefully with increase in the number of individuals as well as the number of variables measured per individual in the data set.

Data sets (%) Runtime/iteration (sec.)
TADPOLE 595 50 24 44.05.6 -261.19.0 8.75.1 50.85.5 -11.44.8 0.03 6.39 0.01 0.01 0.13
SWAN 550 50 137 46.84.9 -16.612.7 38.64.2 40.17.7 46.48.0 0.03 26.1 0.02 0.06 0.59
GSS 1,500 50 1,553 19.13.7 N/A 15.31.4 N/A -4.63.5 0.12 N/A 0.30 N/A 30.1

8,771 1,681 24 64.91.4 N/A 10.40.6 61.91.9 17.60.7 1.48 N/A 0.25 0.03 4.66
SWAN 28,405 3,300 137 52.50.4 N/A 48.62.0 N/A N/A 4.48 N/A 1.74 N/A N/A
GSS 59,599 4,510 1,553 56.90.1 N/A 54.82.2 N/A N/A 5.31 N/A 24.35 N/A N/A
Table 2. Regression accuracy and run time comparison on real-world data sets. We use ‘N/A’ to denote execution error.

5.3. Recovery of Correlation Structure

RQ3: Can L-DKGPR reliably recover the rich correlation structure from the data?

5.3.1. Correlation Structure from Simulated Data

The outcome correlations estimated by all methods on the simulated data are shown in Figure 2. It is easy to see that LMLFM, GLMM and GEE are incapable of recovering MC. Though LGPR can, model data which exhibit MC, the resulting correlation is overly simple. In the presence of MC with , we see that only one known cluster is correctly recovered. We conjecture that the inferior performance of LGPR is due to the limited expressive power of the kernel functions it used. We note that L-GKDPR is able to recover most of the correlation structure present in the data. We further note that L-DKGPR, despite being the best performer among the methods compared in this study, it tends to underestimate the number of clusters because the full data correlation is approximated by a low-rank matrix (see Eq. (3.3)) resulting in information loss.

5.3.2. Correlation Structure in Real-World Data

We show how the correlation structure discovered by L-DKGPR can offer useful insights into longitudinal data gathered in real-world settings. Because of space constraints, we only analyze the correlation structure uncovered by L-DKGPR from the SWAN data.

(a) Cluster Correlation
(b) Score Density
Figure 3. Cluster correlation analysis on SWAN Data. (a) We find that individuals can be roughly split into two clusters; (b) Adjusted CESD score density for the two clusters.
(a) Individual #1
(b) Individual #2
Figure 4. Case study for two selected individuals from cluster #1. (Top figures) Observed and predicted trajectories for the adjusted CESD scores; (Bottom figures) Longitudinal correlation among the predictions.
Data sets
TADPOLE 24 64.91.4 13.21.1 55.52.4
SWAN 137 52.50.4 29.03.2 5.41.6
GSS 1,553 56.90.1 56.20.1 -140.4
Table 3. Effect on the regression accuracy of different components of L-DKGPR

Figure 3(a) displays the time-invariant cluster correlation after fitting the model to the full SWAN data. We find that the individuals can be roughly grouped into two clusters of which the first shows a clear cluster structure. The distributions of the adjusted CESD score of the two clusters are presented in Figure 3(b). We find that individuals assigned to cluster #1 tend to have lower risk of depression whereas those assigned to cluster #2 tend to have higher risk of depression. Closer examination of individuals assigned to the clusters reveals potentially useful suggestions for further analysis. In our case study, we can identify at least two individuals stand out for further in-depth investigation (See Figure 4). Individual #1: Judging from the longitudinal correlation as shown in Figure 4(a), we detect a clear transition from non-depression to depression starting around age . Comparison of the covariates between age and reveals potential reasons for the transition e.g., having family and financial issues at that were not present before age 52. In addition, while we have two observations at the age of

, the first observation tends to be uncorrelated with the others, and hence more likely to be an outlier. Another interesting finding is that, although our model detects a potential transition from non-depression to depression, the predictions are consistently below the threshold, making the individual non-depressive overall. This could further imply that the depression symptom associated with individual #1 is on mild and could be temporary. Perhaps mental health services could help individual #1 to successfully get through what appears to be a temporary depression, likely triggered by family and financial issues.

Individual #2: In the case of individual #2 shown in Figure 4(b) we find a transition from non-depression to depression around age . The transition is perhaps explained by fact that individual #2 is going through a change of menopausal status between age of and . It is worth noting that a sudden rise in CESD score is observed at the age of (the dissident point), which, surprisingly, is consistently ignored by our model. To understand why, we search for clues by comparing the covariates between age of and . Results turn out that no clear evidence are found to support the sudden depression. Therefore, we conjecture that the observed adjusted CESD score at is likely unreliable and a more careful examination might have been warranted. In summary, the correlation structure revealed by L-DKGPR offers useful insights into longitudinal data.

5.4. Deconstructing L-DKGPR

RQ4: How do the different components of L-DKGPR contribute to the overall performance of L-DKGPR?

Regression accuracy comparison carried out on complete real-world data sets is shown in Table 3. L-DKGPR vs. L-DKGPR-: We see dramatic drop on regression performance when time-invariant effects are not included. The average across three data sets drops from to . The results imply that the time-independent component of LC and CC modeled by the time-invariant effects are essential for accurate modeling of longitudinal data. The decomposition of the the task of correlation estimation into the time- variant and time-invariant components simplifies the task. The correlation estimated through time-invariant component is analogous to estimating the mean correlation and that estimated using the time-varying effects contributes to the residual correlation. The two phase process helps to reduce the variance of the estimated correlations. The time-invariant components are estimated using latent factors, which is helpful when there are unobserved covariates in the data. L-DKGPR vs. L-RBF-GPR: We see that L-DKGPR consistently outperforms L-RBF-GPR. The performance gap between L-DKGPR and L-RBF-GPR increases with the number of covariates. A possible explanation of this result has to do with the meaninglessness of Euclidean distance (and hence RBF kernel) as a measure of similarity between data points in a high dimensional space (aggarwal2001surprising); and the ability of the learned deep kernel to map the data to a low-dimensional latent space where distances are meaningful.

6. Summary

We have presented L-DKGPR, a novel longitudinal deep kernel Gaussian process regression model that overcomes the key limitations of existing approaches to predictive modeling from longitudinal data. Specifically, L-DKGPR eliminates the need for trial and error or ad hoc heuristics in choosing a kernel function using a deep kernel learning which combines the advantages of modern deep neural networks (DNN) with the non-parametric flexibility of kernel methods, to automate the discovery of the rich correlation structure from longitudinal data. L-DKGPR adopts a multilevel model to account for the time-invariant individual-specific random effects and the time-varying fixed effects. We have shown how L-DKGPR can be efficiently trained using a variant of the stochastic variational method. We report results of extensive experiments using both simulated and real-world benchmark longitudinal data sets that demonstrate the superior performance of L-DKGPR over the state-of-the-art LDA methods, not only in terms of accuracy of regression, but also scalability of the model. A case study with a real-world data set shows the how the approach can be used to obtain useful insights from longitudinal data.