Scalable mixed-domain Gaussian processes

11/03/2021
by   Juho Timonen, et al.
0

Gaussian process (GP) models that combine both categorical and continuous input variables have found use e.g. in longitudinal data analysis and computer experiments. However, standard inference for these models has the typical cubic scaling, and common scalable approximation schemes for GPs cannot be applied since the covariance function is non-continuous. In this work, we derive a basis function approximation scheme for mixed-domain covariance functions, which scales linearly with respect to the number of observations and total number of basis functions. The proposed approach is naturally applicable to Bayesian GP regression with arbitrary observation models. We demonstrate the approach in a longitudinal data modelling context and show that it approximates the exact GP model accurately, requiring only a fraction of the runtime compared to fitting the corresponding exact model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

03/04/2021

On MCMC for variationally sparse Gaussian processes: A pseudo-marginal approach

Gaussian processes (GPs) are frequently used in machine learning and sta...
05/02/2018

Toward a diagnostic toolkit for linear models with Gaussian-process distributed random effects

Gaussian processes (GPs) are widely used as distributions of random effe...
09/18/2012

Scaling Multidimensional Inference for Structured Gaussian Processes

Exact Gaussian Process (GP) regression has O(N^3) runtime for data size ...
05/01/2020

Scaled Vecchia approximation for fast computer-model emulation

Many scientific phenomena are studied using computer experiments consist...
01/20/2021

A Similarity Measure of Gaussian Process Predictive Distributions

Some scenarios require the computation of a predictive distribution of a...
09/10/2020

Graphical Gaussian Process Models for Highly Multivariate Spatial Data

For multivariate spatial (Gaussian) process models, common cross-covaria...
11/14/2019

Scalable Exact Inference in Multi-Output Gaussian Processes

Multi-output Gaussian processes (MOGPs) leverage the flexibility and int...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian processes (GPs) offer a flexible nonparametric way of modeling unknown functions. While Gaussian process regression and classification are commonly used in problems where the domain of the unknown function is continuous, recent work has seen use of GP models also in mixed domains, where some of the input variables are categorical or discrete and some are continuous. Applications of mixed-domain GPs are found e.g. in Bayesian optimization (Garrido-Merchán and Hernández-Lobato, 2020), computer experiments (Zhang and Notz, 2015; Deng et al., 2017; Roustant et al., 2020; Wang et al., 2021) and longitudinal data analysis (Cheng et al., 2019; Timonen et al., 2021). For example in biomedical applications, the modeled function often depends on categorical covariates, such as treatment vs. no treatment, and accounting for such time-varying effects is essential. Since all commonly used kernel functions (i.e. covariance functions) are defined for either purely continuous or purely categorical input variables, kernels for mixed-domain GPs are typically obtained by combining continuous and categorical kernels through multiplication. Additional modeling flexibility can be obtained by summing the product kernels as has been done in the context of GP modeling for longitudinal data (Cheng et al., 2019; Timonen et al., 2021).

It is well known that exact GP regression has a theoretical complexity of and requires memory, where is the number of observations. This poses a computational problem which in practice renders applications of exact GP regression infeasible for large data. Various scalable approximation approaches for GPs have been proposed (see e.g. (Liu et al., 2020) for a review). However, many popular approaches, such as the inducing point (Snelson and Ghahramani, 2006; Titsias, 2009)

and kernel interpolation

(Wilson and Nickisch, 2015)

methods, can only be applied directly if the kernel (i.e. covariance) function is continuous and differentiable. In addition, they typically require the Gaussian observation model, which is not appropriate for modeling for example discrete, categorical or ordinal response variables. See Section 

3 for a review of previous methods.

In this work, we present a scalable approximation scheme for mixed-domain GPs and additive mixed-domain GPs, where the covariance structure depends on both continuous and categorical variables. We extend the Hilbert space reduced-rank approximation

(Solin and Särkkä, 2019) for said additive mixed-domain GPs and, making it applicable to e.g. analysis of large longitudinal data. The approach

  • scales linearly with respect to data set size

  • allows a wide variety of different categorical kernels

  • allows product kernels that consist of any number of continuous and categorical kernels, as well as sums of such products

  • allows an arbitrary observation model and full Bayesian inference for the model hyperparameters

To our knowledge, there are no existing approaches that satisfy these conditions.

2 Gaussian Processes

2.1 Definition

A Gaussian process (GP) is a collection of random variables, any finite number of which has a multivariate normal distribution

(Rasmussen and Williams, 2006). A function is a GP

(1)

with mean function and kernel (or covariance) function , if for any finite number of inputs

, the vector of function values

follows a multivariate normal distribution with mean vector and covariance matrix with entries . The mean function is commonly the constant zero function , and we have this convention throughout the paper. The kernel function encodes information about the covariance of function values at different points, and therefore affects the model properties crucially.

2.2 Bayesian GP regression

In GP regression, the conditional distribution of response variable given covariates is modeled as some parametric distribution , where represents possible parameters of the observation model. The function has a zero-mean GP prior with covariance function that has hyperparameters . We focus on Bayesian GP modeling, where in addition to the GP prior for , we have a paramemeter prior distribution for . Given observations our goal is to infer the posterior

(2)

where . The part

(3)

is the prior and

(4)

is the likelihood. This task often has to be done by sampling from using MCMC methods, which requires evaluating the right-hand side of Eq. 2 (and possibly its gradient) thousands of times. As the likelihood and parameter prior usually are independent over each parameter and data point, they scale linearly and are not a bottleneck. Instead, computing the GP prior density

(5)

where the matrix has entries , is a costly operation as evaluating the (log) density of the -dimensional multivariate normal distribution has generally complexity (see Suppl. Section 4). Furthermore, the matrix takes memory.

An often exploited fact is that if the observation model (and therefore likelihood) is Gaussian, can be analytically marginalized and only the marginal posterior needs to be sampled. This reduces the MCMC dimension by and likely improves sampling efficiency, but one however needs to evaluate which is again an -dimensional multivariate Gaussian. The complexity and memory requirements therefore remain. In this paper, we generally assume an arbitary observation model, and defer the details of the Gaussian observation model until Suppl. Section 5.

2.3 Additive GP regression

In additive GP regression, the modeled function consists of additive components so that , and each component has a GP prior

(6)

independently from other components. This means that the total GP prior is with

(7)

Furthermore, for each vector

(8)

where . The matrix is defined so that its elements are . This means that the prior for is , where .

Bayesian inference with MCMC for additive GP models requires sampling all , meaning that adding one component increases the number of parameters by (plus the possible additional kernel hyperparameters). Moreover, the multivariate normal prior (Eq. 5) needs to be evaluated for each component, adding the computational burden. In the case of Gaussian likelihood, adding more components does not add any multivariate normal density evaluations as it still needs to be done only for . Also the marginal posteriors of each are analytically available (see Suppl. Section 5).

2.4 Mixed-domain kernels for longitudinal data

Longitudinal data is common in biomedical, psychological, social and other studies and consists of multiple measurements of several subjects at multiple time points. In addition to time (often expressed as subject age), other continuous covariates can be measured. Moreover, in addition to subject id, other categorical covariates, such as treatment, sex or country can be available. In the statistical methods literature, such data is commonly modeled using generalized linear mixed effect models (Verbeke and Molenberghs, 2000). In recent work (Quintana et al., 2016; Cheng et al., 2019; Timonen et al., 2021), longitudinal data has been modeled using additive GPs, where, similar to commonly used linear models, each component is a function of at most one categorical and one continuous variable. Each variable is assigned a one-dimensional base kernel and for components that contain both a continuous and categorical kernel, the kernel is their product. As the total kernel is composed of the simpler categorical and continuous kernels through multiplication and addition, it has a mixed domain.

These models have the very beneficial property that the effects of individual covariates are interpretable. The marginal posterior distributions of each component be studied to infer the marginal effect of different covariates. As an example, if is just the exponentiated quadratic (EQ) kernel

(9)

and is age, the component can be interpreted as the shared effect of age. On the other hand, if is the product kernel where

(10)

is the zero-sum (ZS) kernel (Kaufman and Sain, 2010) for a categorical variable that has categories, can be interpreted as the category-specific effect of the continuous covariate . This also has the property that the effect sums to zero over categories at all values for (see Timonen et al. (2021) for proof), which helps in separating the category effect from the shared effect, if a model has both.

Sometimes it is required to mask effects that are present for only a subset of the individuals, such as the case individuals when the data also has a control group. Using the kernel language, effect of component can be masked by multiplying by a binary kernel which returns 0 if either or takes a value in any of the masked categories, and 1 otherwise.

3 Related Research

GPs and categorical inputs

A suggested approach to handle GPs with categorical covariates is to use a one-hot encoding which turns a variable with

categories into binary variables, of which only one is on at a time, and then apply a continuous kernel for them. Garrido-Merchán and Hernández-Lobato (2020) highlight that the resulting covariance structure is problematic because it does not take into account that only one of the binary variables can be one at a time. This poorly motivated approach might have originated merely from the fact that common GP software have lacked support for categorical kernels. We find it more sensible to define kernels directly on categorical covariates, as that way we can always impose the desired covariance structure.

Category-specific effects of a continuous covariate can be achieved also by assigning independent GPs for the different categories. This way we have only continuous kernel functions, and can possibly use scalable approaches that are designed for them. This limited approach however cannot define any additional covariance structure between the categories, such as the zero-sum constraint (Eq. 10). The ZS kernel is a special case of compound symmetry (CS), and for example Roustant et al. (2020) concluded that a CS covariance structure was more justified than using only indenpendent GPs in their nuclear engineering application.

Chung et al. (2020) developed a deep mixed-effect GP model that facilitates individual-specific effects and scales as , where is the number of individuals and is the number of time points. Zhang et al. (2020) handled categorical inputs by mapping them to a continuous latent space and then using a continuous kernel. While this approach can detect interesting covariance structures, it does not remove need to perform statistical modeling with a predefined covariance structure as in Section 2.4. Another related non-parametric way to model group effects is to use hierarchical generalized additive models (Pedersen et al., 2019), as smoothing splines can be seen as a special case of GP regression (Kimeldorf and Wahba, 1970).

Scalable GP approximations

A number of approximation methods exist that reduce the complexity of GP regression to , where controls the accuracy of the approximation. Popular approaches rely on global sparse approximations (Quiñonero-Candela and Rasmussen, 2005) of the covariance matrix between all pairs of data points, using inducing points. The locations of these inducing points are generally optimized using gradient-based continuous optimization simultaneously with model hyperparameters, which cannot be applied when the domain is not continuous. In Fortuin et al. (2021), the inducing-point approach was studied in purely discrete domains and Cao et al. (2015) presented an optimization algorithm that alternates between discrete optimization of inducing points, and continuous optimization of the hyperparameters. Disadvantages of this method are that it cannot find inducing points outside of the training data, does not perform full Bayesian inference for the hyperparameters, and assumes a Gaussian observation model.

A Hilbert space basis function approach for reduced-rank GP approximation in continuous domains, on which this work is based, was proposed by Solin and Särkkä (2019). Its use in the practice in the Bayesian setting was studied more in Riutort-Mayol et al. (2020).

4 Mixed-Domain Covariance Function Approximation

4.1 Basic idea

We continue with the notation established in Section 2, and note that denotes a general input that can consist of both continuous and categorical dimensions. We consider approximations that decompose the GP kernel function as

(11)

where functions have to be designed so that the approximation is accurate but easy to compute. This is useful in GP regression, because we get a low-rank approximate decomposition for the kernel matrix , where is the matrix with elements . Using this approximation, we can write the approximate GP prior using parameters with independent standard normal priors, connected to through the reparametrization , where . Evaluating the prior density has now only cost. After obtaining posterior draws of , we can obtain posterior draws of with cost, which comes from computing the matrix . The likelihood (Eq. 4) can then be evaluated one data point at a time and the total complexity of the approach is only . Furthermore, the memory requirement is reduced from to , since we only need to store and never compute . This is the approach used throughout this paper, and the focus is on how to design the functions for different kernel functions so that the approximation is accurate with .

4.2 Continuous isotropic covariance functions

A continuous stationary covariance function depends only on the difference and can therefore be written as . Such covariance functions can be approximated by methods that utilize the spectral density

(12)

If the covariance function is isotropic, meaning that it depends only on the Euclidean norm , also is isotropic and can be written as , i.e. as a function of one variable. As shown in (Solin and Särkkä, 2019), an isotropic covariance function can be approximated as

(13)

where and and are the

first eigenfunctions and eigenvalues of the Dirichlet boundary value problem

(14)

for a compact set . We see that this approximation has the same form as Eq. 11 with and . The spectral density has a closed form for many kernels, and the domain can be selected so that the eigenvalues and eigenfunctions have one too. Functions are therefore easy to evaluate and the computation strategy described in Section 4.1 can then be used. As an example, when and with , we have

(15)

and it was proven in (Solin and Särkkä, 2019) that in this case uniformly for any stationary that has a regular enough spectral density. For example for the EQ kernel (Eq. 9) , the spectral density is .

4.3 Kernels for categorical variables

Let us study a kernel , where is a finite set of possible values (categories). We can encode these categories numerically as integers . Because there are only possible input combinations for , and therefore , we can list them in the matrix which has elements . If is symmetric, the symmetric square matrix has the orthogonal eigendecomposition

(16)

where is the diagonal matrix containing the eigenvalues , on the diagonal and has the eigenvectors as its columns. For each column , we can define function so that . We see that

(17)
(18)

meaning that we have written in the form of Eq. 11 with and . Note that this is an exact function decomposition for and not an approximation. The complexity of computing the eigendecomposition is , but in typical applications and this is not a bottleneck. Actually, for example for the ZS kernel and other CS kernels, the eigenvalues have a closed form and the corresponding eigenbasis is known (see Suppl. Section 2). Furthermore, if does not depend on any hyperparameters, the eigendecomposition only needs to be done once before parameter inference. If it is of type where is the only parameter, the decomposition can obviously be done just for which again has no parameters. Evaluating functions is easy as it corresponds to just looking up a value from the matrix .

4.4 Mixed-domain product kernels

We now consider approximating a product kernel with , where for each we have an available decomposition

(19)

which might be an approximation or an exact decomposition of . The total approximation is

(20)
(21)

where . We have now a representation of the product kernel in the form of Eq. 11 with sum terms. Note that since the individual kernels in the product kernel can be both categorical and continuous, Eq. 21 provides a kernel representation for mixed-domain GPs with product kernels. Also note that grows exponentially with .

4.5 Mixed-domain sum kernels

The most general kernels that we consider are of the form , where is the number of product factors in the sum term . If each has a (possibly approximate) decomposition

(22)

with sum terms, we can approximate with

(23)

where for each . Now we have a sum representation (Eq. 11) of the kernel with terms.

4.6 Mixed kernels for longitudinal data

In our framework, we consider mixed kernels , where is a mixed space of both continuous and categorical dimensions, consisting of multiplication and addition so that

(24)

where each is isotropic and depends only one continuous dimension of and each depends only on one categorical dimension of , which has different categories. For each , we use the basis function approximation (Eq. 13) with basis functions and domain , and for each the exact decomposition (Eq. 18). Using Eq. 23, we can write in the format of Eq. 11 with

(25)

terms. In each term, the function is a product of factors and factors .

As an example, if for each and we use basis functions for all components, then the scalability is where . Further, if each categorical variable has many different values, then the scalability is , where .

5 Results

We demonstrate the scalability and accuracy of the presented approach using experiments with simulated and real data. In all experiments we use the Dynamic HMC algorithm of Stan (version 2.27) (Carpenter et al., 2017), with target acceptance rate set to , for MCMC sampling the parameters of our approximate models111Code will be made available at https://github.com/jtimonen/scalable-mixed-domain-GPs.. All models are fitted by running four independent MCMC chains for 2000 iterations each, discarding the first half of each chain as warmup. In all experiments, we use Student- priors with degrees of freedom for the kernel parameters, log-normal priors with mean 0 and scale 1 for kernel lengthscale parameters

. In Experiments 1 and 2 the noise variance parameter

of the Gaussian observation model has an Inverse-Gamma prior with shape parameters and . Priors are on normalized data scale, meaning that continuous variables are standardized to zero mean and unit variance during inference.

In all experiments we use the same number of basis functions for each approximate continuous kernel. We use also for all approximate components the same domain scaling factor , which is defined so that with being times the half-range of the continuous covariate of the approximated kernel (Riutort-Mayol et al., 2020).

Experiments 1-2 are run on a modern CentOS 7 computing cluster and Experiment 3 on a laptop computer.

Figure 1:

The posterior predictive mean for four different approximate models in one replication of Experiment 1. The yellow line shows the posterior predictive mean for the corresponding exact GP model as a reference. Black dots are training data and red crosses test data.

5.1 Experiment 1: Simulation study

In the first experiment we create simulated longitudinal data consisting of categorical variables and , and a continuous variable . We create data with 9 individuals, where individuals with belong to group , individuals with belong to group and individuals belong to group . For each individual 1-6, we create observations at time points where is drawn uniformly from the interval , and is varied as . For individuals 7-9, observations are created similarly, with .

Figure 2: Mean log predictive density for test data in Experiment 1. Black dashed line corresponds to the exact model. Results are averages over 30 replications of the experiment.

We consider an additive GP model with kernels

(26)

and we simulate a realization of data using , and . We then generate a response variable measurements , where and the realization represents the ground truth signal.

Data from individuals 1-6 is used in training, while data from individuals 7-9 is left for testing. Using the training data, we fit an exact and approximate model with the correct covariance structure from Eq. 26 using the Gaussian likelihood model. Exact model is fitted with lgpr (Timonen et al., 2021), which also uses Stan for MCMC. The exact model utilizes the marginalization approach for GPs, as Gaussian observation model is specified.

Figure 1 shows the posterior predictive mean of the exact model and different approximate models with , using . We see that with and the mean predictions are indistinguishable from the exact model. We fit the approximate model using different values of and , and repeat the experiment using different values for . Results in Figure 3 validate empirically that the runtime scales linearly as a function of both and .

We compute the mean log predicive density (MLPD), at test points (see Suppl. Section 3 for details about out-of-sample prediction and MLPD). Results in Figure 2 show that the MLPD of the approximate model approaches the exact model as grows. It is seen that with small data sizes and small , the predictive performance can actually be better than that of the exact model, possibly because the coarser approximation is a simpler model that generalizes better in this case.

Figure 3: Runtimes of fitting the exact and approximate models in Experiment 1. a) Exact model vs. approximation with . b) Approximations using different values for . The markers show the average time taken to run a chain, when a total of MCMC chains were run for 2000 iterations each. The vertical error bars show

one standard deviation (not shown for approximate model in panel

a). Note the smaller y-axis scale in panel b. We see empirically that the runtime of the approximate model scales linearly in both and .

5.2 Experiment 2: Canadian weather data

We analyse data that consists of yearly average temperature measurements in 35 Canadian weather stations (Ramsay and Silverman, 2005). There is a total of data points, which are daily temperatures at the 35 locations, averaged over the years 1960-1994. We fit an additive GP model , with Gaussian likelihood using the EQ kernel for and the product EQZS kernel for and .

Figure 4: Results for the Canadian weather data experiment with and . Figures a)-c) show the marginal posterior distribution of each of the three components , and (mean two times standard deviation). Standard deviation is not shown for for clarity. The functions are on the standardized scale (response variable normalized to zero mean and unit variance). We see for example that the regions tend to have larger differences during winter.

We used domain scaling factor for all components and ran the 4 MCMC chains in parallel using 4 CPU cores. This was repeated with different values of , where is the number of basis functions for each components. Total runtimes for fitting the models were and hours, respectively. The posterior distributions of each model component with are in Figure 4. The posterior predictive distribution for each station separately is visualized in Suppl. Figure 1.

5.3 Experiment 3: US presidential election prediction

In the last example we demonstrate a beta-binomial observation model and model the vote share of the Republican Party in each state in US presidential elections. By two-party vote share we mean proportion of votes cast to the Republican candidate divided by the sum of votes to both the Republican and Democratic candidates222Data is from (MIT Election Data and Science Lab, 2017).. Following Trangucci (2017), Washington DC is excluded from the analysis. We use data from the 1976-2016 elections as training data, meaning that .

We fit an additive GP model , with beta-binomial observation model using the EQ kernel for and the product EQZS kernel for and . The observation model is

(27)

where are the number of votes for the Republican and Democratic parties, respectively, , , and

inv-logit

(f+w0). We use a prior for the intercept and a Log-Normal(1,1) prior for the parameter.

Fitting a model with and on a 2018 MacBook Pro computer (2.3 GHz Quad-Core Intel i5 CPU), running the 4 chains in parallel, took approximately 18 minutes. The posterior distributions of each model component are in Figure 5. See also Supplementary Figure 2, where we have also visualized the data from 2020 election to validate that the model predicts well into the future.

6 Conclusion

Gaussian processes offer an attractive framework for specifying flexible models using a kernel language. The computational cost of their exact inference however limits possible applications to small data sets. Our scalable framework opens up a rich class of GP models to be used in large scale applications of various fields of science as the computational complexity is linear with respect to data size. We have presented a scalable approximation scheme for mixed-domain covariance functions, and demonstrated its use in the context of Bayesian GP regression. However, it can also be applied in GP applications where the kernel hyperparameters are optimized using a marginal likelihood criterion.

We recall that we have assumed that the categorical kernels are symmetric, and continuous kernels are stationary. Non-stationary effects can still be modeled by applying a warping on the input first, and then using a stationary kernel (see for example (Cheng et al., 2019)). Another limitation of the approach is that when the number of product terms in a kernel grows, the total number of basis functions required for that component grows exponentially and can become too large. This still leaves us with a large class of mixed-domain GP models that are scalable.

Figure 5: Results for the US election prediction experiment with and . Figures a)-c) show the marginal posterior distribution of each of the three components , and (mean two times standard deviation). Standard deviation is not shown for and for clarity.

Acknowledgements

We thank Aki Vehtari and Gleb Tikhonov for useful comments on draft versions of this manuscript, and acknowledge the computational resources provided by Aalto Science-IT, Finland. This work was supported by the Academy of Finland and Bayer Oy.

References

  • Y. Cao, M. A. Brubaker, D. J. Fleet, and A. Hertzmann (2015) Efficient optimization for sparse Gaussian process regression. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (12), pp. 2415–2427. Cited by: §3.
  • B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell (2017) Stan: a probabilistic programming language. Journal of Statistical Software 76 (1), pp. 1–32. Cited by: §5.
  • L. Cheng, S. Ramchandran, T. Vatanen, N. Lietzen, R. Lahesmaa, A. Vehtari, and H. Lähdesmäki (2019) An additive Gaussian process regression model for interpretable non-parametric analysis of longitudinal data. Nature Communications 10. Cited by: §1, §2.4, §6.
  • I. Chung, S. Kim, J. Lee, K. J. Kim, S. J. Hwang, and E. Yang (2020) Deep mixed effect model using Gaussian processes: a personalized and reliable prediction for healthcare.

    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)

    .
    Cited by: §3.
  • X. Deng, C. D. Lin, K.-W. Liu, and R. K. Rowe (2017) Additive Gaussian process for computer models with qualitative and quantitative factors. Technometrics 59 (3), pp. 283–292. Cited by: §1.
  • V. Fortuin, G. Dresdner, H. Strathmann, and G. Rätsch (2021) Sparse Gaussian processes on discrete domains. IEEE Access 9 (), pp. 76750–76758. Cited by: §3.
  • E. C. Garrido-Merchán and D. Hernández-Lobato (2020) Dealing with categorical and integer-valued variables in Bayesian optimization with Gaussian processes. Neurocomputing 380, pp. 20–35. Cited by: §1, §3.
  • C. G. Kaufman and S. R. Sain (2010) Bayesian functional ANOVA modeling using Gaussian process prior distributions. Bayesian Analysis 5 (1), pp. 123–149. Cited by: §2.4.
  • G. S. Kimeldorf and G. Wahba (1970)

    A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines

    .
    The Annals of Mathematical Statistics 41 (2), pp. 495 – 502. Cited by: §3.
  • H. Liu, Y. Ong, X. Shen, and J. Cai (2020) When Gaussian process meets big data: a review of scalable GPs.

    IEEE Transactions on Neural Networks and Learning Systems

    31 (11), pp. 4405–4423.
    Cited by: §1.
  • MIT Election Data and Science Lab (2017) U.S. President 1976–2020, V6. External Links: Document Cited by: footnote 2.
  • E. J. Pedersen, D. L. Miller, G. L. Simpson, and N. Ross (2019) Hierarchical generalized additive models in ecology: an introduction with mgcv. PeerJ (5). Cited by: §3.
  • J. Quiñonero-Candela and C. E. Rasmussen (2005) A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 6, pp. 1939–1959. Cited by: §3.
  • F. A. Quintana, W. O. Johnson, L. E. Waetjen, and E. B. Gold (2016) Bayesian nonparametric longitudinal data analysis. Journal of the American Statistical Association 111 (515), pp. 1168–1181. Cited by: §2.4.
  • J. Ramsay and B. W. Silverman (2005) Functional data analysis. 2nd edition, Springer, New York, NY. Cited by: §5.2.
  • C. E. Rasmussen and C. K. I. Williams (2006)

    Gaussian Processes for Machine Learning

    .
    MIT Press, Cambridge, Massachusetts. Cited by: §2.1.
  • G. Riutort-Mayol, P. Bürkner, M. R. Andersen, A. Solin, and A. Vehtari (2020) Practical hilbert space approximate Bayesian Gaussian processes for probabilistic programming. External Links: 2004.11408 Cited by: §3, §5.
  • O. Roustant, E. Padonou, Y. Deville, A. Clément, G. Perrin, J. Giorla, and H. Wynn (2020) Group kernels for Gaussian process metamodels with categorical inputs. SIAM/ASA Journal on Uncertainty Quantification 8 (2), pp. 775–806. Cited by: §1, §3.
  • E. Snelson and Z. Ghahramani (2006) Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, Y. Weiss, B. Schölkopf, and J. Platt (Eds.), Vol. 18. Cited by: §1.
  • A. Solin and S. Särkkä (2019) Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing. Cited by: §1, §3, §4.2.
  • J. Timonen, H. Mannerström, A. Vehtari, and H. Lähdesmäki (2021) lgpr: an interpretable non-parametric method for inferring covariate effects from longitudinal data. Bioinformatics 37 (13), pp. 1860–1867. Cited by: §1, §2.4, §2.4, §5.1.
  • M. Titsias (2009) Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, D. van Dyk and M. Welling (Eds.), Proceedings of Machine Learning Research, Vol. 5, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, pp. 567–574. Cited by: §1.
  • R. Trangucci (2017) Hierarchical Gaussian processes in Stan. External Links: Link Cited by: §5.3.
  • G. Verbeke and G. Molenberghs (2000) Linear mixed models for longitudinal data. Springer, New York, NY. Cited by: §2.4.
  • L. Wang, S. Yerramilli, A. Iyer, D. Apley, P. Zhu, and W. Chen (2021) Scalable Gaussian processes for data-driven design using big data with categorical factors. arXiv:2106.15356. Cited by: §1.
  • A. G. Wilson and H. Nickisch (2015) Kernel interpolation for scalable structured Gaussian processes (KISS-GP). CoRR abs/1503.01057. External Links: Link Cited by: §1.
  • Y. Zhang, S. Tao, W. Chen, and D. W. Apley (2020) A latent variable approach to gaussian process modeling with qualitative and quantitative factors. Technometrics 62 (3), pp. 291–302. Cited by: §3.
  • Y. Zhang and W. I. Notz (2015) Computer experiments with qualitative and quantitative variables: a review and reexamination. Quality Engineering 27, pp. 2–13. Cited by: §1.

References

  • Y. Cao, M. A. Brubaker, D. J. Fleet, and A. Hertzmann (2015) Efficient optimization for sparse Gaussian process regression. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (12), pp. 2415–2427. Cited by: §3.
  • B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell (2017) Stan: a probabilistic programming language. Journal of Statistical Software 76 (1), pp. 1–32. Cited by: §5.
  • L. Cheng, S. Ramchandran, T. Vatanen, N. Lietzen, R. Lahesmaa, A. Vehtari, and H. Lähdesmäki (2019) An additive Gaussian process regression model for interpretable non-parametric analysis of longitudinal data. Nature Communications 10. Cited by: §1, §2.4, §6.
  • I. Chung, S. Kim, J. Lee, K. J. Kim, S. J. Hwang, and E. Yang (2020) Deep mixed effect model using Gaussian processes: a personalized and reliable prediction for healthcare.

    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)

    .
    Cited by: §3.
  • X. Deng, C. D. Lin, K.-W. Liu, and R. K. Rowe (2017) Additive Gaussian process for computer models with qualitative and quantitative factors. Technometrics 59 (3), pp. 283–292. Cited by: §1.
  • V. Fortuin, G. Dresdner, H. Strathmann, and G. Rätsch (2021) Sparse Gaussian processes on discrete domains. IEEE Access 9 (), pp. 76750–76758. Cited by: §3.
  • E. C. Garrido-Merchán and D. Hernández-Lobato (2020) Dealing with categorical and integer-valued variables in Bayesian optimization with Gaussian processes. Neurocomputing 380, pp. 20–35. Cited by: §1, §3.
  • C. G. Kaufman and S. R. Sain (2010) Bayesian functional ANOVA modeling using Gaussian process prior distributions. Bayesian Analysis 5 (1), pp. 123–149. Cited by: §2.4.
  • G. S. Kimeldorf and G. Wahba (1970)

    A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines

    .
    The Annals of Mathematical Statistics 41 (2), pp. 495 – 502. Cited by: §3.
  • H. Liu, Y. Ong, X. Shen, and J. Cai (2020) When Gaussian process meets big data: a review of scalable GPs.

    IEEE Transactions on Neural Networks and Learning Systems

    31 (11), pp. 4405–4423.
    Cited by: §1.
  • MIT Election Data and Science Lab (2017) U.S. President 1976–2020, V6. External Links: Document Cited by: footnote 2.
  • E. J. Pedersen, D. L. Miller, G. L. Simpson, and N. Ross (2019) Hierarchical generalized additive models in ecology: an introduction with mgcv. PeerJ (5). Cited by: §3.
  • J. Quiñonero-Candela and C. E. Rasmussen (2005) A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 6, pp. 1939–1959. Cited by: §3.
  • F. A. Quintana, W. O. Johnson, L. E. Waetjen, and E. B. Gold (2016) Bayesian nonparametric longitudinal data analysis. Journal of the American Statistical Association 111 (515), pp. 1168–1181. Cited by: §2.4.
  • J. Ramsay and B. W. Silverman (2005) Functional data analysis. 2nd edition, Springer, New York, NY. Cited by: §5.2.
  • C. E. Rasmussen and C. K. I. Williams (2006)

    Gaussian Processes for Machine Learning

    .
    MIT Press, Cambridge, Massachusetts. Cited by: §2.1.
  • G. Riutort-Mayol, P. Bürkner, M. R. Andersen, A. Solin, and A. Vehtari (2020) Practical hilbert space approximate Bayesian Gaussian processes for probabilistic programming. External Links: 2004.11408 Cited by: §3, §5.
  • O. Roustant, E. Padonou, Y. Deville, A. Clément, G. Perrin, J. Giorla, and H. Wynn (2020) Group kernels for Gaussian process metamodels with categorical inputs. SIAM/ASA Journal on Uncertainty Quantification 8 (2), pp. 775–806. Cited by: §1, §3.
  • E. Snelson and Z. Ghahramani (2006) Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, Y. Weiss, B. Schölkopf, and J. Platt (Eds.), Vol. 18. Cited by: §1.
  • A. Solin and S. Särkkä (2019) Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing. Cited by: §1, §3, §4.2.
  • J. Timonen, H. Mannerström, A. Vehtari, and H. Lähdesmäki (2021) lgpr: an interpretable non-parametric method for inferring covariate effects from longitudinal data. Bioinformatics 37 (13), pp. 1860–1867. Cited by: §1, §2.4, §2.4, §5.1.
  • M. Titsias (2009) Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, D. van Dyk and M. Welling (Eds.), Proceedings of Machine Learning Research, Vol. 5, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, pp. 567–574. Cited by: §1.
  • R. Trangucci (2017) Hierarchical Gaussian processes in Stan. External Links: Link Cited by: §5.3.
  • G. Verbeke and G. Molenberghs (2000) Linear mixed models for longitudinal data. Springer, New York, NY. Cited by: §2.4.
  • L. Wang, S. Yerramilli, A. Iyer, D. Apley, P. Zhu, and W. Chen (2021) Scalable Gaussian processes for data-driven design using big data with categorical factors. arXiv:2106.15356. Cited by: §1.
  • A. G. Wilson and H. Nickisch (2015) Kernel interpolation for scalable structured Gaussian processes (KISS-GP). CoRR abs/1503.01057. External Links: Link Cited by: §1.
  • Y. Zhang, S. Tao, W. Chen, and D. W. Apley (2020) A latent variable approach to gaussian process modeling with qualitative and quantitative factors. Technometrics 62 (3), pp. 291–302. Cited by: §3.
  • Y. Zhang and W. I. Notz (2015) Computer experiments with qualitative and quantitative variables: a review and reexamination. Quality Engineering 27, pp. 2–13. Cited by: §1.