1 Introduction
A Gaussian process regression (GPR) model is a kernelbased Bayesian nonparametric model that represents the correlation/similarity of the data using a kernel for performing nonlinear probabilistic regression. A limitation of such a fullrank GPR model is its poor scalability to big data since computing its predictive belief and learning the kernel hyperparameters incur cubic time in the data size. To overcome this limitation, a number of sparse GPR (SGPR) models have been proposed [Chen et al.2012, Chen et al.2013, Chen et al.2015, Gal and Turner2015, Hensman et al.2015, Hoang, Hoang, and Low2015, Hoang, Hoang, and Low2016, Hoang, Hoang, and Low2017, LázaroGredilla et al.2010, Low et al.2015a, Low et al.2015b, QuiñoneroCandela and Rasmussen2005, Titsias2009, Titsias and LázaroGredilla2013, Xu et al.2014, Yu et al.2019b]. These SGPR models exploit a small set of inducing variables or a spectral representation of the kernel to derive a lowrank GP approximation for achieving scalable training and prediction to big data.
All the SGPR models mentioned above are either designed only for the widelyused squared exponential (SE) kernel [Titsias and LázaroGredilla2013, Yu et al.2019b] or assume the kernel type to be specified by the user a priori. However, in the era of big data, it has become all but impossible for nonexperts to manually select an appropriate kernel for a GP model since the underlying correlation structures of massive datasets are usually too complex to be captured by the commonlyused base kernels (e.g., SE and periodic kernels). Such an issue can be resolved by kernel selection algorithms [Duvenaud et al.2013, Kim and Teh2018, Lu et al.2018, Malkomes, Schaff, and Garnett2016] which are designed to automatically find a kernel with the highest model evidence (e.g., marginal likelihood, Bayesian information criterion) given a dataset. These algorithms have demonstrated success in improving the GP predictive performance over that of using the manually specified kernel. But, selecting only one kernel with the highest model evidence and ignoring the uncertainty of it being the true kernel may result in overconfident predictions, especially if other kernels yield similar model evidences [Hoeting et al.1999]. This motivates the need to design and develop a Bayesian kernel selection (BKS) algorithm that, instead of searching for the best kernel, considers the kernel as a random variable defined over a kernel space and learns a belief of the probabilistic kernel from data such that the uncertainty of the kernel can be interpreted and exploited to avoid overconfident GP predictions, which is the focus of our work here.
Most existing BKS algorithms for GP models approximate the kernels using their spectral density representation and thus only work for stationary kernels [Benton et al.2019, Oliva et al.2016, Wilson and Adams2013]. The BKS algorithm of malkomes16b malkomes16b caters to any kernel but is designed for the fullrank GPR model only. So, its approach of updating the kernel belief scales poorly in the data size. This paper presents a variational BKS (VBKS) algorithm for SGPR models without any stationary assumption on the kernels. In particular, we represent the probabilistic kernel as an additional variational variable in a variational inference (VI) framework for the SGPR models where its posterior belief is learned together with that of the other variational variables (i.e., inducing variables and hyperparameters) by minimizing the KullbackLeibler (KL) divergence between a variational approximation and their exact posterior belief or, equivalently, maximizing a variational lower bound of the logmarginal likelihood (Section 3). Unfortunately, the existing variational SGPR models [Hensman et al.2015, Hoang, Hoang, and Low2015, Hoang, Hoang, and Low2016, Titsias2009, Titsias and LázaroGredilla2013, Yu et al.2019b]
cannot be straightforwardly used by our VBKS algorithm since (a) they commonly use continuous distributions (e.g., normal distribution) that cannot directly accommodate the discrete kernel belief involving further constraints, and (b) the VBKS algorithm has to jointly learn the posterior belief of a large number of hyperparameters from big data as
many kernels of different types are evaluated simultaneously, which is computationally more expensive than approximating the posterior belief of hyperparameters for only one specified kernel.To address the above challenges, we first reparameterize the discrete kernel belief using variational variables to a continuous parametric distribution (Section 3) and then decompose the variational lower bound into an additive form such that each additive term depends only on a disjoint subset of the variational variables and can thus be maximized independently (Section 4
). To maximize the variational lower bound, stochastic optimization is used to iteratively improve the variational approximation of the exact posterior belief via stochastic gradient ascent where the stochastic gradient is estimated by first sampling from the variational distributions and then from the observed data (Section
4.1). The former sampling step makes the gradient estimation applicable to any differentiable kernel function in the kernel space (rather than only the SE kernel adopted in the works of Titsias2013 Titsias2013 and Yu2019 Yu2019) while the latter step enables our VBKS algorithm to incur constant time per iteration and hence scale to big data. Consequently, the kernel posterior belief can be updated as and when more training data is used, which makes it possible to perform BKS without using the full massive dataset and thus achieve competitive predictive performance fast. We empirically evaluate the performance of our VBKS algorithm on synthetic and two massive realworld datasets.2 Background and Notations
2.1 Gaussian Process Regression (GPR)
Let denote a
dimensional input domain such that each input vector
is associated with a noisy output observed from corrupting the function evaluated at by an additive noise whereis the noise variance. Let
denote a Gaussian process (GP), that is, every finite subset offollows a multivariate Gaussian distribution. Such a GP is fully specified by its
prior mean and covariance for all , the latter of which is usually defined by one of the widelyused kernels (e.g., squared exponential (SE), periodic (PER)) with a vector of hyperparameters . In this paper, is assumed to be zero and is used to denote a function with GP prior covariance for notational simplicity.Supposing a column vector of noisy outputs are observed by evaluating function at a set of training inputs, a fullrank GPR model can perform probabilistic regression by providing a GP predictive belief for any test input with the following posterior mean and variance:
(1) 
where , , and . Due to the inversion of , computing the above predictive belief incurs time and thus scales poorly in the size of observed data.
2.2 Sparse Gaussian Process Regression (SGPR)
To improve the scalability of the GP model, a number of SGPR models have been proposed. These SGPR models exploit a vector of inducing variables^{1}^{1}1Let denote a vector of inducing variables whose prior covariance is computed using the kernel function . for a small set of inducing inputs (i.e., ) for approximating the GP predictive belief:
(2) 
where is a vector of variables that can be set as either (i.e., is assumed to be a point estimate) [Hoang, Hoang, and Low2015, Hoang, Hoang, and Low2016, QuiñoneroCandela and Rasmussen2005, Titsias2009] or [Hensman et al.2015, Titsias and LázaroGredilla2013, Yu et al.2019b]. Variational inference has been used to derive by minimizing the KL divergence between and the exact posterior belief . Various conditional independence assumptions of and given have been imposed for computing , which result in different sparse GP approximations [Hoang, Hoang, and Low2015, Hoang, Hoang, and Low2016, QuiñoneroCandela and Rasmussen2005].
2.3 Automatic Kernel Selection
All the GPR and SGPR models mentioned above assume the kernel type to be specified by the user and learn and from the data. However, the kernel choice is critical to the performance of the (sparse) GP models since various kernel types can capture different underlying correlation structures of the data (see Chapter in [Rasmussen and Williams2006] for a detailed discussion of various kernels). Let be a set of candidate kernels (e.g., SE, PER). The automatic kernel selection algorithms [Duvenaud et al.2013, Kim and Teh2018, Lloyd et al.2014, Lu et al.2018, Malkomes, Schaff, and Garnett2016] aim to automatically find a kernel with the highest model evidence (e.g., marginal likelihood, Bayesian information criterion). Since the sum or product of two valid kernels (i.e., positive semidefinite kernels that define valid covariance functions) is still a valid covariance function, the kernel space can be constructed by repeatedly applying the following composition rules:
where and can be either one of the base kernels (i.e., SE, PER, linear (LIN), and rationalquadratic (RQ)) or a composite kernel [Duvenaud et al.2013].
3 Variational Bayesian Kernel Selection (VBKS) for SGPR Models
As mentioned in Section 1, most existing kernel selection algorithms aim to find only one kernel with the highest model evidence [Duvenaud et al.2013, Kim and Teh2018, Lloyd et al.2014, Lu et al.2018, Malkomes, Schaff, and Garnett2016]. However, if several kernels achieve similar model evidences, ignoring the uncertainty among the kernels and selecting only one kernel with the highest model evidence may result in overconfident inferences/predictions or overfitting, especially if some composite kernel structures in the kernel space are complex. To resolve this issue, the Bayesian kernel selection (BKS) problem considers as a random variable and introduces a kernel belief over . Then, the GP predictive belief (2) has to consider an additional variable of the kernel, which yields
(3) 
where , , and and still, respectively, denote the vectors of inducing variables and hyperparameters of kernel to ease notations, even though is now probabilistic. The exact definitions of , , and will be introduced later. Next, the key issue is to approximate the intractable posterior belief in (3) such that the GP predictive belief can be computed analytically and efficiently. To achieve this, the active structure discovery algorithm of malkomes16b malkomes16b has proposed to approximate and via Laplace approximation such that the posterior belief of the kernel can be computed by applying Bayes’ rule. However, such a BKS algorithm is designed for the fullrank GPR model only. So, it still incurs time and scales poorly in the size of observed data.
To scale BKS of GP models to big data, we propose to approximate the posterior belief in (3) via variational inference (VI) for SGPR models, which we call variational BKS (VBKS)^{2}^{2}2Though our proposed VBKS algorithm performs Bayesian kernel inference instead of “selecting” a specific kernel(s), we call it “kernel selection” to be consistent with the Bayesian model selection framework [Rasmussen and Williams2006].. In particular, the intractable posterior belief in (3) can be approximated with a variational distribution:
(4) 
by minimizing the KL divergence between and the exact posterior belief . The variational approximation in (4) is similar to those used by existing VI frameworks of SGPR models [Titsias and LázaroGredilla2013, Yu et al.2019b] except that an additional term is included due to the probabilistic kernel whose posterior belief needs to be learned from data together with that of . We will consider a finite kernel space and hence a discrete distribution over in our work here.
Let be a set of kernels where each represents a kernel which can either be the base kernel or the composite kernel, as described in Section 2.3. The kernel belief can be defined as a vector where for . Similarly, the variational distribution can be represented by a vector of variational parameters where for . Then, the objective of VBKS is to minimize w.r.t. and the variational parameters of with the following constraints:
A commonlyused method to solve such a constrained optimization problem is to convert it to that of unconstrained optimization via some tricks (e.g., substitution, Lagrange multiplier). In this work, we achieve this by introducing a dimensional vector of continuous variables and reparameterizing using
where is the th element of . Then, let the variational distribution . The abovementioned constrained optimization problem is transformed to that of minimizing w.r.t. the variational parameters (i.e., detailed later) of and without any constraints. An additional benefit of introducing as a vector of variational variables is that we can then place parametric multivariate distributions on and such that the prior knowledge of the kernel set can be included in and the true correlations between different kernels can be learned from data by learning the covariance parameters of , which is useful in interpreting the relationship between kernels (e.g., a high correlation between and can be interpreted as a high similarity between and with potentially similar learning performances).
Then, minimizing is equivalent to maximizing a variational evidence lower bound (ELBO):
(5) 
since the log marginal likelihood is a constant. The derivation of (5) is in Appendix A. Unfortunately, the evaluation of is intractable since it contains the inverse of the prior covariance matrix of the inducing variables which depends on but cannot be analytically integrated over . Some works [Titsias and LázaroGredilla2013, Yu et al.2019b] have resolved this issue by introducing a standardized kernel and reparameterizing the prior covariance matrix such that its inversion does not depend on the hyperparameters. However, such a reparameterization trick cannot be applied to our work here since the standardized kernel is only defined for the SE kernel and cannot be generalized to cater to the other kernels, especially the composite ones. The doubly stochastic VI framework of titsias2014 titsias2014 has proposed to generalize the optimization of to any differentiable kernel function by sampling from the variational distribution, which cannot be straightforwardly used to optimize in VBKS since it is designed for GP models with only one kernel and does not consider the scalability in when many kernels have to be evaluated simultaneously. Next, a scalable stochastic optimization method for VBKS will be introduced to circumvent the abovementioned issues.
4 Scalable Stochastic Optimization for VBKS
Let and where are the hyperparameters of kernel and is a vector of inducing variables whose prior covariance is computed using kernel . Optimizing jointly w.r.t. and is computationally challenging since they are both highdimensional, especially if is large. To improve the scalability of the optimization in , we assume that (a) for are independent and also independent of and , and (b) is conditionally independent of given . Then,
(6) 
The graphical model in Fig. 1 shows the relationship between the variables of our SGPR model with the probabilistic kernel.
Let the variational distribution be factorized as
(7) 
where and are the exact posterior beliefs of and . The ELBO in (5) can be decomposed as
(8) 
where
(9) 
includes only the variational variables and is thus called a local ELBO. The derivation of (8) is in Appendix B. Interestingly, to maximize (8), we can maximize each local ELBO over independently for and then maximize (8) over since and is independent of and the variational variables of for , which makes the optimization of incur linear time in the kernel size and thus easily parallelizable. Next, we will discuss how to maximize and each via stochastic gradient ascent, which incurs only constant time per iteration and hence makes our VBKS algorithm scale well to big data.
4.1 Stochastic Optimization
In this section, we will first discuss the optimization of the local ELBOs for and then the optimization of via stochastic gradient ascent (SGA).
Optimizing the local ELBOs.
To optimize via stochastic optimization, we reparameterize the variational variables and by assuming that they are affine transformations of a vector of variables which follow a standard distribution. In particular, let and where are variational parameters that are independent of and , and follows a standard multivariate Gaussian distribution^{3}^{3}3We use the widelyused standard multivariate Gaussian distribution as an example here. Similar to doubly stochastic VI [Titsias and LázaroGredilla2014], can be any standard continuous density function which yields a different .: . We can factorize the variational distribution and obtain
(10) 
where and . Then, the gradient of the local ELBOs can be computed with respect to the variational parameters using
(11) 
The derivation of (11) can be obtained by applying the results in Appendix D of hoang2017 hoang2017. Note that the reparameterization trick introduced above is used to make the first expectation operator in (11) independent of the variational parameters such that the gradient operator can be moved inside the first expectation w.r.t. . Otherwise, if we assume (10) and optimize each in (9) directly w.r.t. the variational parameters , then the gradient operator cannot be moved into the first expectation in (9) since depends on , which makes the estimation of the gradient nontrivial.
Given (11), the gradient of w.r.t. can be approximated by sampling , as detailed in Appendix C. Unfortunately, the approximation of (11) is still computationally expensive for massive (e.g., millionsized) datasets since the estimation of incurs linear time in the data size per SGA update. To resolve this issue, we exploit the deterministic training conditional (DTC) assumption of conditional independence among for given the inducing variables for deriving
(12) 
Then, we can obtain an unbiased stochastic gradient estimate of w.r.t. by uniformly sampling a minibatch of the observed data where and the (fixed) batch size is much smaller than . As a result, the SGA update of incurs only constant time per iteration.
Optimizing w.r.t. .
The optimization of can be performed by applying similar reparameterization trick and SGA. Specifically, let be the maximal local ELBO achieved using the above SGA and with . The gradient of can be derived in the same manner as that of (11):
where . Then, the variational parameters can also be updated via SGA by approximating the expectation operator in from sampling .
5 Kernel Posterior Belief and Predictive Belief of SGPR Model
The optimized variational parameters and can be computed from the abovementioned stochastic optimization and used to induce the optimal variational distributions and for via (10). Then, the posterior belief of the probabilistic kernel can be approximated by Monte Carlo (MC) sampling:
(13) 
where are i.i.d. samples from . Such an approximated kernel posterior belief is a discrete distribution where each for
can be (a) easily interpreted to identify the “best” kernel whose posterior probability is much larger than that of the other kernels, and (b) exploited to avoid overconfident predictions by Bayesian model averaging, as will be shown in Section
6.Recall from Section 3 that we approximate the posterior belief using the variational distribution for computing the predictive belief . Given the optimal variational distributions and for , the predictive belief in (3) can be approximated using
(14) 
which yields the following approximated predictive mean and variance for :
(15) 
where and are the predictive mean and variance of approximated by MC sampling. The derivations of (14), (15), and the steps for computing and are in Appendix D.
As can be seen from (15), the time incurred to compute the predictive mean and variance is linear in the kernel size , which is still expensive for GP prediction if is large. To overcome this issue, we can consider constructing a much smaller kernel set including only the kernels with topranked posterior probabilities. Then, a new kernel posterior belief over can be easily computed by optimizing a new ELBO constructed from reusing the optimized local ELBOs of the kernels in . As a result, the GP prediction can incur less time by pruning away kernels with low posterior probabilities while still maintaining the kernel uncertainty to avoid overconfident predictions.
In addition, one may notice that the GP predictive belief with the probabilistic kernel in (14) is in fact performing Bayesian model averaging (BMA) [Hoeting et al.1999] over multiple GP models, each of which considers only one kernel . Though it may seem straightforward to use BMA for GP prediction when multiple kernels are considered, our VBKS algorithm provides a principled way of doing so by deriving (14) from (3) using the VI framework and can scale the GP inference with the probabilistic kernel to big data, which is the main contribution of our work here.
6 Experiments and Discussion
This section empirically evaluates the performance of our VBKS algorithm on two small synthetic datasets and two massive realworld datasets. The kernel posterior belief is computed using (13) with . The realworld experiments are performed on a Linux system with Nvidia GeForce GTX GPUs. Stochastic optimization for VBKS is performed in a distributed manner over the GPUs using GPflow [Matthews et al.2017].
6.1 Synthetic Experiments
We will first demonstrate the performance of our VBKS algorithm in identifying kernel(s) that can capture the underlying correlation structure of two synthetic datasets. These two synthetic datasets are generated using the respective true composite kernels and with fixed hyperparameters (Appendix E.1). We set the input dimension as and input domain as . A set of inputs are randomly sampled from and their corresponding outputs are sampled from a fullrank GP prior. Then, we sample another set of inputs from , compute their predictive mean via (1), and use as the synthetic dataset. In all the synthetic experiments, we use inducing inputs and a batch size to perform the SGA update per iteration. Two kernel sets and are used to evaluate the performance of our VBKS algorithm where contains kernels constructed from the base kernels (i.e., SE, RQ, LIN, PER) by applying the grammar rules of duvenaud13 duvenaud13 until level while includes kernels randomly sampled from and the two true kernels, as shown in Appendix E.1. The small kernel set is constructed such that the posterior probabilities of all the kernels in the kernel space can be easily observed and visualized.
Figs. 2a and 2b show the kernel posterior belief over that is produced by our VBKS algorithm. Figs. 2c and 2d include only the kernels whose posterior probabilities have ever been ranked as the top two among during stochastic optimization for VBKS. It can be observed in all the experiments that the posterior probability of the true kernel is small at the beginning and, with a growing data size, is increased by our VBKS algorithm to be around which is much larger than that of the other kernels. Though the SGPR model used by our VBKS algorithm produces a lowrank approximation of the true covariance structure of the data using a small set of inducing variables, our algorithm can find the kernel that correctly interprets the underlying correlation structure. It can be observed from Figs. 2a and 2b that the kernel uncertainty is large (i.e., no kernel has a much larger posterior probability than the others) when less than half of the data is used in these experiments. In such cases, the high kernel uncertainty shows that the current observed data is insufficient in identifying a single kernel that fits the true correlation structure much better than the other kernels. This implies the need to collect more data (as has been done in the experiments) or perform GP prediction with the kernel uncertainty.
6.2 RealWorld Experiments
This section empirically evaluates the performance and time efficiency of our VBKS algorithm on two massive realworld datasets: (a) Swissgrid dataset^{4}^{4}4https://www.swissgrid.ch contains records of the total energy consumed by end users in the Swiss control block from January , to December , in every minutes, and (b) indoor environmental quality (IEQ) dataset^{5}^{5}5http://db.csail.mit.edu/labdata/labdata.html contains temperature (C) taken in every seconds between February and April , by sensors deployed in the Intel Berkeley Research lab.
The candidate kernel set is used in the experiments for both datasets. The predictive performance of our VBKS algorithm is obtained using BMA over kernels with topranked posterior probabilities and compared with that of (a) Bayesian optimization (BO) for kernel selection [Malkomes and Garnett2015] which automatically finds the kernel with the highest model evidence using BO, (b) random algorithm which randomly selects a kernel from at the beginning of each experiment, and (c) VBKS performing GP prediction with a single kernel that yields the highest posterior probability (VBKSs). For each dataset, observations are randomly selected to form the test set . The root mean squared error (RMSE) metric is used to evaluate the predictive performance of the tested algorithms. The RMSEs of the VBKS(s) and random algorithms are averaged over and
independent runs, respectively. The error bars are computed in the form of standard deviation. To save some time from sampling
, the hyperparameters used by all tested algorithms are point estimates by settingas a zero matrix.
Swissgrid Dataset.
We use inducing inputs and a batch size for SGA update per iteration. The kernel posterior belief produced by our VBKS algorithm is evaluated after every 1.28% of data are used, as shown in Fig. 3a. For clarity, we only visualize the kernels whose posterior probabilities have ever been ranked as the top three during stochastic optimization for VBKS. It can be observed that a group of three kernels (rather than a single kernel) achieve comparable posterior probabilities which are larger than that of the other kernels. This implies that the uncertainty among the kernels need to be considered when the selected kernels are used to interpret the correlation structure or perform GP prediction, which is a key benefit of using our VBKS algorithm. The predictive performance of the tested algorithms for Swissgrid dataset is shown in Fig. 3b: Both the VBKS and VBKSs algorithms converge faster to a smaller RMSE than the other tested algorithms. VBKS performs better than VBKSs since BMA can benefit from different kernels in modeling the data when no kernel truly stands out. The BO algorithm performs poorly because it has to approximate the distance between kernels using a small subset of data (i.e., of size here).^{6}^{6}6We also tried a larger subset of data of size to better approximate the kernel distances in BO, which yields similar predictive performance but incurs much more time. If the small subset of data is not large enough to approximate the kernel distances well, the BO performance will degrade, which is the case in our experiments.
IEQ Dataset.
In this experiment, the time and locations of the sensors for producing the temperature readings are used jointly as the input (i.e., ). The first million valid data points are selected for our experiments. We use inducing inputs and a batch size for SGA update per iteration. Fig. 4 shows the kernel posterior belief and RMSE of the tested algorithms for IEQ dataset. Different from the results in Fig. 3a for Swissgrid dataset, Fig. 4a shows that the posterior probability of a specific kernel (i.e., ) increases fast to be much larger than that of the other kernels. However, it can be observed from Fig. 4b that VBKS still outperforms VBKSs because the training of GP model using overfits to the training data and our VBKS algorithm with BMA helps to reduce the overfitting. VBKS also converges to a smaller RMSE than all other tested algorithms, as shown in Fig. 4b. In addition, for both Swissgrid and IEQ datasets, we observe that VBKS has consistently achieved smaller RMSE than VBKSs (albeit slightly) in all independent runs, which demonstrates the benefit of applying BMA. The results of all these runs (instead of the averaged RMSE) are in Appendix E.2.
Scalability.
Fig. 5 shows the time efficiency of our VBKS algorithm for different batch sizes for SGA update per iteration. As expected, the total time incurred by VBKS increases linearly in the number of iterations of SGA updates.
7 Conclusion
This paper describes a novel VBKS algorithm for SGPR models that considers a probabilistic kernel and exploits the kernel uncertainty to avoid overconfident predictions. A stochastic optimization method for VBKS is proposed for learning the kernel variational distribution together with the other variational variables (i.e., inducing variables and kernel hyperparameters). Our VBKS algorithm achieves scalability in the kernel size by decomposing the variational lower bound into an additive form such that each additive term (i.e., local ELBO) depends on the variational variables of only one kernel and can thus be maximized independently. Then, each additive local ELBO is optimized via SGA by sampling from both the variational distribution and the data, which scales to big data since it incurs constant time per iteration of SGA update. The predictive performance of our VBKS algorithm with BMA is shown to outperform the other tested kernel selection algorithms. A limitation of VBKS is that it uses a finite and fixed kernel space, which does not allow flexible exploration. In our future work, we will consider expanding the kernel space during stochastic optimization according to the intermediate learning outcomes and extending VBKS to operate with an infinite kernel space and a deep GP model [Yu et al.2019a].
Acknowledgments.
This research is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) program, SingaporeMIT Alliance for Research and Technology (SMART) Future Urban Mobility (FM) IRG, the Singapore Ministry of Education Academic Research Fund Tier , MOET, and the Joint Funds of the National Natural Science Foundation of China under Key Program Grant U.
References
 [Benton et al.2019] Benton, G. W.; Maddox, W. J.; Salkey, J. P.; Albinati, J.; and Wilson, A. G. 2019. Functionspace distributions over kernels. In Proc. NeurIPS.
 [Chen et al.2012] Chen, J.; Low, K. H.; Tan, C. K.Y.; Oran, A.; Jaillet, P.; Dolan, J. M.; and Sukhatme, G. S. 2012. Decentralized data fusion and active sensing with mobile sensors for modeling and predicting spatiotemporal traffic phenomena. In Proc. UAI, 163–173.
 [Chen et al.2013] Chen, J.; Cao, N.; Low, K. H.; Ouyang, R.; Tan, C. K.Y.; and Jaillet, P. 2013. Parallel Gaussian process regression with lowrank covariance matrix approximations. In Proc. UAI, 152–161.
 [Chen et al.2015] Chen, J.; Low, K. H.; Jaillet, P.; and Yao, Y. 2015. Gaussian process decentralized data fusion and active sensing for spatiotemporal traffic modeling and prediction in mobilityondemand systems. IEEE Transactions on Automation Science and Engineering 12(3):901–921.
 [Duvenaud et al.2013] Duvenaud, D.; Lloyd, J. R.; Grosse, R.; Tenenbaum, J. B.; and Ghahramani, Z. 2013. Structure discovery in nonparametric regression through compositional kernel search. In Proc. ICML, 1166–1174.
 [Gal and Turner2015] Gal, Y., and Turner, R. 2015. Improving the Gaussian process sparse spectrum approximation by representing uncertainty in frequency inputs. In Proc. ICML, 655–664.
 [Hensman et al.2015] Hensman, J.; Matthews, A. G.; Filippone, M.; and Ghahramani, Z. 2015. MCMC for variationally sparse Gaussian processes. In Proc. NeurIPS, 1648–1656.
 [Hoang, Hoang, and Low2015] Hoang, T. N.; Hoang, Q. M.; and Low, K. H. 2015. A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data. In Proc. ICML, 569–578.
 [Hoang, Hoang, and Low2016] Hoang, T. N.; Hoang, Q. M.; and Low, K. H. 2016. A distributed variational inference framework for unifying parallel sparse Gaussian process regression models. In Proc. ICML, 382–391.
 [Hoang, Hoang, and Low2017] Hoang, Q. M.; Hoang, T. N.; and Low, K. H. 2017. A generalized stochastic variational Bayesian hyperparameter learning framework for sparse spectrum Gaussian process regression. In Proc. AAAI, 2007–2014.
 [Hoeting et al.1999] Hoeting, J. A.; Madigan, D.; Raftery, A. E.; and Volinsky, C. T. 1999. Bayesian model averaging: a tutorial. Statistical science 382–401.
 [Kim and Teh2018] Kim, H., and Teh, Y. W. 2018. Scaling up the automatic statistician: Scalable structure discovery using Gaussian processes. In Proc. AISTATS.
 [LázaroGredilla et al.2010] LázaroGredilla, M.; QuiñoneroCandela, J.; Rasmussen, C. E.; and FigueirasVidal, A. R. 2010. Sparse spectrum Gaussian process regression. JMLR 11:1865–1881.
 [Lloyd et al.2014] Lloyd, J. R.; Duvenaud, D.; Grosse, R.; Tenenbaum, J. B.; and Ghahramani, Z. 2014. Automatic construction and naturallanguage description of nonparametric regression models. In Proc. AAAI.
 [Low et al.2015a] Low, K. H.; Chen, J.; Hoang, T. N.; Xu, N.; and Jaillet, P. 2015a. Recent advances in scaling up Gaussian process predictive models for large spatiotemporal data. In Ravela, S., and Sandu, A., eds., Proc. Dynamic Datadriven Environmental Systems Science Conference (DyDESS’14). LNCS 8964, Springer.
 [Low et al.2015b] Low, K. H.; Yu, J.; Chen, J.; and Jaillet, P. 2015b. Parallel Gaussian process regression for big data: Lowrank representation meets Markov approximation. In Proc. AAAI.
 [Lu et al.2018] Lu, X.; Gonzalez, J.; Dai, Z.; and Lawrence, N. 2018. Structured variationally autoencoded optimization. In Proc. ICML, 3273–3281.
 [Malkomes and Garnett2015] Malkomes, G., and Garnett, R. 2015. Active structure discovery for Gaussian processes. In ICML Workshop on AutoML.
 [Malkomes, Schaff, and Garnett2016] Malkomes, G.; Schaff, C.; and Garnett, R. 2016. Bayesian optimization for automated model selection. In Proc. NeurIPS, 2900–2908.

[Matthews et al.2017]
Matthews, A. G. d. G.; van der Wilk, M.; Nickson, T.; Fujii, K.;
Boukouvalas, A.; LeónVillagrá, P.; Ghahramani, Z.; and Hensman,
J.
2017.
GPflow: A Gaussian process library using TensorFlow.
JMLR 18(40):1–6.  [Oliva et al.2016] Oliva, J. B.; Dubey, A.; Wilson, A. G.; Póczos, B.; Schneider, J.; and Xing, E. P. 2016. Bayesian nonparametric kernellearning. In Proc. AISTATS, 1078–1086.
 [QuiñoneroCandela and Rasmussen2005] QuiñoneroCandela, J., and Rasmussen, C. E. 2005. A unifying view of sparse approximate Gaussian process regression. JMLR 6:1939–1959.

[Rasmussen and Williams2006]
Rasmussen, C. E., and Williams, C. K.
2006.
Gaussian processes for machine learning
. MIT Press.  [Titsias and LázaroGredilla2013] Titsias, M. K., and LázaroGredilla, M. 2013. Variational inference for Mahalanobis distance metrics in Gaussian process regression. In Proc. NeurIPS, 279–287.
 [Titsias and LázaroGredilla2014] Titsias, M. K., and LázaroGredilla, M. 2014. Doubly stochastic variational Bayes for nonconjugate inference. In Proc. ICML, 1971–1979.
 [Titsias2009] Titsias, M. K. 2009. Variational learning of inducing variables in sparse Gaussian processes. In Proc. AISTATS.
 [Wilson and Adams2013] Wilson, A. G., and Adams, R. P. 2013. Gaussian process kernels for pattern discovery and extrapolation. In Proc. ICML.
 [Xu et al.2014] Xu, N.; Low, K. H.; Chen, J.; Lim, K. K.; and Özgül, E. B. 2014. GPLocalize: Persistent mobile robot localization using online sparse Gaussian process observation model. In Proc. AAAI, 2585–2592.
 [Yu et al.2019a] Yu, H.; Chen, Y.; Dai, Z.; Low, K. H.; and Jaillet, P. 2019a. Implicit posterior variational inference for deep Gaussian processes. In Proc. NeurIPS.
 [Yu et al.2019b] Yu, H.; Hoang, T. N.; Low, K. H.; and Jaillet, P. 2019b. Stochastic variational inference for Bayesian sparse Gaussian process regression. In Proc. IJCNN.
Appendix A Derivation of (5)
Since for any ,
(16) 
Let
be an arbitrary probability density function. Then, we can take an expectation w.r.t.
on both side of (16), which yieldswhere the fourth equality is due to the fact that is conditionally independent of given since and the fifth equality follows from the definition of KL divergence.
Appendix B Derivation of (8)
Appendix C Derivation of (11)
c.1 Derivation of
Firstly, is a Gaussian with the following mean and covariance:
where , , and denotes computed using hyperparameters . Then,
The gradient can thus be computed automatically using TensorFlow based on the above analytic equation.
c.2 Derivation of
Using the definition of KL divergence,
where the entropy and the KL divergence terms can be computed analytically such that their gradients w.r.t. can be computed automatically using TensorFlow. Though the last term cannot be computed analytically, we can approximate its gradient w.r.t. via the reparameterization trick (Section 4.1) by drawing i.i.d. samples for from : .
Appendix D Derivation of the Approximated Predictive Belief
d.1 Derivation of (14)
From (3),