The Gaussian process (GP) is a non-parametric probabilistic model that can be used to model an unknown nonlinear function from observed input data and (noisy) output data . No explicit form of is assumed, but some assumptions on are encoded through the GP prior and a mean function , a covariance function , and their so-called hyperparameters . In mathematical terms, is a priori modeled to be distributed as
i.e., an infinite-dimensional Gaussian distribution. See for a more general introduction to GPs.
The posterior distribution over given data is also a GP. This is due to the conjugacy property of the Gaussian distribution. The posterior is often greatly influenced by the choice of hyperparameters , which typically are unknown. We therefore propose a method to marginalize the hyperparameters in GPs. Marginalization can be seen as averaging over the range of hyperparameters supported by the data and by the prior;
can be integrated out by treating it as a random variable with priorand likelihood , giving rise to the posterior . For example, the predictive distribution is computed by
which unfortunately is analytically intractable. However, using a Monte Carlo method to obtain (weighted) samples of the distribution , the predictive distribution (2) can be approximated by
where the weights are normalized, i.e., .
A common alternative to marginalization is to choose a point estimate of using an optimization procedure maximizing the likelihood (sometimes referred to as empirical Bayes). This may be difficult if the likelihood is multimodal. See the small toy example in Figure 3 illustrating the robustness of marginalization compared to point estimates. There are also situations where point estimates are not sufficient, and marginalization is necessary, such as the change point detection problem in Section III-C.
Our contribution is a method for sampling from the hyperparameter posterior distribution , based on sequential Monte Carlo (SMC) samplers . SMC samplers and their convergence properties are well studied .
Several methods have previously been proposed in the literature for marginalization of the GP hyperparameters: Bayesian Monte Carlo (BMC) , slice sampling , Hamiltonian Monte Carlo [6, 7], and adaptive importance sampling (AIS) . Particle learning which is closely related to SMC has been proposed by  for this purpose. The work by 
, however, is not targeting the hyperparameters directly, and makes (possibly restrictive) assumptions on conjugate priors and model structure.
In this paper, we compare our proposed method to some of these methods, and apply it to two real-data problems: the first demonstrates that marginalization does not have to be more computationally demanding than finding point estimates. The second example, which deals with a fault detection problem from industry, is possible only with an efficient method for marginalization. Our proposed method (and all examples) are available as Matlab code via the first authors homepage.
From the experiments, we conclude that the advantages of the proposed method are (i) robustness towards multimodal hyperparameter posteriors, (ii) simplified tuning (compared to some other alternatives), (iii) competitive computational load, and (iv) online updating of hyperparameters as the data record grows.
Ii Sampling hyperparameters using SMC
For the numerical marginalization (3), we require samples, known as particles, from the posterior. In this section, we discuss how to use a SMC sampler  to generate such a particle system , where is the weight of particle
. The underlying idea is to construct a sequence of probability distributions (), starting from the prior, and ending up in the posterior. The particles are then ‘guided’ through the sequence.
To construct a sequence , we use the fact that depends on the data , by partitioning the data points into disjoint batches and adding them sequentially as .
To guide the particles through the smooth sequence , we will iteratively apply the three steps weighting, resampling and propagation, akin to a particle filter.
In the weighting step, the ‘usefulness’ of each particle is evaluated. To ensure convergence properties, the particles can be evaluated as [2, Section 3.3.2]
To avoid numerical problems, the particles have to be resampled. The idea is to duplicate particle with large weights, and discard particles with small weights.
To propagate the particles from to , a Metropolis-Hastings (MH) kernel with invariant distribution can be used. The procedure of propagating (a sample of ) to (a sample of ) by is as follows: (i) Sample a new particle from a proposal
, e.g., a random walk with variance. (ii) Compensate for the discrepancy between and by setting with probability
and otherwise . To improve the mixing, this procedure can be repeated times. For this, we use the notation .
We now have an SMC sampler to obtain samples from the hyperparameter posterior, summarized in Algorithm 1 and illustrated by Figure 6. From the figure, the suitability to online applications is clear: If another data point is added to the data, the sequence can be extended to including the new data point, and only the transition from to has to be performed.
We make use of the adaptive SMC sampler by  in the numerical examples to adapt the proposal automatically.
The computational cost of Algorithm 1 is in practice governed by the evaluations of the likelihood . Hence, it is important to choose the number of samples , SMC steps , and MH-moves per SMC-step sensibly. An idea of sensible numbers will be given along with the examples in the next section.
Iii Examples and results
We consider three examples for demonstrating our proposed approach. First, we consider a small simulated example, also comparing to alternative sampling methods, and thereafter two applications with real-world data. The first real-data example is a benchmark problem to compare the marginalization approach in Algorithm 1 to the point estimates obtained using optimization. In the third example, we illustrate how we can make use of our solution within a GP-based online change point detection algorithm. To this end, we require marginalization of the hyperparameters, so an efficient hyperparameter posterior sampler is indeed a key enabler for this. The online nature of the problem also fits well to the possibility to update the samples in Algorithm 1 online, as discussed in Section II.
Iii-a Simulated example
We consider a small problem of 5 data points, and a covariance and mean function with 7 hyperparameters in total. We begin by considering the problem of marginalizing out 7 hyperparameters in a GP prior given 5 data points. Here, we are interested in comparing the performance of our SMC sampler (Algorithm 1) with some popular alternative methods; BMC , AIS , and (deterministic) griding.
The results for 15 runs are presented in Figure 7; it is indeed good if the variance between consecutive runs of the same algorithm gives similar results. The variations between the runs decrease faster for Algorithm 1 than for the comparable methods. When the GP prior has few hyperparameters, we conclude that the AIS and griding might be competitive methods. We have not managed to obtain competitive results with BMC for any problem size, but it should be noted that the computational load of BMC can be substantially decreased if the hyperparameter prior is independent between the dimensions.
The results for the conceptually different point estimates are also presented in Figure 7. The initialization point to the optimization algorithm is drawn from the prior: although it is a deterministic method, it is obviously very sensitive to the initialization.
Iii-B Learning a robot arm model
We consider the problem of learning the inverse dynamics of a seven degrees-of-freedom SARCOS antromorphic robot arm[11, 1]. We use the same setup as [1, Section 2.5], i.e., a non-trivial setting involving 23 hyperparameters.
To handle the size of the data set ( training and test data points), we make use of a subset of: (i) datapoints and (ii) regressors as discussed by [1, Section 8.3.1]. To use our method, we sample the hyperparameters from the posterior with a subset of data points. For comparability, we have also reproduced the results using point estimates from . The results are reported in Table I. For Algorithm 1, , and was used. The priors to the logarithms of the length-scale and the signal variance are , and for the noise variance.
|Subset of datapoints|
|Point est.||256||8.36 0.80||-1.38 0.04||6.8|
|SMC||256||8.10 1.32||-1.38 0.56||7.1|
|Point est.||512||6.36 1.13||-1.51 0.05||26.4|
|SMC||512||6.13 0.91||-1.49 0.04||22.3|
|Point est.||1024||4.31 0.16||-1.66 0.02||101|
|SMC||1024||4.54 0.33||-1.61 0.03||92.5|
|Point est.||2048||2.99 0.08||-1.78 0.03||423|
|SMC||2048||3.33 0.28||-1.69 0.06||405|
|Subset of regressors|
|Point est.||256||3.67 0.17||-1.63 0.02||6.8|
|SMC||256||3.55 0.28||-1.65 0.05||7.1|
|Point est.||512||2.77 0.44||-1.79 0.07||26.4|
|SMC||512||2.89 0.20||-1.77 0.03||22.3|
|Point est.||1024||2.03 0.11||-1.95 0.03||101|
Table I presents the results in the same way as [1, Table 8.1]. SMSE is the standardized mean square error (i.e., mean square error normalized by the variance of the target), and MSLL is the mean standardized log loss; 0 if predicting using a Gaussian density with mean and variance of the training data, and negative if ‘better’. The time is referring to the time required to sample and optimize the hyperparameters, respectively (not including the test evaluation). Numerical problems were experienced for large , therefore indicates runs where no interval can be reported.
Table I indicates no significant difference between the performance of our method and point estimates. It is however worth also to note the computational load: As Algorithm 1 apparently makes an equally good job in finding relevant hyperparameters as the optimization, it is a confirmation that our proposed method is indeed a competitive alternative to point estimates even for large problems.
Iii-C Fault detection of oxygen sensors
We now consider data from the wastewater treatment plant Käppalaverket, Sweden. An oxygen sensor measures the dissolved oxygen (in mg/l) in a bioreactor, but the sensor gets clogged because of suspended cleaning. The identification of such events is relevant to the control of wastewater treatment plants . We apply the GP-based online change point detection algorithm by , where the hyperparameters are marginalized using our proposed method.
The GP-based change point detection presented by  can be summarized as follows: If data undergo a change at time , it is of interest to (online) detect , i.e., estimate . The algorithmic idea is a recursive message passing scheme, updating the probability , where is the last change point at time .
To make predictions using a GP model, the hyperparameters either have to be fixed across all data segments, or marginalized. As it is not relevant to use fixed hyperparameters, an efficient sampling algorithm is a key enabler in solving this problem. The consecutive predictions and are both needed for the algorithm, hence our approach fit this problem well, as discussed in Section II. We used particles. On average, sampling the hyperparameters, i.e., one run of Algorithm 1, took 0.55 seconds on a standard desktop computer.
The results are presented in Figure (a)a. The expected points, suspension and resuming of the cleaning, are indeed indicated. An interpretation of the result is obtained by converting the results to point estimates by thresholding, and plotting at the GP regression for each individual segment, see Figure (b)b.
Note the data-driven nature of the algorithm, as no explicit model of the sensor was used at all. The tuning parameters are the covariance and mean functions, the prior of the change points and the hyperparameter priors.
We have proposed and demonstrated an SMC-based method to marginalize hyperparameters in GP models. The observed benefits are robustness towards multimodal posteriors (Figure 3) and a competitive computational load (Section III-B), also compared to the commonly used point estimates of the hyperparameters. We have been able to cope with a hyperparameter space of dimension 23 (Section III-B), and also concluded a sound convergence behavior (Section III). Finally, the online update of the hyperparameters has been shown useful within the industry-relevant data-driven fault detection application (Section III-C). As a future direction, it would be interesting to apply our method to the challenging GP optimization problem of system identification .
This work was supported by the project Probabilistic modeling of dynamical systems (Contract number: 621-2013-5524) funded by the Swedish Research Council (VR). We would also like to thank Oscar Samuelsson and Dr. Jesús Zambrano for providing the sensor data in Section III-C.
C. E. Rasmussen and C. K. I. Williams,
Gaussian processes for machine learning. Cambridge, MA.: MIT Press, 2006.
-  P. Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo samplers,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 3, pp. 411–436, 2006.
-  N. Whiteley, “Sequential Monte Carlo samplers: error bounds and insensitivity to initial conditions,” Stochastic Analysis and Applications, vol. 30, no. 5, pp. 774–798, 2012.
-  M. A. Osborne, S. J. Roberts, A. Rogers, S. D. Ramchurn, and N. R. Jennings, “Towards real-time information processing of sensor network data using computationally efficient multi-output Gaussian processes,” in Proceedings of the 7th international conference on information processing in sensor networks, St. Louis, MO, USA, Apr. 2008, pp. 109–120.
-  D. K. Agarwal and A. E. Gelfand, “Slice sampling for simulation based fitting of spatial data models,” Statistics and Computing, vol. 15, no. 1, pp. 61–69, 2005.
R. M. Neal, “MCMC using Hamiltonian dynamics,” in
Handbook of Markov Chain Monte Carlo, S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, Eds. Chapman & Hall/ CRC Press, 2010.
-  Y. Saatçi, R. D. Turner, and C. E. Rasmussen, “Gaussian process change point models,” in Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, Jun. 2010, pp. 927–934.
D. Petelin, M. Gašperin, and V. Šmıdl, “Adaptive importance sampling for Bayesian inference in Gaussian process models,” inProceedings of the 19th IFAC World Congress, Cape Town, South Africa, Aug. 2014, pp. 5011–5015.
-  R. B. Gramacy and N. G. Polson, “Particle learning of Gaussian process models for sequential design and optimization,” Journal of Computational and Graphical Statistics, vol. 20, no. 1, pp. 102–118, 2011.
-  P. Fearnhead and B. M. Taylor, “An adaptive sequential Monte Carlo sampler,” Bayesian Analysis, vol. 8, no. 2, pp. 411–438, 2013.
-  S. Vijayakumar and S. Schaal, “Locally weighted projection regression: Incremental real time learning in high dimensional space,” in Proceedings of the 7th International Conference on Machine Learning (ICML), Stanford, CA, USA, Jun. 2000, pp. 1079–1086.
-  G. Olsson, B. Carlsson, J. Comas, J. Copp, K. V. Gernaey, P. Ingildsen, U. Jeppsson, C. Kim, L. Rieger, I. Rodriguez-Roda, J.-P. Steyer, I. Takács, P. A. Vanrolleghem, A. Vargas Casillas, Z. Yuan, and L. Åmand, “Instrumentation, control and automation in wastewater – from London 1973 to Narbonne 2013.” Water Science and Technology, vol. 69, no. 7, pp. 1373–1385, 2014.
-  J. Dahlin and F. Lindsten, “Particle filter-based Gaussian process optimisation for parameter inference,” in Proceedings of the 19th IFAC World Congress, Cape Town, South Africa, Aug. 2014, pp. 8675–8680.